DKPro Core™ Format Reference

The document provides detailed information about the DKPro Core input and output formats.

Overview

Table 1. Formats (62)
Format	Reader	Writer
AclAnthology	AclAnthologyReader	none
Ancora	AncoraReader	none
BinaryCas	BinaryCasReader	BinaryCasWriter
BlikiWikipedia	BlikiWikipediaReader	none
Bnc	BncReader	none
Brat	BratReader	BratWriter
Combination	CombinationReader	none
Conll2000	Conll2000Reader	Conll2000Writer
Conll2002	Conll2002Reader	Conll2002Writer
Conll2003	Conll2003Reader	Conll2003Writer
Conll2006	Conll2006Reader	Conll2006Writer
Conll2008	Conll2008Reader	Conll2008Writer
Conll2009	Conll2009Reader	Conll2009Writer
Conll2012	Conll2012Reader	Conll2012Writer
ConllU	ConllUReader	ConllUWriter
DiTop	none	DiTopWriter
Html	HtmlReader	none
ImsCwb	ImsCwbReader	ImsCwbWriter
InlineXml	none	InlineXmlWriter
Jdbc	JdbcReader	none
Json	none	JsonWriter
Lif	LifReader	LifWriter
Lxf	LxfReader	LxfWriter
MalletLdaTopicProportions	none	MalletLdaTopicProportionsWriter
MalletLdaTopicsProportionsSorted	none	MalletLdaTopicsProportionsSortedWriter
NYTCollection	NYTCollectionReader	none
NegraExport	NegraExportReader	none
Nif	NifReader	NifWriter
Pdf	PdfReader	none
PennTreebankChunked	PennTreebankChunkedReader	none
PennTreebankCombined	PennTreebankCombinedReader	PennTreebankCombinedWriter
RTF	RTFReader	none
Reuters21578Sgml	Reuters21578SgmlReader	none
Reuters21578Txt	Reuters21578TxtReader	none
SerializedCas	SerializedCasReader	SerializedCasWriter
Solr	none	SolrWriter
String	StringReader	none
TGrep	none	TGrepWriter
Tcf	TcfReader	TcfWriter
Tei	TeiReader	TeiWriter
Text	TextReader	TextWriter
TigerXml	TigerXmlReader	TigerXmlWriter
Tika	TikaReader	none
TokenizedText	none	TokenizedTextWriter
TuebaDZ	TuebaDZReader	none
Tuepp	TueppReader	none
Web1T	none	Web1TWriter
WikipediaArticle	WikipediaArticleReader	none
WikipediaArticleInfo	WikipediaArticleInfoReader	none
WikipediaDiscussion	WikipediaDiscussionReader	none
WikipediaLink	WikipediaLinkReader	none
WikipediaPage	WikipediaPageReader	none
WikipediaQuery	WikipediaQueryReader	none
WikipediaRevision	WikipediaRevisionReader	none
WikipediaRevisionPair	WikipediaRevisionPairReader	none
WikipediaTemplateFilteredArticle	WikipediaTemplateFilteredArticleReader	none
XcesBasicXml	XcesBasicXmlReader	XcesBasicXmlWriter
XcesXml	XcesXmlReader	XcesXmlWriter
Xmi	XmiReader	XmiWriter
Xml	XmlReader	none
XmlText	XmlTextReader	none
XmlXPath	XmlXPathReader	none

I/O components

ACL Anthology

AclAnthology

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.aclanthology-asl

Known corpora in this format

ACL Anthology Reference Corpus (ACL ARC)

AclAnthologyReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.aclanthology.AclAnthologyReader

Description

Reads the ACL anthology corpus and outputs CASes with plain text documents.

The reader tries to strip out hyphenation and replace problematic characters to produce a cleaned text. Otherwise, it is a plain text reader.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. If not specified, the default system encoding will be used. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 2. Capabilities
Media types	text/plain
Outputs	DocumentMetaData

AnCora

Ancora

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.ancora-asl

AncoraReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.ancora.AncoraReader

Description

Read AnCora XML format.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
dropSentencesMissingPosTags	Type: Boolean — Default value: `false`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readLemma	Write lemma annotations to the CAS. Type: Boolean — Default value: `true`
readPOS	Write part-of-speech annotations to the CAS. Type: Boolean — Default value: `true`
readSentence	Write sentence annotations to the CAS. Type: Boolean — Default value: `true`
readToken	Write token annotations to the CAS. Type: Boolean — Default value: `true`
sourceLocation	Location from which the input is read. Optional — Type: String
splitMultiWordTokens	Type: Boolean — Default value: `true`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 3. Capabilities
Media types	application/x.org.dkpro.ancora+xml application/xml
Outputs	POS DocumentMetaData Lemma Sentence Token

brat file format

Brat

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.brat-asl

BratReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.brat.BratReader

Description

Reader for the brat format.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
relationTypes	Types that are relations. It is mandatory to provide the type name followed by two feature names that represent Arg1 and Arg2 separated by colons, e.g. `de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent{A}`. Additionally, a subcategorization feature may be specified. Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent{A}]`
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
textAnnotationTypes	Types that are text annotations. It is mandatory to provide the type name which can optionally be followed by a subcategorization feature. Using this parameter is only necessary to specify a subcategorization feature. Otherwise, text annotation types are automatically detected. Type: String[] — Default value: `[]`
typeMappings	Optional — Type: String[] — Default value: `[]`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 4. Capabilities
Media types	none specified
Outputs	none specified

BratWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.brat.BratWriter

Description

Writer for the brat annotation format.

Known issues:

Brat is unable to read relation attributes created by this writer.
PARAM_TYPE_MAPPINGS not implemented yet

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
enableTypeMappings	Enable type mappings. Type: Boolean — Default value: `false`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
excludeTypes	Types that will not be written to the exported file. Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence]`
filenameExtension	Specify the suffix of output files. Default value `.ann`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.ann`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
palette	Colors to be used for the visual configuration that is generated for brat. Optional — Type: String[] — Default value: `[#8dd3c7, #ffffb3, #bebada, #fb8072, #80b1d3, #fdb462, #b3de69, #fccde5, #d9d9d9, #bc80bd, #ccebc5, #ffed6f]`
relationTypes	Types that are relations. It is mandatory to provide the type name followed by two feature names that represent Arg1 and Arg2 separated by colons, e.g. `de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent`. Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent]`
shortAttributeNames	Whether to render attributes by their short name or by their qualified name. Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
spanTypes	Types that are text annotations (aka entities or spans). Type: String[] — Default value: `[]`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
textFilenameExtension	Specify the suffix of text output files. Default value `.txt`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.txt`
typeMappings	FIXME Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.ner.type.(\\w+) → $1]`
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeNullAttributes	Enable writing of features with null values. Type: Boolean — Default value: `false`
writeRelationAttributes	The brat web application can currently not handle attributes on relations, thus they are disabled by default. Here they can be enabled again. Type: Boolean — Default value: `false`

Table 5. Capabilities
Media types	none specified
Inputs	none specified

British National Corpus

Bnc

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.bnc-asl

Known corpora in this format

British National Corpus

BncReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bnc.BncReader

Description

Reader for the British National Corpus (XML version).

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 6. Capabilities
Media types	application/xml
Outputs	POS DocumentMetaData Lemma Sentence Token

Combination

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.combination-asl

CombinationReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.combination.CombinationReader

Description

Combines multiple readers into a single reader.

Parameters

readers

Type: String[]

Table 7. Capabilities
Media types	none specified
Outputs	none specified

CoNLL

Conll2000

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2000 format represents POS and Chunk tags. Fields in a line are separated by spaces. Sentences are separated by a blank new line.

Table 8. Columns
Column	Type	Description
FORM	Token	token
POSTAG	POS	part-of-speech tag
CHUNK	Chunk	chunk (IOB1 encoded)

Example

He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O

Table 9. Known corpora in this format
Corpus	Language
CoNLL 2000 Chunking Corpus	English
CoNLL 2000 Chunking Corpus (NLTK)	English

Conll2000Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2000Reader

Description

Reads the CoNLL 2000 chunking format.

Parameters

ChunkMappingLocation	Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ChunkTagSet	Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
internTags	Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: `true`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readChunk	Write chunk information. Default: true Type: Boolean — Default value: `true`
readPOS	Write part-of-speech information. Default: true Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 10. Capabilities
Media types	text/x.org.dkpro.conll-2000
Outputs	DocumentMetaData Sentence Token Chunk

Conll2000Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2000Writer

Description

Writes the CoNLL 2000 chunking format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeChunk	Type: Boolean — Default value: `true`
writePOS	Type: Boolean — Default value: `true`

Table 11. Capabilities
Media types	text/x.org.dkpro.conll-2000
Inputs	DocumentMetaData Sentence Token Chunk

Conll2002

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2002 format encodes named entity spans. Fields are separated by a single space. Sentences are separated by a blank new line.

Table 12. Columns
Column	Type/Feature	Description
FORM	Token	Word form or punctuation symbol.
NER	NamedEntity	named entity (IOB2 encoded)

Example

Wolff      B-PER
,          O
currently  O
a          O
journalist O
in         O
Argentina  B-LOC
,          O
played     O
with       O
Del        B-PER
Bosque     I-PER
in         O
the        O
final      O
years      O
of         O
the        O
seventies  O
in         O
Real       B-ORG
Madrid     I-ORG
.          O

For readability, the columns in the example above are aligned. In actual files, there is only a single space separating the fields in each line.

Table 13. Known corpora in this format
Corpus	Language
AQMAR Arabic Wikipedia Named Entity Corpus	Arabic
CoNLL 2002 dataset	Spanish
CoNLL 2002 dataset	Dutch

Conll2002Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2002Reader

Description

Reads by default the CoNLL 2002 named entity format.

The reader is also compatible with the CoNLL-based GermEval 2014 named entity format, in which the columns are separated by a tab, and there is an extra column for embedded named entities, besides the token number being put in the first column (see below). For that, additional parameters are provided, by which one can determine the column separator, whether there is an additional first column for token numbers, and whether embedded named entities should be read. (Note: Currently, the reader only reads the outer named entities, not the embedded ones.


The following snippet shows an example of the TSV format
# http://de.wikipedia.org/wiki/Manfred_Korfmann [2009-10-17]
1  Aufgrund          O           O
2  seiner            O           O
3  Initiative        O           O
4  fand              O           O
5  2001/2002         O           O
6  in                O           O
7  Stuttgart         B-LOC       O
8  ,                 O           O
9  Braunschweig      B-LOC       O
10 und               O           O
11 Bonn              B-LOC       O
12 eine              O           O
13 große             O           O
14 und               O           O
15 publizistisch     O           O
16 vielbeachtete     O           O
17 Troia-Ausstellung B-LOCpart   O
18 statt             O           O
19 ,                 O           O
20 „                 O           O
21 Troia             B-OTH       B-LOC
22 -                 I-OTH       O
23 Traum             I-OTH       O
24 und               I-OTH       O
25 Wirklichkeit      I-OTH       O
26 “                 O           O
27 .                 O           O

WORD_NUMBER - token number
FORM - token
NER1 - outer named entity (BIO encoded)
NER2 - embedded named entity (BIO encoded)

The sentence is encoded as one token per line, with information provided in tab-separated columns. The first column contains either a #, which signals the source the sentence is cited from and the date it was retrieved, or the token number within the sentence. The second column contains the token. Name spans are encoded in the BIO-scheme. Outer spans are encoded in the third column, embedded spans in the fourth column.

Parameters

NamedEntityMappingLocation	Location of the mapping file for named entity tags to UIMA types. Optional — Type: String
columnSeparator	Column separator parameter. Acceptable input values come from ColumnSeparators. Example usage: if you want to define 'tab' as the column separator the following value should be input for this parameter Conll2002Reader.ColumnSeparators.TAB.getName() Optional — Type: String — Default value: `space`
hasEmbeddedNamedEntity	Has embedded named entity extra column. Default: false Optional — Type: Boolean — Default value: `false`
hasHeader	Indicates that there is a header line before the sentence Optional — Type: Boolean — Default value: `false`
hasTokenNumber	Token number flag. When true, the first column contains the token number inside the sentence (as in GermEval 2014 format) Optional — Type: Boolean — Default value: `false`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
internTags	Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: `true`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readNamedEntity	Read named entity information. Default: true Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 14. Capabilities
Media types	text/x.org.dkpro.conll-2002 text/x.org.dkpro.germeval-2014
Outputs	DocumentMetaData NamedEntity Sentence Token

Conll2002Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2002Writer

Description

Writes the CoNLL 2002 named entity format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeNamedEntity	Type: Boolean — Default value: `true`

Table 15. Capabilities
Media types	text/x.org.dkpro.conll-2002
Inputs	DocumentMetaData NamedEntity Sentence Token

Conll2003

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2004 format encodes named entity spans and chunk spans. Fields are separated by a single space. Sentences are separated by a blank new line. Named entities and chunks are encoded in the IOB1 format. I.e. a B prefix is only used if the category of the following span differs from the category of the current span.

Table 16. Columns
Column	Type/Feature	Description
FORM	Token	Word form or punctuation symbol.
CHUNK	Chunk	chunk (IOB1 encoded)
NER	Named entity	named entity (IOB1 encoded)

Example

U.N.         NNP  I-NP  I-ORG
official     NN   I-NP  O
Ekeus        NNP  I-NP  I-PER
heads        VBZ  I-VP  O
for          IN   I-PP  O
Baghdad      NNP  I-NP  I-LOC
.            .    O     O

For readability, the columns in the example above are aligned. In actual files, there is only a single space separating the fields in each line.

Table 17. Known corpora in this format
Corpus	Language
AQMAR Arabic Wikipedia Named Entity Corpus	Arabic
CoNLL 2002 dataset	Spanish
CoNLL 2002 dataset	Dutch

Conll2003Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2003Reader

Description

Reads the CoNLL 2003 format.

Parameters

ChunkMappingLocation	Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ChunkTagSet	Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
NamedEntityMappingLocation	Location of the mapping file for named entity tags to UIMA types. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
internTags	Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: `true`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readChunk	Write chunk information. Default: true Type: Boolean — Default value: `true`
readNamedEntity	Read named entity information. Default: true Type: Boolean — Default value: `true`
readPOS	Write part-of-speech information. Default: true Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 18. Capabilities
Media types	text/x.org.dkpro.conll-2003
Outputs	DocumentMetaData NamedEntity Sentence Token Chunk

Conll2003Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2003Writer

Description

Writes the CoNLL 2003 format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeChunk	Type: Boolean — Default value: `true`
writeNamedEntity	Type: Boolean — Default value: `true`
writePOS	Type: Boolean — Default value: `true`

Table 19. Capabilities
Media types	text/x.org.dkpro.conll-2003
Inputs	DocumentMetaData NamedEntity Sentence Token Chunk

Conll2006

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2006 (aka CoNLL-X) format targets dependency parsing. Columns are tab-separated. Sentences are separated by a blank new line.

Table 20. Columns
Column	Type/Feature	Description
ID	ignored	Token counter, starting at 1 for each new sentence.
FORM	Token	Word form or punctuation symbol.
LEMMA	Lemma	Lemma of the word form.
CPOSTAG	POS coarseValue
POSTAG	POS PosValue	Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
FEATS	MorphologicalFeatures	Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (`\|`), or an underscore if not available.
HEAD	Dependency	Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
DEPREL	Dependency	Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.
PHEAD	ignored	Projective head of current token, which is either a value of ID or zero ('0'), or an underscore if not available. Note that depending on the original treebank annotation, there may be multiple tokens an with ID of zero. The dependency structure resulting from the PHEAD column is guaranteed to be projective (but is not available for all languages), whereas the structures resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available).
PDEPREL	ignored	Dependency relation to the PHEAD, or an underscore if not available. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.

Example

Heutzutage heutzutage ADV _ _ ADV _ _

Table 21. Known corpora in this format
Corpus	Language
CoNLL-X Shared Task free data	Danish, Dutch, Portuguese, and Swedish
Copenhagen Dependency Treebanks	Danish
FinnTreeBank (in recent versions with additional pseudo-XML metadata)	Finnish
Floresta Sintá(c)tica (Bosque-CoNLL)	Portuguese
Sequoia corpus	French
SETimes.HR corpus and dependency treebank of Croatian	Croatian
Składnica zależnościowa	Polish
Slovene Dependency Treebank	Slovene
Swedish Treebank	Swedish
Talbanken05	Swedish
Uppsala Persian Dependency Treebank	Persian (Farsi)
Norwegian Dependency Treebank (NDT)	Norwegian
IULA Resources. Corpus & Tools. IULA Spanish LSP Treebank	Spanish
Turin University Treebank	Italian

Conll2006Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2006Reader

Description

Reads a file in the CoNLL-2006 format (aka CoNLL-X).

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readCPOS	Type: Boolean — Default value: `true`
readDependency	Type: Boolean — Default value: `true`
readLemma	Type: Boolean — Default value: `true`
readMorph	Type: Boolean — Default value: `true`
readPOS	Type: Boolean — Default value: `true`
sourceEncoding	Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useCPosAsPos	Enable to use CPOS (column 4) as the part-of-speech tag. Otherwise the POS (column 3) is used. Type: Boolean — Default value: `false`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 22. Capabilities
Media types	text/x.org.dkpro.conll-2006
Outputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

Conll2006Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2006Writer

Description

Writes a file in the CoNLL-2006 format (aka CoNLL-X).

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeCPOS	Type: Boolean — Default value: `true`
writeDependency	Type: Boolean — Default value: `true`
writeLemma	Type: Boolean — Default value: `true`
writeMorph	Type: Boolean — Default value: `true`
writePOS	Type: Boolean — Default value: `true`

Table 23. Capabilities
Media types	text/x.org.dkpro.conll-2006
Inputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

Conll2008

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2008 format targets syntactic and semantic dependencies. Columns are tab-separated. Sentences are separated by a blank new line.

Table 24. Columns
Column	Type/Feature	Description
ID	ignored	Token counter, starting at 1 for each new sentence.
FORM	Token	Word form or punctuation symbol.
LEMMA	Lemma	Lemma of the word form.
GPOS	POS PosValue	Golf fine-grained part-of-speech tag, where the tagset depends on the language.
PPOS	ignored	Automatically predicted major POS by a language-specific tagger.
SPLIT_FORM	ignored	Tokens split at hyphens and slashes.
SPLIT_LEMMA	ignored	Predicted lemma of SPLIT_FORM.
PPOSS	ignored	Predicted POS tags of the split forms.
HEAD	Dependency	Head of the current token, which is either a value of ID or zero (`0). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
DEPREL	Dependency	Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply `ROOT`.
PRED	SemPred	(sense) identifier of a semantic 'predicate' coming from a current token.
APREDs	SemArg	Columns with argument labels for each semantic predicate (in the ID order).

Example

1   Some    some    DT  _   Some    some    DT  10  SBJ _   _   _   _   A1  _   _   _
2   of  of  IN  _   of  of  IN  1   NMOD    _   _   _   _   _   _   _   _
3   the the DT  _   the the DT  5   NMOD    _   _   _   _   _   _   _   _
4   strongest   strongest   JJS _   strongest   strong  JJS 5   NMOD    _   _   _   _   _   _   _   _
5   critics critics NNS _   critics critic  NNS 2   PMOD    critic.01   A0  _   _   _   _   _   _
6   of  of  IN  _   of  of  IN  5   NMOD    _   A1  _   _   _   _   _   _
7   our our PRP$    _   our our PRP$    9   NMOD    _   _   A1  A0  _   _   _   _
8   welfare welfare NN  _   welfare welfare NN  9   NMOD    welfare.01  _   A2  _   _   _   _   _
9   system  system  NN  _   system  system  NN  6   PMOD    system.01   _   _   _   _   _   _   _
10  are are VBP _   are be  VBP 0   ROOT    be.01   _   _   _   _   _   _   _
11  the the DT  _   the the DT  12  NMOD    _   _   _   _   _   _   _   _
12  people  people  NNS _   people  people  NNS 10  PRD person.02   _   _   _   A2  A0  A0  A1
13  who who WP  _   who who WP  14  SBJ _   _   _   _   _   _   _   _
14  have    have    VBP _   have    have    VBP 12  NMOD    have.04 _   _   _   _   SU  _   _
15  become  become  VBN _   become  become  VBN 14  VC  become.01   _   _   _   _   A1  A1  _
16  dependent   dependent   JJ  _   dependent   dependent   JJ  15  PRD _   _   _   _   _   _   _   _
17  on  on  IN  _   on  on  IN  16  AMOD    _   _   _   _   _   _   _   _
18  it  it  PRP _   it  it  PRP 17  PMOD    _   _   _   _   _   _   _   _
19  .   .   .   _   .   .   .   10  P   _   _   _   _   _   _   _   _

Table 25. Known corpora in this format
Corpus	Language
MASC-CONLL	English

Conll2008Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2008Reader

Description

Reads a file in the CoNLL-2008 format.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readDependency	Type: Boolean — Default value: `true`
readLemma	Type: Boolean — Default value: `true`
readPOS	Type: Boolean — Default value: `true`
readSemanticPredicate	Type: Boolean — Default value: `true`
sourceEncoding	Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 26. Capabilities
Media types	text/x.org.dkpro.conll-2008
Outputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemArg SemPred Dependency

Conll2008Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2008Writer

Description

Writes a file in the CoNLL-2008 format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeDependency	Type: Boolean — Default value: `true`
writeLemma	Type: Boolean — Default value: `true`
writeMorph	Type: Boolean — Default value: `true`
writePOS	Type: Boolean — Default value: `true`
writeSemanticPredicate	Type: Boolean — Default value: `true`

Table 27. Capabilities
Media types	text/x.org.dkpro.conll-2008
Inputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemArg SemPred Dependency

Conll2009

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2009 format targets semantic role labeling. Columns are tab-separated. Sentences are separated by a blank new line.

Table 28. Columns
Column	Type/Feature	Description
ID	ignored	Token counter, starting at 1 for each new sentence.
FORM	Token	Word form or punctuation symbol.
LEMMA	Lemma	Lemma of the word form.
PLEMMA	ignored	Automatically predicted lemma of FORM.
POS	POS PosValue	Fine-grained part-of-speech tag, where the tagset depends on the language.
PPOS	ignored	Automatically predicted major POS by a language-specific tagger.
FEATS	MorphologicalFeatures	Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (`\|`), or an underscore if not available.
PFEAT	ignored)	Automatically predicted morphological features (if applicable).
HEAD	Dependency	Head of the current token, which is either a value of ID or zero (`0). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
PHEAD	ignored	Automatically predicted syntactic head.
DEPREL	Dependency	Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply `ROOT`.
PDEPREL	ignored	Automatically predicted dependency relation to PHEAD.
FILLPRED	ignored	Contains `Y` for argument-bearing tokens.
PRED	SemPred	(sense) identifier of a semantic 'predicate' coming from a current token.
APREDs	SemArg	Columns with argument labels for each semantic predicate (in the ID order).

Example

1   The the the DT  DT  _   _   4   4   NMOD    NMOD    _   _   _   _
2   most    most    most    RBS RBS _   _   3   3   AMOD    AMOD    _   _   _   _
3   troublesome troublesome troublesome JJ  JJ  _   _   4   4   NMOD    NMOD    _   _   _   _
4   report  report  report  NN  NN  _   _   5   5   SBJ SBJ _   _   _   _
5   may may may MD  MD  _   _   0   0   ROOT    ROOT    _   _   _   _
6   be  be  be  VB  VB  _   _   5   5   VC  VC  _   _   _   _
7   the the the DT  DT  _   _   11  11  NMOD    NMOD    _   _   _   _
8   August  august  august  NNP NNP _   _   11  11  NMOD    NMOD    _   _   _   AM-TMP
9   merchandise merchandise merchandise NN  NN  _   _   10  10  NMOD    NMOD    _   _   A1  _
10  trade   trade   trade   NN  NN  _   _   11  11  NMOD    NMOD    Y   trade.01    _   A1
11  deficit deficit deficit NN  NN  _   _   6   6   PRD PRD Y   deficit.01  _   A2
12  due due due JJ  JJ  _   _   13  11  AMOD    APPO    _   _   _   _
13  out out out IN  IN  _   _   11  12  APPO    AMOD    _   _   _   _
14  tomorrow    tomorrow    tomorrow    NN  NN  _   _   13  12  TMP TMP _   _   _   _
15  .   .   .   .   .   _   _   5   5   P   P   _   _   _   _

Table 29. Known corpora in this format
Corpus	Language
CoNLL 2009 Shared Task	Catalan, German, Japanese, Spanish

Conll2009Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2009Reader

Description

Reads a file in the CoNLL-2009 format.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readDependency	Type: Boolean — Default value: `true`
readLemma	Type: Boolean — Default value: `true`
readMorph	Type: Boolean — Default value: `true`
readPOS	Type: Boolean — Default value: `true`
readSemanticPredicate	Type: Boolean — Default value: `true`
sourceEncoding	Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 30. Capabilities
Media types	text/x.org.dkpro.conll-2009
Outputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemArg SemPred Dependency

Conll2009Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2009Writer

Description

Writes a file in the CoNLL-2009 format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeDependency	Type: Boolean — Default value: `true`
writeLemma	Type: Boolean — Default value: `true`
writeMorph	Type: Boolean — Default value: `true`
writePOS	Type: Boolean — Default value: `true`
writeSemanticPredicate	Type: Boolean — Default value: `true`

Table 31. Capabilities
Media types	text/x.org.dkpro.conll-2009
Inputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemArg SemPred Dependency

Conll2012

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2012 format targets semantic role labeling and coreference. Columns are tab-separated. Sentences are separated by a blank new line.

Table 32. Columns
Column	Type/Feature	Description
Document ID	ignored	This is a variation on the document filename.</li>
Part number	ignored	Some files are divided into multiple parts numbered as 000, 001, 002, … etc.
Word number	ignored	</li>
Word itself	document text	This is the token as segmented/tokenized in the Treebank. Initially the `*_skel` file contain the placeholder `[WORD]` which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.
Part-of-Speech	POS
Parse bit	Constituent	This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a `*`. The full parse can be created by substituting the asterix with the `([pos] [word])` string (or leaf) and concatenating the items in the rows of that column.
Predicate lemma	Lemma	The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-".
Predicate Frameset ID	SemPred	This is the PropBank frameset ID of the predicate in Column 7.
Word sense	ignored	This is the word sense of the word in Column 3.
Speaker/Author	ignored	This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data.
Named Entities	NamedEntity	These columns identifies the spans representing various named entities.
Predicate Arguments	SemPred	There is one column each of predicate argument structure information for the predicate mentioned in Column 7.
Coreference	CoreferenceChain	Coreference chain information encoded in a parenthesis structure.

Example

en-orig.conll   0   0       John   NNP   (TOP(S(NP*)      john   -   -          -   (PERSON)       (A0) (1)
en-orig.conll   0   1       went   VBD         (VP*         go go.02   -          -         *        (V*) -
en-orig.conll   0   2         to    TO         (PP*         to   -   -          -         *          *  -
en-orig.conll   0   3        the    DT         (NP*        the   -   -          -         *          *  (2
en-orig.conll   0   4     market    NN          *)))    market   -   -          -         *        (A1) 2)
en-orig.conll   0   5          .     .           *))         .   -   -          -         *          *  -

Conll2012Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2012Reader

Description

Reads a file in the CoNLL-2012 format.

Parameters

ConstituentMappingLocation	Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ConstituentTagSet	Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
internTags	Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: `true`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readConstituent	Type: Boolean — Default value: `true`
readCoreference	Type: Boolean — Default value: `true`
readLemma	Disabled by default because CoNLL 2012 format does not include lemmata for all words, only for predicates. Type: Boolean — Default value: `false`
readNamedEntity	Type: Boolean — Default value: `true`
readPOS	Type: Boolean — Default value: `true`
readSemanticPredicate	Type: Boolean — Default value: `true`
readWordSense	Type: Boolean — Default value: `true`
sourceEncoding	Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`
useHeaderMetadata	Use the document ID declared in the file header instead of using the filename. Type: Boolean — Default value: `true`
writeTracesToText	Optional — Type: Boolean — Default value: `false`

Table 33. Capabilities
Media types	text/x.org.dkpro.conll-2012
Outputs	POS DocumentMetaData Lemma Sentence Token SemArg SemPred

Conll2012Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2012Writer

Description

Writer for the CoNLL-2012 format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeLemma	Type: Boolean — Default value: `true`
writePOS	Type: Boolean — Default value: `true`
writeSemanticPredicate	Type: Boolean — Default value: `true`

Table 34. Capabilities
Media types	text/x.org.dkpro.conll-2012
Inputs	POS DocumentMetaData Lemma Sentence Token SemArg SemPred

ConllU

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2012 format targets semantic role labeling and coreference. Columns are tab-separated. Sentences are separated by a blank new line.

Table 35. Columns
Column	Type/Feature	Description
ID	ignored	Word index, integer starting at 1 for each new sentence; may be a range for tokens with multiple words.
FORM	Token	Word form or punctuation symbol.
LEMMA	Lemma	Lemma or stem of word form.
CPOSTAG	POS coarseValue	Part-of-speech tag from the universal POS tag set.
POSTAG	POS PosValue	Language-specific part-of-speech tag; underscore if not available.
FEATS	MorphologicalFeatures	List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
HEAD	Dependency	Head of the current token, which is either a value of ID or zero (0).
DEPREL	Dependency	Universal Stanford dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
DEPS	Dependency	List of secondary dependencies (head-deprel pairs).
MISC	unused	Any other annotation.

Example

1   They    they    PRON    PRN Case=Nom|Number=Plur    2   nsubj   4:nsubj _
2   buy buy VERB    VB  Number=Plur|Person=3|Tense=Pres 0   root    _   _
3   and and CONJ    CC  _   2   cc  _   _
4   sell    sell    VERB    VB  Number=Plur|Person=3|Tense=Pres 2   conj    0:root  _
5   books   book    NOUN    NNS Number=Plur 2   dobj    4:dobj  SpaceAfter=No
6   .   .   PUNCT   .   _   2   punct   _   _

Table 36. Known corpora in this format
Corpus	Language
Universal Dependency Treebank	Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish

ConllUReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.ConllUReader

Description

Reads a file in the CoNLL-U format.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readCPOS	Type: Boolean — Default value: `true`
readDependency	Type: Boolean — Default value: `true`
readLemma	Type: Boolean — Default value: `true`
readMorph	Type: Boolean — Default value: `true`
readPOS	Type: Boolean — Default value: `true`
sourceEncoding	Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useCPosAsPos	Type: Boolean — Default value: `false`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 37. Capabilities
Media types	text/x.org.dkpro.conll-u
Outputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

ConllUWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.ConllUWriter

Description

Writes a file in the CoNLL-U format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeCPOS	Type: Boolean — Default value: `true`
writeDependency	Type: Boolean — Default value: `true`
writeLemma	Type: Boolean — Default value: `true`
writeMorph	Type: Boolean — Default value: `true`
writePOS	Type: Boolean — Default value: `true`

Table 38. Capabilities
Media types	text/x.org.dkpro.conll-u
Inputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

Ditop

DiTop

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.ditop-asl

DiTopWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.ditop.DiTopWriter

Description

This annotator (consumer) writes output files as required by DiTop. It requires JCas input annotated by de.tudarmstadt.ukp.dkpro.core.mallet.lda.MalletLdaTopicModelInferencer using the same model.

Parameters

appendConfig	If set to true, the new corpus will be appended to an existing config file. If false, the existing file is overwritten. Default: true. Type: Boolean — Default value: `true`
collectionValues	If set, only documents with one of the listed collection IDs are written, all others are ignored. If this is empty (null), all documents are written. Optional — Type: String[]
collectionValuesExactMatch	If true (default), only write documents with collection ids matching one of the collection values exactly. If false, write documents with collection ids containing any of the collection value string in collection while ignoring cases. Type: Boolean — Default value: `true`
compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
corpusName	The corpus name is used to name the corresponding sub-directory and will be set in the configuration file. Type: String
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
maxTopicWords	The maximum number of topic words to extract. Default: 15 Type: Integer — Default value: `15`
modelLocation	A Mallet file storing a serialized ParallelTopicModel. Type: String
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Directory in which to store output files. Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 39. Capabilities
Media types	application/x.org.dkpro.ditop
Inputs	DocumentMetaData TopicDistribution

HTML

Html

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.html-asl

HtmlReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.html.HtmlReader

Description

Reads the contents of a given URL and strips the HTML. Returns the textual contents. Also recognizes headings and paragraphs.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 40. Capabilities
Media types	application/xhtml+xml text/html
Outputs	DocumentMetaData Heading Paragraph

IMS Corpus Workbench

ImsCwb

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.imscwb-asl

The IMS Open Corpus Workbench is a linguistic search engine. It uses a tab-separated format with limited markup (e.g. for sentences, documents, but not recursive structures like parse-trees). If a local installation of the corpus workbench is available, it can be used by this module to immediately generate the corpus workbench index format. Search is not supported by this module.

See also

IMS Open Corpus Workbench

Known corpora in this format

WaCky - The Web-As-Corpus Kool Yinitiative - corpora crawled from the world wide web in several different languages (DeWaC, UkWaC, ItWaC, etc.)

ImsCwbReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.imscwb.ImsCwbReader

Description

Reads a tab-separated format including pseudo-XML tags.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Specify which tag set should be used to locate the mapping file. Optional — Type: String
generateNewIds	If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. (Default: false) Type: Boolean — Default value: `false`
idIsUrl	If true, the unit text ID encoded in the corpus file is stored as the URI in the document meta data. This setting has is not affected by #PARAM_GENERATE_NEW_IDS (Default: false) Type: Boolean — Default value: `false`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readLemma	Read lemmas. Default: true Type: Boolean — Default value: `true`
readPOS	Read part-of-speech tags and generate POS annotations or subclasses if a #PARAM_POS_TAG_SET tag set or #PARAM_POS_MAPPING_LOCATION mapping file is used. Default: true Type: Boolean — Default value: `true`
readSentence	Read sentences. Default: true Type: Boolean — Default value: `true`
readToken	Read tokens and generate Token annotations. Default: true Type: Boolean — Default value: `true`
replaceNonXml	Replace non-XML characters with spaces. (Default: true) Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the output. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 41. Capabilities
Media types	text/x.org.dkpro.imscwb
Outputs	POS DocumentMetaData Lemma Sentence Token

ImsCwbWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.imscwb.ImsCwbWriter

Description

This Consumer outputs the content of all CASes into the IMS workbench format. This writer produces a text file which needs to be converted to the binary IMS CWB index files using the command line tools that come with the CWB. It is possible to set the parameter #PARAM_CQP_HOME to directly create output in the native binary CQP format via the original CWB command line tools.

Parameters

additionalFeatures	Write additional token-level annotation features. These have to be given as an array of fully qualified feature paths (fully.qualified.classname/featureName). The names for these annotations in CQP are their lowercase shortnames. Optional — Type: String[]
corpusName	The name of the generated corpus. Type: String — Default value: `corpus`
cqpCompress	Set this parameter to compress the token streams and the indexes using cwb-huffcode and cwb-compress-rdx. With modern hardware, this may actually slow down queries, so we turn it off by default. If you have large data sets, you best try yourself what works best for you. (default: false) Type: Boolean — Default value: `false`
cqpHome	Set this parameter to the directory containing the cwb-encode and cwb-makeall commands if you want the write to directly encode into the CQP binary format. Optional — Type: String
cqpwebCompatibility	Make document IDs compatible with CQPweb. CQPweb demands an id consisting of only letters, numbers and underscore. Type: Boolean — Default value: `false`
sentenceTag	Type: String — Default value: `s`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Location to which the output is written. Type: String
writeCPOS	Write coarse-grained part-of-speech tags. These are the simple names of the UIMA types used to represent the part-of-speech tag. Type: Boolean — Default value: `false`
writeDocId	Write the document ID for each token. It is usually a better idea to generate a #PARAM_WRITE_DOCUMENT_TAG document tag or a #PARAM_WRITE_TEXT_TAG text tag which also contain the document ID that can be queried in CQP. Type: Boolean — Default value: `false`
writeDocumentTag	Write a pseudo-XML tag with the name document to mark the start and end of a document. Type: Boolean — Default value: `false`
writeLemma	Write lemmata. Type: Boolean — Default value: `true`
writeOffsets	Write the start and end position of each token. Type: Boolean — Default value: `false`
writePOS	Write part-of-speech tags. Type: Boolean — Default value: `true`
writeTextTag	Write a pseudo-XML tag with the name text to mark the start and end of a document. This is used by CQPweb. Type: Boolean — Default value: `true`

Table 42. Capabilities
Media types	text/x.org.dkpro.imscwb
Inputs	POS DocumentMetaData Lemma Sentence Token

JDBC

Jdbc

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.jdbc-asl

JdbcReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jdbc.JdbcReader

Description

Collection reader for JDBC database.The obtained data will be written into CAS DocumentText as well as fields of the DocumentMetaData annotation.

The field names are available as constants and begin with CAS_. Please specify the mapping of the columns and the field names in the query. For example,

SELECT text AS cas_text, title AS cas_metadata_title FROM test_table

will create a CAS for each record, write the content of "text" column into CAS document text and that of "title" column into the document title field of the DocumentMetaData annotation.

Parameters

connection	Specifies the URL to the database. If used with uimaFIT and the value is not given, `jdbc:mysql://127.0.0.1/` will be taken. Do not use this parameter to add additional parameters, but use #PARAM_CONNECTION_PARAMS instead. Type: String — Default value: `jdbc:mysql://127.0.0.1/`
connectionParams	Add additional parameters for the connection URL here in a single string: [&propertyName1=propertyValue1[&propertyName2=propertyValue2]...]. Type: String — Default value: ``
database	Specifies name of the database to be accessed. Type: String
driver	Specify the class name of the JDBC driver. If used with uimaFIT and the value is not given, `com.mysql.cj.jdbc.Driver` will be taken. Type: String — Default value: `com.mysql.cj.jdbc.Driver`
language	Specifies the language. Optional — Type: String
password	Specifies the password for database access. Type: String
query	Specifies the query. Type: String
user	Specifies the user name for database access. Type: String

Table 43. Capabilities
Media types	none specified
Outputs	DocumentMetaData

LIF

Lif

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.lif-asl

LifReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.lif.LifReader

Description

Reader for the LIF format.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 44. Capabilities
Media types	application/x.org.dkpro.lif+json
Outputs	DocumentMetaData NamedEntity Paragraph Sentence Token Constituent Dependency

LifWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.lif.LifWriter

Description

Writer for the LIF format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Specify the suffix of output files. Default value `.json`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.json`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 45. Capabilities
Media types	application/x.org.dkpro.lif+json
Inputs	DocumentMetaData NamedEntity Paragraph Sentence Token Constituent Dependency

LXF

Lxf

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-lxf-asl

LxfReader

Implementation

org.dkpro.core.io.lxf.LxfReader

Description

null

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 46. Capabilities
Media types	application/x.org.dkpro.lxf+json
Outputs	POS DocumentMetaData Lemma Sentence Token Dependency

LxfWriter

Implementation

org.dkpro.core.io.lxf.LxfWriter

Description

null

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
delta	Type: Boolean — Default value: `false`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Specify the suffix of output files. Default value `.lxf`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.lxf`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 47. Capabilities
Media types	application/x.org.dkpro.lxf+json
Inputs	POS DocumentMetaData Lemma Sentence Token Dependency

Mallet

MalletLdaTopicProportions

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.mallet-asl

MalletLdaTopicProportionsWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.mallet.lda.io.MalletLdaTopicProportionsWriter

Description

Write topic proportions to a file in the shape [\t]\t\t...

This writer depends on the TopicDistribution annotation which needs to be created by MalletLdaTopicModelInferencer before.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	If #PARAM_SINGULAR_TARGET is set to false (default), this extension will be appended to the output files. Default: .topics. Type: String — Default value: `.topics`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeDocid	If set to true (default), each output line is preceded by the document id. Type: Boolean — Default value: `true`

Table 48. Capabilities
Media types	none specified
Inputs	none specified

MalletLdaTopicsProportionsSorted

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.mallet-asl

MalletLdaTopicsProportionsSortedWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.mallet.lda.io.MalletLdaTopicsProportionsSortedWriter

Description

Write the topic proportions according to an LDA topic model to an output file. The proportions need to be inferred in a previous step using MalletLdaTopicModelInferencer.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
nTopics	Type: Integer — Default value: `3`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 49. Capabilities
Media types	none specified
Inputs	none specified

NEGRA

NegraExport

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.negra-asl

NegraExportReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.negra.NegraExportReader

Description

This CollectionReader reads a file which is formatted in the NEGRA export format. The texts and add. information like constituent structure is reproduced in CASes, one CAS per text (article) .

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
collectionId	The collection ID to the written to the document meta data. (Default: none) Optional — Type: String
documentUnit	What indicates if a new CAS should be started. E.g., if set to DocumentUnit#ORIGIN_NAME ORIGIN_NAME, a new CAS is generated whenever the origin name of the current sentence differs from the origin name of the last sentence. (Default: ORIGIN_NAME) Type: String — Default value: `ORIGIN_NAME`
generateNewIds	If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. (Default: false) Type: Boolean — Default value: `false`
language	The language. Optional — Type: String
readLemma	Write lemma information. Default: true Type: Boolean — Default value: `true`
readPOS	Write part-of-speech information. Default: true Type: Boolean — Default value: `true`
readPennTree	Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset! Default: false Type: Boolean — Default value: `false`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Type: String

Table 50. Capabilities
Media types	application/x.org.dkpro.negra3 application/x.org.dkpro.negra4
Outputs	POS DocumentMetaData Lemma Sentence Token Constituent

New York Times Corpus

NYTCollection

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	dkpro-core-io-nyt-asl

NYTCollectionReader

Implementation

org.dkpro.core.io.nyt.NYTCollectionReader

Description

null

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
offset	A number of documents which will be skipped at the beginning. Optional — Type: Integer
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 51. Capabilities
Media types	none specified
Outputs	none specified

NIF

Nif

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-nif-asl

The NLP Interchange Format (NIF) provides a way of representing NLP information using semantic web technology, specifically RDF and OWL.

Known corpora in this format

NifReader

Implementation

org.dkpro.core.io.nif.NifReader

Description

Reader for the NLP Interchange Format (NIF). The file format (e.g. TURTLE, etc.) is automatically chosen depending on the name of the file(s) being read. Compressed files are supported.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 52. Capabilities
Media types	application/x.org.dkpro.nif+turtle
Outputs	POS DocumentMetaData NamedEntity Heading Lemma Paragraph Sentence Stem Token

NifWriter

Implementation

org.dkpro.core.io.nif.NifWriter

Description

Writer for the NLP Interchange Format (NIF).

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Specify the suffix of output files. Default value `.ttl`. The file format will be chosen depending on the file suffice. Type: String — Default value: `.ttl`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 53. Capabilities
Media types	application/x.org.dkpro.nif+turtle
Inputs	POS DocumentMetaData NamedEntity Heading Lemma Paragraph Sentence Stem Token

PDF

Pdf

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.pdf-asl

PdfReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.pdf.PdfReader

Description

Collection reader for PDF files. Uses simple heuristics to detect headings and paragraphs.

Parameters

endPage	The last page to be extracted from the PDF. Optional — Type: Integer — Default value: `-1`
headingType	The type used to annotate headings. Optional — Type: String — Default value: `<built-in>`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
paragraphType	The type used to annotate paragraphs. Optional — Type: String — Default value: `<built-in>`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
startPage	The first page to be extracted from the PDF. Optional — Type: Integer — Default value: `-1`
substitutionTableLocation	The location of the substitution table use to post-process the text extracted form the PDF, e.g. to convert ligatures to separate characters. Optional — Type: String — Default value: `<built-in>`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 54. Capabilities
Media types	application/pdf
Outputs	DocumentMetaData Heading Paragraph

Penn Treebank Format

PennTreebankChunked

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.penntree-asl

PennTreebankChunkedReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.penntree.PennTreebankChunkedReader

Description

Penn Treebank chunked format reader.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readChunk	Write chunk annotations to the CAS. Type: Boolean — Default value: `true`
readPOS	Write part-of-speech annotations to the CAS. Type: Boolean — Default value: `true`
readSentence	Write sentence annotations to the CAS. Type: Boolean — Default value: `true`
readToken	Write token annotations to the CAS. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 55. Capabilities
Media types	text/x.org.dkpro.ptb-chunked
Outputs	POS DocumentMetaData Sentence Token Chunk

PennTreebankCombined

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.penntree-asl

Known corpora in this format

Floresta Sintá(c)tica (Bosque) - Portuguese

PennTreebankCombinedReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.penntree.PennTreebankCombinedReader

Description

Penn Treebank combined format reader.

Parameters

ConstituentMappingLocation	Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ConstituentTagSet	Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
internTags	Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: `true`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readPOS	Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Default: true Type: Boolean — Default value: `true`
removeTraces	Optional — Type: Boolean — Default value: `true`
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`
writeTracesToText	Optional — Type: Boolean — Default value: `false`

Table 56. Capabilities
Media types	text/x.org.dkpro.ptb-combined
Outputs	POS DocumentMetaData Sentence Token Constituent

PennTreebankCombinedWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.penntree.PennTreebankCombinedWriter

Description

Penn Treebank combined format writer.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
emptyRootLabel	Type: Boolean — Default value: `false`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Specify the suffix of output files. Default value `.mrg`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.mrg`
noRootLabel	Type: Boolean — Default value: `false`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 57. Capabilities
Media types	text/x.org.dkpro.ptb-combined
Inputs	POS DocumentMetaData Sentence Token Constituent

Reuters-21578

Reuters21578Sgml

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.reuters-asl

Reuters21578SgmlReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.reuters.Reuters21578SgmlReader

Description

Read a Reuters-21578 corpus in SGML format.

Set the directory that contains the SGML files with #PARAM_SOURCE_LOCATION.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 58. Capabilities
Media types	application/x.org.dkpro.reuters21578+sgml
Outputs	DocumentMetaData

Reuters21578Txt

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.reuters-asl

Reuters21578TxtReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.reuters.Reuters21578TxtReader

Description

Read a Reuters-21578 corpus that has been transformed into text format using ExtractReuters in the lucene-benchmarks project.

The #PARAM_SOURCE_LOCATION parameter should typically point to the file name pattern reut2-*.txt, preceded by the corpus root directory.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 59. Capabilities
Media types	text/x.org.dkpro.reuters21578
Outputs	DocumentMetaData

RTF

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.rtf-asl

RTFReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.rtf.RTFReader

Description

Read RTF (Rich Text Format) files. Uses RTFEditorKit for parsing RTF.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 60. Capabilities
Media types	application/rtf text/rtf
Outputs	DocumentMetaData

Solr

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.solr-asl

SolrWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.solr.SolrWriter

Description

A simple implementation of SolrWriter_ImplBase

Parameters

numThreads	The number of background numThreads used to empty the queue. Default: 1. Type: Integer — Default value: `1`
optimizeIndex	If set to true, the index is optimized once all documents are uploaded. Default is false. Type: Boolean — Default value: `false`
queueSize	The buffer size before the documents are sent to the server (default: 10000). Type: Integer — Default value: `10000`
solrIdField	The name of the id field in the Solr schema (default: "id"). Type: String — Default value: `id`
targetLocation	Solr server URL string in the form ://:/, e.g. http://localhost:8983/solr/collection1 Type: String
textField	The name of the text field in the Solr schema (default: "text"). Type: String — Default value: `text`
update	Define whether existing documents with same ID are updated (true) of overwritten (false)? Default: true (update). Type: Boolean — Default value: `true`
waitFlush	When committing to the index, i.e. when all documents are processed, block until index changes are flushed to disk? Default: true. Type: Boolean — Default value: `true`
waitSearcher	When committing to the index, i.e. when all documents are processed, block until a new searcher is opened and registered as the main query searcher, making the changes visible? Default: true. Type: Boolean — Default value: `true`

Table 61. Capabilities
Media types	none specified
Inputs	none specified

TCF

Tcf

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.tcf-asl

The TCF (Text Corpus Format) was created in the context of the CLARIN project. It is mainly used to exchange data between the different web-services that are part of the WebLicht platform.

TcfReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tcf.TcfReader

Description

Reader for the WebLicht TCF format. It reads all the available annotation Layers from the TCF file and convert it to a CAS annotations. The TCF data do not have begin/end offsets for all of its annotations which is required in CAS annotation. Hence, addresses are manually calculated per tokens and stored in a map (token_id, token(CAS object)) where later we get can get the offset from the token

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 62. Capabilities
Media types	text/tcf+xml
Outputs	CoreferenceChain CoreferenceLink POS DocumentMetaData NamedEntity Lemma Sentence Token Dependency

TcfWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tcf.TcfWriter

Description

Writer for the WebLicht TCF format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Specify the suffix of output files. Default value `.tcf`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.tcf`
merge	Merge with source TCF file if one is available. Default: true Type: Boolean — Default value: `true`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
preserveIfEmpty	If there are no annotations for a particular layer in the CAS, preserve any potentially existing annotations in the original TCF. Default: false Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 63. Capabilities
Media types	text/tcf+xml
Inputs	CoreferenceChain CoreferenceLink POS DocumentMetaData NamedEntity Lemma Sentence Token Dependency

TEI

Tei

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.tei-asl

Known corpora in this format

TeiReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tei.TeiReader

Description

Reader for the TEI XML.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
omitIgnorableWhitespace	Do not write ignoreable whitespace from the XML file to the CAS. Type: Boolean — Default value: `false`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readConstituent	Write constituent annotations to the CAS. Type: Boolean — Default value: `true`
readLemma	Write lemma annotations to the CAS. Type: Boolean — Default value: `true`
readNamedEntity	Write named entity annotations to the CAS. Type: Boolean — Default value: `true`
readPOS	Write part-of-speech annotations to the CAS. Type: Boolean — Default value: `true`
readParagraph	Write paragraphs annotations to the CAS. Type: Boolean — Default value: `true`
readSentence	Write sentence annotations to the CAS. Type: Boolean — Default value: `true`
readToken	Write token annotations to the CAS. Type: Boolean — Default value: `true`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`
useFilenameId	When not using the XML ID, use only the filename instead of the whole URL as ID. Mind that the filenames should be unique in this case. Type: Boolean — Default value: `false`
useXmlId	Use the xml:id attribute on the TEI elements as document ID. Mind that many TEI files may not have this attribute on all TEI elements and you may end up with no document ID at all. Also mind that the IDs should be unique. Type: Boolean — Default value: `false`
utterancesAsSentences	Interpret utterances "u" as sentenes "s". (EXPERIMENTAL) Type: Boolean — Default value: `false`

Table 64. Capabilities
Media types	application/tei+xml
Outputs	POS DocumentMetaData NamedEntity Lemma Paragraph Sentence Token Constituent

TeiWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tei.TeiWriter

Description

UIMA CAS consumer writing the CAS document text in TEI format.

Parameters

cTextPattern	A token matching this pattern is rendered as a TEI "c" element instead of a "w" element. Type: String — Default value: [,.:;()]\|(``)\|('')\|(--)
compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Specify the suffix of output files. Default value `.xml`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.xml`
indent	Indent the XML. Type: Boolean — Default value: `false`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeConstituent	Write constituent annotations to the CAS. Disabled by default because it requires type priorities to be set up (Constituents must have a higher prio than Tokens). Type: Boolean — Default value: `false`
writeNamedEntity	Write named entity annotations to the CAS. Overlapping named entities are not supported. Type: Boolean — Default value: `true`

Table 65. Capabilities
Media types	application/tei+xml
Inputs	POS DocumentMetaData NamedEntity Lemma Paragraph Sentence Token Constituent

Text

String

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.text-asl

StringReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.text.StringReader

Description

Simple reader that generates a CAS from a String. This can be useful in situations where a reader is preferred over manually crafting a CAS using JCasFactory#createJCas().

Parameters

collectionId	The collection ID to set in the DocumentMetaData. Type: String — Default value: `COLLECTION_ID`
documentBaseUri	The document base URI to set in the DocumentMetaData. Optional — Type: String
documentId	The document ID to set in the DocumentMetaData. Type: String — Default value: `DOCUMENT_ID`
documentText	The document text. Type: String
documentUri	The document URI to set in the DocumentMetaData. Type: String — Default value: `STRING`
language	Set this as the language of the produced documents. Type: String

Table 66. Capabilities
Media types	text/plain
Outputs	DocumentMetaData

Text

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.text-asl

TextReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.text.TextReader

Description

UIMA collection reader for plain text files.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 67. Capabilities
Media types	text/plain
Outputs	DocumentMetaData

TextWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.text.TextWriter

Description

UIMA CAS consumer writing the CAS document text as plain text file.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Specify the suffix of output files. Default value `.txt`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.txt`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 68. Capabilities
Media types	text/plain
Inputs	DocumentMetaData

TokenizedText

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.text-asl

TokenizedTextWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.text.TokenizedTextWriter

Description

This class writes a set of pre-processed documents into a large text file containing one sentence per line and tokens split by whitespaces. Optionally, annotations other than tokens (e.g. lemmas) are written as specified by #PARAM_FEATURE_PATH.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
coveringType	In the output file, each unit of the covering type is written into a separate line. The default (set in #DEFAULT_COVERING_TYPE), is sentences so that each sentence is written to a line. If no linebreaks within a document is desired, set this value to null. Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
extension	Set the output file extension. Default: .txt. Type: String — Default value: `.txt`
featurePath	The feature path, e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value for lemmas. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token (i.e. token texts). Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token`
numberRegex	Type: String — Default value: ``
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stopwordsFile	All the tokens listed in this file (one token per line) are replaced by STOP. Empty lines and lines starting with # are ignored. Casing is ignored. Type: String — Default value: ``
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Encoding for the target file. Default is UTF-8. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 69. Capabilities
Media types	text/plain
Inputs	DocumentMetaData

TGrep2

TGrep

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.tgrep-gpl

TGrep and TGrep2 are a tools to search over syntactic parse trees represented as bracketed structures. This module supports in particular TGrep2 and allows to conveniently generate TGrep2 indexes which can then be searched. Search is not supported by this module.

See also

TGrep2

TGrepWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tgrep.TGrepWriter

Description

TGrep2 corpus file writer. Requires PennTrees to be annotated before.

Parameters

compression	Method to compress the tgrep file (only used if PARAM_WRITE_T2C is true). Only NONE, GZIP and BZIP2 are supported. Default: CompressionMethod#NONE Type: String — Default value: `NONE`
dropMalformedTrees	If true, silently drops malformed Penn Trees instead of throwing an exception. Default: false Type: Boolean — Default value: `false`
targetLocation	Path to which the output is written. Type: String
writeComments	Set this parameter to true if you want to add a comment to each PennTree which is written to the output files. The comment is of the form documentId,beginOffset,endOffset. Default: true Type: Boolean — Default value: `true`
writeT2c	Set this parameter to true if you want to encode directly into the tgrep2 binary format. Default: true Type: Boolean — Default value: `true`

Table 70. Capabilities
Media types	application/x.org.dkpro.tgrep2
Inputs	PennTree

TIGER-XML

TigerXml

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.tiger-asl

The TIGER XML format was created for encoding syntactic constituency structures in the German TIGER corpus. It has since been used for many other corpora as well. TIGERSearch is a linguistic search engine specifically targetting this format. The format has later been extended to also support semantic frame annotations.

Known corpora in this format

Floresta Sintá(c)tica (Bosque) - Portuguese
Semeval-2 Task 10 - (extended format)
Składnica frazowa - Polish
Swedish Treebank - Swedish
Talbanken05 - Swedish
TIGER - German

TigerXmlReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tiger.TigerXmlReader

Description

UIMA collection reader for TIGER-XML files. Also supports the augmented format used in the Semeval 2010 task which includes semantic role data.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
ignoreIllegalSentences	If a sentence has an illegal structure (e.g. TIGER 2.0 has non-terminal nodes that do not have child nodes), then just ignore these sentences. Default: false Type: Boolean — Default value: `false`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readPennTree	Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset! Default: false Type: Boolean — Default value: `false`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 71. Capabilities
Media types	application/x.org.dkpro.semeval-2010+xml application/x.org.dkpro.tiger+xml
Outputs	POS DocumentMetaData Lemma Sentence Token SemArg SemPred Constituent

TigerXmlWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tiger.TigerXmlWriter

Description

UIMA CAS consumer writing the CAS document text in the TIGER-XML format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Specify the suffix of output files. Default value `.xml`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.xml`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 72. Capabilities
Media types	application/x.org.dkpro.tiger+xml
Inputs	POS DocumentMetaData Lemma Sentence Token Constituent

Tika

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-tika-asl

TikaReader

Implementation

org.dkpro.core.io.tika.TikaReader

Description

Reader for many file formats based on Apache Tika.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 73. Capabilities
Media types	none specified
Outputs	none specified

TUEBADZ

TuebaDZ

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.tuebadz-asl

The TüBa-D/Z treebank is a syntactically annotated German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz).

Sentences have a header line and are followed by a blank new line.

Table 74. Columns
Column	Type/Feature	Description
FORM	Token	Word form or punctuation symbol.
POSTAG	POS PosValue	Fine-grained part-of-speech tag, where the tagset depends on the language.
CHUNK	Chunk	chunk (BIO encoded) - For named entities, it can also include its type, e.g., B-NX=ORG

Example

%% sent no. 1
Veruntreute  VVFIN   B-VXFIN
die          ART     B-NX=ORG
AWO          NN      I-NX=ORG
Spendengeld  NN      B-NX
?   $.  O

Known corpora in this format

TüBa-D/Z - German

TuebaDZReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tuebadz.TuebaDZReader

Description

Reads the Tüba-D/Z chunking format.

Parameters

ChunkMappingLocation	Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ChunkTagSet	Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
internTags	Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: `true`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readChunk	Read chunk information. Default: true Type: Boolean — Default value: `true`
readNamedEntity	Read named entity information. Default: false Type: Boolean — Default value: `false`
readPOS	Write part-of-speech information. Default: true Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 75. Capabilities
Media types	application/x.org.dkpro.tuebadz-chunk
Outputs	DocumentMetaData Sentence Token Chunk

TüPP-D/Z

Tuepp

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.tuepp-asl

TüPP D/Z is a collection of articles from the German newspaper taz (die tageszeitung) annotated and encoded in a XML format.

Known corpora in this format

TüPP-D/Z - German

TueppReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tuepp.TueppReader

Description

UIMA collection reader for Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) XML files.

Only the part-of-speech with the best rank (rank 1) is read, if there is a tie between multiple tags, the first one from the XML file is read.
Only the first lemma (baseform) from the XML file is read.
Token are read, but not the specific kind of token (e.g. TEL, AREA, etc.).
Article boundaries are not read.
Paragraph boundaries are not read.
Lemma information is read, but morphological information is not read.
Chunk, field, and clause information is not read.
Meta data headers are not read.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 76. Capabilities
Media types	application/x.org.dkpro.tuepp+xml
Outputs	POS DocumentMetaData Lemma Sentence Token

UIMA Binary CAS

BinaryCas

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.bincas-asl

The CAS is the native data model used by UIMA. There are various ways of saving CAS data, using XMI, XCAS, or binary formats. This module supports the binary formats.

See also

Compressed Binary CASes

BinaryCasReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bincas.BinaryCasReader

Description

UIMA Binary CAS formats reader.

Parameters

addDocumentMetadata	Add DKPro Core metadata if it is not already present in the document. Type: Boolean — Default value: `true`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mergeTypeSystem	Determines whether the type system from a currently read file should be merged with the current type system Type: Boolean — Default value: `false`
overrideDocumentMetadata	Generate new DKPro Core document metadata (i.e. title, ID, URI) for the document instead of retaining what is already present in the XMI file. Type: Boolean — Default value: `false`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
typeSystemLocation	The location from which to obtain the type system when the CAS is stored in form 0. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 77. Capabilities
Media types	application/x.org.dkpro.uima+binary
Outputs	none specified

BinaryCasWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bincas.BinaryCasWriter

Description

Write CAS in one of the UIMA binary formats.

All the supported formats except 6+ can also be loaded and saved via the UIMA CasIOUtils.

Supported formats
Format	Description	Type system on load	CAS Addresses preserved
`SERIALIZED` or `S`	CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). Because these structures are pre-allocated in memory at larger sizes than what is actually required, files in this format may be larger than necessary. However, the CAS addresses of feature structures are preserved in this format. When the data is loaded back into a CAS, it must have been initialized with the same type system as the original CAS.	must be the same	yes
`SERIALIZED_TSI` or `S+`	CAS structures are dumped to disc as they are using Java serialization as in form 0, but now using the CASCompleteSerializer which includes CAS metadata like type system and index repositories.	is reinitialized	yes
`BINARY` or 0	CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). This is basically the same as format S but includes a UIMA header and can be read using org.apache.uima.cas.impl.Serialization#deserializeCAS.	must be the same	yes
`BINARY_TSI` or 0	The same as `BINARY_TSI`, except that the type system and index configuration are also stored in the file. However, lenient loading or reinitalizing the CAS with this information is presently not supported.	must be the same	yes
`COMPRESSED` or `4`	UIMA binary serialization saving all feature structures (reachable or not). This format internally uses gzip compression and a binary representation of the CAS, making it much more efficient than format 0.	must be the same	yes
`COMPRESSED_FILTERED` or `6`	UIMA binary serialization as format 4, but saving only reachable feature structures.	must be the same	no
6+	This is a legacy format specific to DKPro Core. Since UIMA 2.9.0, `COMPRESSED_FILTERED_TSI` is supported and should be used instead of this format. UIMA binary serialization as format 6, but also contains the type system definition. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system.	lenient loading	no
`COMPRESSED_FILTERED_TS`	Same as `COMPRESSED_FILTERED`, but also contains the type system definition. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system.	lenient loading	no
`COMPRESSED_FILTERED_TSI`	Default. UIMA binary serialization as format 6, but also contains the type system definition and index definitions. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system.	lenient loading	no

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	The file extension. If this is set to AUTO, then the extension will be chosen based on the default extension specified by the UIMA SerialFormat class. However, this only works when using the new long format names (e.g. `COMPRESSED_FILTERED_TSI`). When using the old short names (e.g. `6`), the default extension .bin is used. Type: String — Default value: `AUTO`
format	Type: String — Default value: `COMPRESSED_FILTERED_TSI`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
typeSystemLocation	Location to write the type system to. The type system is saved using Java serialization, it is not saved as a XML type system description. We recommend to use the name typesystem.ser. The #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the type system file should be compressed or not is detected from the file name extension (e.g. ".gz"). If this parameter is set, the type system and index repository are no longer serialized into the same file as the test of the CAS. The SerializedCasReader can currently not read such files. Use this only if you really know what you are doing. This parameter has no effect if formats S+ or 6+ are used as the type system information is embedded in each individual file. Otherwise, it is recommended that this parameter be set unless some other mechanism is used to initialize the CAS with the same type system and index repository during reading that was used during writing. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 78. Capabilities
Media types	application/x.org.dkpro.uima+binary
Inputs	DocumentMetaData

SerializedCas

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.bincas-asl

SerializedCasReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bincas.SerializedCasReader

Description

null

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
typeSystemLocation	The file from which to obtain the type system if it is not embedded in the serialized CAS. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 79. Capabilities
Media types	none specified
Outputs	none specified

SerializedCasWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bincas.SerializedCasWriter

Description

null

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Type: String — Default value: `.ser`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
typeSystemLocation	Location to write the type system to. The type system is saved using Java serialization, it is not saved as a XML type system description. We recommend to use the name typesystem.ser. The #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the type system file should be compressed or not is detected from the file name extension (e.g. ".gz"). If this parameter is set, the type system and index repository are no longer serialized into the same file as the test of the CAS. The SerializedCasReader can currently not read such files. Use this only if you really know what you are doing. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 80. Capabilities
Media types	none specified
Inputs	DocumentMetaData

UIMA JSON

Json

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.json-asl

JsonWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.json.JsonWriter

Description

UIMA JSON format writer.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
jsonContextFormat	Type: String — Default value: `omitExpandedTypeNames`
omitDefaultValues	Type: Boolean — Default value: `true`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
prettyPrint	Type: Boolean — Default value: `true`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
typeSystemFile	Location to write the type system to. If this is not set, a file called typesystem.xml will be written to the XMI output path. If this is set, it is expected to be a file relative to the current work directory or an absolute file. If this parameter is set, the #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the file name ends in ".gz", the file will be compressed, otherwise not. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 81. Capabilities
Media types	application/x.org.dkpro.uima+json
Inputs	DocumentMetaData

UIMA XMI

Xmi

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.xmi-asl

One of the official formats supported by UIMA is the XMI format. It is an XML-based format that does not support a few very specific characters which are invalid in XML. But it is able to capture all the information contained in the CAS. The XMI format is the de-facto standard for exchanging data in the UIMA world. Most UIMA-related tools support it.

The XMI format does not include type system information. It is therefore recommended to always configure the XmiWriter component to also write out the type system to a file.

If you with to view anntated documents using the UIMA CAS Editor in Eclipse, you can e.g. set up your XmiWriter in the following way to write out XMIs and a type system file:

AnalysisEngineDescription xmiWriter =
  AnalysisEngineFactory.createEngineDescription(
      XmiWriter.class,
      XmiWriter.PARAM_TARGET_LOCATION, ".",
      XmiWriter.PARAM_TYPE_SYSTEM_FILE, "typesystem.xml");

XmiReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiReader

Description

Reader for UIMA XMI files.

Parameters

addDocumentMetadata	Add DKPro Core metadata if it is not already present in the document. Type: Boolean — Default value: `true`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
lenient	In lenient mode, unknown types are ignored and do not cause an exception to be thrown. Type: Boolean — Default value: `false`
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
overrideDocumentMetadata	Generate new DKPro Core document metadata (i.e. title, ID, URI) for the document instead of retaining what is already present in the XMI file. Type: Boolean — Default value: `false`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 82. Capabilities
Media types	application/vnd.xmi+xml application/x.org.dkpro.uima+xmi
Outputs	DocumentMetaData

XmiWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiWriter

Description

UIMA XMI format writer.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Specify the suffix of output files. Default value `.xmi`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.xmi`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
prettyPrint	Type: Boolean — Default value: `true`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
typeSystemFile	Location to write the type system to. If this is not set, a file called typesystem.xml will be written to the XMI output path. If this is set, it is expected to be a file relative to the current work directory or an absolute file. If this parameter is set, the #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the file name ends in ".gz", the file will be compressed, otherwise not. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 83. Capabilities
Media types	application/vnd.xmi+xml application/x.org.dkpro.uima+xmi
Inputs	DocumentMetaData

Web1T n-grams

Web1T

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.web1t-asl

The Web1T n-gram corpus is a huge collection of n-grams collected from the internet. The jweb1t library allows to access this corpus efficiently. This module provides support for the file format used by the Web1T n-gram corpus and allows to conveniently created jweb1t indexes.

See also

Web1TWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.web1t.Web1TWriter

Description

Web1T n-gram index format writer.

Parameters

contextType	The type being used for segments Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence`
createIndexes	Create the indexes that jWeb1T needs to operate. (default: true) Optional — Type: Boolean — Default value: `true`
inputTypes	Types to generate n-grams from. Example: Token.class.getName() + "/pos/PosValue" for part-of-speech n-grams Type: String[]
lowercase	Create a lower case index. Optional — Type: Boolean — Default value: `false`
maxNgramLength	Maximum n-gram length. Default: 3 Optional — Type: Integer — Default value: `3`
minFreq	Specifies the minimum frequency a NGram must have to be written to the final index. The specified value is interpreted as inclusive value, the default is 1. Thus, all NGrams with a frequency of at least 1 or higher will be written. Optional — Type: Integer — Default value: `1`
minNgramLength	Minimum n-gram length. Default: 1 Optional — Type: Integer — Default value: `1`
splitFileTreshold	The input file(s) is/are split into smaller files for quick access. An own file is created if the first two starting letters (or the starting letter if the word has a length of 1 character) account for at least x% of all starting letters in the input file(s). The default value for splitting a file is 1.0%. Every word that has starting characters which does not suffice the threshold is written with other words that also did not meet the threshold into an own file for miscellaneous words. A high threshold will lead to only a few, but large files and a most likely very large misc. file. A low threshold results in many small files. Use a zero or a negative value to write everything to one file. Optional — Type: Float — Default value: `1.0`
targetEncoding	Character encoding of the output data. Optional — Type: String — Default value: `UTF-8`
targetLocation	Location to which the output is written. Type: String

Table 84. Capabilities
Media types	text/x.org.dkpro.ngram
Inputs	Sentence

Wikipedia via Bliki Engine

BlikiWikipedia

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.bliki-asl

Access the online Wikipedia and extract its contents using the Bliki engine.

See also

Java Wikipedia API (Bliki engine)

BlikiWikipediaReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bliki.BlikiWikipediaReader

Description

Bliki-based Wikipedia reader.

Parameters

language	The language of the wiki installation. Type: String
outputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
pageTitles	Which page titles should be retrieved. Type: String[]
sourceLocation	Wikiapi URL E.g. for the English Wikipedia it should be: http://en.wikipedia.org/w/api.php Type: String

Table 85. Capabilities
Media types	none specified
Outputs	DocumentMetaData

Wikipedia via JWPL

WikipediaArticle

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaArticleReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaArticleReader

Description

Reads all article pages. A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects, disambiguation pages, or discussion pages are regarded, however.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
OnlyFirstParagraph	If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: `false`
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
PageIdFromArray	Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[]
PageIdsFromFile	Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitleFromFile	Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitlesFromArray	Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[]
Password	The password of the database account. Type: String
User	The username of the database account. Type: String

Table 86. Capabilities
Media types	none specified
Outputs	none specified

WikipediaArticleInfo

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaArticleInfoReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaArticleInfoReader

Description

Reads all general article infos without retrieving the whole Page objects

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
Password	The password of the database account. Type: String
User	The username of the database account. Type: String

Table 87. Capabilities
Media types	none specified
Outputs	DocumentMetaData ArticleInfo

WikipediaDiscussion

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaDiscussionReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaDiscussionReader

Description

Reads all discussion pages.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
PageIdFromArray	Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[]
PageIdsFromFile	Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitleFromFile	Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitlesFromArray	Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[]
Password	The password of the database account. Type: String
User	The username of the database account. Type: String

Table 88. Capabilities
Media types	none specified
Outputs	DBConfig

WikipediaLink

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaLinkReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaLinkReader

Description

Read links from Wikipedia.

Parameters

AllowedLinkTypes	Which types of links are allowed? Type: String[]
CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
PageIdFromArray	Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[]
PageIdsFromFile	Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitleFromFile	Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitlesFromArray	Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[]
Password	The password of the database account. Type: String
User	The username of the database account. Type: String

Table 89. Capabilities
Media types	none specified
Outputs	DBConfig WikipediaLink

WikipediaPage

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaPageReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaPageReader

Description

Reads all Wikipedia pages in the database (articles, discussions, etc). A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects or disambiguation pages are regarded, however.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
OnlyFirstParagraph	If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: `false`
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
PageIdFromArray	Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[]
PageIdsFromFile	Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitleFromFile	Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitlesFromArray	Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[]
Password	The password of the database account. Type: String
User	The username of the database account. Type: String

Table 90. Capabilities
Media types	none specified
Outputs	DBConfig

WikipediaQuery

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaQueryReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaQueryReader

Description

Reads all article pages that match a query created by the numerous parameters of this class.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
MaxCategories	Maximum number of categories. Articles with a higher number of categories will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MaxInlinks	Maximum number of incoming links. Articles with a higher number of incoming links will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MaxOutlinks	Maximum number of outgoing links. Articles with a higher number of outgoing links will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MaxRedirects	Maximum number of redirects. Articles with a higher number of redirects will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MaxTokens	Maximum number of tokens. Articles with a higher number of tokens will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MinCategories	Minimum number of categories. Articles with a lower number of categories will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MinInlinks	Minimum number of incoming links. Articles with a lower number of incoming links will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MinOutlinks	Minimum number of outgoing links. Articles with a lower number of outgoing links will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MinRedirects	Minimum number of redirects. Articles with a lower number of redirects will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MinTokens	Minimum number of tokens. Articles with a lower number of tokens will not be returned by the query. Optional — Type: Integer — Default value: `-1`
OnlyFirstParagraph	If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: `false`
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
PageIdFromArray	Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[]
PageIdsFromFile	Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitleFromFile	Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitlesFromArray	Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[]
Password	The password of the database account. Type: String
TitlePattern	SQL-style title pattern. Only articles that match the pattern will be returned by the query. Optional — Type: String — Default value: ``
User	The username of the database account. Type: String

Table 91. Capabilities
Media types	none specified
Outputs	none specified

WikipediaRevision

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaRevisionReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaRevisionReader

Description

Reads Wikipedia page revisions.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
Password	The password of the database account. Type: String
RevisionIdFromArray	Defines an array of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String[]
RevisionIdsFromFile	Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String
User	The username of the database account. Type: String

Table 92. Capabilities
Media types	none specified
Outputs	DocumentMetaData DBConfig WikipediaRevision

WikipediaRevisionPair

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaRevisionPairReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaRevisionPairReader

Description

Reads pairs of adjacent revisions of all articles.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
MaxChange	Restrict revision pairs to cases where the length of the revisions does not differ more than this value (counted in characters). Type: Integer — Default value: `10000`
MinChange	Restrict revision pairs to cases where the length of the revisions differ more than this value (counted in characters). Type: Integer — Default value: `0`
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
Password	The password of the database account. Type: String
RevisionIdFromArray	Defines an array of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String[]
RevisionIdsFromFile	Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String
SkipFirstNPairs	The number of revision pairs that should be skipped in the beginning. Optional — Type: Integer
User	The username of the database account. Type: String

Table 93. Capabilities
Media types	none specified
Outputs	DocumentMetaData DBConfig

WikipediaTemplateFilteredArticle

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaTemplateFilteredArticleReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaTemplateFilteredArticleReader

Description

Reads all pages that contain or do not contain the templates specified in the template whitelist and template blacklist.

It is possible to just define a whitelist OR a blacklist. If both whitelist and blacklist are provided, the articles are chosen that DO contain the templates from the whitelist and at the same time DO NOT contain the templates from the blacklist (= the intersection of the "whitelist page set" and the "blacklist page set")

This reader only works if template tables have been generated for the JWPL database using the WikipediaTemplateInfoGenerator.

NOTE: This reader directly extends the WikipediaReaderBase and not the WikipediaStandardReaderBase

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
DoubleCheckAssociatedPages	If this option is set, discussion pages are rejected that are associated with a blacklisted article. Analogously, articles are rejected that are associated with a blacklisted discussion page. This check is rather expensive and could take a long time. This is option is not active if only a whitelist is used. Default Value: false Type: Boolean — Default value: `false`
ExactTemplateMatching	Defines whether to match the templates exactly or whether to match all templates that start with the String given in the respective parameter list. Default Value: true Type: Boolean — Default value: `true`
Host	The host server. Type: String
IncludeDiscussions	Whether the reader should read also include talk pages. Type: Boolean — Default value: `true`
Language	The language of the Wikipedia that should be connected to. Type: String
LimitNUmberOfArticlesToRead	Optional parameter that allows to define the max number of articles that should be delivered by the reader. This avoids unnecessary filtering if only a small number of articles is needed. Optional — Type: Integer
OnlyFirstParagraph	If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: `false`
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
Password	The password of the database account. Type: String
TemplateBlacklist	Defines templates that the articles MUST NOT contain. If you also define a whitelist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist) Optional — Type: String[]
TemplateWhitelist	Defines templates that the articles MUST contain. If you also define a blacklist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist) Optional — Type: String[]
User	The username of the database account. Type: String

Table 94. Capabilities
Media types	none specified
Outputs	DocumentMetaData DBConfig

XCES-XML

XcesBasicXml

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-xces-asl

XcesBasicXmlReader

Implementation

org.dkpro.core.io.xces.XcesBasicXmlReader

Description

null

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 95. Capabilities
Media types	none specified
Outputs	Paragraph

XcesBasicXmlWriter

Implementation

org.dkpro.core.io.xces.XcesBasicXmlWriter

Description

null

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameSuffix	Type: String — Default value: `.xml`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 96. Capabilities
Media types	none specified
Inputs	Paragraph

XcesXml

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-xces-asl

XcesXmlReader

Implementation

org.dkpro.core.io.xces.XcesXmlReader

Description

null

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 97. Capabilities
Media types	none specified
Outputs	Lemma Paragraph Sentence Token

XcesXmlWriter

Implementation

org.dkpro.core.io.xces.XcesXmlWriter

Description

null

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameSuffix	Type: String — Default value: `.xml`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 98. Capabilities
Media types	none specified
Inputs	POS Lemma Paragraph Sentence Token

XML

InlineXml

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.xml-asl

InlineXmlWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xml.InlineXmlWriter

Description

Writes an approximation of the content of a textual CAS as an inline XML file. Optionally applies an XSLT stylesheet.

Note this component inherits the restrictions from CasToInlineXml:

Features whose values are FeatureStructures are not represented.
Feature values which are strings longer than 64 characters are truncated.
Feature values which are arrays of primitives are represented by strings that look like [ xxx, xxx ]
The Subject of analysis is presumed to be a text string.
Some characters in the document's Subject-of-analysis are replaced by blanks, because the characters aren't valid in xml documents.
It doesn't work for annotations which are overlapping, because these cannot be properly represented as properly - nested XML.

Parameters

Xslt	XSLT stylesheet to apply. Optional — Type: String
compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 99. Capabilities
Media types	application/xml text/xml
Inputs	DocumentMetaData

Xml

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.xml-asl

XmlReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xml.XmlReader

Description

Reader for XML files.

Parameters

DocIdTag	tag which contains the docId Optional — Type: String
ExcludeTag	optional, tags those should not be worked on. Out them should no text be extracted and also no Annotations be produced. Type: String[] — Default value: `[]`
IncludeTag	optional, tags those should be worked on (if empty, then all tags except those ExcludeTags will be worked on) Type: String[] — Default value: `[]`
collectionId	The collection ID to set in the DocumentMetaData. Optional — Type: String
language	Set this as the language of the produced documents. Optional — Type: String
sourceLocation	Location from which the input is read. Type: String

Table 100. Capabilities
Media types	application/xml text/xml
Outputs	DocumentMetaData Field

XmlText

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.xml-asl

XmlTextReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xml.XmlTextReader

Description

null

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 101. Capabilities
Media types	application/xml text/xml
Outputs	DocumentMetaData

XmlXPath

Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.io.xml-asl

XmlXPathReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xml.XmlXPathReader

Description

A component reader for XML files implemented with XPath.

This is currently optimized for TREC format, which means the style topics are presented in. You should provide the parameter XPath expression that of the parent node And the child nodes of each parent node will be stored separately in its own CAS.

If your expression evaluates to leaf nodes, empty CASes will be created.

Parameters

caseSensitive	States whether the matching is done case sensitive. (default: true) Optional — Type: Boolean — Default value: `true`
docIdTag	Tag which contains the docId. If it is given, it will be ensured that within the same document there is only one id tag and it is not empty Optional — Type: String
excludeTags	Tags which should be ignored. If empty then all tags will be processed. If this and PARAM_INCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed. Type: String[] — Default value: `[]`
includeTags	Tags which should be worked on. If empty then all tags will be processed. If this and PARAM_EXCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed. Type: String[] — Default value: `[]`
language	Language of the documents. If given, it will be set in each CAS. Optional — Type: String
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Type: String[]
rootXPath	Specifies the XPath expression to all nodes to be processed. Different segments will be separated via PARAM_ID_TAG, and each segment will be stored in a separate CAS. Type: String
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`
workingDir	Specify to substitute tag names in CAS. Please give the substitutions each in before - after order. For example to substitute "foo" with "bar", and "hey" with "ho", you can provide { "foo", "bar", "hey", "ho" }. Optional — Type: String[]

Table 102. Capabilities
Media types	application/xml text/xml
Outputs	DocumentMetaData Field