The document provides detailed information about the DKPro Core input and output formats.

Overview

Table 1. Formats (49)
Format	Reader	Writer
AclAnthology	AclAnthologyReader	none
BinaryCas	BinaryCasReader	BinaryCasWriter
BlikiWikipedia	BlikiWikipediaReader	none
Bnc	BncReader	none
Brat	BratReader	BratWriter
Combination	CombinationReader	none
Conll2000	Conll2000Reader	Conll2000Writer
Conll2002	Conll2002Reader	Conll2002Writer
Conll2006	Conll2006Reader	Conll2006Writer
Conll2009	Conll2009Reader	Conll2009Writer
Conll2012	Conll2012Reader	Conll2012Writer
Html	HtmlReader	none
ImsCwb	ImsCwbReader	ImsCwbWriter
InlineXml	none	InlineXmlWriter
Jdbc	JdbcReader	none
Json	none	JsonWriter
MalletTopicProportions	none	MalletTopicProportionsWriter
MalletTopicsProportionsSorted	none	MalletTopicsProportionsSortedWriter
NegraExport	NegraExportReader	none
Pdf	PdfReader	none
PennTreebankChunked	PennTreebankChunkedReader	none
PennTreebankCombined	PennTreebankCombinedReader	PennTreebankCombinedWriter
RTF	RTFReader	none
Reuters21578Sgml	Reuters21578SgmlReader	none
Reuters21578Txt	Reuters21578TxtReader	none
SerializedCas	SerializedCasReader	SerializedCasWriter
Solr	none	SolrWriter
String	StringReader	none
TGrep	none	TGrepWriter
Tcf	TcfReader	TcfWriter
Tei	TeiReader	TeiWriter
Text	TextReader	TextWriter
TigerXml	TigerXmlReader	TigerXmlWriter
TokenizedText	none	TokenizedTextWriter
Tuepp	TueppReader	none
Web1T	none	Web1TWriter
WikipediaArticle	WikipediaArticleReader	none
WikipediaArticleInfo	WikipediaArticleInfoReader	none
WikipediaDiscussion	WikipediaDiscussionReader	none
WikipediaLink	WikipediaLinkReader	none
WikipediaPage	WikipediaPageReader	none
WikipediaQuery	WikipediaQueryReader	none
WikipediaRevision	WikipediaRevisionReader	none
WikipediaRevisionPair	WikipediaRevisionPairReader	none
WikipediaTemplateFilteredArticle	WikipediaTemplateFilteredArticleReader	none
Xmi	XmiReader	XmiWriter
Xml	XmlReader	none
XmlText	XmlTextReader	none
XmlXPath	XmlXPathReader	none

I/O components

ACL Anthology

AclAnthology

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.aclanthology-asl

Known corpora in this format:

ACL Anthology Reference Corpus (ACL ARC)

AclAnthologyReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.aclanthology.AclAnthologyReader

Reada the ACL anthology corpus and outputs CASes with plain text documents.

Parameters

Encoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files. If not specified, the default system encoding will be used.
includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	DocumentMetaData

DocumentMetaData

brat file format

Brat

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.brat-asl

BratReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.brat.BratReader

Reader for the brat format.

Parameters

includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
relationTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent{A}]: Types that are relations. It is mandatory to provide the type name followed by two feature names that represent Arg1 and Arg2 separated by colons, e.g. de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent{A}. Additionally, a subcategorization feature may be specified.
sourceEncoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files.
sourceLocation (String) [optional]: Location from which the input is read.
textAnnotationTypes (String[]) = []: Types that are text annotations. It is mandatory to provide the type name which can optionally be followed by a subcategorization feature. Using this parameter is only necessary to specify a subcategorization feature. Otherwise, text annotation types are automatically detected.
typeMappings (String[]) = [] [optional]
useDefaultExcludes (Boolean) = true: Use the default excludes.

BratWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.brat.BratWriter

Writer for the brat annotation format.

Known issues:

Brat is unable to read relation attributes created by this writer.
PARAM_TYPE_MAPPINGS not implemented yet

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
enableTypeMappings (Boolean) = false: Enable type mappings.
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
excludeTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence]: Types that will not be written to the exported file.
filenameSuffix (String) = .ann: Specify the suffix of output files. Default value .ann. If the suffix is not needed, provide an empty string as value.
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
palette (String[]) = [#8dd3c7, #ffffb3, #bebada, #fb8072, #80b1d3, #fdb462, #b3de69, #fccde5, #d9d9d9, #bc80bd, #ccebc5, #ffed6f] [optional]: Colors to be used for the visual configuration that is generated for brat.
relationTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent]: Types that are relations. It is mandatory to provide the type name followed by two feature names that represent Arg1 and Arg2 separated by colons, e.g. de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent.
shortAttributeNames (Boolean) = false: Whether to render attributes by their short name or by their qualified name.
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
spanTypes (String[]) = []: Types that are text annotations (aka entities or spans).
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
textFilenameSuffix (String) = .txt: Specify the suffix of text output files. Default value .txt. If the suffix is not needed, provide an empty string as value.
typeMappings (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.ner.type.(\\w+) → $1] [optional]: FIXME
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.
writeNullAttributes (Boolean) = false: Enable writing of features with null values.
writeRelationAttributes (Boolean) = false: The brat web application can currently not handle attributes on relations, thus they are disabled by default. Here they can be enabled again.

British National Corpus

Bnc

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.bnc-asl

Known corpora in this format:

British National Corpus

BncReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.bnc.BncReader

Reader for the British National Corpus (XML version).

Parameters

POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data.
includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	POS DocumentMetaData Lemma Sentence Token

POS DocumentMetaData Lemma Sentence Token

Combination

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.combination-asl

CombinationReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.combination.CombinationReader

Combines multiple readers into a single reader.

Parameters

readers (String[])

CoNLL

Conll2000

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.conll-asl

Known corpora in this format:

CoNLL 2000 Chunking Corpus - English (CoNLL 2000 format)

Conll2000Reader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2000Reader

Reads the Conll 2000 chunking format.


He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O

FORM - token
POSTAG - part-of-speech tag
CHUNK - chunk (BIO encoded)

Sentences are separated by a blank new line.

Parameters

ChunkMappingLocation (String) [optional]: Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.
ChunkTagSet (String) [optional]: Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
includeHidden (Boolean) = false: Include hidden files and directories.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
readChunk (Boolean) = true: Write chunk information. Default: true
readPOS (Boolean) = true: Write part-of-speech information. Default: true
sourceEncoding (String) = UTF-8: Character encoding of the input data.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	DocumentMetaData Sentence Token Chunk

DocumentMetaData Sentence Token Chunk

Conll2000Writer

Writer class: de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2000Writer

Writes the CoNLL 2000 chunking format. The columns are separated by spaces.


He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O

FORM - token
POSTAG - part-of-speech tag
CHUNK - chunk (BIO encoded)

Sentences are separated by a blank new line.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameSuffix (String) = .conll
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
sourceEncoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.
writeChunk (Boolean) = true
writePOS (Boolean) = true

Inputs

Inputs	DocumentMetaData Sentence Token Chunk

DocumentMetaData Sentence Token Chunk

Conll2002

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.conll-asl

Conll2002Reader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2002Reader

Reads the CoNLL 2002 named entity format. The columns are separated by a single space, like illustrated below.


Wolff      B-PER
,          O
currently  O
a          O
journalist O
in         O
Argentina  B-LOC
,          O
played     O
with       O
Del        B-PER
Bosque     I-PER
in         O
the        O
final      O
years      O
of         O
the        O
seventies  O
in         O
Real       B-ORG
Madrid     I-ORG
.          O

FORM - token
NER - named entity (BIO encoded)

Sentences are separated by a blank new line.

Parameters

includeHidden (Boolean) = false: Include hidden files and directories.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true
language (String) [optional]: The language.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
readNamedEntity (Boolean) = true: Write named entity information. Default: true
sourceEncoding (String) = UTF-8: Character encoding of the input data.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	DocumentMetaData NamedEntity Sentence Token

DocumentMetaData NamedEntity Sentence Token

Conll2002Writer

Writer class: de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2002Writer

Writes the CoNLL 2002 named entity format. The columns are separated by a single space, unlike illustrated below.


Wolff      B-PER
,          O
currently  O
a          O
journalist O
in         O
Argentina  B-LOC
,          O
played     O
with       O
Del        B-PER
Bosque     I-PER
in         O
the        O
final      O
years      O
of         O
the        O
seventies  O
in         O
Real       B-ORG
Madrid     I-ORG
.          O

FORM - token
NER - named entity (BIO encoded)

Sentences are separated by a blank new line.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameSuffix (String) = .conll
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
sourceEncoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.
writeNamedEntity (Boolean) = true

Inputs

Inputs	DocumentMetaData NamedEntity Sentence Token

DocumentMetaData NamedEntity Sentence Token

Conll2006

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.conll-asl

Known corpora in this format:

CoNLL-X Shared Task free data - Danish, Dutch, Portuguese, and Swedish
Copenhagen Dependency Treebanks - Danish
FinnTreeBank - Finnish (in recent versions with additional pseudo-XML metadata)
Floresta Sintá(c)tica (Bosque-CoNLL) - Portuguese
Sequoia corpus - French
SETimes.HR corpus and dependency treebank of Croatian - Croatian
Składnica zależnościowa - Polish
Slovene Dependency Treebank - Slovene
Swedish Treebank - Swedish
Talbanken05 - Swedish
Uppsala Persian Dependency Treebank - Persian (Farsi)

Conll2006Reader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2006Reader

Reads a file in the CoNLL-2006 format (aka CoNLL-X).


Heutzutage heutzutage ADV _ _ ADV _ _

ID - (ignored) Token counter, starting at 1 for each new sentence.
FORM - (Token) Word form or punctuation symbol.
LEMMA - (Lemma) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
CPOSTAG - (unused)
POSTAG - (POS) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
FEATS - (MorphologicalFeatures) Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available.
HEAD - (Dependency) Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
DEPREL - (Dependency) Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.
PHEAD - (ignored) Projective head of current token, which is either a value of ID or zero ('0'), or an underscore if not available. Note that depending on the original treebank annotation, there may be multiple tokens an with ID of zero. The dependency structure resulting from the PHEAD column is guaranteed to be projective (but is not available for all languages), whereas the structures resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available).
PDEPREL - (ignored) Dependency relation to the PHEAD, or an underscore if not available. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.

Sentences are separated by a blank new line.

Parameters

POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
readDependency (Boolean) = true
readLemma (Boolean) = true
readMorph (Boolean) = true
readPOS (Boolean) = true
sourceEncoding (String) = UTF-8
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

Conll2006Writer

Writer class: de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2006Writer

Writes a file in the CoNLL-2006 format (aka CoNLL-X).


Heutzutage heutzutage ADV _ _ ADV _ _

ID - token number in sentence
FORM - token
LEMMA - lemma
CPOSTAG - part-of-speech tag (coarse grained)
POSTAG - part-of-speech tag
FEATS - unused
HEAD - target token for a dependency parsing
DEPREL - function of the dependency parsing
PHEAD - unused
PDEPREL - unused

Sentences are separated by a blank new line

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameSuffix (String) = .conll
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
sourceEncoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.
writeDependency (Boolean) = true
writeLemma (Boolean) = true
writeMorph (Boolean) = true
writePOS (Boolean) = true

Inputs

Inputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

Conll2009

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.conll-asl

Conll2009Reader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2009Reader

Reads a file in the CoNLL-2009 format.

ID - (ignored) Token counter, starting at 1 for each new sentence.
FORM - (Token) Word form or punctuation symbol.
LEMMA - (Lemma) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
PLEMMA - (ignored) Automatically predicted lemma of FORM
POS - (POS) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
PPOS - (ignored) Automatically predicted major POS by a language-specific tagger
FEAT - (MorphologicalFeatures) Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available.
PFEAT - (ignored) Automatically predicted morphological features (if applicable)
HEAD - (Dependency) Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
PHEAD - (ignored) Automatically predicted syntactic head
DEPREL - (Dependency) Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningfull or simply 'ROOT'.
PDEPREL - (ignored) Automatically predicted dependency relation to PHEAD
FILLPRED - (ignored) Contains 'Y' for argument-bearing tokens
PRED - (SemanticPredicate) (sense) identifier of a semantic 'predicate' coming from a current token
APREDs - (SemanticArgument) Columns with argument labels for each semantic predicate (in the ID order)

Sentences are separated by a blank new line.

Parameters

POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
readDependency (Boolean) = true
readLemma (Boolean) = true
readMorph (Boolean) = true
readPOS (Boolean) = true
readSemanticPredicate (Boolean) = true
sourceEncoding (String) = UTF-8
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemanticArgument SemanticPredicate Dependency

MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemanticArgument SemanticPredicate Dependency

Conll2009Writer

Writer class: de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2009Writer

Writes a file in the CoNLL-2009 format.

ID - (ignored) Token counter, starting at 1 for each new sentence.
FORM - (Token) Word form or punctuation symbol.
LEMMA - (Lemma) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
PLEMMA - (ignored) Automatically predicted lemma of FORM
POS - (POS) Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
PPOS - (ignored) Automatically predicted major POS by a language-specific tagger
FEAT - (MorphologicalFeatures) Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available.
PFEAT - (ignored) Automatically predicted morphological features (if applicable)
HEAD - (Dependency) Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
PHEAD - (ignored) Automatically predicted syntactic head
DEPREL - (Dependency) Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningfull or simply 'ROOT'.
PDEPREL - (ignored) Automatically predicted dependency relation to PHEAD
FILLPRED - (auto-generated) Contains 'Y' for argument-bearing tokens
PRED - (SemanticPredicate) (sense) identifier of a semantic 'predicate' coming from a current token
APREDs - (SemanticArgument) Columns with argument labels for each semantic predicate (in the ID order)

Sentences are separated by a blank new line

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameSuffix (String) = .conll
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
sourceEncoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.
writeDependency (Boolean) = true
writeLemma (Boolean) = true
writeMorph (Boolean) = true
writePOS (Boolean) = true
writeSemanticPredicate (Boolean) = true

Inputs

Inputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemanticArgument SemanticPredicate Dependency

MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemanticArgument SemanticPredicate Dependency

Conll2012

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.conll-asl

Conll2012Reader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2012Reader

Reads a file in the CoNLL-2009 format.

Document ID - (ignored) This is a variation on the document filename.
Part number - (ignored) Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.
Word number - (ignored)
Word itself - (document text) This is the token as segmented/tokenized in the Treebank. Initially the *_skel file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.
Part-of-Speech - (POS)
Parse bit - (Constituent) This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column.
Predicate lemma - (Lemma) The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-"
Predicate Frameset ID - (SemanticPredicate) This is the PropBank frameset ID of the predicate in Column 7.
Word sense - (ignored) This is the word sense of the word in Column 3.
Speaker/Author - (ignored) This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data.
Named Entities - (NamedEntity) These columns identifies the spans representing various named entities.
Predicate Arguments - (SemanticPredicate) There is one column each of predicate argument structure information for the predicate mentioned in Column 7.
Coreference - (CoreferenceChain) Coreference chain information encoded in a parenthesis structure.

Sentences are separated by a blank new line.

Parameters

ConstituentMappingLocation (String) [optional]: Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically.
ConstituentTagSet (String) [optional]: Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
includeHidden (Boolean) = false: Include hidden files and directories.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
readConstituent (Boolean) = true
readCoreference (Boolean) = true
readLemma (Boolean) = false: Disabled by default because CoNLL 2012 format does not include lemmata for all words, only for predicates.
readNamedEntity (Boolean) = true
readPOS (Boolean) = true
readSemanticPredicate (Boolean) = true
readWordSense (Boolean) = true
sourceEncoding (String) = UTF-8
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.
useHeaderMetadata (Boolean) = true: Use the document ID declared in the file header instead of using the filename.
writeTracesToText (Boolean) = false [optional]

Outputs

Outputs	POS DocumentMetaData Lemma Sentence Token SemanticArgument SemanticPredicate

POS DocumentMetaData Lemma Sentence Token SemanticArgument SemanticPredicate

Conll2012Writer

Writer class: de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2012Writer

Writer for the CoNLL-2009 format.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameSuffix (String) = .conll
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
sourceEncoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.
writeLemma (Boolean) = true
writePOS (Boolean) = true
writeSemanticPredicate (Boolean) = true

Inputs

Inputs	POS DocumentMetaData Lemma Sentence Token SemanticArgument SemanticPredicate

POS DocumentMetaData Lemma Sentence Token SemanticArgument SemanticPredicate

HTML

Html

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.html-asl

HtmlReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.html.HtmlReader

Reads the contents of a given URL and strips the HTML. Returns only the textual contents.

Parameters

language (String) [optional]: Set this as the language of the produced documents.
sourceEncoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files.
sourceLocation (String): URL from which the input is read.

Outputs

Outputs	DocumentMetaData

DocumentMetaData

IMS Corpus Workbench

ImsCwb

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.imscwb-asl

The IMS Open Corpus Workbench is a linguistic search engine. It uses a tab-separated format with limited markup (e.g. for sentences, documents, but not recursive structures like parse-trees). If a local installation of the corpus workbench is available, it can be used by this module to immediately generate the corpus workbench index format. Search is not supported by this module.

See also:

IMS Open Corpus Workbench

Known corpora in this format:

WaCky - The Web-As-Corpus Kool Yinitiative - corpora crawled from the world wide web in several different languages (DeWaC, UkWaC, ItWaC, etc.)

ImsCwbReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.imscwb.ImsCwbReader

Reads a tab-separated format including pseudo-XML tags.

Parameters

POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
POSTagSet (String) [optional]: Specify which tag set should be used to locate the mapping file.
generateNewIds (Boolean) = false: If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. (Default: false)
idIsUrl (Boolean) = false: If true, the unit text ID encoded in the corpus file is stored as the URI in the document meta data. This setting has is not affected by #PARAM_GENERATE_NEW_IDS (Default: false)
includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
readLemma (Boolean) = true: Read lemmas. Default: true
readPOS (Boolean) = true: Read part-of-speech tags and generate POS annotations or subclasses if a #PARAM_POS_TAG_SET tag set or #PARAM_POS_MAPPING_LOCATION mapping file is used. Default: true
readSentence (Boolean) = true: Read sentences. Default: true
readToken (Boolean) = true: Read tokens and generate Token annotations. Default: true
replaceNonXml (Boolean) = true: Replace non-XML characters with spaces. (Default: true)
sourceEncoding (String) = UTF-8
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	POS DocumentMetaData Lemma Sentence Token

POS DocumentMetaData Lemma Sentence Token

ImsCwbWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.imscwb.ImsCwbWriter

This Consumer outputs the content of all CASes into the IMS workbench format. This writer produces a text file which needs to be converted to the binary IMS CWB index files using the command line tools that come with the CWB. It is possible to set the parameter #PARAM_CQP_HOME to directly create output in the native binary CQP format via the original CWB command line tools.

Parameters

additionalFeatures (String[]) [optional]: Write additional token-level annotation features. These have to be given as an array of fully qualified feature paths (fully.qualified.classname/featureName). The names for these annotations in CQP are their lowercase shortnames.
corpusName (String) = corpus: The name of the generated corpus.
cqpCompress (Boolean) = false: Set this parameter to compress the token streams and the indexes using cwb-huffcode and cwb-compress-rdx. With modern hardware, this may actually slow down queries, so we turn it off by default. If you have large data sets, you best try yourself what works best for you. (default: false)
cqpHome (String) [optional]: Set this parameter to the directory containing the cwb-encode and cwb-makeall commands if you want the write to directly encode into the CQP binary format.
cqpwebCompatibility (Boolean) = false: Make document IDs compatible with CQPweb. CQPweb demands an id consisting of only letters, numbers and underscore.
sentenceTag (String) = s
targetEncoding (String) = UTF-8: Character encoding of the output data.
targetLocation (String): Location to which the output is written.
writeCPOS (Boolean) = false: Write coarse-grained part-of-speech tags. These are the simple names of the UIMA types used to represent the part-of-speech tag.
writeDocId (Boolean) = false: Write the document ID for each token. It is usually a better idea to generate a #PARAM_WRITE_DOCUMENT_TAG document tag or a #PARAM_WRITE_TEXT_TAG text tag which also contain the document ID that can be queried in CQP.
writeDocumentTag (Boolean) = false: Write a pseudo-XML tag with the name document to mark the start and end of a document.
writeLemma (Boolean) = true: Write lemmata.
writeOffsets (Boolean) = false: Write the start and end position of each token.
writePOS (Boolean) = true: Write part-of-speech tags.
writeTextTag (Boolean) = true: Write a pseudo-XML tag with the name text to mark the start and end of a document. This is used by CQPweb.

Inputs

Inputs	POS DocumentMetaData Lemma Sentence Token

POS DocumentMetaData Lemma Sentence Token

JDBC

Jdbc

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.jdbc-asl

JdbcReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.jdbc.JdbcReader

Collection reader for JDBC database.The obtained data will be written into CAS DocumentText as well as fields of the DocumentMetaData annotation.

The field names are available as constants and begin with CAS_. Please specify the mapping of the columns and the field names in the query. For example,

SELECT text AS cas_text, title AS cas_metadata_title FROM test_table

will create a CAS for each record, write the content of "text" column into CAS documen text and that of "title" column into the document title field of the DocumentMetaData annotation.

Parameters

connection (String) = jdbc:mysql://127.0.0.1/: Specifies the URL to the database.
If used with uimaFIT and the value is not given, jdbc:mysql://127.0.0.1/ will be taken.
database (String): Specifies name of the database to be accessed.
driver (String) = com.mysql.jdbc.Driver: Specify the class name of the JDBC driver.
If used with uimaFIT and the value is not given, com.mysql.jdbc.Driver will be taken.
language (String) [optional]: Specifies the language.
password (String): Specifies the password for database access.
query (String): Specifies the query.
user (String): Specifies the user name for database access.

Outputs

Outputs	DocumentMetaData

DocumentMetaData

Mallet

MalletTopicProportions

Artifact ID: de.tudarmstadt.ukp.dkpro.core.mallet-asl

MalletTopicProportionsWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.mallet.topicmodel.io.MalletTopicProportionsWriter

Write topic proportions to a file in the shape depends on the {@link TopicDistribution annotation which should have been created by MalletTopicModelInferencer before.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String)
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

MalletTopicsProportionsSorted

Artifact ID: de.tudarmstadt.ukp.dkpro.core.mallet-asl

MalletTopicsProportionsSortedWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.mallet.topicmodel.io.MalletTopicsProportionsSortedWriter

Write the topic proportions according to an LDA topic model to an output file. The proportions need to be inferred in a previous step using MalletTopicModelInferencer.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
nTopics (Integer) = 3
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String)
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

NEGRA

NegraExport

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.negra-asl

NegraExportReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.negra.NegraExportReader

This CollectionReader reads a file which is formatted in the NEGRA export format. The texts and add. information like constituent structure is reproduced in CASes, one CAS per text (article) .

Parameters

POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
collectionId (String) [optional]: The collection ID to the written to the document meta data. (Default: none)
documentUnit (String) = ORIGIN_NAME: What indicates if a new CAS should be started. E.g., if set to DocumentUnit#ORIGIN_NAME ORIGIN_NAME, a new CAS is generated whenever the origin name of the current sentence differs from the origin name of the last sentence. (Default: ORIGIN_NAME)
generateNewIds (Boolean) = false: If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. (Default: false)
language (String) [optional]: The language.
readLemma (Boolean) = true: Write lemma information. Default: true
readPOS (Boolean) = true: Write part-of-speech information. Default: true
readPennTree (Boolean) = false: Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset! Default: false
sourceEncoding (String) = UTF-8: Character encoding of the input data.
sourceLocation (String): Location from which the input is read.

Outputs

Outputs	POS DocumentMetaData Lemma Sentence Token Constituent

POS DocumentMetaData Lemma Sentence Token Constituent

PDF

Pdf

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.pdf-asl

PdfReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.pdf.PdfReader

Collection reader for PDF files. Uses simple heuristics to detect headings and paragraphs.

Parameters

endPage (Integer) = -1 [optional]: The last page to be extracted from the PDF.
headingType (String) = <built-in> [optional]: The type used to annotate headings.
includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
paragraphType (String) = <built-in> [optional]: The type used to annotate paragraphs.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceLocation (String) [optional]: Location from which the input is read.
startPage (Integer) = -1 [optional]: The first page to be extracted from the PDF.
substitutionTableLocation (String) = <built-in> [optional]: The location of the substitution table use to post-process the text extracted form the PDF, e.g. to convert ligatures to separate characters.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	DocumentMetaData

DocumentMetaData

Penn Treebank Format

PennTreebankChunked

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.penntree-asl

PennTreebankChunkedReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.penntree.PennTreebankChunkedReader

Penn Treebank chunked format reader.

Parameters

POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
readChunk (Boolean) = true: Write chunk annotations to the CAS.
readPOS (Boolean) = true: Write part-of-speech annotations to the CAS.
readSentence (Boolean) = true: Write sentence annotations to the CAS.
readToken (Boolean) = true: Write token annotations to the CAS.
sourceEncoding (String) = UTF-8: Character encoding of the input data.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	POS DocumentMetaData Sentence Token Chunk

POS DocumentMetaData Sentence Token Chunk

PennTreebankCombined

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.penntree-asl

Known corpora in this format:

Floresta Sintá(c)tica (Bosque) - Portuguese

PennTreebankCombinedReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.penntree.PennTreebankCombinedReader

Penn Treebank combined format reader.

Parameters

ConstituentMappingLocation (String) [optional]: Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically.
ConstituentTagSet (String) [optional]: Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
includeHidden (Boolean) = false: Include hidden files and directories.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.
Default: true
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
readPOS (Boolean) = true: Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.
Default: true
removeTraces (Boolean) = true [optional]
sourceEncoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.
writeTracesToText (Boolean) = false [optional]

Outputs

Outputs	POS DocumentMetaData Sentence Token Constituent

POS DocumentMetaData Sentence Token Constituent

PennTreebankCombinedWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.penntree.PennTreebankCombinedWriter

Penn Treebank combined format writer.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
emptyRootLabel (Boolean) = false
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameSuffix (String) = .penn: Specify the suffix of output files. Default value .penn. If the suffix is not needed, provide an empty string as value.
noRootLabel (Boolean) = false
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
sourceEncoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

Inputs

Inputs	POS DocumentMetaData Sentence Token Constituent

POS DocumentMetaData Sentence Token Constituent

Reuters-21578

Reuters21578Sgml

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.reuters-asl

Reuters21578SgmlReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.reuters.Reuters21578SgmlReader

Read a Reuters-21578 corpus in SGML format.

Set the directory that contains the SGML files with #PARAM_SOURCE_LOCATION.

Parameters

sourceLocation (String): The directory that contains the Reuters-21578 SGML files.

Reuters21578Txt

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.reuters-asl

Reuters21578TxtReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.reuters.Reuters21578TxtReader

Read a Reuters-21578 corpus that has been transformed into text format using ExtractReuters in the lucene-benchmarks project.

Parameters

sourceLocation (String): The directory that contains the Reuters-21578 text files, named according to the pattern #FILE_PATTERN.

RTF

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.rtf-asl

RTFReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.rtf.RTFReader

Read RTF (Rich Test Format) files. Uses RTFEditorKit for parsing RTF..

Parameters

includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Solr

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.solr-asl

SolrWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.solr.SolrWriter

A simple implementation of SolrWriter_ImplBase

Parameters

optimizeIndex (Boolean) = false: If set to true, the index is optimized once all documents are uploaded. Default is false.
queueSize (Integer) = 10000: The buffer size before the documents are sent to the server (default: 10000).
solrIdField (String) = id: The name of the id field in the Solr schema (default: "id").
targetLocation (String): Solr server URL string in the form ://:/, e.g. http://localhost:8983/solr/collection1.
textField (String) = text: The name of the text field in the Solr schema (default: "text").
threads (Integer) = 1: The number of background threads used to empty the queue. Default: 1.
update (Boolean) = true: Define whether existing documents with same ID are updated (true) of overwritten (false)? Default: true (update).
waitFlush (Boolean) = true: When committing to the index, i.e. when all documents are processed, block until index changes are flushed to disk? Default: true.
waitSearcher (Boolean) = true: When committing to the index, i.e. when all documents are processed, block until a new searcher is opened and registered as the main query searcher, making the changes visible? Default: true.

TCF

Tcf

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.tcf-asl

The TCF (Text Corpus Format) was created in the context of the CLARIN project. It is mainly used to exchange data between the different web-services that are part of the WebLicht platform.

TcfReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.tcf.TcfReader

Reader for the WebLicht TCF format. It reads all the available annotation Layers from the TCF file and convert it to a CAS annotations. The TCF data do not have begin/end offsets for all of its annotations which is required in CAS annotation. Hence, addresses are manually calculated per tokens and stored in a map (token_id, token(CAS object)) where later we get can get the offset from the token

Parameters

includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	CoreferenceChain CoreferenceLink POS DocumentMetaData NamedEntity Lemma Sentence Token Dependency

CoreferenceChain CoreferenceLink POS DocumentMetaData NamedEntity Lemma Sentence Token Dependency

TcfWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.tcf.TcfWriter

Writer for the WebLicht TCF format.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameSuffix (String) = .tcf: Specify the suffix of output files. Default value .tcf. If the suffix is not needed, provide an empty string as value.
merge (Boolean) = true: Merge with source TCF file if one is available.
Default: true
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
preserveIfEmpty (Boolean) = false: If there are no annotations for a particular layer in the CAS, preserve any potentially existing annotations in the original TCF.
Default: false
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

Inputs

Inputs	CoreferenceChain CoreferenceLink POS DocumentMetaData NamedEntity Lemma Sentence Token Dependency

CoreferenceChain CoreferenceLink POS DocumentMetaData NamedEntity Lemma Sentence Token Dependency

TEI

Tei

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.tei-asl

Known corpora in this format:

TeiReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.tei.TeiReader

Reader for the TEI XML.

Parameters

POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
omitIgnorableWhitespace (Boolean) = false: Do not write ignoreable whitespace from the XML file to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
readConstituent (Boolean) = true: Write constituent annotations to the CAS.
readLemma (Boolean) = true: Write lemma annotations to the CAS.
readNamedEntity (Boolean) = true: Write named entity annotations to the CAS.
readPOS (Boolean) = true: Write part-of-speech annotations to the CAS.
readParagraph (Boolean) = true: Write paragraphs annotations to the CAS.
readSentence (Boolean) = true: Write sentence annotations to the CAS.
readToken (Boolean) = true: Write token annotations to the CAS.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.
useFilenameId (Boolean) = false: When not using the XML ID, use only the filename instead of the whole URL as ID. Mind that the filenames should be unique in this case.
useXmlId (Boolean) = false: Use the xml:id attribute on the TEI elements as document ID. Mind that many TEI files may not have this attribute on all TEI elements and you may end up with no document ID at all. Also mind that the IDs should be unique.
utterancesAsSentences (Boolean) = false: Interpret utterances "u" as sentenes "s". (EXPERIMENTAL)

Outputs

Outputs	POS DocumentMetaData NamedEntity Lemma Paragraph Sentence Token Constituent

POS DocumentMetaData NamedEntity Lemma Paragraph Sentence Token Constituent

TeiWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.tei.TeiWriter

UIMA CAS consumer writing the CAS document text in TEI format.

Parameters

cTextPattern (String) = [,.:;()]|(``)|('')|(--): A token matching this pattern is rendered as a TEI "c" element instead of a "w" element.
compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameSuffix (String) = .xml: Specify the suffix of output files. Default value .xml. If the suffix is not needed, provide an empty string as value.
indent (Boolean) = false: Indent the XML.
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.
writeConstituent (Boolean) = false: Write constituent annotations to the CAS. Disabled by default because it requires type priorities to be set up (Constituents must have a higher prio than Tokens).
writeNamedEntity (Boolean) = true: Write named entity annotations to the CAS. Overlapping named entities are not supported.

Inputs

Inputs	POS DocumentMetaData NamedEntity Lemma Paragraph Sentence Token Constituent

POS DocumentMetaData NamedEntity Lemma Paragraph Sentence Token Constituent

Text

String

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.text-asl

StringReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.text.StringReader

Simple reader that generates a CAS from a String. This can be useful in situations where a reader is preferred over manually crafting a CAS using JCasFactory#createJCas().

Parameters

collectionId (String) = COLLECTION_ID: The collection ID to set in the DocumentMetaData.
documentBaseUri (String) [optional]: The document base URI to set in the DocumentMetaData.
documentId (String) = DOCUMENT_ID: The document ID to set in the DocumentMetaData.
documentText (String): The document text.
documentUri (String) = STRING: The document URI to set in the DocumentMetaData.
language (String): Set this as the language of the produced documents.

Outputs

Outputs	DocumentMetaData

DocumentMetaData

Text

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.text-asl

TextReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.text.TextReader

UIMA collection reader for plain text files.

Parameters

includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceEncoding (String) = UTF-8: Name of configuration parameter that contains the character encoding used by the input files.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	DocumentMetaData

DocumentMetaData

TextWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.text.TextWriter

UIMA CAS consumer writing the CAS document text as plain text file.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameSuffix (String) = .txt: Specify the suffix of output files. Default value .txt. If the suffix is not needed, provide an empty string as value.
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

Inputs

Inputs	DocumentMetaData

DocumentMetaData

TokenizedText

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.text-asl

TokenizedTextWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.text.TokenizedTextWriter

This class writes a set of pre-processed documents into a large text file containing one sentence per line and tokens split by whitespaces. Optionally, annotations other than tokens (e.g. lemmas) are written as specified by #PARAM_FEATURE_PATH.

Parameters

compression (String) = NONE [optional]

Choose a compression method. (default: CompressionMethod#NONE)

escapeDocumentId (Boolean) = true

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

featurePath (String) = de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

The feature path, e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma/value for lemmas. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token (i.e. token texts).

In order to specify a different annotation use the annotation class' type name (e.g. Token.class.getTypeName()) and optionally append a field, e.g. /value to specify the feature path. If you do not specify a field, the covered text is used.

numberRegex (String) [optional]

All tokens that match this regex are replaced by NUM. Examples:

^[0-9]+$
^[0-9,\.]+$
^[0-9]+(\.[0-9]*)?$

Make sure that these regular expressions are fit to the segmentation, e.g. if your work on tokens, your tokenizer might split prefixes such as + and - from the rest of the number.

overwrite (Boolean) = false

Allow overwriting target files (ignored when writing to ZIP archives).

singularTarget (Boolean) = false

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

stopwordsFile (String) [optional]

All the tokens listed in this file (one token per line) are replaced by STOP. Empty lines and lines starting with # are ignored. Casing is ignored.

stripExtension (Boolean) = false

Remove the original extension.

targetEncoding (String) = UTF-8

Encoding for the target file. Default is UTF-8.

targetLocation (String) [optional]

Target location. If this parameter is not yet, data is written to stdout.

useDocumentId (Boolean) = false

Use the document ID as file name even if a relative path information is present.

TGrep2

TGrep

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.tgrep-gpl

TGrep and TGrep2 are a tools to search over syntactic parse trees represented as bracketed structures. This module supports in particular TGrep2 and allows to conveniently generate TGrep2 indexes which can then be searched. Search is not supported by this module.

See also:

TGrep2

TGrepWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.tgrep.TGrepWriter

TGrep2 corpus file writer. Requires PennTrees to be annotated before.

Parameters

compression (String) = NONE: Method to compress the tgrep file (only used if PARAM_WRITE_T2C is true). Only NONE, GZIP and BZIP2 are supported. Default: CompressionMethod#NONE
dropMalformedTrees (Boolean) = false: If true, silently drops malformed Penn Trees instead of throwing an exception. Default: false
targetLocation (String): Path to which the output is written.
writeComments (Boolean) = true: Set this parameter to true if you want to add a comment to each PennTree which is written to the output files. The comment is of the form documentId,beginOffset,endOffset. Default: true
writeT2c (Boolean) = true: Set this parameter to true if you want to encode directly into the tgrep2 binary format. Default: true

TIGER-XML

TigerXml

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.tiger-asl

The TIGER XML format was created for encoding syntactic constituency structures in the German TIGER corpus. It has since been used for many other corpora as well. TIGERSearch is a linguistic search engine specifically targetting this format. The format has later been extended to also support semantic frame annotations.

Known corpora in this format:

Floresta Sintá(c)tica (Bosque) - Portuguese
Semeval-2 Task 10 - (extended format)
Składnica frazowa - Polish
Swedish Treebank - Swedish
Talbanken05 - Swedish
TIGER - German

TigerXmlReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.tiger.TigerXmlReader

UIMA collection reader for TIGER-XML files. Also supports the augmented format used in the Semeval 2010 task which includes semantic role data.

Parameters

POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
ignoreIllegalSentences (Boolean) = false: If a sentence has an illegal structure (e.g. TIGER 2.0 has non-terminal nodes that do not have child nodes), then just ignore these sentences. Default: false
includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
readPennTree (Boolean) = false: Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset! Default: false
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	POS Lemma Sentence Token SemanticArgument SemanticPredicate Constituent

POS Lemma Sentence Token SemanticArgument SemanticPredicate Constituent

TigerXmlWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.tiger.TigerXmlWriter

UIMA CAS consumer writing the CAS document text in the TIGER-XML format.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameSuffix (String) = .xml: Specify the suffix of output files. Default value .xml. If the suffix is not needed, provide an empty string as value.
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

Inputs

Inputs	POS DocumentMetaData Lemma Sentence Token Constituent

POS DocumentMetaData Lemma Sentence Token Constituent

TüPP-D/Z

Tuepp

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.tuepp-asl

TüPP D/Z is a collection of articles from the German newspaper taz (die tageszeitung) annotated and encoded in a XML format.

Known corpora in this format:

TüPP-D/Z - German

TueppReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.tuepp.TueppReader

UIMA collection reader for Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) XML files.

Only the part-of-speech with the best rank (rank 1) is read, if there is a tie between multiple tags, the first one from the XML file is read.
Only the first lemma (baseform) from the XML file is read.
Token are read, but not the specific kind of token (e.g. TEL, AREA, etc.).
Article boundaries are not read.
Paragraph boundaries are not read.
Lemma information is read, but morphological information is not read.
Chunk, field, and clause information is not read.
Meta data headers are not read.

Parameters

POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
POSTagSet (String) [optional]: Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.
includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceEncoding (String) = UTF-8: Character encoding of the input data.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	POS Lemma Sentence Token

POS Lemma Sentence Token

UIMA Binary CAS

BinaryCas

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.bincas-asl

The CAS is the native data model used by UIMA. There are various ways of saving CAS data, using XMI, XCAS, or binary formats. This module supports the binary formats.

See also:

Compressed Binary CASes

BinaryCasReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.bincas.BinaryCasReader

UIMA Binary CAS formats reader.

Parameters

includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceLocation (String) [optional]: Location from which the input is read.
typeSystemLocation (String) [optional]: The location from which to obtain the type system when the CAS is stored in form 0.
useDefaultExcludes (Boolean) = true: Use the default excludes.

BinaryCasWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.bincas.BinaryCasWriter

Write CAS in one of the UIMA binary formats.

Supported formats
Format	Description	Type system on load	CAS Addresses preserved
S	CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). Because these structures are pre-allocated in memory at larger sizes than what is actually required, files in this format may be larger than necessary. However, the CAS addresses of feature structures are preserved in this format. When the data is loaded back into a CAS, it must have been initialized with the same type system as the original CAS.	must be the same	yes
S+	CAS structures are dumped to disc as they are using Java serialization as in form 0, but now using the CASCompleteSerializer which includes CAS metadata like type system and index repositories.	is reinitialized	yes
0	CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). This is basically the same as format S but includes a UIMA header and can be read using org.apache.uima.cas.impl.Serialization#deserializeCAS.	must be the same	yes
4	UIMA binary serialization saving all feature structures (reachable or not). This format internally uses gzip compression and a binary representation of the CAS, making it much more efficient than format 0.	must be the same	yes
6	UIMA binary serialization as format 4, but saving only reachable feature structures.	must be the same	no
6+	UIMA binary serialization as format 6, but also contains the type system defintion. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system.	lenient loading	no

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameExtension (String) = .bin
format (String) = 6+
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
typeSystemLocation (String) [optional]: Location to write the type system to. The type system is saved using Java serialization, it is not saved as a XML type system description. We recommend to use the name typesystem.ser.
The #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the type system file should be compressed or not is detected from the file name extension (e.g. ".gz").
If this parameter is set, the type system and index repository are no longer serialized into the same file as the test of the CAS. The SerializedCasReader can currently not read such files. Use this only if you really know what you are doing.
This parameter has no effect if formats S+ or 6+ are used as the type system information is embedded in each individual file. Otherwise, it is recommended that this parameter be set unless some other mechanism is used to initialize the CAS with the same type system and index repository during reading that was used during writing.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

Inputs

Inputs	DocumentMetaData

DocumentMetaData

SerializedCas

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.bincas-asl

SerializedCasReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.bincas.SerializedCasReader

null

Parameters

includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceLocation (String) [optional]: Location from which the input is read.
typeSystemLocation (String) [optional]: The file from which to obtain the type system if it is not embedded in the serialized CAS.
useDefaultExcludes (Boolean) = true: Use the default excludes.

SerializedCasWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.bincas.SerializedCasWriter

null

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
filenameExtension (String) = .ser
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
typeSystemLocation (String) [optional]: Location to write the type system to. The type system is saved using Java serialization, it is not saved as a XML type system description. We recommend to use the name typesystem.ser.
The #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the type system file should be compressed or not is detected from the file name extension (e.g. ".gz").
If this parameter is set, the type system and index repository are no longer serialized into the same file as the test of the CAS. The SerializedCasReader can currently not read such files. Use this only if you really know what you are doing.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

Inputs

Inputs	DocumentMetaData

DocumentMetaData

UIMA JSON

Json

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.json-asl

JsonWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.json.JsonWriter

UIMA JSON format writer.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
jsonContextFormat (String) = omitExpandedTypeNames
omitDefaultValues (Boolean) = true
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
prettyPrint (Boolean) = true
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
typeSystemFile (String) [optional]: Location to write the type system to. If this is not set, a file called typesystem.xml will be written to the XMI output path. If this is set, it is expected to be a file relative to the current work directory or an absolute file.
If this parameter is set, the #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the file name ends in ".gz", the file will be compressed, otherwise not.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

Inputs

Inputs	DocumentMetaData

DocumentMetaData

UIMA XMI

Xmi

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.xmi-asl

One of the official formats supported by UIMA is the XMI format. It is an XML-based format that does not support a few very specific characters which are invalid in XML. But it is able to capture all the information contained in the CAS. The XMI format is the de-facto standard for exchanging data in the UIMA world. Most UIMA-related tools support it.

The XMI format does not include type system information. It is therefore recommended to always configure the XmiWriter component to also write out the type system to a file.

If you with to view anntated documents using the UIMA CAS Editor in Eclipse, you can e.g. set up your XmiWriter in the following way to write out XMIs and a type system file:

AnalysisEngineDescription xmiWriter =
  AnalysisEngineFactory.createEngineDescription(
      XmiWriter.class,
      XmiWriter.PARAM_TARGET_LOCATION, ".",
      XmiWriter.PARAM_TYPE_SYSTEM_FILE, "typesystem.xml");

XmiReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiReader

Reader for UIMA XMI files.

Parameters

includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
lenient (Boolean) = false: In lenient mode, unknown types are ignored and do not cause an exception to be thrown.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	DocumentMetaData

DocumentMetaData

XmiWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiWriter

UIMA XMI format writer.

Parameters

compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
typeSystemFile (String) [optional]: Location to write the type system to. If this is not set, a file called typesystem.xml will be written to the XMI output path. If this is set, it is expected to be a file relative to the current work directory or an absolute file.
If this parameter is set, the #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the file name ends in ".gz", the file will be compressed, otherwise not.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

Inputs

Inputs	DocumentMetaData

DocumentMetaData

Web1T n-grams

Web1T

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.web1t-asl

The Web1T n-gram corpus is a huge collection of n-grams collected from the internet. The jweb1t library allows to access this corpus efficiently. This module provides support for the file format used by the Web1T n-gram corpus and allows to conveniently created jweb1t indexes.

See also:

Web1TWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.web1t.Web1TWriter

Web1T n-gram index format writer.

Parameters

contextType (String) = de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence: The type being used for segments
createIndexes (Boolean) = true [optional]: Create the indexes that jWeb1T needs to operate. (default: true)
inputTypes (String[]): Types to generate n-grams from. Example: Token.class.getName() + "/pos/PosValue" for part-of-speech n-grams
lowercase (Boolean) = false [optional]: Create a lower case index.
maxNgramLength (Integer) = 3 [optional]: Maximum n-gram length. Default: 3
minFreq (Integer) = 1 [optional]: Specifies the minimum frequency a NGram must have to be written to the final index. The specified value is interpreted as inclusive value, the default is 1. Thus, all NGrams with a frequency of at least 1 or higher will be written.
minNgramLength (Integer) = 1 [optional]: Minimum n-gram length. Default: 1
splitFileTreshold (Float) = 1.0 [optional]: The input file(s) is/are split into smaller files for quick access. An own file is created if the first two starting letters (or the starting letter if the word has a length of 1 character) account for at least x% of all starting letters in the input file(s). The default value for splitting a file is 1.0%. Every word that has starting characters which does not suffice the threshold is written with other words that also did not meet the threshold into an own file for miscellaneous words. A high threshold will lead to only a few, but large files and a most likely very large misc. file. A low threshold results in many small files. Use a zero or a negative value to write everything to one file.
targetEncoding (String) = UTF-8 [optional]: Character encoding of the output data.
targetLocation (String): Location to which the output is written.

Inputs

Inputs	Sentence

Sentence

Wikipedia via Bliki Engine

BlikiWikipedia

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.bliki-asl

Access the online Wikipedia and extract its contents using the Bliki engine.

See also:

Java Wikipedia API (Bliki engine)

BlikiWikipediaReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.bliki.BlikiWikipediaReader

Bliki-based Wikipedia reader.

Parameters

language (String): The language of the wiki installation.
outputPlainText (Boolean) = true: Whether the reader outputs plain text or wiki markup.
pageTitles (String[]): Which page titles should be retrieved.
sourceLocation (String): Wikiapi URL E.g. for the English Wikipedia it should be: http://en.wikipedia.org/w/api.php

Outputs

Outputs	DocumentMetaData

DocumentMetaData

Wikipedia via JWPL

WikipediaArticle

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaArticleReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaArticleReader

Reads all article pages. A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects, disambiguation pages, or discussion pages are regarded, however.

Parameters

CreateDBAnno (Boolean) = false: Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.
Database (String): The name of the database.
Host (String): The host server.
Language (String): The language of the Wikipedia that should be connected to.
OnlyFirstParagraph (Boolean) = false: If set to true, only the first paragraph instead of the whole article is used.
OutputPlainText (Boolean) = true: Whether the reader outputs plain text or wiki markup.
PageBuffer (Integer) = 1000: The page buffer size (#pages) of the page iterator.
PageIdFromArray (String[]) [optional]: Defines an array of page ids of the pages that should be retrieved. (Optional)
PageIdsFromFile (String) [optional]: Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)
PageTitleFromFile (String) [optional]: Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)
PageTitlesFromArray (String[]) [optional]: Defines an array of page titles of the pages that should be retrieved. (Optional)
Password (String): The password of the database account.
User (String): The username of the database account.

WikipediaArticleInfo

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaArticleInfoReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaArticleInfoReader

Reads all general article infos without retrieving the whole Page objects

Parameters

CreateDBAnno (Boolean) = false: Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.
Database (String): The name of the database.
Host (String): The host server.
Language (String): The language of the Wikipedia that should be connected to.
Password (String): The password of the database account.
User (String): The username of the database account.

Outputs

Outputs	DocumentMetaData ArticleInfo

DocumentMetaData ArticleInfo

WikipediaDiscussion

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaDiscussionReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaDiscussionReader

Reads all discussion pages.

Parameters

CreateDBAnno (Boolean) = false: Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.
Database (String): The name of the database.
Host (String): The host server.
Language (String): The language of the Wikipedia that should be connected to.
OutputPlainText (Boolean) = true: Whether the reader outputs plain text or wiki markup.
PageBuffer (Integer) = 1000: The page buffer size (#pages) of the page iterator.
PageIdFromArray (String[]) [optional]: Defines an array of page ids of the pages that should be retrieved. (Optional)
PageIdsFromFile (String) [optional]: Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)
PageTitleFromFile (String) [optional]: Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)
PageTitlesFromArray (String[]) [optional]: Defines an array of page titles of the pages that should be retrieved. (Optional)
Password (String): The password of the database account.
User (String): The username of the database account.

Outputs

Outputs	DBConfig

DBConfig

WikipediaLink

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaLinkReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaLinkReader

Read links from Wikipedia.

Parameters

AllowedLinkTypes (String[]): Which types of links are allowed?
CreateDBAnno (Boolean) = false: Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.
Database (String): The name of the database.
Host (String): The host server.
Language (String): The language of the Wikipedia that should be connected to.
OutputPlainText (Boolean) = true: Whether the reader outputs plain text or wiki markup.
PageBuffer (Integer) = 1000: The page buffer size (#pages) of the page iterator.
PageIdFromArray (String[]) [optional]: Defines an array of page ids of the pages that should be retrieved. (Optional)
PageIdsFromFile (String) [optional]: Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)
PageTitleFromFile (String) [optional]: Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)
PageTitlesFromArray (String[]) [optional]: Defines an array of page titles of the pages that should be retrieved. (Optional)
Password (String): The password of the database account.
User (String): The username of the database account.

Outputs

Outputs	DBConfig WikipediaLink

DBConfig WikipediaLink

WikipediaPage

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaPageReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaPageReader

Reads all Wikipedia pages in the database (articles, discussions, etc). A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects or disambiguation pages are regarded, however.

Parameters

CreateDBAnno (Boolean) = false: Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.
Database (String): The name of the database.
Host (String): The host server.
Language (String): The language of the Wikipedia that should be connected to.
OnlyFirstParagraph (Boolean) = false: If set to true, only the first paragraph instead of the whole article is used.
OutputPlainText (Boolean) = true: Whether the reader outputs plain text or wiki markup.
PageBuffer (Integer) = 1000: The page buffer size (#pages) of the page iterator.
PageIdFromArray (String[]) [optional]: Defines an array of page ids of the pages that should be retrieved. (Optional)
PageIdsFromFile (String) [optional]: Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)
PageTitleFromFile (String) [optional]: Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)
PageTitlesFromArray (String[]) [optional]: Defines an array of page titles of the pages that should be retrieved. (Optional)
Password (String): The password of the database account.
User (String): The username of the database account.

Outputs

Outputs	DBConfig

DBConfig

WikipediaQuery

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaQueryReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaQueryReader

Reads all article pages that match a query created by the numerous parameters of this class.

Parameters

CreateDBAnno (Boolean) = false: Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.
Database (String): The name of the database.
Host (String): The host server.
Language (String): The language of the Wikipedia that should be connected to.
MaxCategories (Integer) = -1 [optional]: Maximum number of categories. Articles with a higher number of categories will not be returned by the query.
MaxInlinks (Integer) = -1 [optional]: Maximum number of incoming links. Articles with a higher number of incoming links will not be returned by the query.
MaxOutlinks (Integer) = -1 [optional]: Maximum number of outgoing links. Articles with a higher number of outgoing links will not be returned by the query.
MaxRedirects (Integer) = -1 [optional]: Maximum number of redirects. Articles with a higher number of redirects will not be returned by the query.
MaxTokens (Integer) = -1 [optional]: Maximum number of tokens. Articles with a higher number of tokens will not be returned by the query.
MinCategories (Integer) = -1 [optional]: Minimum number of categories. Articles with a lower number of categories will not be returned by the query.
MinInlinks (Integer) = -1 [optional]: Minimum number of incoming links. Articles with a lower number of incoming links will not be returned by the query.
MinOutlinks (Integer) = -1 [optional]: Minimum number of outgoing links. Articles with a lower number of outgoing links will not be returned by the query.
MinRedirects (Integer) = -1 [optional]: Minimum number of redirects. Articles with a lower number of redirects will not be returned by the query.
MinTokens (Integer) = -1 [optional]: Minimum number of tokens. Articles with a lower number of tokens will not be returned by the query.
OnlyFirstParagraph (Boolean) = false: If set to true, only the first paragraph instead of the whole article is used.
OutputPlainText (Boolean) = true: Whether the reader outputs plain text or wiki markup.
PageBuffer (Integer) = 1000: The page buffer size (#pages) of the page iterator.
PageIdFromArray (String[]) [optional]: Defines an array of page ids of the pages that should be retrieved. (Optional)
PageIdsFromFile (String) [optional]: Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)
PageTitleFromFile (String) [optional]: Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)
PageTitlesFromArray (String[]) [optional]: Defines an array of page titles of the pages that should be retrieved. (Optional)
Password (String): The password of the database account.
TitlePattern (String) = `` [optional]: SQL-style title pattern. Only articles that match the pattern will be returned by the query.
User (String): The username of the database account.

WikipediaRevision

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaRevisionReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaRevisionReader

Reads Wikipedia page revisions.

Parameters

CreateDBAnno (Boolean) = false: Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.
Database (String): The name of the database.
Host (String): The host server.
Language (String): The language of the Wikipedia that should be connected to.
OutputPlainText (Boolean) = true: Whether the reader outputs plain text or wiki markup.
PageBuffer (Integer) = 1000: The page buffer size (#pages) of the page iterator.
Password (String): The password of the database account.
RevisionIdFromArray (String[]) [optional]: Defines an array of revision ids of the revisions that should be retrieved. (Optional)
RevisionIdsFromFile (String) [optional]: Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional)
User (String): The username of the database account.

Outputs

Outputs	DocumentMetaData DBConfig WikipediaRevision

DocumentMetaData DBConfig WikipediaRevision

WikipediaRevisionPair

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaRevisionPairReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaRevisionPairReader

Reads pairs of adjacent revisions of all articles.

Parameters

CreateDBAnno (Boolean) = false: Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.
Database (String): The name of the database.
Host (String): The host server.
Language (String): The language of the Wikipedia that should be connected to.
MaxChange (Integer) = 10000: Restrict revision pairs to cases where the length of the revisions does not differ more than this value (counted in characters).
MinChange (Integer) = 0: Restrict revision pairs to cases where the length of the revisions differ more than this value (counted in characters).
OutputPlainText (Boolean) = true: Whether the reader outputs plain text or wiki markup.
PageBuffer (Integer) = 1000: The page buffer size (#pages) of the page iterator.
Password (String): The password of the database account.
RevisionIdFromArray (String[]) [optional]: Defines an array of revision ids of the revisions that should be retrieved. (Optional)
RevisionIdsFromFile (String) [optional]: Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional)
SkipFirstNPairs (Integer) [optional]: The number of revision pairs that should be skipped in the beginning.
User (String): The username of the database account.

Outputs

Outputs	DocumentMetaData DBConfig

DocumentMetaData DBConfig

WikipediaTemplateFilteredArticle

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaTemplateFilteredArticleReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaTemplateFilteredArticleReader

Reads all pages that contain or do not contain the templates specified in the template whitelist and template blacklist.

It is possible to just define a whitelist OR a blacklist. If both whitelist and blacklist are provided, the articles are chosen that DO contain the templates from the whitelist and at the same time DO NOT contain the templates from the blacklist (= the intersection of the "whitelist page set" and the "blacklist page set")

This reader only works if template tables have been generated for the JWPL database using the WikipediaTemplateInfoGenerator.

NOTE: This reader directly extends the WikipediaReaderBase and not the WikipediaStandardReaderBase

Parameters

CreateDBAnno (Boolean) = false

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Database (String)

The name of the database.

DoubleCheckAssociatedPages (Boolean) = false

If this option is set, discussion pages are rejected that are associated with a blacklisted article. Analogously, articles are rejected that are associated with a blacklisted discussion page.

This check is rather expensive and could take a long time. This is option is not active if only a whitelist is used.

Default Value: false

ExactTemplateMatching (Boolean) = true

Defines whether to match the templates exactly or whether to match all templates that start with the String given in the respective parameter list.

Default Value: true

Host (String)

The host server.

IncludeDiscussions (Boolean) = true

Whether the reader should read also include talk pages.

Language (String)

The language of the Wikipedia that should be connected to.

LimitNUmberOfArticlesToRead (Integer) [optional]

Optional parameter that allows to define the max number of articles that should be delivered by the reader.

This avoids unnecessary filtering if only a small number of articles is needed.

OnlyFirstParagraph (Boolean) = false

If set to true, only the first paragraph instead of the whole article is used.

OutputPlainText (Boolean) = true

Whether the reader outputs plain text or wiki markup.

PageBuffer (Integer) = 1000

The page buffer size (#pages) of the page iterator.

Password (String)

The password of the database account.

TemplateBlacklist (String[]) [optional]

Defines templates that the articles MUST NOT contain.

If you also define a whitelist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist)

TemplateWhitelist (String[]) [optional]

Defines templates that the articles MUST contain.

If you also define a blacklist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist)

User (String)

The username of the database account.

Outputs

Outputs	DocumentMetaData DBConfig

DocumentMetaData DBConfig

XML

InlineXml

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.xml-asl

InlineXmlWriter

Writer class: de.tudarmstadt.ukp.dkpro.core.io.xml.InlineXmlWriter

Writes an approximation of the content of a textual CAS as an inline XML file. Optionally applies an XSLT stylesheet.

Note this component inherits the restrictions from CasToInlineXml:

Features whose values are FeatureStructures are not represented.
Feature values which are strings longer than 64 characters are truncated.
Feature values which are arrays of primitives are represented by strings that look like [ xxx, xxx ]
The Subject of analysis is presumed to be a text string.
Some characters in the document's Subject-of-analysis are replaced by blanks, because the characters aren't valid in xml documents.
It doesn't work for annotations which are overlapping, because these cannot be properly represented as properly - nested XML.

Parameters

Xslt (String) [optional]: XSLT stylesheet to apply.
compression (String) = NONE [optional]: Choose a compression method. (default: CompressionMethod#NONE)
escapeDocumentId (Boolean) = true: URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)
overwrite (Boolean) = false: Allow overwriting target files (ignored when writing to ZIP archives).
singularTarget (Boolean) = false: Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.
stripExtension (Boolean) = false: Remove the original extension.
targetLocation (String) [optional]: Target location. If this parameter is not yet, data is written to stdout.
useDocumentId (Boolean) = false: Use the document ID as file name even if a relative path information is present.

Inputs

Inputs	DocumentMetaData

DocumentMetaData

Xml

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.xml-asl

XmlReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.xml.XmlReader

Reader for XML files.

Parameters

DocIdTag (String) [optional]: tag which contains the docId
ExcludeTag (String[]) = []: optional, tags those should not be worked on. Out them should no text be extracted and also no Annotations be produced.
IncludeTag (String[]) = []: optional, tags those should be worked on (if empty, then all tags except those ExcludeTags will be worked on)
collectionId (String) [optional]: The collection ID to set in the DocumentMetaData.
language (String) [optional]: Set this as the language of the produced documents.
sourceLocation (String): Location from which the input is read.

Outputs

Outputs	DocumentMetaData Field

DocumentMetaData Field

XmlText

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.xml-asl

XmlTextReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.xml.XmlTextReader

null

Parameters

includeHidden (Boolean) = false: Include hidden files and directories.
language (String) [optional]: Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.
patterns (String[]) [optional]: A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.

Outputs

Outputs	DocumentMetaData

DocumentMetaData

XmlXPath

Artifact ID: de.tudarmstadt.ukp.dkpro.core.io.xml-asl

XmlXPathReader

Reader class: de.tudarmstadt.ukp.dkpro.core.io.xml.XmlXPathReader

A component reader for XML files implemented with XPath.

This is currently optimized for TREC format, which means the style topics are presented in. You should provide the parameter XPath expression that of the parent node And the child nodes of each parent node will be stored separately in its own CAS.

If your expression evaluates to leaf nodes, empty CASes will be created.

Parameters

caseSensitive (Boolean) = true [optional]: States whether the matching is done case sensitive. (default: true)
docIdTag (String) [optional]: Tag which contains the docId. If it is given, it will be ensured that within the same document there is only one id tag and it is not empty
excludeTags (String[]) = []: Tags which should be ignored. If empty then all tags will be processed.
If this and PARAM_INCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed.
includeTags (String[]) = []: Tags which should be worked on. If empty then all tags will be processed.
If this and PARAM_EXCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed.
language (String) [optional]: Language of the documents. If given, it will be set in each CAS.
patterns (String[]): A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.
rootXPath (String): Specifies the XPath expression to all nodes to be processed. Different segments will be separated via PARAM_ID_TAG, and each segment will be stored in a separate CAS.
sourceLocation (String) [optional]: Location from which the input is read.
useDefaultExcludes (Boolean) = true: Use the default excludes.
workingDir (String[]) [optional]: Specify to substitute tag names in CAS.
Please give the substitutions each in before - after order. For example to substitute "foo" with "bar", and "hey" with "ho", you can provide { "foo", "bar", "hey", "ho" }.

Outputs

DocumentMetaData Field

Support DKPro Core by allowing the use of cookies

DKPro Core™ Format Reference

Overview

I/O components

ACL Anthology

AclAnthology

AclAnthologyReader

Parameters

Outputs

brat file format

Brat

BratReader

Parameters

BratWriter

Parameters

British National Corpus

Bnc

BncReader

Parameters

Outputs

Combination

Combination

CombinationReader

Parameters

CoNLL

Conll2000

Conll2000Reader

Parameters

Outputs

Conll2000Writer

Parameters

Inputs

Conll2002

Conll2002Reader

Parameters

Outputs

Conll2002Writer

Parameters

Inputs

Conll2006

Conll2006Reader

Parameters

Outputs

Conll2006Writer

Parameters

Inputs

Conll2009

Conll2009Reader

Parameters

Outputs

Conll2009Writer

Parameters

Inputs

Conll2012

Conll2012Reader

Parameters

Outputs

Conll2012Writer

Parameters

Inputs

HTML

Html

HtmlReader

Parameters

Outputs

IMS Corpus Workbench

ImsCwb

ImsCwbReader

Parameters

Outputs

ImsCwbWriter

Parameters

Inputs

JDBC

Jdbc

JdbcReader

Parameters

Outputs

Mallet

MalletTopicProportions