The document provides detailed information about the DKPro Core input and output formats.
Overview
Format | Reader | Writer |
---|---|---|
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
||
none |
ACL Anthology
AclAnthology
AclAnthologyReader
Reads the ACL anthology corpus and outputs CASes with plain text documents.
The reader tries to strip out hyphenation and replace problematic characters to produce a cleaned text. Otherwise, it is a plain text reader.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceEncoding |
Name of configuration parameter that contains the character encoding used by the input files. If not specified, the default system encoding will be used. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/plain |
---|---|
Outputs |
AnCora
Ancora
AncoraReader
Read AnCora XML format.
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
dropSentencesMissingPosTags |
Whether to ignore sentence in which any POS tags are missing. Normally, it is assumed that if any POS tags are present, then every token as a POS tag. Type: Boolean — Default value: |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readLemma |
Write lemma annotations to the CAS. Type: Boolean — Default value: |
readPOS |
Write part-of-speech annotations to the CAS. Type: Boolean — Default value: |
readSentence |
Write sentence annotations to the CAS. Type: Boolean — Default value: |
readToken |
Write token annotations to the CAS. Type: Boolean — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
splitMultiWordTokens |
Whether to split words containing underscores into multiple tokens. Type: Boolean — Default value: |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.ancora+xml
application/xml |
---|---|
Outputs |
brat file format
Brat
BratReader
Reader for the brat format.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
relationTypeMappings |
Mapping of brat relation annotations to UIMA types, e.g. :
Optional — Type: String[] |
relationTypes |
Types that are relations. It is mandatory to provide the type name followed by two feature names that represent Arg1 and Arg2 separated by colons, e.g.
Additionally, a subcategorization feature may be specified.Type: String[] — Default value: |
sourceEncoding |
Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
textAnnotationTypeMappings |
Mapping of brat text annotations (entities or events) to UIMA types, e.g. :
Optional — Type: String[] |
textAnnotationTypes |
Using this parameter is only necessary to specify a subcategorization feature for text and event annotation types. It is mandatory to provide the type name which can optionally be followed by a subcategorization feature. Type: String[] — Default value: |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.brat |
---|---|
Outputs |
none specified |
BratWriter
Writer for the brat annotation format.
Known issues:
- Brat is unable to read relation attributes created by this writer.
- PARAM_TYPE_MAPPINGS not implemented yet
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
enableTypeMappings |
Enable type mappings. Type: Boolean — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
excludeTypes |
Types that will not be written to the exported file. Type: String[] — Default value: |
filenameExtension |
Specify the suffix of output files. Default value Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
palette |
Colors to be used for the visual configuration that is generated for brat. Optional — Type: String[] — Default value: |
relationTypes |
Types that are relations. It is mandatory to provide the type name followed by two feature
names that represent Arg1 and Arg2 separated by colons, e.g.
Type: String[] — Default value: |
shortAttributeNames |
Whether to render attributes by their short name or by their qualified name. Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
spanTypes |
Types that are text annotations (aka entities or spans). Type: String[] — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
textFilenameExtension |
Specify the suffix of text output files. Default value Type: String — Default value: |
typeMappings |
FIXME Optional — Type: String[] — Default value: |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeNullAttributes |
Enable writing of features with null values. Type: Boolean — Default value: |
writeRelationAttributes |
The brat web application can currently not handle attributes on relations, thus they are disabled by default. Here they can be enabled again. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.brat |
---|---|
Inputs |
none specified |
British National Corpus
Bnc
BncReader
Reader for the British National Corpus (XML version).
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.bnc+xml |
---|---|
Outputs |
CoNLL
Conll2000
The CoNLL 2000 format represents POS and Chunk tags. Fields in a line are separated by spaces. Sentences are separated by a blank new line.
Column | Type | Description |
---|---|---|
FORM |
Token |
token |
POSTAG |
POS |
part-of-speech tag |
CHUNK |
Chunk |
chunk (IOB1 encoded) |
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
to TO B-PP
only RB B-NP
# # I-NP
1.8 CD I-NP
billion CD I-NP
in IN B-PP
September NNP B-NP
. . O
Corpus | Language |
---|---|
English |
|
English |
Conll2000Reader
Reads the CoNLL 2000 chunking format.
ChunkMappingLocation |
Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
ChunkTagSet |
Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readChunk |
Read chunk information. Type: Boolean — Default value: |
readPOS |
Read part-of-speech information. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2000 |
---|---|
Outputs |
Conll2000Writer
Writes the CoNLL 2000 chunking format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeChunk |
Write chunking information. Type: Boolean — Default value: |
writeCoveredText |
Write text covered by the token instead of the token form. Type: Boolean — Default value: |
writePOS |
Write part-of-speech information. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2000 |
---|---|
Inputs |
Conll2002
The CoNLL 2002 format encodes named entity spans. Fields are separated by a single space. Sentences are separated by a blank new line.
Column | Type/Feature | Description |
---|---|---|
FORM |
Token |
Word form or punctuation symbol. |
NER |
NamedEntity |
named entity (IOB2 encoded) |
Wolff B-PER
, O
currently O
a O
journalist O
in O
Argentina B-LOC
, O
played O
with O
Del B-PER
Bosque I-PER
in O
the O
final O
years O
of O
the O
seventies O
in O
Real B-ORG
Madrid I-ORG
. O
For readability, the columns in the example above are aligned. In actual files, there is only a single space separating the fields in each line. |
Corpus | Language |
---|---|
Arabic |
|
Spanish |
|
Dutch |
Conll2002Reader
Reads by default the CoNLL 2002 named entity format.
The reader is also compatible with the CoNLL-based GermEval 2014 named entity format, in which the columns are separated by a tab, and there is an extra column for embedded named entities, besides the token number being put in the first column (see below). For that, additional parameters are provided, by which one can determine the column separator, whether there is an additional first column for token numbers, and whether embedded named entities should be read. (Note: Currently, the reader only reads the outer named entities, not the embedded ones.
The following snippet shows an example of the TSV format
# http://de.wikipedia.org/wiki/Manfred_Korfmann [2009-10-17]
1 Aufgrund O O
2 seiner O O
3 Initiative O O
4 fand O O
5 2001/2002 O O
6 in O O
7 Stuttgart B-LOC O
8 , O O
9 Braunschweig B-LOC O
10 und O O
11 Bonn B-LOC O
12 eine O O
13 große O O
14 und O O
15 publizistisch O O
16 vielbeachtete O O
17 Troia-Ausstellung B-LOCpart O
18 statt O O
19 , O O
20 „ O O
21 Troia B-OTH B-LOC
22 - I-OTH O
23 Traum I-OTH O
24 und I-OTH O
25 Wirklichkeit I-OTH O
26 “ O O
27 . O O
- WORD_NUMBER - token number
- FORM - token
- NER1 - outer named entity (BIO encoded)
- NER2 - embedded named entity (BIO encoded)
NamedEntityMappingLocation |
Location of the mapping file for named entity tags to UIMA types. Optional — Type: String |
columnSeparator |
Column separator parameter. Acceptable input values come from ColumnSeparators.
Optional — Type: String — Default value: |
hasEmbeddedNamedEntity |
Has embedded named entity extra column. Optional — Type: Boolean — Default value: |
hasHeader |
Indicates that there is a header line before the sentence Optional — Type: Boolean — Default value: |
hasTokenNumber |
Token number flag. When true, the first column contains the token number inside the sentence (as in GermEval 2014 format) Optional — Type: Boolean — Default value: |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readNamedEntity |
Read named entity information. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2002
text/x.org.dkpro.germeval-2014 |
---|---|
Outputs |
Conll2002Writer
Writes the CoNLL 2002 named entity format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeCoveredText |
Write text covered by the token instead of the token form. Type: Boolean — Default value: |
writeNamedEntity |
Write named entity information. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2002 |
---|---|
Inputs |
Conll2003
The CoNLL 2004 format encodes named entity spans and chunk spans. Fields are separated by a single
space. Sentences are separated by a blank new line. Named entities and chunks are encoded in the
IOB1 format. I.e. a B
prefix is only used if the category of the following span differs from the
category of the current span.
Column | Type/Feature | Description |
---|---|---|
FORM |
Token |
Word form or punctuation symbol. |
CHUNK |
Chunk |
chunk (IOB1 encoded) |
NER |
Named entity |
named entity (IOB1 encoded) |
U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOC
. . O O
For readability, the columns in the example above are aligned. In actual files, there is only a single space separating the fields in each line. |
Corpus | Language |
---|---|
Arabic |
|
Spanish |
|
Dutch |
Conll2003Reader
Reads the CoNLL 2003 format.
ChunkMappingLocation |
Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
ChunkTagSet |
Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
NamedEntityMappingLocation |
Location of the mapping file for named entity tags to UIMA types. Optional — Type: String |
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readChunk |
Read chunk information. Type: Boolean — Default value: |
readNamedEntity |
Read named entity information. Type: Boolean — Default value: |
readPOS |
Read part-of-speech information. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2003 |
---|---|
Outputs |
Conll2003Writer
Writes the CoNLL 2003 format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeChunk |
Write chunking information. Type: Boolean — Default value: |
writeCoveredText |
Write text covered by the token instead of the token form. Type: Boolean — Default value: |
writeNamedEntity |
Write named entity information. Type: Boolean — Default value: |
writePOS |
Write part-of-speech information. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2003 |
---|---|
Inputs |
Conll2006
The CoNLL 2006 (aka CoNLL-X) format targets dependency parsing. Columns are tab-separated. Sentences are separated by a blank new line.
Column | Type/Feature | Description |
---|---|---|
ID |
ignored |
Token counter, starting at 1 for each new sentence. |
FORM |
Token |
Word form or punctuation symbol. |
LEMMA |
Lemma |
Lemma of the word form. |
CPOSTAG |
POS coarseValue |
|
POSTAG |
POS PosValue |
Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available. |
FEATS |
MorphologicalFeatures |
Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar ( |
HEAD |
Dependency |
Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero. |
DEPREL |
Dependency |
Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'. |
PHEAD |
ignored |
Projective head of current token, which is either a value of ID or zero ('0'), or an underscore if not available. Note that depending on the original treebank annotation, there may be multiple tokens an with ID of zero. The dependency structure resulting from the PHEAD column is guaranteed to be projective (but is not available for all languages), whereas the structures resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available). |
PDEPREL |
ignored |
Dependency relation to the PHEAD, or an underscore if not available. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'. |
Heutzutage heutzutage ADV _ _ ADV _ _
Corpus | Language |
---|---|
Danish |
|
FinnTreeBank (in recent versions with additional pseudo-XML metadata) |
Finnish |
Portuguese |
|
French |
|
Croatian |
|
Polish |
|
Slovene |
|
Swedish |
|
Swedish |
|
Persian (Farsi) |
|
Norwegian |
|
Spanish |
|
Italian |
Conll2006Reader
Reads a file in the CoNLL-2006 format (aka CoNLL-X).
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readCPOS |
Read coarse-grained part-of-speech information. Type: Boolean — Default value: |
readDependency |
Read syntactic dependency information. Type: Boolean — Default value: |
readLemma |
Read lemma information. Type: Boolean — Default value: |
readMorph |
Read morphological features. Type: Boolean — Default value: |
readPOS |
Read fine-grained part-of-speech information. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useCPosAsPos |
Enable to use CPOS (column 4) as the part-of-speech tag. Otherwise the POS (column 3) is used. Type: Boolean — Default value: |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2006 |
---|---|
Outputs |
Conll2006Writer
Writes a file in the CoNLL-2006 format (aka CoNLL-X).
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeCPOS |
Write coarse-grained part-of-speech information. Type: Boolean — Default value: |
writeCoveredText |
Write text covered by the token instead of the token form. Type: Boolean — Default value: |
writeDependency |
Write syntactic dependency infomation. Type: Boolean — Default value: |
writeLemma |
Write lemma information. Type: Boolean — Default value: |
writeMorph |
Write morphological features. Type: Boolean — Default value: |
writePOS |
Write fine-grained part-of-speech information. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2006 |
---|---|
Inputs |
Conll2008
The CoNLL 2008 format targets syntactic and semantic dependencies. Columns are tab-separated. Sentences are separated by a blank new line.
Column | Type/Feature | Description |
---|---|---|
ID |
ignored |
Token counter, starting at 1 for each new sentence. |
FORM |
Token |
Word form or punctuation symbol. |
LEMMA |
Lemma |
Lemma of the word form. |
GPOS |
POS PosValue |
Golf fine-grained part-of-speech tag, where the tagset depends on the language. |
PPOS |
ignored |
Automatically predicted major POS by a language-specific tagger. |
SPLIT_FORM |
ignored |
Tokens split at hyphens and slashes. |
SPLIT_LEMMA |
ignored |
Predicted lemma of SPLIT_FORM. |
PPOSS |
ignored |
Predicted POS tags of the split forms. |
HEAD |
Dependency |
Head of the current token, which is either a value of ID or zero (`0). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero. |
DEPREL |
Dependency |
Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply |
PRED |
SemPred |
(sense) identifier of a semantic 'predicate' coming from a current token. |
APREDs |
SemArg |
Columns with argument labels for each semantic predicate (in the ID order). |
1 Some some DT _ Some some DT 10 SBJ _ _ _ _ A1 _ _ _
2 of of IN _ of of IN 1 NMOD _ _ _ _ _ _ _ _
3 the the DT _ the the DT 5 NMOD _ _ _ _ _ _ _ _
4 strongest strongest JJS _ strongest strong JJS 5 NMOD _ _ _ _ _ _ _ _
5 critics critics NNS _ critics critic NNS 2 PMOD critic.01 A0 _ _ _ _ _ _
6 of of IN _ of of IN 5 NMOD _ A1 _ _ _ _ _ _
7 our our PRP$ _ our our PRP$ 9 NMOD _ _ A1 A0 _ _ _ _
8 welfare welfare NN _ welfare welfare NN 9 NMOD welfare.01 _ A2 _ _ _ _ _
9 system system NN _ system system NN 6 PMOD system.01 _ _ _ _ _ _ _
10 are are VBP _ are be VBP 0 ROOT be.01 _ _ _ _ _ _ _
11 the the DT _ the the DT 12 NMOD _ _ _ _ _ _ _ _
12 people people NNS _ people people NNS 10 PRD person.02 _ _ _ A2 A0 A0 A1
13 who who WP _ who who WP 14 SBJ _ _ _ _ _ _ _ _
14 have have VBP _ have have VBP 12 NMOD have.04 _ _ _ _ SU _ _
15 become become VBN _ become become VBN 14 VC become.01 _ _ _ _ A1 A1 _
16 dependent dependent JJ _ dependent dependent JJ 15 PRD _ _ _ _ _ _ _ _
17 on on IN _ on on IN 16 AMOD _ _ _ _ _ _ _ _
18 it it PRP _ it it PRP 17 PMOD _ _ _ _ _ _ _ _
19 . . . _ . . . 10 P _ _ _ _ _ _ _ _
Corpus | Language |
---|---|
English |
Conll2008Reader
Reads a file in the CoNLL-2008 format.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readDependency |
Read syntactic dependency information. Type: Boolean — Default value: |
readLemma |
Read lemma information. Type: Boolean — Default value: |
readPOS |
Read part-of-speech information. Type: Boolean — Default value: |
readSemPred |
Read semantic predicate information. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2008 |
---|---|
Outputs |
Conll2008Writer
Writes a file in the CoNLL-2008 format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeCoveredText |
Write text covered by the token instead of the token form. Type: Boolean — Default value: |
writeDependency |
Write syntactic dependency infomation. Type: Boolean — Default value: |
writeLemma |
Write lemma information. Type: Boolean — Default value: |
writeMorph |
Write morphological features. Type: Boolean — Default value: |
writePOS |
Write part-of-speech information. Type: Boolean — Default value: |
writeSemanticPredicate |
Write semantic predicate infomation. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2008 |
---|---|
Inputs |
Conll2009
The CoNLL 2009 format targets semantic role labeling. Columns are tab-separated. Sentences are separated by a blank new line.
Column | Type/Feature | Description |
---|---|---|
ID |
ignored |
Token counter, starting at 1 for each new sentence. |
FORM |
Token |
Word form or punctuation symbol. |
LEMMA |
Lemma |
Lemma of the word form. |
PLEMMA |
ignored |
Automatically predicted lemma of FORM. |
POS |
POS PosValue |
Fine-grained part-of-speech tag, where the tagset depends on the language. |
PPOS |
ignored |
Automatically predicted major POS by a language-specific tagger. |
FEATS |
MorphologicalFeatures |
Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar ( |
PFEAT |
ignored) |
Automatically predicted morphological features (if applicable). |
HEAD |
Dependency |
Head of the current token, which is either a value of ID or zero (`0). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero. |
PHEAD |
ignored |
Automatically predicted syntactic head. |
DEPREL |
Dependency |
Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply |
PDEPREL |
ignored |
Automatically predicted dependency relation to PHEAD. |
FILLPRED |
ignored |
Contains |
PRED |
SemPred |
(sense) identifier of a semantic 'predicate' coming from a current token. |
APREDs |
SemArg |
Columns with argument labels for each semantic predicate (in the ID order). |
1 The the the DT DT _ _ 4 4 NMOD NMOD _ _ _ _
2 most most most RBS RBS _ _ 3 3 AMOD AMOD _ _ _ _
3 troublesome troublesome troublesome JJ JJ _ _ 4 4 NMOD NMOD _ _ _ _
4 report report report NN NN _ _ 5 5 SBJ SBJ _ _ _ _
5 may may may MD MD _ _ 0 0 ROOT ROOT _ _ _ _
6 be be be VB VB _ _ 5 5 VC VC _ _ _ _
7 the the the DT DT _ _ 11 11 NMOD NMOD _ _ _ _
8 August august august NNP NNP _ _ 11 11 NMOD NMOD _ _ _ AM-TMP
9 merchandise merchandise merchandise NN NN _ _ 10 10 NMOD NMOD _ _ A1 _
10 trade trade trade NN NN _ _ 11 11 NMOD NMOD Y trade.01 _ A1
11 deficit deficit deficit NN NN _ _ 6 6 PRD PRD Y deficit.01 _ A2
12 due due due JJ JJ _ _ 13 11 AMOD APPO _ _ _ _
13 out out out IN IN _ _ 11 12 APPO AMOD _ _ _ _
14 tomorrow tomorrow tomorrow NN NN _ _ 13 12 TMP TMP _ _ _ _
15 . . . . . _ _ 5 5 P P _ _ _ _
Corpus | Language |
---|---|
Catalan, German, Japanese, Spanish |
Conll2009Reader
Reads a file in the CoNLL-2009 format.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readDependency |
Read syntactic dependency information. Type: Boolean — Default value: |
readLemma |
Read lemma information. Type: Boolean — Default value: |
readMorph |
Read morphological features. Type: Boolean — Default value: |
readPOS |
Read part-of-speech information. Type: Boolean — Default value: |
readSemPred |
Read semantic predicate information. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2009 |
---|---|
Outputs |
Conll2009Writer
Writes a file in the CoNLL-2009 format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeCoveredText |
Write text covered by the token instead of the token form. Type: Boolean — Default value: |
writeDependency |
Write syntactic dependency information. Type: Boolean — Default value: |
writeLemma |
Write lemma information. Type: Boolean — Default value: |
writeMorph |
Read morphological features. Type: Boolean — Default value: |
writePOS |
Write part-of-speech information. Type: Boolean — Default value: |
writeSemPred |
Write semantic predicate information. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2009 |
---|---|
Inputs |
Conll2012
The CoNLL 2012 format targets semantic role labeling and coreference. Columns are tab-separated. Sentences are separated by a blank new line.
Column | Type/Feature | Description |
---|---|---|
Document ID |
ignored |
This is a variation on the document filename.</li> |
Part number |
ignored |
Some files are divided into multiple parts numbered as 000, 001, 002, … etc. |
Word number |
ignored |
</li> |
Word itself |
document text |
This is the token as segmented/tokenized in the Treebank. Initially the |
Part-of-Speech |
POS |
|
Parse bit |
Constituent |
This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a |
Predicate lemma |
Lemma |
The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-". |
Predicate Frameset ID |
SemPred |
This is the PropBank frameset ID of the predicate in Column 7. |
Word sense |
ignored |
This is the word sense of the word in Column 3. |
Speaker/Author |
ignored |
This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data. |
Named Entities |
NamedEntity |
These columns identifies the spans representing various named entities. |
Predicate Arguments |
SemPred |
There is one column each of predicate argument structure information for the predicate mentioned in Column 7. |
Coreference |
CoreferenceChain |
Coreference chain information encoded in a parenthesis structure. |
en-orig.conll 0 0 John NNP (TOP(S(NP*) john - - - (PERSON) (A0) (1)
en-orig.conll 0 1 went VBD (VP* go go.02 - - * (V*) -
en-orig.conll 0 2 to TO (PP* to - - - * * -
en-orig.conll 0 3 the DT (NP* the - - - * * (2
en-orig.conll 0 4 market NN *))) market - - - * (A1) 2)
en-orig.conll 0 5 . . *)) . - - - * * -
Conll2012Reader
Reads a file in the CoNLL-2012 format.
ConstituentMappingLocation |
Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
ConstituentTagSet |
Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readConstituent |
Read syntactic constituent information. Type: Boolean — Default value: |
readCoreference |
Read co-reference information. Type: Boolean — Default value: |
readLemma |
Read lemma information. Disabled by default because CoNLL 2012 format does not include lemmata for all words, only for predicates. Type: Boolean — Default value: |
readNamedEntity |
Read named entity information. Type: Boolean — Default value: |
readPOS |
Read part-of-speech information. Type: Boolean — Default value: |
readSemPred |
Read semantic predicate information. Type: Boolean — Default value: |
readWordSense |
Read word sense information. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
useHeaderMetadata |
Use the document ID declared in the file header instead of using the filename. Type: Boolean — Default value: |
writeTracesToText |
Whether to render traces into the document text. Optional — Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2012 |
---|---|
Outputs |
Conll2012Writer
Writer for the CoNLL-2012 format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeCoveredText |
Write text covered by the token instead of the token form. Type: Boolean — Default value: |
writeLemma |
Write lemma information. Type: Boolean — Default value: |
writePOS |
Write part-of-speech information. Type: Boolean — Default value: |
writeSemanticPredicate |
Write semantic predicate infomation. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-2012 |
---|---|
Inputs |
ConllU
The CoNLL 2012 format targets semantic role labeling and coreference. Columns are tab-separated. Sentences are separated by a blank new line.
Column | Type/Feature | Description |
---|---|---|
ID |
ignored |
Word index, integer starting at 1 for each new sentence; may be a range for tokens with multiple words. |
FORM |
Token |
Word form or punctuation symbol. |
LEMMA |
Lemma |
Lemma or stem of word form. |
CPOSTAG |
POS coarseValue |
Part-of-speech tag from the universal POS tag set. |
POSTAG |
POS PosValue |
Language-specific part-of-speech tag; underscore if not available. |
FEATS |
MorphologicalFeatures |
List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available. |
HEAD |
Dependency |
Head of the current token, which is either a value of ID or zero (0). |
DEPREL |
Dependency |
Universal Stanford dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one. |
DEPS |
Dependency |
List of secondary dependencies (head-deprel pairs). |
MISC |
unused |
Any other annotation. |
1 They they PRON PRN Case=Nom|Number=Plur 2 nsubj 4:nsubj _
2 buy buy VERB VB Number=Plur|Person=3|Tense=Pres 0 root _ _
3 and and CONJ CC _ 2 cc _ _
4 sell sell VERB VB Number=Plur|Person=3|Tense=Pres 2 conj 0:root _
5 books book NOUN NNS Number=Plur 2 dobj 4:dobj SpaceAfter=No
6 . . PUNCT . _ 2 punct _ _
Corpus | Language |
---|---|
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish |
ConllUReader
Reads a file in the CoNLL-U format.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readCPOS |
Read coarse-grained part-of-speech information. Type: Boolean — Default value: |
readDependency |
Read syntactic dependency information. Type: Boolean — Default value: |
readLemma |
Read lemma information. Type: Boolean — Default value: |
readMorph |
Read morphological features. Type: Boolean — Default value: |
readPOS |
Read fine-grained part-of-speech information. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useCPosAsPos |
Treat coarse-grained part-of-speech as fine-grained part-of-speech information. Type: Boolean — Default value: |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-u |
---|---|
Outputs |
ConllUWriter
Writes a file in the CoNLL-U format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeCPOS |
Write coarse-grained part-of-speech information. Type: Boolean — Default value: |
writeCoveredText |
Write text covered by the token instead of the token form. Type: Boolean — Default value: |
writeDependency |
Write syntactic dependency infomation. Type: Boolean — Default value: |
writeLemma |
Write lemma information. Type: Boolean — Default value: |
writeMorph |
Write morphological features. Type: Boolean — Default value: |
writePOS |
Write fine-grained part-of-speech information. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.conll-u |
---|---|
Inputs |
Ditop
DiTop
DiTopWriter
This annotator (consumer) writes output files as required by DiTop. It requires JCas input annotated by de.tudarmstadt.ukp.dkpro.core.mallet.lda.MalletLdaTopicModelInferencer using the same model.
appendConfig |
If set to true, the new corpus will be appended to an existing config file. If false, the existing file is overwritten. Type: Boolean — Default value: |
collectionValues |
If set, only documents with one of the listed collection IDs are written, all others are ignored. If this is empty (null), all documents are written. Optional — Type: String[] |
collectionValuesExactMatch |
If true (default), only write documents with collection ids matching one of the collection values exactly. If false, write documents with collection ids containing any of the collection value string in collection while ignoring cases. Type: Boolean — Default value: |
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
corpusName |
The corpus name is used to name the corresponding sub-directory and will be set in the configuration file. Type: String |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
maxTopicWords |
The maximum number of topic words to extract. Type: Integer — Default value: |
modelLocation |
A Mallet file storing a serialized ParallelTopicModel. Type: String |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Directory in which to store output files. Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.ditop |
---|---|
Inputs |
Frequency
Frequency
FrequencyWriter
Count uni-grams and bi-grams in a collection.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
coveringType |
Set this parameter if bigrams should only be counted when occurring within a covering type, e.g. sentences. Optional — Type: String |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
featurePath |
The feature path. Type: String — Default value: |
filterRegex |
Regular expression of tokens to be filtered. Type: String — Default value: `` |
lowercase |
If true, all tokens are lowercased. Type: Boolean — Default value: |
minCount |
Tokens occurring fewer times than this value are omitted. Type: Integer — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
regexReplacement |
Value with which tokens matching the regular expression are replaced. Type: String — Default value: `` |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
sortByAlphabet |
If true, sort output alphabetically. Type: Boolean — Default value: |
sortByCount |
If true, sort output by count (descending order). Type: Boolean — Default value: |
stopwordsFile |
Path of a file containing stopwords one work per line. Type: String — Default value: `` |
stopwordsReplacement |
Stopwords are replaced by this value. Type: String — Default value: `` |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
none specified |
---|---|
Inputs |
none specified |
TfIdf
TfIdfWriter
This consumer builds a DfModel. It collects the df (document frequency) counts for the processed collection. The counts are serialized as a DfModel-object.
featurePath |
This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path. Type: String |
lowercase |
If set to true, the whole text is handled in lower case. Type: Boolean — Default value: |
targetLocation |
Specifies the path and filename where the model file is written. Type: String |
Media types |
none specified |
---|---|
Inputs |
none specified |
HTML
Html
HtmlReader
Reads the contents of a given URL and strips the HTML. Returns the textual contents. Also recognizes headings and paragraphs.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceEncoding |
Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/xhtml+xml
text/html |
---|---|
Outputs |
IMS Corpus Workbench
ImsCwb
The IMS Open Corpus Workbench is a linguistic search engine. It uses a tab-separated format with limited markup (e.g. for sentences, documents, but not recursive structures like parse-trees). If a local installation of the corpus workbench is available, it can be used by this module to immediately generate the corpus workbench index format. Search is not supported by this module.
-
WaCky - The Web-As-Corpus Kool Yinitiative - corpora crawled from the world wide web in several different languages (DeWaC, UkWaC, ItWaC, etc.)
ImsCwbReader
Reads a tab-separated format including pseudo-XML tags.
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
POSTagSet |
Specify which tag set should be used to locate the mapping file. Optional — Type: String |
generateNewIds |
If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. Type: Boolean — Default value: |
idIsUrl |
If true, the unit text ID encoded in the corpus file is stored as the URI in the document meta data. This setting has is not affected by #PARAM_GENERATE_NEW_IDS Type: Boolean — Default value: |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readLemma |
Read lemmas. Type: Boolean — Default value: |
readPOS |
Read part-of-speech tags and generate POS annotations or subclasses if a #PARAM_POS_TAG_SET tag set or #PARAM_POS_MAPPING_LOCATION mapping file is used. Type: Boolean — Default value: |
readSentence |
Read sentences. Type: Boolean — Default value: |
readToken |
Read tokens and generate Token annotations. Type: Boolean — Default value: |
replaceNonXml |
Replace non-XML characters with spaces. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the output. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.imscwb |
---|---|
Outputs |
ImsCwbWriter
This Consumer outputs the content of all CASes into the IMS workbench format. This writer produces a text file which needs to be converted to the binary IMS CWB index files using the command line tools that come with the CWB. It is possible to set the parameter #PARAM_CQP_HOME to directly create output in the native binary CQP format via the original CWB command line tools.
additionalFeatures |
Write additional token-level annotation features. These have to be given as an array of fully qualified feature paths (fully.qualified.classname/featureName). The names for these annotations in CQP are their lowercase shortnames. Optional — Type: String[] |
corpusName |
The name of the generated corpus. Type: String — Default value: |
cqpCompress |
Set this parameter to compress the token streams and the indexes using cwb-huffcode and cwb-compress-rdx. With modern hardware, this may actually slow down queries, so we turn it off by default. If you have large data sets, you best try yourself what works best for you. (default: false) Type: Boolean — Default value: |
cqpHome |
Set this parameter to the directory containing the cwb-encode and cwb-makeall commands if you want the write to directly encode into the CQP binary format. Optional — Type: String |
cqpwebCompatibility |
Make document IDs compatible with CQPweb. CQPweb demands an id consisting of only letters, numbers and underscore. Type: Boolean — Default value: |
sentenceTag |
The pseudo-XML tag used to mark sentence boundaries. Type: String — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Location to which the output is written. Type: String |
writeCPOS |
Write coarse-grained part-of-speech tags. These are the simple names of the UIMA types used to represent the part-of-speech tag. Type: Boolean — Default value: |
writeDocId |
Write the document ID for each token. It is usually a better idea to generate a #PARAM_WRITE_DOCUMENT_TAG document tag or a #PARAM_WRITE_TEXT_TAG text tag which also contain the document ID that can be queried in CQP. Type: Boolean — Default value: |
writeDocumentTag |
Write a pseudo-XML tag with the name document to mark the start and end of a document. Type: Boolean — Default value: |
writeLemma |
Write lemmata. Type: Boolean — Default value: |
writeOffsets |
Write the start and end position of each token. Type: Boolean — Default value: |
writePOS |
Write part-of-speech tags. Type: Boolean — Default value: |
writeTextTag |
Write a pseudo-XML tag with the name text to mark the start and end of a document. This is used by CQPweb. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.imscwb |
---|---|
Inputs |
JDBC
Jdbc
JdbcReader
Collection reader for JDBC database.The obtained data will be written into CAS DocumentText as well as fields of the DocumentMetaData annotation.
The field names are available as constants and begin with CAS_
. Please specify the
mapping of the columns and the field names in the query. For example,
SELECT text AS cas_text, title AS cas_metadata_title FROM test_table
will create a CAS for each record, write the content of "text" column into CAS document text and that of "title" column into the document title field of the DocumentMetaData annotation.
connection |
Specifies the URL to the database.
If used with uimaFIT and the value is not given, Do not use this parameter to add additional parameters, but use #PARAM_CONNECTION_PARAMS instead. Type: String — Default value: |
connectionParams |
Add additional parameters for the connection URL here in a single string: [&propertyName1=propertyValue1[&propertyName2=propertyValue2]...]. Type: String — Default value: `` |
database |
Specifies name of the database to be accessed. Type: String |
driver |
Specify the class name of the JDBC driver.
If used with uimaFIT and the value is not given, Type: String — Default value: |
language |
Specifies the language. Optional — Type: String |
password |
Specifies the password for database access. Type: String |
query |
Specifies the query. Type: String |
user |
Specifies the user name for database access. Type: String |
Media types |
none specified |
---|---|
Outputs |
Leipzig Corpora Collection
Lcc
LccReader
Reader for sentence-based Leipzig Corpora Collection files.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sentencesPerCAS |
How many input sentences should be merged into one CAS. Type: Integer — Default value: |
sourceEncoding |
Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
writeSentence |
Whether sentences should be written by the reader or not. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.lcc |
---|---|
Outputs |
LIF
Lif
LifReader
Reader for the LIF format.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceEncoding |
Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.lif+json |
---|---|
Outputs |
LifWriter
Writer for the LIF format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Specify the suffix of output files. Default value Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.lif+json |
---|---|
Inputs |
LXF
Lxf
LxfReader
Reader for the CLARINO LAP LXF format.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.lxf+json |
---|---|
Outputs |
LxfWriter
Writer for the CLARINO LAP LXF format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
delta |
Write only the changes to the annotations. This works only in conjunction with the LxfReader. Type: Boolean — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.lxf+json |
---|---|
Inputs |
Mallet
MalletLdaTopicProportions
MalletLdaTopicProportionsWriter
Write topic proportions to a file in the shape
[
This writer depends on the TopicDistribution annotation which needs to be created by MalletLdaTopicModelInferencer before.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
If #PARAM_SINGULAR_TARGET is set to false (default), this extension will be appended to the output files. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeDocid |
If set to true (default), each output line is preceded by the document id. Type: Boolean — Default value: |
Media types |
none specified |
---|---|
Inputs |
none specified |
MalletLdaTopicsProportionsSorted
MalletLdaTopicsProportionsSortedWriter
Write the topic proportions according to an LDA topic model to an output file. The proportions need to be inferred in a previous step using MalletLdaTopicModelInferencer.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
nTopics |
Number of topics to generate. Type: Integer — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
none specified |
---|---|
Inputs |
none specified |
NEGRA
NegraExport
NegraExportReader
This CollectionReader reads a file which is formatted in the NEGRA export format. The texts and add. information like constituent structure is reproduced in CASes, one CAS per text (article) .
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
collectionId |
The collection ID to the written to the document meta data. Optional — Type: String |
documentUnit |
What indicates if a new CAS should be started. E.g., if set to DocumentUnit#ORIGIN_NAME ORIGIN_NAME, a new CAS is generated whenever the origin name of the current sentence differs from the origin name of the last sentence. Type: String — Default value: |
generateNewIds |
If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. Type: Boolean — Default value: |
language |
The language. Optional — Type: String |
readLemma |
Write lemma information. Type: Boolean — Default value: |
readPOS |
Write part-of-speech information. Type: Boolean — Default value: |
readPennTree |
Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset! Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Type: String |
Media types |
application/x.org.dkpro.negra3
application/x.org.dkpro.negra4 |
---|---|
Outputs |
New York Times Corpus
NYTCollection
NYTCollectionReader
Reader for New York Times articles from NITF files.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
offset |
A number of documents which will be skipped at the beginning. Optional — Type: Integer |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.nitf+xml |
---|---|
Outputs |
NIF
Nif
The NLP Interchange Format (NIF) provides a way of representing NLP information using semantic web technology, specifically RDF and OWL.
NifReader
Reader for the NLP Interchange Format (NIF). The file format (e.g. TURTLE, etc.) is automatically chosen depending on the name of the file(s) being read. Compressed files are supported.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.nif+turtle |
---|---|
Outputs |
NifWriter
Writer for the NLP Interchange Format (NIF).
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Specify the suffix of output files. Default value Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.nif+turtle |
---|---|
Inputs |
PdfReader
Collection reader for PDF files. Uses simple heuristics to detect headings and paragraphs.
endPage |
The last page to be extracted from the PDF. Optional — Type: Integer — Default value: |
headingType |
The type used to annotate headings. Optional — Type: String — Default value: |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
paragraphType |
The type used to annotate paragraphs. Optional — Type: String — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
startPage |
The first page to be extracted from the PDF. Optional — Type: Integer — Default value: |
substitutionTableLocation |
The location of the substitution table use to post-process the text extracted form the PDF, e.g. to convert ligatures to separate characters. Optional — Type: String — Default value: |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/pdf |
---|---|
Outputs |
Penn Treebank Format
PennTreebankChunked
PennTreebankChunkedReader
Penn Treebank chunked format reader.
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readChunk |
Write chunk annotations to the CAS. Type: Boolean — Default value: |
readPOS |
Write part-of-speech annotations to the CAS. Type: Boolean — Default value: |
readSentence |
Write sentence annotations to the CAS. Type: Boolean — Default value: |
readToken |
Write token annotations to the CAS. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.ptb-chunked |
---|---|
Outputs |
PennTreebankCombined
-
Floresta Sintá(c)tica (Bosque) - Portuguese
PennTreebankCombinedReader
Penn Treebank combined format reader.
ConstituentMappingLocation |
Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
ConstituentTagSet |
Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readPOS |
Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Type: Boolean — Default value: |
removeTraces |
Whether to remove traces from the parse tree. Optional — Type: Boolean — Default value: |
sourceEncoding |
Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
writeTracesToText |
Whether to render traces into the document text. Optional — Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.ptb-combined |
---|---|
Outputs |
PennTreebankCombinedWriter
Penn Treebank combined format writer.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
emptyRootLabel |
Whether to force the root label to be empty. Type: Boolean — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Specify the suffix of output files. Default value Type: String — Default value: |
noRootLabel |
Whether to remove the root node. This is only possible if the root node has only a single child (i.e. a sentence node). Type: Boolean — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.ptb-combined |
---|---|
Inputs |
PubAnnotation
PubAnnotation
PubAnnotationReader
Reader for the PubAnnotation format. Since the PubAnnotation format only associates spans/relations with simple values and since annotations are not typed, it is necessary to define target types and features via #PARAM_SPAN_TYPE and #PARAM_SPAN_LABEL_FEATURE. In PubAnnotation, every annotation has an ID. If the target type has a suitable feature to retain the ID, it can be configured via #PARAM_SPAN_ID_FEATURE. The sourcedb and sourceid from the PubAnnotation document are imported as DocumentMetaData#setCollectionId(String) collectionId and DocumentMetaData#setDocumentId(String) documentId respectively. If present, also the target is imported as DocumentMetaData#setDocumentUri(String) documentUri. The DocumentMetaData#setDocumentBaseUri(String) documentBaseUri is cleared in this case. Currently supports only span annotations, i.e. no relations or modifications. Discontinuous segments are also not supported.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
resolveNamespaces |
The feature on the span annotation type which receives the label. Type: Boolean — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
spanIdFeature |
The feature on the span annotation type which receives the ID. Optional — Type: String |
spanLabelFeature |
The feature on the span annotation type which receives the label. Optional — Type: String |
spanType |
The span annotation type to which the PubAnnotation spans are mapped. Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.pubannotation+json |
---|---|
Outputs |
PubAnnotationWriter
Writer for the PubAnnotation format. Since the PubAnnotation format only associates spans/relations with simple values and since annotations are not typed, it is necessary to define target types and features via #PARAM_SPAN_TYPE and #PARAM_SPAN_LABEL_FEATURE. In PubAnnotation, every annotation has an ID. If the annotation type has an ID feature, it can be configured via #PARAM_SPAN_ID_FEATURE. If this parameter is not set, the IDs are generated automatically. The sourcedb and sourceid from the PubAnnotation document are exported from DocumentMetaData#setCollectionId(String) collectionId and DocumentMetaData#setDocumentId(String) documentId respectively. The target is exported from DocumentMetaData#setDocumentUri(String) documentUri. Currently supports only span annotations, i.e. no relations or modifications. Discontinuous segments are also not supported.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Specify the suffix of output files. Default value Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
spanIdFeature |
The feature on the span annotation type which receives the ID. Optional — Type: String |
spanLabelFeature |
The feature on the span annotation type which receives the label. Optional — Type: String |
spanType |
The span annotation type to which the PubAnnotation spans are mapped. Type: String |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.pubannotation+json |
---|---|
Inputs |
Reuters-21578
Reuters21578Sgml
Reuters21578SgmlReader
Read a Reuters-21578 corpus in SGML format.
Set the directory that contains the SGML files with #PARAM_SOURCE_LOCATION.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.reuters21578+sgml |
---|---|
Outputs |
Reuters21578Txt
Reuters21578TxtReader
Read a Reuters-21578 corpus that has been transformed into text format using ExtractReuters in the lucene-benchmarks project.
The #PARAM_SOURCE_LOCATION parameter should typically point to the file name pattern reut2-*.txt, preceded by the corpus root directory.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/x.org.dkpro.reuters21578 |
---|---|
Outputs |
RTF
RTF
RTFReader
Read RTF (Rich Text Format) files. Uses RTFEditorKit for parsing RTF.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/rtf
text/rtf |
---|---|
Outputs |
Solr
Solr
SolrWriter
A simple implementation of SolrWriter_ImplBase
numThreads |
The number of background numThreads used to empty the queue. Type: Integer — Default value: |
optimizeIndex |
If set to true, the index is optimized once all documents are uploaded. Default is false. Type: Boolean — Default value: |
queueSize |
The buffer size before the documents are sent to the server (default: 10000). Type: Integer — Default value: |
solrIdField |
The name of the id field in the Solr schema (default: "id"). Type: String — Default value: |
targetLocation |
Solr server URL string in the form Type: String |
textField |
The name of the text field in the Solr schema (default: "text"). Type: String — Default value: |
update |
Define whether existing documents with same ID are updated (true) of overwritten (false)? Type: Boolean — Default value: |
waitFlush |
When committing to the index, i.e. when all documents are processed, block until index changes are flushed to disk? Type: Boolean — Default value: |
waitSearcher |
When committing to the index, i.e. when all documents are processed, block until a new searcher is opened and registered as the main query searcher, making the changes visible? Type: Boolean — Default value: |
Media types |
none specified |
---|---|
Inputs |
none specified |
TCF
Tcf
The TCF (Text Corpus Format) was created in the context of the CLARIN project. It is mainly used to exchange data between the different web-services that are part of the WebLicht platform.
TcfReader
Reader for the WebLicht TCF format. It reads all the available annotation Layers from the TCF file and convert it to a CAS annotations. The TCF data do not have begin/end offsets for all of its annotations which is required in CAS annotation. Hence, addresses are manually calculated per tokens and stored in a map (token_id, token(CAS object)) where later we get can get the offset from the token
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/tcf+xml |
---|---|
Outputs |
TcfWriter
Writer for the WebLicht TCF format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Specify the suffix of output files. Default value Type: String — Default value: |
merge |
Merge with source TCF file if one is available. Type: Boolean — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
preserveIfEmpty |
If there are no annotations for a particular layer in the CAS, preserve any potentially existing annotations in the original TCF. Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
text/tcf+xml |
---|---|
Inputs |
TEI
Tei
TeiReader
Reader for the TEI XML.
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
omitIgnorableWhitespace |
Do not write ignoreable whitespace from the XML file to the CAS. Type: Boolean — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readConstituent |
Write constituent annotations to the CAS. Type: Boolean — Default value: |
readLemma |
Write lemma annotations to the CAS. Type: Boolean — Default value: |
readNamedEntity |
Write named entity annotations to the CAS. Type: Boolean — Default value: |
readPOS |
Write part-of-speech annotations to the CAS. Type: Boolean — Default value: |
readParagraph |
Write paragraphs annotations to the CAS. Type: Boolean — Default value: |
readSentence |
Write sentence annotations to the CAS. Type: Boolean — Default value: |
readToken |
Write token annotations to the CAS. Type: Boolean — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
useFilenameId |
When not using the XML ID, use only the filename instead of the whole URL as ID. Mind that the filenames should be unique in this case. Type: Boolean — Default value: |
useXmlId |
Use the xml:id attribute on the TEI elements as document ID. Mind that many TEI files may not have this attribute on all TEI elements and you may end up with no document ID at all. Also mind that the IDs should be unique. Type: Boolean — Default value: |
utterancesAsSentences |
Interpret utterances "u" as sentenes "s". (EXPERIMENTAL) Type: Boolean — Default value: |
Media types |
application/tei+xml |
---|---|
Outputs |
TeiWriter
UIMA CAS consumer writing the CAS document text in TEI format.
cTextPattern |
A token matching this pattern is rendered as a TEI "c" element instead of a "w" element. Type: String — Default value: |
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Specify the suffix of output files. Default value Type: String — Default value: |
indent |
Indent the XML. Type: Boolean — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
writeConstituent |
Write constituent annotations to the CAS. Disabled by default because it requires type priorities to be set up (Constituents must have a higher prio than Tokens). Type: Boolean — Default value: |
writeNamedEntity |
Write named entity annotations to the CAS. Overlapping named entities are not supported. Type: Boolean — Default value: |
Media types |
application/tei+xml |
---|---|
Inputs |
Text
String
StringReader
Simple reader that generates a CAS from a String. This can be useful in situations where a reader is preferred over manually crafting a CAS using JCasFactory#createJCas().
collectionId |
The collection ID to set in the DocumentMetaData. Type: String — Default value: |
documentBaseUri |
The document base URI to set in the DocumentMetaData. Optional — Type: String |
documentId |
The document ID to set in the DocumentMetaData. Type: String — Default value: |
documentText |
The document text. Type: String |
documentUri |
The document URI to set in the DocumentMetaData. Type: String — Default value: |
language |
Set this as the language of the produced documents. Type: String |
Media types |
text/plain |
---|---|
Outputs |
Text
TextReader
UIMA collection reader for plain text files.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceEncoding |
Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
text/plain |
---|---|
Outputs |
TextWriter
UIMA CAS consumer writing the CAS document text as plain text file.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Specify the suffix of output files. Default value Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
text/plain |
---|---|
Inputs |
TokenizedText
TokenizedTextWriter
Write texts into into a large file containing one sentence per line and tokens separated by whitespace. Optionally, annotations other than tokens (e.g. lemmas) are written as specified by #PARAM_FEATURE_PATH.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
coveringType |
In the output file, each unit of the covering type is written into a separate line. The default (set in #DEFAULT_COVERING_TYPE), is sentences so that each sentence is written to a line. If no line breaks within a document are desired, set this value to null. Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
extension |
Set the output file extension. Type: String — Default value: |
featurePath |
The feature path, e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value for lemmas. Type: String — Default value: |
numberRegex |
Regular expression to match numbers. These are written to the output as NUM. Type: String — Default value: `` |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stopwordsFile |
All the tokens listed in this file (one token per line) are replaced by STOP. Empty lines and lines starting with # are ignored. Casing is ignored. Type: String — Default value: `` |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Encoding for the target file. Default is UTF-8. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
text/plain |
---|---|
Inputs |
TGrep2
TGrep
TGrep and TGrep2 are a tools to search over syntactic parse trees represented as bracketed structures. This module supports in particular TGrep2 and allows to conveniently generate TGrep2 indexes which can then be searched. Search is not supported by this module.
TGrepWriter
TGrep2 corpus file writer. Requires PennTrees to be annotated before.
compression |
Method to compress the tgrep file (only used if PARAM_WRITE_T2C is true). Only NONE, GZIP and BZIP2 are supported. Type: String — Default value: |
dropMalformedTrees |
If true, silently drops malformed Penn Trees instead of throwing an exception. Type: Boolean — Default value: |
targetLocation |
Path to which the output is written. Type: String |
writeComments |
Set this parameter to true if you want to add a comment to each PennTree which is written to the output files. The comment is of the form documentId,beginOffset,endOffset. Type: Boolean — Default value: |
writeT2c |
Set this parameter to true if you want to encode directly into the tgrep2 binary format. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.tgrep2 |
---|---|
Inputs |
TIGER-XML
TigerXml
The TIGER XML format was created for encoding syntactic constituency structures in the German TIGER corpus. It has since been used for many other corpora as well. TIGERSearch is a linguistic search engine specifically targetting this format. The format has later been extended to also support semantic frame annotations.
-
Floresta Sintá(c)tica (Bosque) - Portuguese
-
Semeval-2 Task 10 - (extended format)
-
Składnica frazowa - Polish
-
Swedish Treebank - Swedish
-
Talbanken05 - Swedish
-
TIGER - German
TigerXmlReader
UIMA collection reader for TIGER-XML files. Also supports the augmented format used in the Semeval 2010 task which includes semantic role data.
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
ignoreIllegalSentences |
If a sentence has an illegal structure (e.g. TIGER 2.0 has non-terminal nodes that do not have child nodes), then just ignore these sentences. Type: Boolean — Default value: |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readPennTree |
Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset! Type: Boolean — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.semeval-2010+xml
application/x.org.dkpro.tiger+xml |
---|---|
Outputs |
TigerXmlWriter
UIMA CAS consumer writing the CAS document text in the TIGER-XML format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Specify the suffix of output files. Default value Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.tiger+xml |
---|---|
Inputs |
Tika
Tika
TikaReader
Reader for many file formats based on Apache Tika.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
parseEmbeddedDocuments |
Parse embedded documents in addition to the main document. Optional — Type: Boolean — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
none specified |
---|---|
Outputs |
TUEBADZ
TuebaDZ
The TüBa-D/Z treebank is a syntactically annotated German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz).
Sentences have a header line and are followed by a blank new line.
Column | Type/Feature | Description |
---|---|---|
FORM |
Token |
Word form or punctuation symbol. |
POSTAG |
POS PosValue |
Fine-grained part-of-speech tag, where the tagset depends on the language. |
CHUNK |
Chunk |
chunk (BIO encoded) - For named entities, it can also include its type, e.g., B-NX=ORG |
%% sent no. 1
Veruntreute VVFIN B-VXFIN
die ART B-NX=ORG
AWO NN I-NX=ORG
Spendengeld NN B-NX
? $. O
-
TüBa-D/Z - German
TuebaDZReader
Reads the Tüba-D/Z chunking format.
ChunkMappingLocation |
Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
ChunkTagSet |
Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
readChunk |
Read chunk information. Type: Boolean — Default value: |
readNamedEntity |
Read named entity information. Type: Boolean — Default value: |
readPOS |
Write part-of-speech information. Type: Boolean — Default value: |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.tuebadz-chunk |
---|---|
Outputs |
TüPP-D/Z
Tuepp
TüPP D/Z is a collection of articles from the German newspaper taz (die tageszeitung) annotated and encoded in a XML format.
-
TüPP-D/Z - German
TueppReader
- Only the part-of-speech with the best rank (rank 1) is read, if there is a tie between multiple tags, the first one from the XML file is read.
- Only the first lemma (baseform) from the XML file is read.
- Token are read, but not the specific kind of token (e.g. TEL, AREA, etc.).
- Article boundaries are not read.
- Paragraph boundaries are not read.
- Lemma information is read, but morphological information is not read.
- Chunk, field, and clause information is not read.
- Meta data headers are not read.
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
POSTagSet |
Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.tuepp+xml |
---|---|
Outputs |
UIMA Binary CAS
BinaryCas
The CAS is the native data model used by UIMA. There are various ways of saving CAS data, using XMI, XCAS, or binary formats. This module supports the binary formats.
BinaryCasReader
UIMA Binary CAS formats reader.
addDocumentMetadata |
Add DKPro Core metadata if it is not already present in the document. Type: Boolean — Default value: |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
mergeTypeSystem |
Determines whether the type system from a currently read file should be merged with the current type system Type: Boolean — Default value: |
overrideDocumentMetadata |
Generate new DKPro Core document metadata (i.e. title, ID, URI) for the document instead of retaining what is already present in the XMI file. Type: Boolean — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
typeSystemLocation |
The location from which to obtain the type system when the CAS is stored in form 0. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.uima+binary |
---|---|
Outputs |
none specified |
BinaryCasWriter
Write CAS in one of the UIMA binary formats.
All the supported formats except 6+
can also be loaded and saved via the UIMA
CasIOUtils.
Format | Description | Type system on load | CAS Addresses preserved |
---|---|---|---|
SERIALIZED or S |
CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). Because these structures are pre-allocated in memory at larger sizes than what is actually required, files in this format may be larger than necessary. However, the CAS addresses of feature structures are preserved in this format. When the data is loaded back into a CAS, it must have been initialized with the same type system as the original CAS. | must be the same | yes |
SERIALIZED_TSI or S+ |
CAS structures are dumped to disc as they are using Java serialization as in form 0, but now using the CASCompleteSerializer which includes CAS metadata like type system and index repositories. | is reinitialized | yes |
BINARY or 0 |
CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). This is basically the same as format S but includes a UIMA header and can be read using org.apache.uima.cas.impl.Serialization#deserializeCAS. | must be the same | yes |
BINARY_TSI or 0 |
The same as BINARY_TSI , except that the type system and index configuration are
also stored in the file. However, lenient loading or reinitalizing the CAS with this information
is presently not supported. |
must be the same | yes |
COMPRESSED or 4 |
UIMA binary serialization saving all feature structures (reachable or not). This format internally uses gzip compression and a binary representation of the CAS, making it much more efficient than format 0. | must be the same | yes |
COMPRESSED_FILTERED or 6 |
UIMA binary serialization as format 4, but saving only reachable feature structures. | must be the same | no |
6+ | This is a legacy format specific to DKPro Core. Since UIMA 2.9.0,
COMPRESSED_FILTERED_TSI is supported and should be used instead of this format. UIMA
binary serialization as format 6, but also contains the type system definition. This allows the
BinaryCasReader to load data leniently into a CAS that has been initialized with a
different type system. |
lenient loading | no |
COMPRESSED_FILTERED_TS |
Same as COMPRESSED_FILTERED , but also contains the type system definition. This
allows the BinaryCasReader to load data leniently into a CAS that has been initialized
with a different type system. |
lenient loading | no |
COMPRESSED_FILTERED_TSI |
Default. UIMA binary serialization as format 6, but also contains the type system definition and index definitions. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system. | lenient loading | no |
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
The file extension. If this is set to AUTO, then the extension will be chosen based
on the default extension specified by the UIMA SerialFormat class. However, this
only works when using the new long format names (e.g. Type: String — Default value: |
format |
Binary format to produce. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
typeSystemLocation |
Location to write the type system to. The type system is saved using Java serialization, it
is not saved as a XML type system description. We recommend to use the name
typesystem.ser.
Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.uima+binary |
---|---|
Inputs |
SerializedCas
SerializedCasReader
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
typeSystemLocation |
The file from which to obtain the type system if it is not embedded in the serialized CAS. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
none specified |
---|---|
Outputs |
none specified |
SerializedCasWriter
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
typeSystemLocation |
Location to write the type system to. The type system is saved using Java serialization, it
is not saved as a XML type system description. We recommend to use the name
typesystem.ser.
Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
none specified |
---|---|
Inputs |
UIMA JSON
Json
JsonWriter
UIMA JSON format writer.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
jsonContextFormat |
The level of detail to use for the context (i.e. type system) information. Type: String — Default value: |
omitDefaultValues |
Whether to fields that have their default values from the JSON output. Type: Boolean — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
prettyPrint |
Whether to pretty-print the JSON output. Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
typeSystemFile |
Location to write the type system to. If this is not set, a file called typesystem.xml will
be written to the XMI output path. If this is set, it is expected to be a file relative
to the current work directory or an absolute file.
Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.uima+json |
---|---|
Inputs |
UIMA XMI
Xmi
One of the official formats supported by UIMA is the XMI format. It is an XML-based format that does not support a few very specific characters which are invalid in XML. But it is able to capture all the information contained in the CAS. The XMI format is the de-facto standard for exchanging data in the UIMA world. Most UIMA-related tools support it.
The XMI format does not include type system information. It is therefore recommended to always configure the XmiWriter component to also write out the type system to a file.
If you with to view anntated documents using the UIMA CAS Editor in Eclipse, you can e.g. set up your XmiWriter in the following way to write out XMIs and a type system file:
AnalysisEngineDescription xmiWriter =
AnalysisEngineFactory.createEngineDescription(
XmiWriter.class,
XmiWriter.PARAM_TARGET_LOCATION, ".",
XmiWriter.PARAM_TYPE_SYSTEM_FILE, "typesystem.xml");
XmiReader
Reader for UIMA XMI files.
addDocumentMetadata |
Add DKPro Core metadata if it is not already present in the document. Type: Boolean — Default value: |
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
lenient |
In lenient mode, unknown types are ignored and do not cause an exception to be thrown. Type: Boolean — Default value: |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
mergeTypeSystem |
Determines whether the type system from a currently read file should be merged with the current type system. Type: Boolean — Default value: |
overrideDocumentMetadata |
Generate new DKPro Core document metadata (i.e. title, ID, URI) for the document instead of retaining what is already present in the XMI file. Type: Boolean — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
typeSystemFile |
If a type system is specified, then the type system already in the CAS is replaced by this one. Except if XmiReader#PARAM_MERGE_TYPE_SYSTEM is enabled, in which case it will be merged with the type system already present in the CAS. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/vnd.xmi+xml
application/x.org.dkpro.uima+xmi |
---|---|
Outputs |
XmiWriter
UIMA XMI format writer.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Specify the suffix of output files. Default value Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
prettyPrint |
Format and indent the XML. Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
typeSystemFile |
Location to write the type system to. If this is not set, a file called typesystem.xml will
be written to the XMI output path. If this is set, it is expected to be a file relative
to the current work directory or an absolute file.
Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/vnd.xmi+xml
application/x.org.dkpro.uima+xmi |
---|---|
Inputs |
Web1T n-grams
Web1T
The Web1T n-gram corpus is a huge collection of n-grams collected from the internet. The jweb1t library allows to access this corpus efficiently. This module provides support for the file format used by the Web1T n-gram corpus and allows to conveniently created jweb1t indexes.
Web1TWriter
Web1T n-gram index format writer.
contextType |
The type being used for segments Type: String — Default value: |
createIndexes |
Create the indexes that jWeb1T needs to operate. (default: true) Optional — Type: Boolean — Default value: |
inputTypes |
Types to generate n-grams from. Example: Token.class.getName() + "/pos/PosValue" for part-of-speech n-grams Type: String[] |
lowercase |
Create a lower case index. Optional — Type: Boolean — Default value: |
maxNgramLength |
Maximum n-gram length. Optional — Type: Integer — Default value: |
minFreq |
Specifies the minimum frequency a NGram must have to be written to the final index. The specified value is interpreted as inclusive value, the default is 1. Thus, all NGrams with a frequency of at least 1 or higher will be written. Optional — Type: Integer — Default value: |
minNgramLength |
Minimum n-gram length. Optional — Type: Integer — Default value: |
splitFileTreshold |
The input file(s) is/are split into smaller files for quick access. An own file is created if the first two starting letters (or the starting letter if the word has a length of 1 character) account for at least x% of all starting letters in the input file(s). The default value for splitting a file is 1.0%. Every word that has starting characters which does not suffice the threshold is written with other words that also did not meet the threshold into an own file for miscellaneous words. A high threshold will lead to only a few, but large files and a most likely very large misc. file. A low threshold results in many small files. Use a zero or a negative value to write everything to one file. Optional — Type: Float — Default value: |
targetEncoding |
Character encoding of the output data. Optional — Type: String — Default value: |
targetLocation |
Location to which the output is written. Type: String |
Media types |
text/x.org.dkpro.ngram |
---|---|
Inputs |
WebAnno TSV
WebannoTsv3X
WebannoTsv3XReader
Reads the WebAnno TSV v3.x format.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceEncoding |
Character encoding of the input data. Type: String — Default value: |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
none specified |
---|---|
Outputs |
none specified |
WebannoTsv3XWriter
Writes the WebAnno TSV v3.x format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
The character encoding used by the input files. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
none specified |
---|---|
Inputs |
none specified |
Wikipedia via Bliki Engine
BlikiWikipedia
Access the online Wikipedia and extract its contents using the Bliki engine.
BlikiWikipediaReader
Bliki-based Wikipedia reader.
language |
The language of the wiki installation. Type: String |
outputPlainText |
Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: |
pageTitles |
Which page titles should be retrieved. Type: String[] |
sourceLocation |
Wikiapi URL E.g. for the English Wikipedia it should be: http://en.wikipedia.org/w/api.php Type: String |
Media types |
none specified |
---|---|
Outputs |
Wikipedia via JWPL
WikipediaArticle
WikipediaArticleReader
Reads all article pages. A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects, disambiguation pages, or discussion pages are regarded, however.
CreateDBAnno |
Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: |
Database |
The name of the database. Type: String |
Host |
The host server. Type: String |
Language |
The language of the Wikipedia that should be connected to. Type: String |
OnlyFirstParagraph |
If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: |
OutputPlainText |
Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: |
PageBuffer |
The page buffer size (#pages) of the page iterator. Type: Integer — Default value: |
PageIdFromArray |
Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[] |
PageIdsFromFile |
Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String |
PageTitleFromFile |
Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String |
PageTitlesFromArray |
Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[] |
Password |
The password of the database account. Type: String |
User |
The username of the database account. Type: String |
Media types |
none specified |
---|---|
Outputs |
none specified |
WikipediaArticleInfo
WikipediaArticleInfoReader
Reads all general article infos without retrieving the whole Page objects
CreateDBAnno |
Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: |
Database |
The name of the database. Type: String |
Host |
The host server. Type: String |
Language |
The language of the Wikipedia that should be connected to. Type: String |
Password |
The password of the database account. Type: String |
User |
The username of the database account. Type: String |
Media types |
none specified |
---|---|
Outputs |
WikipediaDiscussion
WikipediaDiscussionReader
Reads all discussion pages.
CreateDBAnno |
Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: |
Database |
The name of the database. Type: String |
Host |
The host server. Type: String |
Language |
The language of the Wikipedia that should be connected to. Type: String |
OutputPlainText |
Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: |
PageBuffer |
The page buffer size (#pages) of the page iterator. Type: Integer — Default value: |
PageIdFromArray |
Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[] |
PageIdsFromFile |
Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String |
PageTitleFromFile |
Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String |
PageTitlesFromArray |
Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[] |
Password |
The password of the database account. Type: String |
User |
The username of the database account. Type: String |
Media types |
none specified |
---|---|
Outputs |
WikipediaLink
WikipediaLinkReader
Read links from Wikipedia.
AllowedLinkTypes |
Which types of links are allowed? Type: String[] |
CreateDBAnno |
Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: |
Database |
The name of the database. Type: String |
Host |
The host server. Type: String |
Language |
The language of the Wikipedia that should be connected to. Type: String |
OutputPlainText |
Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: |
PageBuffer |
The page buffer size (#pages) of the page iterator. Type: Integer — Default value: |
PageIdFromArray |
Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[] |
PageIdsFromFile |
Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String |
PageTitleFromFile |
Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String |
PageTitlesFromArray |
Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[] |
Password |
The password of the database account. Type: String |
User |
The username of the database account. Type: String |
Media types |
none specified |
---|---|
Outputs |
WikipediaPage
WikipediaPageReader
Reads all Wikipedia pages in the database (articles, discussions, etc). A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects or disambiguation pages are regarded, however.
CreateDBAnno |
Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: |
Database |
The name of the database. Type: String |
Host |
The host server. Type: String |
Language |
The language of the Wikipedia that should be connected to. Type: String |
OnlyFirstParagraph |
If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: |
OutputPlainText |
Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: |
PageBuffer |
The page buffer size (#pages) of the page iterator. Type: Integer — Default value: |
PageIdFromArray |
Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[] |
PageIdsFromFile |
Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String |
PageTitleFromFile |
Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String |
PageTitlesFromArray |
Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[] |
Password |
The password of the database account. Type: String |
User |
The username of the database account. Type: String |
Media types |
none specified |
---|---|
Outputs |
WikipediaQuery
WikipediaQueryReader
Reads all article pages that match a query created by the numerous parameters of this class.
CreateDBAnno |
Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: |
Database |
The name of the database. Type: String |
Host |
The host server. Type: String |
Language |
The language of the Wikipedia that should be connected to. Type: String |
MaxCategories |
Maximum number of categories. Articles with a higher number of categories will not be returned by the query. Optional — Type: Integer — Default value: |
MaxInlinks |
Maximum number of incoming links. Articles with a higher number of incoming links will not be returned by the query. Optional — Type: Integer — Default value: |
MaxOutlinks |
Maximum number of outgoing links. Articles with a higher number of outgoing links will not be returned by the query. Optional — Type: Integer — Default value: |
MaxRedirects |
Maximum number of redirects. Articles with a higher number of redirects will not be returned by the query. Optional — Type: Integer — Default value: |
MaxTokens |
Maximum number of tokens. Articles with a higher number of tokens will not be returned by the query. Optional — Type: Integer — Default value: |
MinCategories |
Minimum number of categories. Articles with a lower number of categories will not be returned by the query. Optional — Type: Integer — Default value: |
MinInlinks |
Minimum number of incoming links. Articles with a lower number of incoming links will not be returned by the query. Optional — Type: Integer — Default value: |
MinOutlinks |
Minimum number of outgoing links. Articles with a lower number of outgoing links will not be returned by the query. Optional — Type: Integer — Default value: |
MinRedirects |
Minimum number of redirects. Articles with a lower number of redirects will not be returned by the query. Optional — Type: Integer — Default value: |
MinTokens |
Minimum number of tokens. Articles with a lower number of tokens will not be returned by the query. Optional — Type: Integer — Default value: |
OnlyFirstParagraph |
If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: |
OutputPlainText |
Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: |
PageBuffer |
The page buffer size (#pages) of the page iterator. Type: Integer — Default value: |
PageIdFromArray |
Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[] |
PageIdsFromFile |
Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String |
PageTitleFromFile |
Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String |
PageTitlesFromArray |
Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[] |
Password |
The password of the database account. Type: String |
TitlePattern |
SQL-style title pattern. Only articles that match the pattern will be returned by the query. Optional — Type: String — Default value: `` |
User |
The username of the database account. Type: String |
Media types |
none specified |
---|---|
Outputs |
none specified |
WikipediaRevision
WikipediaRevisionReader
Reads Wikipedia page revisions.
CreateDBAnno |
Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: |
Database |
The name of the database. Type: String |
Host |
The host server. Type: String |
Language |
The language of the Wikipedia that should be connected to. Type: String |
OutputPlainText |
Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: |
PageBuffer |
The page buffer size (#pages) of the page iterator. Type: Integer — Default value: |
Password |
The password of the database account. Type: String |
RevisionIdFromArray |
Defines an array of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String[] |
RevisionIdsFromFile |
Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String |
User |
The username of the database account. Type: String |
Media types |
none specified |
---|---|
Outputs |
WikipediaRevisionPair
WikipediaRevisionPairReader
Reads pairs of adjacent revisions of all articles.
CreateDBAnno |
Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: |
Database |
The name of the database. Type: String |
Host |
The host server. Type: String |
Language |
The language of the Wikipedia that should be connected to. Type: String |
MaxChange |
Restrict revision pairs to cases where the length of the revisions does not differ more than this value (counted in characters). Type: Integer — Default value: |
MinChange |
Restrict revision pairs to cases where the length of the revisions differ more than this value (counted in characters). Type: Integer — Default value: |
OutputPlainText |
Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: |
PageBuffer |
The page buffer size (#pages) of the page iterator. Type: Integer — Default value: |
Password |
The password of the database account. Type: String |
RevisionIdFromArray |
Defines an array of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String[] |
RevisionIdsFromFile |
Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String |
SkipFirstNPairs |
The number of revision pairs that should be skipped in the beginning. Optional — Type: Integer |
User |
The username of the database account. Type: String |
Media types |
none specified |
---|---|
Outputs |
WikipediaTemplateFilteredArticle
WikipediaTemplateFilteredArticleReader
Reads all pages that contain or do not contain the templates specified in the template whitelist and template blacklist.
It is possible to just define a whitelist OR a blacklist. If both whitelist and blacklist are provided, the articles are chosen that DO contain the templates from the whitelist and at the same time DO NOT contain the templates from the blacklist (= the intersection of the "whitelist page set" and the "blacklist page set")
This reader only works if template tables have been generated for the JWPL database using the WikipediaTemplateInfoGenerator.
NOTE: This reader directly extends the WikipediaReaderBase and not the WikipediaStandardReaderBase
CreateDBAnno |
Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: |
Database |
The name of the database. Type: String |
DoubleCheckAssociatedPages |
If this option is set, discussion pages are rejected that are associated with a blacklisted article. Analogously, articles are rejected that are associated with a blacklisted discussion page. This check is rather expensive and could take a long time. This is option is not active if only a whitelist is used. Default Value: false Type: Boolean — Default value: |
ExactTemplateMatching |
Defines whether to match the templates exactly or whether to match all templates that start with the String given in the respective parameter list. Default Value: true Type: Boolean — Default value: |
Host |
The host server. Type: String |
IncludeDiscussions |
Whether the reader should read also include talk pages. Type: Boolean — Default value: |
Language |
The language of the Wikipedia that should be connected to. Type: String |
LimitNUmberOfArticlesToRead |
Optional parameter that allows to define the max number of articles that should be delivered by the reader. This avoids unnecessary filtering if only a small number of articles is needed. Optional — Type: Integer |
OnlyFirstParagraph |
If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: |
OutputPlainText |
Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: |
PageBuffer |
The page buffer size (#pages) of the page iterator. Type: Integer — Default value: |
Password |
The password of the database account. Type: String |
TemplateBlacklist |
Defines templates that the articles MUST NOT contain. If you also define a whitelist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist) Optional — Type: String[] |
TemplateWhitelist |
Defines templates that the articles MUST contain. If you also define a blacklist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist) Optional — Type: String[] |
User |
The username of the database account. Type: String |
Media types |
none specified |
---|---|
Outputs |
XCES-XML
XcesBasicXml
XcesBasicXmlReader
Reader for the basic XCES XML format.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.xces-basic+xml |
---|---|
Outputs |
XcesBasicXmlWriter
Writer for the basic XCES XML format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.xces-basic+xml |
---|---|
Inputs |
XcesXml
XcesXmlReader
Reader for the XCES XML format.
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.xces+xml |
---|---|
Outputs |
XcesXmlWriter
Writer for the XCES XML format.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filenameExtension |
Use this filename extension. Type: String — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetEncoding |
Character encoding of the output data. Type: String — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/x.org.dkpro.xces+xml |
---|---|
Inputs |
XML
InlineXml
InlineXmlWriter
Writes an approximation of the content of a textual CAS as an inline XML file. Optionally applies an XSLT stylesheet.
Note this component inherits the restrictions from CasToInlineXml:
- Features whose values are FeatureStructures are not represented.
- Feature values which are strings longer than 64 characters are truncated.
- Feature values which are arrays of primitives are represented by strings that look like [ xxx, xxx ]
- The Subject of analysis is presumed to be a text string.
- Some characters in the document's Subject-of-analysis are replaced by blanks, because the characters aren't valid in xml documents.
- It doesn't work for annotations which are overlapping, because these cannot be properly represented as properly - nested XML.
Xslt |
XSLT stylesheet to apply. Optional — Type: String |
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
Media types |
application/xml
text/xml |
---|---|
Inputs |
Xml
XmlReader
Reader for XML files.
DocIdTag |
tag which contains the docId Optional — Type: String |
ExcludeTag |
optional, tags those should not be worked on. Out them should no text be extracted and also no Annotations be produced. Type: String[] — Default value: |
IncludeTag |
optional, tags those should be worked on (if empty, then all tags except those ExcludeTags will be worked on) Type: String[] — Default value: |
collectionId |
The collection ID to set in the DocumentMetaData. Optional — Type: String |
language |
Set this as the language of the produced documents. Optional — Type: String |
sourceLocation |
Location from which the input is read. Type: String |
Media types |
application/xml
text/xml |
---|---|
Outputs |
XmlText
XmlTextReader
includeHidden |
Include hidden files and directories. Type: Boolean — Default value: |
language |
Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String |
logFreq |
The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Optional — Type: String[] |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
Media types |
application/xml
text/xml |
---|---|
Outputs |
XmlXPath
XmlXPathReader
A component reader for XML files implemented with XPath.
This is currently optimized for TREC format, which means the style topics are presented in. You should provide the parameter XPath expression that of the parent node And the child nodes of each parent node will be stored separately in its own CAS.
If your expression evaluates to leaf nodes, empty CASes will be created.
caseSensitive |
States whether the matching is done case sensitive. (default: true) Optional — Type: Boolean — Default value: |
docIdTag |
Tag which contains the docId. If it is given, it will be ensured that within the same document there is only one id tag and it is not empty Optional — Type: String |
excludeTags |
Tags which should be ignored. If empty then all tags will be processed. If this and PARAM_INCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed. Type: String[] — Default value: |
includeTags |
Tags which should be worked on. If empty then all tags will be processed. If this and PARAM_EXCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed. Type: String[] — Default value: |
language |
Language of the documents. If given, it will be set in each CAS. Optional — Type: String |
patterns |
A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+]
if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern.
The wildcard Type: String[] |
rootXPath |
Specifies the XPath expression to all nodes to be processed. Different segments will be separated via PARAM_ID_TAG, and each segment will be stored in a separate CAS. Type: String |
sourceLocation |
Location from which the input is read. Optional — Type: String |
useDefaultExcludes |
Use the default excludes. Type: Boolean — Default value: |
workingDir |
Specify to substitute tag names in CAS. Please give the substitutions each in before - after order. For example to substitute "foo" with "bar", and "hey" with "ho", you can provide { "foo", "bar", "hey", "ho" }. Optional — Type: String[] |
Media types |
application/xml
text/xml |
---|---|
Outputs |