DKPro Core™ Format Reference

The document provides detailed information about the DKPro Core input and output formats.

Overview

Table 1. Formats (73)
Format	Reader	Writer
AclAnthology	AclAnthologyReader	none
Ancora	AncoraReader	none
AnnotatedGigaword	AnnotatedGigawordReader	none
BinaryCas	BinaryCasReader	BinaryCasWriter
BlikiWikipedia	BlikiWikipediaReader	none
Bnc	BncReader	none
Brat	BratReader	BratWriter
Combination	CombinationReader	none
Concrete	ConcreteReader	ConcreteWriter
Conll2000	Conll2000Reader	Conll2000Writer
Conll2002	Conll2002Reader	Conll2002Writer
Conll2003	Conll2003Reader	Conll2003Writer
Conll2006	Conll2006Reader	Conll2006Writer
Conll2008	Conll2008Reader	Conll2008Writer
Conll2009	Conll2009Reader	Conll2009Writer
Conll2012	Conll2012Reader	Conll2012Writer
ConllCoreNlp	ConllCoreNlpReader	ConllCoreNlpWriter
ConllU	ConllUReader	ConllUWriter
DiTop	none	DiTopWriter
Frequency	none	FrequencyWriter
Html	HtmlReader	none
HtmlDocument	HtmlDocumentReader	none
ImsCwb	ImsCwbReader	ImsCwbWriter
InlineXml	none	InlineXmlWriter
Jdbc	JdbcReader	none
Json	none	JsonWriter
Lcc	LccReader	none
Lif	LifReader	LifWriter
Lxf	LxfReader	LxfWriter
MalletLdaTopicProportions	none	MalletLdaTopicProportionsWriter
MalletLdaTopicsProportionsSorted	none	MalletLdaTopicsProportionsSortedWriter
NegraExport	NegraExportReader	none
Nif	NifReader	NifWriter
Nitf	NitfReader	none
Pdf	PdfReader	none
PennTreebankChunked	PennTreebankChunkedReader	none
PennTreebankCombined	PennTreebankCombinedReader	PennTreebankCombinedWriter
Perseus	PerseusReader	none
PubAnnotation	PubAnnotationReader	PubAnnotationWriter
RTF	RTFReader	none
Reuters21578Sgml	Reuters21578SgmlReader	none
Reuters21578Txt	Reuters21578TxtReader	none
SerializedCas	SerializedCasReader	SerializedCasWriter
Solr	none	SolrWriter
String	StringReader	none
TGrep	none	TGrepWriter
Tcf	TcfReader	TcfWriter
Tei	TeiReader	TeiWriter
Text	TextReader	TextWriter
TfIdf	none	TfIdfWriter
TigerXml	TigerXmlReader	TigerXmlWriter
Tika	TikaReader	none
TokenizedText	none	TokenizedTextWriter
TuebaDZ	TuebaDZReader	none
Tuepp	TueppReader	none
Web1T	none	Web1TWriter
WebannoTsv3X	WebannoTsv3XReader	WebannoTsv3XWriter
WikipediaArticle	WikipediaArticleReader	none
WikipediaArticleInfo	WikipediaArticleInfoReader	none
WikipediaDiscussion	WikipediaDiscussionReader	none
WikipediaLink	WikipediaLinkReader	none
WikipediaPage	WikipediaPageReader	none
WikipediaQuery	WikipediaQueryReader	none
WikipediaRevision	WikipediaRevisionReader	none
WikipediaRevisionPair	WikipediaRevisionPairReader	none
WikipediaTemplateFilteredArticle	WikipediaTemplateFilteredArticleReader	none
XcesBasicXml	XcesBasicXmlReader	XcesBasicXmlWriter
XcesXml	XcesXmlReader	XcesXmlWriter
Xmi	XmiReader	XmiWriter
Xml	XmlReader	none
XmlDocument	XmlDocumentReader	XmlDocumentWriter
XmlText	XmlTextReader	none
XmlXPath	XmlXPathReader	none

I/O components

ACL Anthology

AclAnthology

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-aclanthology-asl

Known corpora in this format

ACL Anthology Reference Corpus (ACL ARC)

AclAnthologyReader

Implementation

org.dkpro.core.io.aclanthology.AclAnthologyReader

Description

Reads the ACL anthology corpus and outputs CASes with plain text documents.

The reader tries to strip out hyphenation and replace problematic characters to produce a cleaned text. Otherwise, it is a plain text reader.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. If not specified, the default system encoding will be used. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 2. Capabilities
Media types	text/plain
Outputs	DocumentMetaData

AnCora

Ancora

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-ancora-asl

AncoraReader

Implementation

org.dkpro.core.io.ancora.AncoraReader

Description

Read AnCora XML format.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
dropSentencesMissingPosTags	Whether to ignore sentence in which any POS tags are missing. Normally, it is assumed that if any POS tags are present, then every token as a POS tag. Type: Boolean — Default value: `false`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readLemma	Write lemma annotations to the CAS. Type: Boolean — Default value: `true`
readPOS	Write part-of-speech annotations to the CAS. Type: Boolean — Default value: `true`
readSentence	Write sentence annotations to the CAS. Type: Boolean — Default value: `true`
readToken	Write token annotations to the CAS. Type: Boolean — Default value: `true`
sourceLocation	Location from which the input is read. Optional — Type: String
splitMultiWordTokens	Whether to split words containing underscores into multiple tokens. Type: Boolean — Default value: `true`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 3. Capabilities
Media types	application/x.org.dkpro.ancora+xml application/xml
Outputs	POS DocumentMetaData Lemma Sentence Token

brat file format

Brat

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-brat-asl

This format is the native format of the brat rapid annotation tool. Its official documentation can be found here.

In general, the format consists of two files for each document:

an .ann file containing the annotations. These are the files you need to point the PARAM_SOURCE_LOCATION parameter of the BratReader to.
a plain text file (.txt) containing the document text in UTF-8. These files need to be next to the corresponding .ann files and have the same name, just with the .txt extension instead of .ann extension.

The brat format supports different types of annotations which start with different letters in the .ann file:

Table 4. brat annotation types
Type	Letter	Comment
Text annotations	`T`
Event annotations	`E`
Relation annotations	`R`
Note annotations	`#`
Normalization annotations	`N`	currently not supported by DKPro Core

Attributes

Additionally, attributes (A) can be attached to annotations. Note that DKPro Core supports attributes on relations, but the brat tool itself can only deal with attributes on text annotations and events. The BratReader will try to store the values of attributes in correspondingly named features on the target UIMA types.

Reading the brat format

The DKPro Core BratReader tries its best to map a given brat file into the UIMA type system of the CAS it is given. Thus, the BratReader is not bound strictly to the pre-defined DKPro Core types, but supports any custom types as well. Since the type names in UIMA are typically long (e.g. de.tudarmstadt.ukp.dkpro.core.api.ner.type.Location) and the names used in brat tend to be short (e.g. LOC), an explicit mapping is usually required. This mapping can be provided as JSON which needs to be passed to the PARAM_MAPPINGS parameter of the BratReader. Note that the parameter takes actual JSON, not the path to a JSON file.

The mapping file consists of five sections: text type mapping, relation type mapping, span mapping, relation mapping and comment mapping.

Mappings JSON file: high-level structure

{
  'textTypeMapppings': [ ... ],
  'relationTypeMapppings': [ ... ],
  'spans': [ ... ],
  'relations': [ ... ],
  'comments': [ ... ]
}

Type mappings

The type mappings (span and relation) indicates how to find the UIMA type for a given brat annotation. Each type mapping contains two mandatory fields:

from: this field is a regular expression which matches the annotation name used by brat. Note that dashes (-) in the brat name must be replaced by dots (.) or escaped dots (\.) to match here! It is also possible to match multiple brat annotations at once using regular expressions such as (PER|LOC) or .*-LOC.
to: this is the UIMA type to map to.

The order of the mappings matter - brat annotations are matched to them in the order they are defined in the mappings file. This allows e.g. to put a catch-all mapping at the end with 'from': '.*' which would match all brat annotations not matched by a previous mapping.

Mapping text annotations

For the purpose of mapping, brat event (E) and text (T) annotations are both considered text type annotations.

Example: Mapping brat text-type annotations to UIMA types

{
  'textTypeMapppings': [
    {
      'from': 'LOC',
      'to': 'de.tudarmstadt.ukp.dkpro.core.api.ner.type.Location'
    },
    {
      'from': 'PER',
      'to': 'de.tudarmstadt.ukp.dkpro.core.api.ner.type.Person'
    },
    ...
  ],
  ...
}

In addition to the textTypeMapppings section, there is a spans section. This can be used to further configure any annotations of a given UIMA type that are created by the reader. In addition to the defaultFeatureValues (see futher below) option, there is the option to store original brat annotation name in a feature indicated by subCatFeature. The example below stores the name of the brat annotation into the value feature of the NamedEntity type.

Example: Span mappings

{
  'spans': [
    {",
      'type': 'de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity',
      'subCatFeature': 'value',
      'defaultFeatureValues': {
        'identity': 'none'
      }",
   }",
    ...
  ],
  ...
}

Mapping events

Event annotations (E) from brat are basically treated like text annotations (T). However, events can have multiple arguments in brat and these arguments point to other annotations. The BratReader will try to store these argument values in the target UIMA type in corresponding feature values.

For example if the brat file contains an event annotation as shown below, the target UIMA type for the brat pred annotation should have a feature subject and a feature object which would be able to accept the type of annotation to which the brat entity annotation is mapped.

T1 pred 5 10	likes
T2 entity 0 4	John
T3 entity 11 16	pizza
E1 pred:T1 subject:T2 object:T3

Mapping relations

Relation annotations can be mapped in the same way.

Example: Mapping brat relation annotations to UIMA types

{
  'relationTypeMapppings': [
    {
      'from': 'nsubj|obj|iobj',
      'to': 'de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency'
    },
    ...
  ],
  ...
}

In addition to the textTypeMapppings section, there is a relations section. Here, the features used to represent the relation end points can be configured. The example matches all brat relation annotations which have been mapped to the Dependency UIMA type. The first argument from the brat relation is mapped to the source feature while the second argument is mapped to the target feature. The option flags1 or flags2 can be set to A to indicate that either the offsets of the first or second argument are used as the offsets of the created UIMA annotation. Also, the subCatFeature and defaultFeatureValues already mentioned for the span mappings are supported.

Example: Mapping brat relation annotations to UIMA types

{
  'relations': [
    {
      'type': 'de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency',
      'arg1': 'source',
      'arg2': 'target',
      'flags2': 'A',
      'subCatFeature': 'DependencyType',
      'defaultFeatureValues': {
        'flavour': 'basic'
      }
    },
    ...
  ],
  ...
}

Mapping brat comments to UIMA

The comment field of annotations is the only free text field in brat (all others have a controlled vocabulary). Sometimes the field is indeed used for comments. But sometimes, the field is also used to store actual tags. In order to map comments to UIMA, a comments section needs to be added to the mapping file. A comment mapping then consists of these items:

type: the name of a UIMA type to which the brat annotation was matched.
feature: the feature of the UIMA type where the comment value should be stored
match (optional): a regular expression indicating when to use this mapping rule.
replace (optional): can be used to modify the value stores in the UIMA feature. If the match field includes capturing groups in its regular expression, these can be accessed here e.g. using $1. This can be used to normalize values.

Mind that the same type can appear multiple times if the comment field should be mapped to different features depending on the comment value. The example below maps the comment value to the value feature if the comment is PER, LOC, ORG or MISC. However, if the value field is a URL, then the comment is mapped into the identifier feature.

Example: Mapping brat relation annotations to UIMA types

{
  'comments': [
    {
      'type': 'de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity',
      'feature': 'value',
      'match': '^(PER|LOC|ORG|MISC)$',
    },
    {
      'type': 'de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity',
      'feature': 'identifier',
      'match': '^http://.*$'
    },
    ...
  ],
  ...
}

Default feature values (text-type and relation annotations)

It may be desirable to set certain UIMA features as part of the conversion. E.g. when reading dependency relation annotations, it may be useful to set the flavour feature of the DKPro Core Dependency type to basic. This can be done by adding a defaultFeatureValues section to the mapping.

Example: Default feature values

{
  'relationTypeMapppings': [
    {
      'from': 'nsubj|obj|iobj',
      'to': 'de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency',
      'defaultFeatureValues': {
        'flavour': 'basic'
      }
    },
    ...
  ],
  ...
}

Another use-case of default feature values is if the brat annotation label is actually a concatenation of multiple tags which should be split up into multiple features at the UIMA level:

Example: Multiple default feature values

{
  'textTypeMapppings': [
    {
      'from': 'top-left',
      'to': 'custom.Direction',
      'defaultFeatureValues': {
        'horizontal': 'left',
        'vertical': 'top'
      }
    },
    {
      'from': 'bottom-right',
      'to': 'custom.Direction',
      'defaultFeatureValues': {
        'horizontal': 'right',
        'vertical': 'bottom'
      }
    },
    ...
  ],
  ...
}

Segmentation

Note that the brat annotation format does not have a built-in concept of token or sentence boundaries. So unless these are explicitly annotated in the brat file and mapped to the DKPro Core Token and Sentence types, there will not be any such annotations available. If you apply a segmenter component (e.g. the DKPro Core BreakIteratorSegmenter) to the output of the reader you will get token and sentence boundaries, but they might not coincide with the annotations boundaries read from the brat file. Your mileage may vary.

BratReader

Implementation

org.dkpro.core.io.brat.BratReader

Description

Reader for the brat format.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mapping	Configuration Optional — Type: String
noteMappings	Mapping of brat notes to particular features. Optional — Type: String[] — Default value: `[]`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
relationTypeMappings	Mapping of brat relation annotations to UIMA types, e.g. : `SUBJ -> de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency` Optional — Type: String[]
relationTypes	Types that are relations. It is mandatory to provide the type name followed by two feature names that represent Arg1 and Arg2 separated by colons, e.g. `de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent{A}` Additionally, a subcategorization feature may be specified. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent{A}]`
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
textAnnotationTypeMappings	Mapping of brat text annotations (entities or events) to UIMA types, e.g. : `Country -> de.tudarmstadt.ukp.dkpro.core.api.ner.type.Location` Optional — Type: String[]
textAnnotationTypes	Using this parameter is only necessary to specify a subcategorization feature for text and event annotation types. It is mandatory to provide the type name which can optionally be followed by a subcategorization feature. Optional — Type: String[] — Default value: `[]`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 5. Capabilities
Media types	application/x.org.dkpro.brat
Outputs	none specified

BratWriter

Implementation

org.dkpro.core.io.brat.BratWriter

Description

Writer for the brat annotation format.

Known issues:

Brat is unable to read relation attributes created by this writer.
PARAM_TYPE_MAPPINGS not implemented yet

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
enableTypeMappings	Enable type mappings. Type: Boolean — Default value: `false`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
excludeTypes	Types that will not be written to the exported file. Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence]`
filenameExtension	Specify the suffix of output files. Default value `.ann`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.ann`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
palette	Colors to be used for the visual configuration that is generated for brat. Optional — Type: String[] — Default value: `[#8dd3c7, #ffffb3, #bebada, #fb8072, #80b1d3, #fdb462, #b3de69, #fccde5, #d9d9d9, #bc80bd, #ccebc5, #ffed6f]`
relationTypes	Types that are relations. It is mandatory to provide the type name followed by two feature names that represent Arg1 and Arg2 separated by colons, e.g. `de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent`. Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent]`
shortAttributeNames	Whether to render attributes by their short name or by their qualified name. Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
spanTypes	Types that are text annotations (aka entities or spans). Type: String[] — Default value: `[]`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
textFilenameExtension	Specify the suffix of text output files. Default value `.txt`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.txt`
typeMappings	FIXME Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.ner.type.(\\w+) → $1]`
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeNullAttributes	Enable writing of features with null values. Type: Boolean — Default value: `false`
writeRelationAttributes	The brat web application can currently not handle attributes on relations, thus they are disabled by default. Here they can be enabled again. Type: Boolean — Default value: `false`

Table 6. Capabilities
Media types	application/x.org.dkpro.brat
Inputs	none specified

British National Corpus

Bnc

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-bnc-asl

Known corpora in this format

British National Corpus

BncReader

Implementation

org.dkpro.core.io.bnc.BncReader

Description

Reader for the British National Corpus (XML version).

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 7. Capabilities
Media types	application/x.org.dkpro.bnc+xml
Outputs	POS DocumentMetaData Lemma Sentence Token

Combination

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-combination-asl

CombinationReader

Implementation

org.dkpro.core.io.combination.CombinationReader

Description

Combines multiple readers into a single reader.

Parameters

readers

Locations of UIMA reader description files.

Type: String[]

Table 8. Capabilities
Media types	none specified
Outputs	none specified

CoNLL

Conll2000

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-conll-asl

The CoNLL 2000 format represents POS and Chunk tags. Fields in a line are separated by spaces. Sentences are separated by a blank new line.

Table 9. Columns
Column	Type	Description
FORM	Token	token
POSTAG	POS	part-of-speech tag
CHUNK	Chunk	chunk (IOB1 encoded)

Example

He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
to TO B-PP
only RB B-NP
# # I-NP
1.8 CD I-NP
billion CD I-NP
in IN B-PP
September NNP B-NP
. . O

Table 10. Known corpora in this format
Corpus	Language
CoNLL 2000 Chunking Corpus	English
CoNLL 2000 Chunking Corpus (NLTK)	English

Conll2000Reader

Implementation

org.dkpro.core.io.conll.Conll2000Reader

Description

Reads the CoNLL 2000 chunking format.

Parameters

ChunkMappingLocation	Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ChunkTagSet	Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readChunk	Read chunk information. Type: Boolean — Default value: `true`
readPOS	Read part-of-speech information. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
trimFields	Trim field values. Type: Boolean — Default value: `true`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 11. Capabilities
Media types	text/x.org.dkpro.conll-2000
Outputs	DocumentMetaData Sentence Token Chunk

Conll2000Writer

Implementation

org.dkpro.core.io.conll.Conll2000Writer

Description

Writes the CoNLL 2000 chunking format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeChunk	Write chunking information. Type: Boolean — Default value: `true`
writeCoveredText	Write text covered by the token instead of the token form. Type: Boolean — Default value: `true`
writePOS	Write part-of-speech information. Type: Boolean — Default value: `true`

Table 12. Capabilities
Media types	text/x.org.dkpro.conll-2000
Inputs	DocumentMetaData Sentence Token Chunk

Conll2002

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-conll-asl

The CoNLL 2002 format encodes named entity spans. Fields are separated by a single space. Sentences are separated by a blank new line.

Table 13. Columns
Column	Type/Feature	Description
FORM	Token	Word form or punctuation symbol.
NER	NamedEntity	named entity (IOB2 encoded)

Example

Wolff B-PER
, O
currently O
a O
journalist O
in O
Argentina B-LOC
, O
played O
with O
Del B-PER
Bosque I-PER
in O
the O
final O
years O
of O
the O
seventies O
in O
Real B-ORG
Madrid I-ORG
. O

Table 14. Known corpora in this format
Corpus	Language
AQMAR Arabic Wikipedia Named Entity Corpus	Arabic
CoNLL 2002 dataset	Spanish
CoNLL 2002 dataset	Dutch

Conll2002Reader

Implementation

org.dkpro.core.io.conll.Conll2002Reader

Description

Reads by default the CoNLL 2002 named entity format.

The reader is also compatible with the CoNLL-based GermEval 2014 named entity format, in which the columns are separated by a tab, and there is an extra column for embedded named entities, besides the token number being put in the first column (see below). For that, additional parameters are provided, by which one can determine the column separator, whether there is an additional first column for token numbers, and whether embedded named entities should be read. (Note: Currently, the reader only reads the outer named entities, not the embedded ones.


The following snippet shows an example of the TSV format
# http://de.wikipedia.org/wiki/Manfred_Korfmann [2009-10-17]
1  Aufgrund          O           O
2  seiner            O           O
3  Initiative        O           O
4  fand              O           O
5  2001/2002         O           O
6  in                O           O
7  Stuttgart         B-LOC       O
8  ,                 O           O
9  Braunschweig      B-LOC       O
10 und               O           O
11 Bonn              B-LOC       O
12 eine              O           O
13 große             O           O
14 und               O           O
15 publizistisch     O           O
16 vielbeachtete     O           O
17 Troia-Ausstellung B-LOCpart   O
18 statt             O           O
19 ,                 O           O
20 „                 O           O
21 Troia             B-OTH       B-LOC
22 -                 I-OTH       O
23 Traum             I-OTH       O
24 und               I-OTH       O
25 Wirklichkeit      I-OTH       O
26 “                 O           O
27 .                 O           O

WORD_NUMBER - token number
FORM - token
NER1 - outer named entity (BIO encoded)
NER2 - embedded named entity (BIO encoded)

The sentence is encoded as one token per line, with information provided in tab-separated columns. The first column contains either a #, which signals the source the sentence is cited from and the date it was retrieved, or the token number within the sentence. The second column contains the token. Name spans are encoded in the BIO-scheme. Outer spans are encoded in the third column, embedded spans in the fourth column.

Parameters

NamedEntityMappingLocation	Location of the mapping file for named entity tags to UIMA types. Optional — Type: String
columnSeparator	Column separator parameter. Acceptable input values come from ColumnSeparators. Example usage: if you want to define 'tab' as the column separator the following value should be input for this parameter Conll2002Reader.ColumnSeparators.TAB.getName() Optional — Type: String — Default value: `space`
hasEmbeddedNamedEntity	Has embedded named entity extra column. Optional — Type: Boolean — Default value: `false`
hasHeader	Indicates that there is a header line before the sentence Optional — Type: Boolean — Default value: `false`
hasTokenNumber	Token number flag. When true, the first column contains the token number inside the sentence (as in GermEval 2014 format) Optional — Type: Boolean — Default value: `false`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readNamedEntity	Read named entity information. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
trimFields	Trim field values. Type: Boolean — Default value: `true`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 15. Capabilities
Media types	text/x.org.dkpro.conll-2002 text/x.org.dkpro.germeval-2014
Outputs	DocumentMetaData NamedEntity Sentence Token

Conll2002Writer

Implementation

org.dkpro.core.io.conll.Conll2002Writer

Description

Writes the CoNLL 2002 named entity format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeCoveredText	Write text covered by the token instead of the token form. Type: Boolean — Default value: `true`
writeNamedEntity	Write named entity information. Type: Boolean — Default value: `true`

Table 16. Capabilities
Media types	text/x.org.dkpro.conll-2002
Inputs	DocumentMetaData NamedEntity Sentence Token

Conll2003

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-conll-asl

The CoNLL 2003 format encodes named entity spans and chunk spans. Fields are separated by a single space. Sentences are separated by a blank new line. Named entities and chunks are encoded in the IOB1 format. I.e. a B prefix is only used if the category of the following span differs from the category of the current span.

Table 17. Columns
Column	Type/Feature	Description
FORM	Token	Word form or punctuation symbol.
CHUNK	Chunk	chunk (IOB1 encoded)
NER	Named entity	named entity (IOB1 encoded)

Example

U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOC
. . O O

Table 18. Known corpora in this format
Corpus	Language
AQMAR Arabic Wikipedia Named Entity Corpus	Arabic
CoNLL 2002 dataset	Spanish
CoNLL 2002 dataset	Dutch

Conll2003Reader

Implementation

org.dkpro.core.io.conll.Conll2003Reader

Description

Reads the CoNLL 2003 format.

Parameters

ChunkMappingLocation	Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ChunkTagSet	Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
NamedEntityMappingLocation	Location of the mapping file for named entity tags to UIMA types. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readChunk	Read chunk information. Type: Boolean — Default value: `true`
readNamedEntity	Read named entity information. Type: Boolean — Default value: `true`
readPOS	Read part-of-speech information. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
trimFields	Trim field values. Type: Boolean — Default value: `true`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 19. Capabilities
Media types	text/x.org.dkpro.conll-2003
Outputs	DocumentMetaData NamedEntity Sentence Token Chunk

Conll2003Writer

Implementation

org.dkpro.core.io.conll.Conll2003Writer

Description

Writes the CoNLL 2003 format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeChunk	Write chunking information. Type: Boolean — Default value: `true`
writeCoveredText	Write text covered by the token instead of the token form. Type: Boolean — Default value: `true`
writeNamedEntity	Write named entity information. Type: Boolean — Default value: `true`
writePOS	Write part-of-speech information. Type: Boolean — Default value: `true`

Table 20. Capabilities
Media types	text/x.org.dkpro.conll-2003
Inputs	DocumentMetaData NamedEntity Sentence Token Chunk

Conll2006

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-conll-asl

The CoNLL 2006 (aka CoNLL-X) format targets dependency parsing. Columns are tab-separated. Sentences are separated by a blank new line.

Table 21. Columns
Column	Type/Feature	Description
ID	ignored	Token counter, starting at 1 for each new sentence.
FORM	Token	Word form or punctuation symbol.
LEMMA	Lemma	Lemma of the word form.
CPOSTAG	POS coarseValue
POSTAG	POS PosValue	Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
FEATS	MorphologicalFeatures	Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (`\|`), or an underscore if not available.
HEAD	Dependency	Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
DEPREL	Dependency	Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.
PHEAD	ignored	Projective head of current token, which is either a value of ID or zero ('0'), or an underscore if not available. Note that depending on the original treebank annotation, there may be multiple tokens an with ID of zero. The dependency structure resulting from the PHEAD column is guaranteed to be projective (but is not available for all languages), whereas the structures resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available).
PDEPREL	ignored	Dependency relation to the PHEAD, or an underscore if not available. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.

Example

Heutzutage	heutzutage	ADV	_	_	ADV	_	_

Table 22. Known corpora in this format
Corpus	Language
Copenhagen Dependency Treebanks	Danish
FinnTreeBank (in recent versions with additional pseudo-XML metadata)	Finnish
Floresta Sintá(c)tica (Bosque-CoNLL)	Portuguese
Sequoia corpus	French
SETimes.HR corpus and dependency treebank of Croatian	Croatian
Składnica zależnościowa	Polish
Slovene Dependency Treebank	Slovene
Swedish Treebank	Swedish
Talbanken05	Swedish
Uppsala Persian Dependency Treebank	Persian (Farsi)
Norwegian Dependency Treebank (NDT)	Norwegian
IULA Resources. Corpus & Tools. IULA Spanish LSP Treebank	Spanish
Turin University Treebank	Italian

Conll2006Reader

Implementation

org.dkpro.core.io.conll.Conll2006Reader

Description

Reads files in the CoNLL-2006 format (aka CoNLL-X).

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readCPOS	Read coarse-grained part-of-speech information. Type: Boolean — Default value: `true`
readDependency	Read syntactic dependency information. Type: Boolean — Default value: `true`
readLemma	Read lemma information. Type: Boolean — Default value: `true`
readMorph	Read morphological features. Type: Boolean — Default value: `true`
readPOS	Read fine-grained part-of-speech information. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
trimFields	Trim field values. Type: Boolean — Default value: `true`
useCPosAsPos	Enable to use CPOS (column 4) as the part-of-speech tag. Otherwise the POS (column 3) is used. Type: Boolean — Default value: `false`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 23. Capabilities
Media types	text/x.org.dkpro.conll-2006
Outputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

Conll2006Writer

Implementation

org.dkpro.core.io.conll.Conll2006Writer

Description

Writes a file in the CoNLL-2006 format (aka CoNLL-X).

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeCPOS	Write coarse-grained part-of-speech information. Type: Boolean — Default value: `true`
writeCoveredText	Write text covered by the token instead of the token form. Type: Boolean — Default value: `true`
writeDependency	Write syntactic dependency information. Type: Boolean — Default value: `true`
writeLemma	Write lemma information. Type: Boolean — Default value: `true`
writeMorph	Write morphological features. Type: Boolean — Default value: `true`
writePOS	Write fine-grained part-of-speech information. Type: Boolean — Default value: `true`

Table 24. Capabilities
Media types	text/x.org.dkpro.conll-2006
Inputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

Conll2008

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-conll-asl

The CoNLL 2008 format targets syntactic and semantic dependencies. Columns are tab-separated. Sentences are separated by a blank new line.

Table 25. Columns
Column	Type/Feature	Description
ID	ignored	Token counter, starting at 1 for each new sentence.
FORM	Token	Word form or punctuation symbol.
LEMMA	Lemma	Lemma of the word form.
GPOS	POS PosValue	Golf fine-grained part-of-speech tag, where the tagset depends on the language.
PPOS	ignored	Automatically predicted major POS by a language-specific tagger.
SPLIT_FORM	ignored	Tokens split at hyphens and slashes.
SPLIT_LEMMA	ignored	Predicted lemma of SPLIT_FORM.
PPOSS	ignored	Predicted POS tags of the split forms.
HEAD	Dependency	Head of the current token, which is either a value of ID or zero (`0). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
DEPREL	Dependency	Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply `ROOT`.
PRED	SemPred	(sense) identifier of a semantic 'predicate' coming from a current token.
APREDs	SemArg	Columns with argument labels for each semantic predicate (in the ID order).

Example

1	Some	some	DT	_	Some	some	DT	10	SBJ	_	_	_	_	A1	_	_	_
2	of	of	IN	_	of	of	IN	1	NMOD	_	_	_	_	_	_	_	_
3	the	the	DT	_	the	the	DT	5	NMOD	_	_	_	_	_	_	_	_
4	strongest	strongest	JJS	_	strongest	strong	JJS	5	NMOD	_	_	_	_	_	_	_	_
5	critics	critics	NNS	_	critics	critic	NNS	2	PMOD	critic.01	A0	_	_	_	_	_	_
6	of	of	IN	_	of	of	IN	5	NMOD	_	A1	_	_	_	_	_	_
7	our	our	PRP$	_	our	our	PRP$	9	NMOD	_	_	A1	A0	_	_	_	_
8	welfare	welfare	NN	_	welfare	welfare	NN	9	NMOD	welfare.01	_	A2	_	_	_	_	_
9	system	system	NN	_	system	system	NN	6	PMOD	system.01	_	_	_	_	_	_	_
10	are	are	VBP	_	are	be	VBP	0	ROOT	be.01	_	_	_	_	_	_	_
11	the	the	DT	_	the	the	DT	12	NMOD	_	_	_	_	_	_	_	_
12	people	people	NNS	_	people	people	NNS	10	PRD	person.02	_	_	_	A2	A0	A0	A1
13	who	who	WP	_	who	who	WP	14	SBJ	_	_	_	_	_	_	_	_
14	have	have	VBP	_	have	have	VBP	12	NMOD	have.04	_	_	_	_	SU	_	_
15	become	become	VBN	_	become	become	VBN	14	VC	become.01	_	_	_	_	A1	A1	_
16	dependent	dependent	JJ	_	dependent	dependent	JJ	15	PRD	_	_	_	_	_	_	_	_
17	on	on	IN	_	on	on	IN	16	AMOD	_	_	_	_	_	_	_	_
18	it	it	PRP	_	it	it	PRP	17	PMOD	_	_	_	_	_	_	_	_
19	.	.	.	_	.	.	.	10	P	_	_	_	_	_	_	_	_

Table 26. Known corpora in this format
Corpus	Language
MASC-CONLL	English

Conll2008Reader

Implementation

org.dkpro.core.io.conll.Conll2008Reader

Description

Reads a file in the CoNLL-2008 format.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readDependency	Read syntactic dependency information. Type: Boolean — Default value: `true`
readLemma	Read lemma information. Type: Boolean — Default value: `true`
readPOS	Read part-of-speech information. Type: Boolean — Default value: `true`
readSemPred	Read semantic predicate information. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
trimFields	Trim field values. Type: Boolean — Default value: `true`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 27. Capabilities
Media types	text/x.org.dkpro.conll-2008
Outputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemArg SemPred Dependency

Conll2008Writer

Implementation

org.dkpro.core.io.conll.Conll2008Writer

Description

Writes a file in the CoNLL-2008 format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeCoveredText	Write text covered by the token instead of the token form. Type: Boolean — Default value: `true`
writeDependency	Write syntactic dependency infomation. Type: Boolean — Default value: `true`
writeLemma	Write lemma information. Type: Boolean — Default value: `true`
writeMorph	Write morphological features. Type: Boolean — Default value: `true`
writePOS	Write part-of-speech information. Type: Boolean — Default value: `true`
writeSemanticPredicate	Write semantic predicate infomation. Type: Boolean — Default value: `true`

Table 28. Capabilities
Media types	text/x.org.dkpro.conll-2008
Inputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemArg SemPred Dependency

Conll2009

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-conll-asl

The CoNLL 2009 format targets semantic role labeling. Columns are tab-separated. Sentences are separated by a blank new line.

Table 29. Columns
Column	Type/Feature	Description
ID	ignored	Token counter, starting at 1 for each new sentence.
FORM	Token	Word form or punctuation symbol.
LEMMA	Lemma	Lemma of the word form.
PLEMMA	ignored	Automatically predicted lemma of FORM.
POS	POS PosValue	Fine-grained part-of-speech tag, where the tagset depends on the language.
PPOS	ignored	Automatically predicted major POS by a language-specific tagger.
FEATS	MorphologicalFeatures	Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (`\|`), or an underscore if not available.
PFEAT	ignored)	Automatically predicted morphological features (if applicable).
HEAD	Dependency	Head of the current token, which is either a value of ID or zero (`0). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
PHEAD	ignored	Automatically predicted syntactic head.
DEPREL	Dependency	Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply `ROOT`.
PDEPREL	ignored	Automatically predicted dependency relation to PHEAD.
FILLPRED	ignored	Contains `Y` for argument-bearing tokens.
PRED	SemPred	(sense) identifier of a semantic 'predicate' coming from a current token.
APREDs	SemArg	Columns with argument labels for each semantic predicate (in the ID order).

Example

1	The	the	the	DT	DT	_	_	4	4	NMOD	NMOD	_	_	_	_
2	most	most	most	RBS	RBS	_	_	3	3	AMOD	AMOD	_	_	_	_
3	troublesome	troublesome	troublesome	JJ	JJ	_	_	4	4	NMOD	NMOD	_	_	_	_
4	report	report	report	NN	NN	_	_	5	5	SBJ	SBJ	_	_	_	_
5	may	may	may	MD	MD	_	_	0	0	ROOT	ROOT	_	_	_	_
6	be	be	be	VB	VB	_	_	5	5	VC	VC	_	_	_	_
7	the	the	the	DT	DT	_	_	11	11	NMOD	NMOD	_	_	_	_
8	August	august	august	NNP	NNP	_	_	11	11	NMOD	NMOD	_	_	_	AM-TMP
9	merchandise	merchandise	merchandise	NN	NN	_	_	10	10	NMOD	NMOD	_	_	A1	_
10	trade	trade	trade	NN	NN	_	_	11	11	NMOD	NMOD	Y	trade.01	_	A1
11	deficit	deficit	deficit	NN	NN	_	_	6	6	PRD	PRD	Y	deficit.01	_	A2
12	due	due	due	JJ	JJ	_	_	13	11	AMOD	APPO	_	_	_	_
13	out	out	out	IN	IN	_	_	11	12	APPO	AMOD	_	_	_	_
14	tomorrow	tomorrow	tomorrow	NN	NN	_	_	13	12	TMP	TMP	_	_	_	_
15	.	.	.	.	.	_	_	5	5	P	P	_	_	_	_

Table 30. Known corpora in this format
Corpus	Language
CoNLL 2009 Shared Task	Catalan, German, Japanese, Spanish

Conll2009Reader

Implementation

org.dkpro.core.io.conll.Conll2009Reader

Description

Reads a file in the CoNLL-2009 format.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readDependency	Read syntactic dependency information. Type: Boolean — Default value: `true`
readLemma	Read lemma information. Type: Boolean — Default value: `true`
readMorph	Read morphological features. Type: Boolean — Default value: `true`
readPOS	Read part-of-speech information. Type: Boolean — Default value: `true`
readSemPred	Read semantic predicate information. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
trimFields	Trim field values. Type: Boolean — Default value: `true`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 31. Capabilities
Media types	text/x.org.dkpro.conll-2009
Outputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemArg SemPred Dependency

Conll2009Writer

Implementation

org.dkpro.core.io.conll.Conll2009Writer

Description

Writes a file in the CoNLL-2009 format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeCoveredText	Write text covered by the token instead of the token form. Type: Boolean — Default value: `true`
writeDependency	Write syntactic dependency information. Type: Boolean — Default value: `true`
writeLemma	Write lemma information. Type: Boolean — Default value: `true`
writeMorph	Read morphological features. Type: Boolean — Default value: `true`
writePOS	Write part-of-speech information. Type: Boolean — Default value: `true`
writeSemPred	Write semantic predicate information. Type: Boolean — Default value: `true`

Table 32. Capabilities
Media types	text/x.org.dkpro.conll-2009
Inputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token SemArg SemPred Dependency

Conll2012

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-conll-asl

The CoNLL 2012 format targets semantic role labeling and coreference. Columns are whitespace-separated (tabs or spaces). Sentences are separated by a blank new line.

Note that this format cannot deal with the following situations: * An annotation has no label (e.g. a SemPred annotation has no category) - in such a case null is written into the corresponding column. However, the reader will actually read this value as the label. * If a SemPred annotation is at the same position as a SemArg annotation linked to it, then only the (V*) representing the SemPred annotation will be written. * SemPred annotations spanning more than one token are not supported * If there are multiple SemPred annotations on the same token, then only one of them is written. This is because the category of the SemPred annotation goes to the Predicate Frameset ID and that can only hold one value which.

Table 33. Columns
Column	Type/Feature	Description
Document ID	ignored	This is a variation on the document filename.</li>
Part number	ignored	Some files are divided into multiple parts numbered as 000, 001, 002, … etc.
Word number	ignored
Word itself	document text	This is the token as segmented/tokenized in the Treebank. Initially the `*_skel` file contain the placeholder `[WORD]` which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.
Part-of-Speech	POS
Parse bit	Constituent	This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a `*`. The full parse can be created by substituting the asterisk with the `([pos] [word])` string (or leaf) and concatenating the items in the rows of that column.
Predicate lemma	Lemma	The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a `-`.
Predicate Frameset ID	SemPred	This is the PropBank frameset ID of the predicate in Column 7.
Word sense	ignored	This is the word sense of the word in Column 3.
Speaker/Author	ignored	This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data.
Named Entities	NamedEntity	These columns identifies the spans representing various named entities.
Predicate Arguments	SemPred	There is one column each of predicate argument structure information for the predicate mentioned in Column 7.
Coreference	CoreferenceChain	Coreference chain information encoded in a parenthesis structure.

Example

en-orig.conll	0	0	John	NNP	(TOP(S(NP*)	john	-	-	-	(PERSON)	(A0)	(1)
en-orig.conll	0	1	went	VBD	(VP*	go	go.02	-	-	*	(V*)	-
en-orig.conll	0	2	to	TO	(PP*	to	-	-	-	*	*	-
en-orig.conll	0	3	the	DT	(NP*	the	-	-	-	*	*	(2
en-orig.conll	0	4	market	NN	*)))	market	-	-	-	*	(A1)	2)
en-orig.conll	0	5	.	.	*))	.	-	-	-	*	*	-

Conll2012Reader

Implementation

org.dkpro.core.io.conll.Conll2012Reader

Description

Reads a file in the CoNLL-2012 format.

Parameters

ConstituentMappingLocation	Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ConstituentTagSet	Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readConstituent	Read syntactic constituent information. Type: Boolean — Default value: `true`
readCoreference	Read co-reference information. Type: Boolean — Default value: `true`
readLemma	Read lemma information. Disabled by default because CoNLL 2012 format does not include lemmata for all words, only for predicates. Type: Boolean — Default value: `false`
readNamedEntity	Read named entity information. Type: Boolean — Default value: `true`
readPOS	Read part-of-speech information. Type: Boolean — Default value: `true`
readSemPred	Read semantic predicate information. Type: Boolean — Default value: `true`
readWordSense	Read word sense information. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
trimFields	Trim field values. Type: Boolean — Default value: `true`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`
useHeaderMetadata	Use the document ID declared in the file header instead of using the filename. Type: Boolean — Default value: `true`
writeTracesToText	Whether to render traces into the document text. Optional — Type: Boolean — Default value: `false`

Table 34. Capabilities
Media types	text/x.org.dkpro.conll-2012
Outputs	CoreferenceChain POS DocumentMetaData NamedEntity Lemma Sentence Token SemArg SemPred WordSense

Conll2012Writer

Implementation

org.dkpro.core.io.conll.Conll2012Writer

Description

Writer for the CoNLL-2012 format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeCoveredText	Write text covered by the token instead of the token form. Type: Boolean — Default value: `true`
writeLemma	Write lemma information. Type: Boolean — Default value: `true`
writePOS	Write part-of-speech information. Type: Boolean — Default value: `true`
writeSemanticPredicate	Write semantic predicate infomation. Type: Boolean — Default value: `true`

Table 35. Capabilities
Media types	text/x.org.dkpro.conll-2012
Inputs	CoreferenceChain POS DocumentMetaData NamedEntity Lemma Sentence Token SemArg SemPred WordSense

ConllCoreNlp

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-conll-asl

The CoreNLP CoNLL format is used by the Stanford CoreNLP package. Columns are tab-separated. Sentences are separated by a blank new line.

Table 36. Columns
Column	Type/Feature	Description
ID	ignored	Token counter, starting at 1 for each new sentence.
FORM	Token	Word form or punctuation symbol.
LEMMA	Lemma	Lemma of the word form.
POSTAG	POS PosValue	Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
NER	NamedEntity	Named Entity tag, or underscore if not available. If a named entity covers multiple tokens, all of the tokens simply carry the same label without (no sequence encoding).
HEAD	Dependency	Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
DEPREL	Dependency	Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.

Example

1	Selectum	Selectum	NNP	O	_	_
2	,	,	,	O	_	_
3	Société	Société	NNP	O	_	_
4	d'Investissement	d'Investissement	NNP	O	_	_
5	à	à	NNP	O	_	_
6	Capital	Capital	NNP	O	_	_
7	Variable	Variable	NNP	O	_	_
8	.	.	.	O	_	_

ConllCoreNlpReader

Implementation

org.dkpro.core.io.conll.ConllCoreNlpReader

Description

Reads files in the default CoreNLP CoNLL format.

Parameters

NamedEntityMappingLocation	Location of the mapping file for named entity tags to UIMA types. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readDependency	Read syntactic dependency information. Type: Boolean — Default value: `true`
readLemma	Read lemma information. Type: Boolean — Default value: `true`
readNamedEntity	Read morphological features. Type: Boolean — Default value: `true`
readPOS	Read fine-grained part-of-speech information. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
trimFields	Trim field values. Type: Boolean — Default value: `true`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 37. Capabilities
Media types	text/x.org.dkpro.conll-corenpl
Outputs	POS DocumentMetaData NamedEntity Lemma Sentence Token Dependency

ConllCoreNlpWriter

Implementation

org.dkpro.core.io.conll.ConllCoreNlpWriter

Description

Write files in the default CoreNLP CoNLL format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.conll`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeCoveredText	Write text covered by the token instead of the token form. Type: Boolean — Default value: `true`
writeDependency	Write syntactic dependency information. Type: Boolean — Default value: `true`
writeLemma	Write lemma information. Type: Boolean — Default value: `true`
writeNamedEntity	Write named entity information. Type: Boolean — Default value: `true`
writePOS	Write fine-grained part-of-speech information. Type: Boolean — Default value: `true`

Table 38. Capabilities
Media types	text/x.org.dkpro.conll-corenpl
Inputs	POS DocumentMetaData NamedEntity Lemma Sentence Token Dependency

ConllU

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-conll-asl

The CoNLL-U format format targets dependency parsing. Columns are tab-separated. Sentences are separated by a blank new line.

Table 39. Columns
Column	Type/Feature	Description
ID	ignored	Word index, integer starting at 1 for each new sentence; may be a range for tokens with multiple words.
FORM	Token	Word form or punctuation symbol.
LEMMA	Lemma	Lemma or stem of word form.
CPOSTAG	POS coarseValue	Part-of-speech tag from the universal POS tag set.
POSTAG	POS PosValue	Language-specific part-of-speech tag; underscore if not available.
FEATS	MorphologicalFeatures	List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
HEAD	Dependency	Head of the current token, which is either a value of ID or zero (0).
DEPREL	Dependency	Universal Stanford dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
DEPS	Dependency	List of secondary dependencies (head-deprel pairs).
MISC	unused	Any other annotation.

Example

1	They	they	PRON	PRN	Case=Nom|Number=Plur	2	nsubj	4:nsubj	_
2	buy	buy	VERB	VB	Number=Plur|Person=3|Tense=Pres	0	root	_	_
3	and	and	CONJ	CC	_	2	cc	_	_
4	sell	sell	VERB	VB	Number=Plur|Person=3|Tense=Pres	2	conj	0:root	_
5	books	book	NOUN	NNS	Number=Plur	2	dobj	4:dobj	SpaceAfter=No
6	.	.	PUNCT	.	_	2	punct	_	_

Table 40. Known corpora in this format
Corpus	Language
Universal Dependency Treebank	Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish

ConllUReader

Implementation

org.dkpro.core.io.conll.ConllUReader

Description

Reads a file in the CoNLL-U format.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readCPOS	Read coarse-grained part-of-speech information. Type: Boolean — Default value: `true`
readDependency	Read syntactic dependency information. Type: Boolean — Default value: `true`
readLemma	Read lemma information. Type: Boolean — Default value: `true`
readMorph	Read morphological features. Type: Boolean — Default value: `true`
readPOS	Read fine-grained part-of-speech information. Type: Boolean — Default value: `true`
readParagraph	Read paragraph information. If no paragraph information is provided in the file, or if set to false, then output one sentence per line, separated by an empty line. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
trimFields	Trim field values. Type: Boolean — Default value: `true`
useCPosAsPos	Treat coarse-grained part-of-speech as fine-grained part-of-speech information. Type: Boolean — Default value: `false`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 41. Capabilities
Media types	text/x.org.dkpro.conll-u
Outputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

ConllUWriter

Implementation

org.dkpro.core.io.conll.ConllUWriter

Description

Writes a file in the CoNLL-U format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.conllu`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeCPOS	Write coarse-grained part-of-speech information. Type: Boolean — Default value: `true`
writeCoveredText	Write text covered by the token instead of the token form. Type: Boolean — Default value: `true`
writeDependency	Write syntactic dependency information. Type: Boolean — Default value: `true`
writeLemma	Write lemma information. Type: Boolean — Default value: `true`
writeMorph	Write morphological features. Type: Boolean — Default value: `true`
writePOS	Write fine-grained part-of-speech information. Type: Boolean — Default value: `true`
writeTextComment	Include the full sentence text as a comment in front of each sentence. Type: Boolean — Default value: `true`

Table 42. Capabilities
Media types	text/x.org.dkpro.conll-u
Inputs	MorphologicalFeatures POS DocumentMetaData Lemma Sentence Token Dependency

Ditop

DiTop

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-ditop-asl

DiTopWriter

Implementation

org.dkpro.core.io.ditop.DiTopWriter

Description

This annotator (consumer) writes output files as required by DiTop. It requires JCas input annotated by org.dkpro.core.mallet.lda.MalletLdaTopicModelInferencer using the same model.

Parameters

appendConfig	If set to true, the new corpus will be appended to an existing config file. If false, the existing file is overwritten. Type: Boolean — Default value: `true`
collectionValues	If set, only documents with one of the listed collection IDs are written, all others are ignored. If this is empty (null), all documents are written. Optional — Type: String[]
collectionValuesExactMatch	If true (default), only write documents with collection ids matching one of the collection values exactly. If false, write documents with collection ids containing any of the collection value string in collection while ignoring cases. Type: Boolean — Default value: `true`
compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
corpusName	The corpus name is used to name the corresponding sub-directory and will be set in the configuration file. Type: String
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
maxTopicWords	The maximum number of topic words to extract. Type: Integer — Default value: `15`
modelLocation	A Mallet file storing a serialized ParallelTopicModel. Type: String
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Directory in which to store output files. Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 43. Capabilities
Media types	application/x.org.dkpro.ditop
Inputs	DocumentMetaData TopicDistribution

DKPro Core

Concrete

Group ID	org.dkpro.core
Artifact ID	dkpro-core

ConcreteReader

Implementation

org.dkpro.core.io.concrete.ConcreteReader

Description

null

Parameters

format	Type: String — Default value: `compact`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Default: 1 (log every document). Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 44. Capabilities
Media types	application/x.org.dkpro.lxf+json
Outputs	DocumentMetaData Sentence Token

ConcreteWriter

Implementation

org.dkpro.core.io.concrete.ConcreteWriter

Description

null

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filenameExtension	Specify the suffix of output files. Default value `.concrete`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.concrete`
format	Type: String — Default value: `compact`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 45. Capabilities
Media types	application/x.org.dkpro.lxf+json
Inputs	DocumentMetaData Sentence Token

Frequency

Group ID	org.dkpro.core
Artifact ID	dkpro-core-frequency-asl

FrequencyWriter

Implementation

org.dkpro.core.frequency.phrasedetection.FrequencyWriter

Description

Count uni-grams and bi-grams in a collection.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
coveringType	Set this parameter if bigrams should only be counted when occurring within a covering type, e.g. sentences. Optional — Type: String
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
featurePath	The feature path. Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token`
filterRegex	Regular expression of tokens to be filtered. Type: String — Default value: ``
lowercase	If true, all tokens are lowercased. Type: Boolean — Default value: `false`
minCount	Tokens occurring fewer times than this value are omitted. Type: Integer — Default value: `5`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
regexReplacement	Value with which tokens matching the regular expression are replaced. Type: String — Default value: ``
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
sortByAlphabet	If true, sort output alphabetically. Type: Boolean — Default value: `false`
sortByCount	If true, sort output by count (descending order). Type: Boolean — Default value: `false`
stopwordsFile	Path of a file containing stopwords one work per line. Type: String — Default value: ``
stopwordsReplacement	Stopwords are replaced by this value. Type: String — Default value: ``
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 46. Capabilities
Media types	none specified
Inputs	none specified

TfIdf

Group ID	org.dkpro.core
Artifact ID	dkpro-core-frequency-asl

TfIdfWriter

Implementation

org.dkpro.core.frequency.tfidf.TfIdfWriter

Description

This consumer builds a DfModel. It collects the df (document frequency) counts for the processed collection. The counts are serialized as a DfModel-object.

Parameters

featurePath	This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path. Type: String
lowercase	If set to true, the whole text is handled in lower case. Type: Boolean — Default value: `false`
targetLocation	Specifies the path and filename where the model file is written. Type: String

Table 47. Capabilities
Media types	none specified
Inputs	none specified

Gigaword

AnnotatedGigaword

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-gigaword-asl

AnnotatedGigawordReader

Implementation

org.dkpro.core.io.gigaword.AnnotatedGigawordReader

Description

UIMA collection reader for plain text files.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 48. Capabilities
Media types	text/plain
Outputs	DocumentMetaData

HTML

Html

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-html-asl

HtmlReader

Implementation

org.dkpro.core.io.html.HtmlReader

Description

Reads the contents of a given URL and strips the HTML. Returns the textual contents. Also recognizes headings and paragraphs.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 49. Capabilities
Media types	application/xhtml+xml text/html
Outputs	DocumentMetaData Heading Paragraph

HtmlDocument

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-html-asl

HtmlDocumentReader

Implementation

org.dkpro.core.io.html.HtmlDocumentReader

Description

Reads the contents of a given URL and strips the HTML. Returns the textual contents. Also recognizes headings and paragraphs.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
normalizeWhitespace	Normalize whitespace. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 50. Capabilities
Media types	application/xhtml+xml text/html
Outputs	DocumentMetaData Heading Paragraph XmlAttribute XmlDocument XmlElement XmlNode XmlTextNode

IMS Corpus Workbench

ImsCwb

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-imscwb-asl

The "verticalized XML" format used by the IMS Open Corpus Workbench, a linguistic search engine. It uses a tab-separated format with limited markup (e.g. for sentences, documents, but not recursive structures like parse-trees). In principle, it is a generic format - i.e. there can be arbitrary columns, pseudo-XML elements and attributes. However, support is limited to a specific set of columns that must appear exactly in a specific order: token text, part-of-speech tag, lemma. Also only specific pseudo-XML elements and attributes are supported: text (including an id attribute), s.

If a local installation of the corpus workbench is available, it can be used by this module to immediately generate the corpus workbench index format. Search is not supported by this module.

Example

<text id="http://www.epguides.de/nikita.htm">
<s>
Nikita	NE	Nikita
(	$(	(
La	FM	La
Femme	NN	Femme
Nikita	NE	Nikita
)	$(	)
Dieser	PDS	dies
Episodenführer	NN	Episodenführer
wurde	VAFIN	werden
von	APPR	von
September	NN	September
1998	CARD	1998
bis	APPR	bis
Mai	NN	Mai
1999	CARD	1999
von	APPR	von
Konstantin	NE	Konstantin
C.W.	NE	C.W.
Volkmann	NE	Volkmann
geschrieben	VVPP	schreiben
und	KON	und
im	APPRART	im
Mai	NN	Mai
2000	CARD	2000
von	APPR	von
Stefan	NE	Stefan
Börzel	NN	Börzel
übernommen	VVPP	übernehmen
.	$.	.
</s>
</text>

See also

IMS Open Corpus Workbench

Known corpora in this format

WaCky - The Web-As-Corpus Kool Yinitiative - corpora crawled from the world wide web in several different languages (DeWaC, UkWaC, ItWaC, etc.)

ImsCwbReader

Implementation

org.dkpro.core.io.imscwb.ImsCwbReader

Description

Reads a tab-separated format including pseudo-XML tags.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Specify which tag set should be used to locate the mapping file. Optional — Type: String
generateNewIds	If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. Type: Boolean — Default value: `false`
idIsUrl	If true, the unit text ID encoded in the corpus file is stored as the URI in the document meta data. This setting has is not affected by #PARAM_GENERATE_NEW_IDS Type: Boolean — Default value: `false`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readLemma	Read lemmas. Type: Boolean — Default value: `true`
readPOS	Read part-of-speech tags and generate POS annotations or subclasses if a #PARAM_POS_TAG_SET tag set or #PARAM_POS_MAPPING_LOCATION mapping file is used. Type: Boolean — Default value: `true`
readSentence	Read sentences. Type: Boolean — Default value: `true`
readToken	Read tokens and generate Token annotations. Type: Boolean — Default value: `true`
replaceNonXml	Replace non-XML characters with spaces. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the output. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 51. Capabilities
Media types	text/x.org.dkpro.imscwb
Outputs	POS DocumentMetaData Lemma Sentence Token

ImsCwbWriter

Implementation

org.dkpro.core.io.imscwb.ImsCwbWriter

Description

Writes in the IMS Open Corpus Workbench verticalized XML format.

This writer produces a text file which needs to be converted to the binary IMS CWB index files using the command line tools that come with the CWB.

It is possible to set the parameter #PARAM_CQP_HOME to directly create output in the native binary CQP format via the original CWB command line tools.

When not configured to write directly to a CQP process, then the writer will produce one file per CAS. In order to write all data to the same file, use JCasFileWriter_ImplBase#PARAM_SINGULAR_TARGET.

Parameters

additionalFeatures	Write additional token-level annotation features. These have to be given as an array of fully qualified feature paths (fully.qualified.classname/featureName). The names for these annotations in CQP are their lowercase shortnames. Optional — Type: String[]
compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
corpusName	The name of the generated corpus. Type: String — Default value: `corpus`
cqpCompress	Set this parameter to compress the token streams and the indexes using cwb-huffcode and cwb-compress-rdx. With modern hardware, this may actually slow down queries, so we turn it off by default. If you have large data sets, you best try yourself what works best for you. (default: false) Type: Boolean — Default value: `false`
cqpHome	Set this parameter to the directory containing the cwb-encode and cwb-makeall commands if you want the write to directly encode into the CQP binary format. Optional — Type: String
cqpwebCompatibility	Make document IDs compatible with CQPweb. CQPweb demands an id consisting of only letters, numbers and underscore. Type: Boolean — Default value: `false`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.vrt`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.vrt`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
sentenceTag	The pseudo-XML tag used to mark sentence boundaries. Type: String — Default value: `s`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeCPOS	Write coarse-grained part-of-speech tags. These are the simple names of the UIMA types used to represent the part-of-speech tag. Type: Boolean — Default value: `false`
writeDocId	Write the document ID for each token. It is usually a better idea to generate a #PARAM_WRITE_DOCUMENT_TAG document tag or a #PARAM_WRITE_TEXT_TAG text tag which also contain the document ID that can be queried in CQP. Type: Boolean — Default value: `false`
writeDocumentTag	Write a pseudo-XML tag with the name document to mark the start and end of a document. Type: Boolean — Default value: `false`
writeLemma	Write lemmata. Type: Boolean — Default value: `true`
writeOffsets	Write the start and end position of each token. Type: Boolean — Default value: `false`
writePOS	Write part-of-speech tags. Type: Boolean — Default value: `true`
writeTextTag	Write a pseudo-XML tag with the name text to mark the start and end of a document. This is used by CQPweb. Type: Boolean — Default value: `true`

Table 52. Capabilities
Media types	text/x.org.dkpro.imscwb
Inputs	POS DocumentMetaData Lemma Sentence Token

JDBC

Jdbc

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-jdbc-asl

JdbcReader

Implementation

org.dkpro.core.io.jdbc.JdbcReader

Description

Collection reader for JDBC database.The obtained data will be written into CAS DocumentText as well as fields of the DocumentMetaData annotation.

The field names are available as constants and begin with CAS_. Please specify the mapping of the columns and the field names in the query. For example,

SELECT text AS cas_text, title AS cas_metadata_title FROM test_table

will create a CAS for each record, write the content of "text" column into CAS document text and that of "title" column into the document title field of the DocumentMetaData annotation.

Parameters

connection	Specifies the URL to the database. If used with uimaFIT and the value is not given, `jdbc:mysql://127.0.0.1/` will be taken. Do not use this parameter to add additional parameters, but use #PARAM_CONNECTION_PARAMS instead. Type: String — Default value: `jdbc:mysql://127.0.0.1/`
connectionParams	Add additional parameters for the connection URL here in a single string: [&propertyName1=propertyValue1[&propertyName2=propertyValue2]...]. Type: String — Default value: ``
database	Specifies name of the database to be accessed. Type: String
driver	Specify the class name of the JDBC driver. If used with uimaFIT and the value is not given, `com.mysql.cj.jdbc.Driver` will be taken. Type: String — Default value: `com.mysql.cj.jdbc.Driver`
language	Specifies the language. Optional — Type: String
password	Specifies the password for database access. Type: String
query	Specifies the query. Type: String
user	Specifies the user name for database access. Type: String

Table 53. Capabilities
Media types	none specified
Outputs	DocumentMetaData

Leipzig Corpora Collection

Lcc

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-lcc-asl

LccReader

Implementation

org.dkpro.core.io.lcc.LccReader

Description

Reader for sentence-based Leipzig Corpora Collection files.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sentencesPerCAS	How many input sentences should be merged into one CAS. Type: Integer — Default value: `100`
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`
writeSentence	Whether sentences should be written by the reader or not. Type: Boolean — Default value: `false`

Table 54. Capabilities
Media types	text/x.org.dkpro.lcc
Outputs	DocumentMetaData

LIF

Lif

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-lif-asl

The the LAPPS Interchange Format (LIF) is a JSON-based format which is used by the Language Applications Grid. The the format is in principle generic, the support for it is based on the LAPPS Web Service Exchange Vocabulary.

Example

{
  "id": "v2",
  "metadata": {
     "contains": {
       "Token": {
         "producer": "org.anc.lapps.stanford.SATokenizer:1.4.0",
         "type": "tokenization:stanford" },
       "Token#pos": {
         "producer": "org.anc.lapps.stanford.SATagger:1.4.0",
         "posTagSet": "penn",
         "type": "postagging:stanford" }}},
  "annotations": [
     { "@type": "Token", "id": "tok0", "start": 0, "end": 4, "features": { "pos": "NNP" } },
     { "@type": "Token", "id": "tok1", "start": 5, "end": 10, "features": { "pos": "VBZ" } },
     { "@type": "Token", "id": "tok2", "start": 10, "end": 11, "features": { "pos": "." } } ]
}

LifReader

Implementation

org.dkpro.core.io.lif.LifReader

Description

Reader for the LIF format.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 55. Capabilities
Media types	application/x.org.dkpro.lif+json
Outputs	DocumentMetaData NamedEntity Paragraph Sentence Token Constituent Dependency

LifWriter

Implementation

org.dkpro.core.io.lif.LifWriter

Description

Writer for the LIF format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.lif`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.lif`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
wrapAsDataObject	Wrap as data object. Type: Boolean — Default value: `false`
writeTimestamp	Write timestamp to view. Type: Boolean — Default value: `true`

Table 56. Capabilities
Media types	application/x.org.dkpro.lif+json
Inputs	DocumentMetaData NamedEntity Paragraph Sentence Token Constituent Dependency

LXF

Lxf

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-lxf-asl

LxfReader

Implementation

org.dkpro.core.io.lxf.LxfReader

Description

Reader for the CLARINO LAP LXF format.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 57. Capabilities
Media types	application/x.org.dkpro.lxf+json
Outputs	POS DocumentMetaData Lemma Sentence Token Dependency

LxfWriter

Implementation

org.dkpro.core.io.lxf.LxfWriter

Description

Writer for the CLARINO LAP LXF format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
delta	Write only the changes to the annotations. This works only in conjunction with the LxfReader. Type: Boolean — Default value: `false`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.lxf`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 58. Capabilities
Media types	application/x.org.dkpro.lxf+json
Inputs	POS DocumentMetaData Lemma Sentence Token Dependency

Mallet

MalletLdaTopicProportions

Group ID	org.dkpro.core
Artifact ID	dkpro-core-mallet-asl

MalletLdaTopicProportionsWriter

Implementation

org.dkpro.core.mallet.lda.io.MalletLdaTopicProportionsWriter

Description

Write topic proportions to a file in the shape [\t]\t\t...

This writer depends on the TopicDistribution annotation which needs to be created by MalletLdaTopicModelInferencer before.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	If #PARAM_SINGULAR_TARGET is set to false (default), this extension will be appended to the output files. Type: String — Default value: `.topics`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeDocid	If set to true (default), each output line is preceded by the document id. Type: Boolean — Default value: `true`

Table 59. Capabilities
Media types	none specified
Inputs	none specified

MalletLdaTopicsProportionsSorted

Group ID	org.dkpro.core
Artifact ID	dkpro-core-mallet-asl

MalletLdaTopicsProportionsSortedWriter

Implementation

org.dkpro.core.mallet.lda.io.MalletLdaTopicsProportionsSortedWriter

Description

Write the topic proportions according to an LDA topic model to an output file. The proportions need to be inferred in a previous step using MalletLdaTopicModelInferencer.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
nTopics	Number of topics to generate. Type: Integer — Default value: `3`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 60. Capabilities
Media types	none specified
Inputs	none specified

NEGRA

NegraExport

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-negra-asl

NegraExportReader

Implementation

org.dkpro.core.io.negra.NegraExportReader

Description

This CollectionReader reads a file which is formatted in the NEGRA export format. The texts and add. information like constituent structure is reproduced in CASes, one CAS per text (article) .

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
collectionId	The collection ID to the written to the document meta data. Optional — Type: String
documentUnit	What indicates if a new CAS should be started. E.g., if set to DocumentUnit#ORIGIN_NAME ORIGIN_NAME, a new CAS is generated whenever the origin name of the current sentence differs from the origin name of the last sentence. Type: String — Default value: `ORIGIN_NAME`
generateNewIds	If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. Type: Boolean — Default value: `false`
language	The language. Optional — Type: String
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
readLemma	Write lemma information. Type: Boolean — Default value: `true`
readPOS	Write part-of-speech information. Type: Boolean — Default value: `true`
readPennTree	Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset! Type: Boolean — Default value: `false`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Type: String

Table 61. Capabilities
Media types	application/x.org.dkpro.negra3 application/x.org.dkpro.negra4
Outputs	POS DocumentMetaData Lemma Sentence Token Constituent

New York Times Corpus

Nitf

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-nitf-asl

NitfReader

Implementation

org.dkpro.core.io.nitf.NitfReader

Description

Reader for the News Industry Text Format (NITF). Was developed primarily to work with the New York Times Annotated Corpus.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
offset	A number of documents which will be skipped at the beginning. Optional — Type: Integer
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 62. Capabilities
Media types	application/x.org.dkpro.nitf+xml
Outputs	DocumentMetaData ArticleMetaData

NIF

Nif

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-nif-asl

The NLP Interchange Format (NIF) provides a way of representing NLP information using semantic web technology, specifically RDF and OWL. A few additions of the format were defined in the apparently in-official NIF 2.1 specification.

Example

@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix nif:   <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://example.org/document0#char=0,86>
        a               nif:RFC5147String , nif:String , nif:Context ;
        nif:beginIndex  "0"^^xsd:nonNegativeInteger ;
        nif:endIndex    "86"^^xsd:nonNegativeInteger ;
        nif:isString    "Japan (Japanese: 日本 Nippon or Nihon) is a stratovolcanic archipelago of 6,852 islands."^^xsd:string ;
        nif:topic       <http://example.org/document0#annotation0> .

<http://example.org/document0#char=0,5>
        a                     nif:RFC5147String , nif:String ;
        nif:anchorOf          "Japan"^^xsd:string ;
        nif:beginIndex        "0"^^xsd:nonNegativeInteger ;
        nif:endIndex          "5"^^xsd:nonNegativeInteger ;
        nif:referenceContext  <http://example.org/document0#char=0,86> ;
        itsrdf:taClassRef     <http://example.org/Country> , <http://example.org/StratovolcanicArchipelago> ;
        itsrdf:taIdentRef     <http://example.org/Japan> .

<http://example.org/document0#char=42,68>
        a                     nif:RFC5147String , nif:String ;
        nif:anchorOf          "stratovolcanic archipelago"^^xsd:string ;
        nif:beginIndex        "42"^^xsd:nonNegativeInteger ;
        nif:endIndex          "68"^^xsd:nonNegativeInteger ;
        nif:referenceContext  <http://example.org/document0#char=0,86> ;
        itsrdf:taClassRef     <http://example.org/Archipelago> , rdfs:Class ;
        itsrdf:taIdentRef     <http://example.org/StratovolcanicArchipelago> .

<http://example.org/document0#annotation0>
        a                  nif:Annotation ;
        itsrdf:taIdentRef  <http://example.org/Geography> .

Known corpora in this format

NifReader

Implementation

org.dkpro.core.io.nif.NifReader

Description

Reader for the NLP Interchange Format (NIF). The file format (e.g. TURTLE, etc.) is automatically chosen depending on the name of the file(s) being read. Compressed files are supported.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 63. Capabilities
Media types	application/x.org.dkpro.nif+turtle
Outputs	POS DocumentMetaData NamedEntity Heading Lemma Paragraph Sentence Stem Token

NifWriter

Implementation

org.dkpro.core.io.nif.NifWriter

Description

Writer for the NLP Interchange Format (NIF).

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.ttl`. The file format will be chosen depending on the file suffice. Type: String — Default value: `.ttl`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 64. Capabilities
Media types	application/x.org.dkpro.nif+turtle
Inputs	POS DocumentMetaData NamedEntity Heading Lemma Paragraph Sentence Stem Token

PDF

Pdf

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-pdf-asl

PdfReader

Implementation

org.dkpro.core.io.pdf.PdfReader

Description

Collection reader for PDF files. Uses simple heuristics to detect headings and paragraphs.

Parameters

endPage	The last page to be extracted from the PDF. Optional — Type: Integer — Default value: `-1`
headingType	The type used to annotate headings. Optional — Type: String — Default value: `<built-in>`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
paragraphType	The type used to annotate paragraphs. Optional — Type: String — Default value: `<built-in>`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
startPage	The first page to be extracted from the PDF. Optional — Type: Integer — Default value: `-1`
substitutionTableLocation	The location of the substitution table use to post-process the text extracted form the PDF, e.g. to convert ligatures to separate characters. Optional — Type: String — Default value: `<built-in>`
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 65. Capabilities
Media types	application/pdf
Outputs	DocumentMetaData Heading Paragraph

Penn Treebank Format

PennTreebankChunked

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-penntree-asl

PennTreebankChunkedReader

Implementation

org.dkpro.core.io.penntree.PennTreebankChunkedReader

Description

Penn Treebank chunked format reader.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readChunk	Write chunk annotations to the CAS. Type: Boolean — Default value: `true`
readPOS	Write part-of-speech annotations to the CAS. Type: Boolean — Default value: `true`
readSentence	Write sentence annotations to the CAS. Type: Boolean — Default value: `true`
readToken	Write token annotations to the CAS. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 66. Capabilities
Media types	text/x.org.dkpro.ptb-chunked
Outputs	POS DocumentMetaData Sentence Token Chunk

PennTreebankCombined

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-penntree-asl

Known corpora in this format

Floresta Sintá(c)tica (Bosque) - Portuguese

PennTreebankCombinedReader

Implementation

org.dkpro.core.io.penntree.PennTreebankCombinedReader

Description

Penn Treebank combined format reader.

Parameters

ConstituentMappingLocation	Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ConstituentTagSet	Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readPOS	Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Type: Boolean — Default value: `true`
removeTraces	Whether to remove traces from the parse tree. Optional — Type: Boolean — Default value: `true`
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`
writeTracesToText	Whether to render traces into the document text. Optional — Type: Boolean — Default value: `false`

Table 67. Capabilities
Media types	text/x.org.dkpro.ptb-combined
Outputs	POS DocumentMetaData Sentence Token Constituent

PennTreebankCombinedWriter

Implementation

org.dkpro.core.io.penntree.PennTreebankCombinedWriter

Description

Penn Treebank combined format writer.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
emptyRootLabel	Whether to force the root label to be empty. Type: Boolean — Default value: `false`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.mrg`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.mrg`
noRootLabel	Whether to remove the root node. This is only possible if the root node has only a single child (i.e. a sentence node). Type: Boolean — Default value: `false`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 68. Capabilities
Media types	text/x.org.dkpro.ptb-combined
Inputs	POS DocumentMetaData Sentence Token Constituent

Perseus Treebank

Perseus

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-perseus-asl

An XML format used by the Perseus Ancient Greek and Latin Dependency Treebank.

Example (excerpt from tlg0013.tlg002.perseus-grc1.tb.xml)

<treebank version="2.1" xml:lang="grc" cts="urn:cts:greekLit:tlg0013.tlg002.perseus-grc1.tb">
  <body>
    <sentence id="2" document_id="urn:cts:greekLit:tlg0013.tlg002.perseus-grc1" subdoc="1-495">
      <word id="1" form="σέβας" lemma="σέβας" postag="n-s---nn-" relation="PNOM" sg="nmn dpd" gloss="object.of.wonder" head="13"/>
      <word id="2" form="τό" lemma="ὁ" postag="p-s---nn-" relation="SBJ" sg="sbs nmn dpd" gloss="this" head="13"/>
      <word id="3" form="γε" lemma="γε" postag="d--------" relation="AuxY" sg="prt" gloss="indeed" head="13"/>
      <word id="4" form="πᾶσιν" lemma="πᾶς" postag="a-p---md-" relation="ATR" sg="prp" gloss="all" head="9"/>
      <word id="5" form="ἰδέσθαι" lemma="εἶδον" postag="v--anm---" relation="ATR" sg="dpd vrb as_nmn not_ind" gloss="see" head="1"/>
      <word id="6" form="ἀθανάτοις" lemma="ἀθάνατος" postag="a-p---md-" relation="ATR" sg="prp" gloss="immortal" head="8"/>
      <word id="7" form="τε" lemma="τε" postag="c--------" relation="AuxY" sg="" gloss="and" head="9"/>
      <word id="8" form="θεοῖς" lemma="θεός" postag="n-p---md-" relation="ADV_CO" sg="dtv dpd prp int adv" gloss="god" head="9"/>
      <word id="9" form="ἠδὲ" lemma="ἠδέ" postag="c--------" relation="COORD" sg="" gloss="and" head="13"/>
      <word id="10" form="θνητοῖς" lemma="θνητός" postag="a-p---md-" relation="ATR" sg="prp" gloss="mortal" head="11"/>
      <word id="11" form="ἀνθρώποις" lemma="ἄνθρωπος" postag="n-p---md-" relation="ADV_CO" sg="dtv dpd prp int adv" gloss="man" head="9"/>
      <word id="12" form="·" lemma="·" postag="u--------" relation="AuxK" sg="" head="0"/>
      <word id="13" insertion_id="0003e" artificial="elliptic" relation="PRED" lemma="εἰμί" postag="v3spia---" form="ἐστι" sg="ind stt" gloss="be" head="0"/>
    </sentence>
</treebank>

PerseusReader

Implementation

org.dkpro.core.io.perseus.PerseusReader

Description

Reader for the Perseus Treebank XML format.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readDependency	Read syntactic dependency information. Type: Boolean — Default value: `true`
readLemma	Read lemma information. Type: Boolean — Default value: `true`
readPOS	Read fine-grained part-of-speech information. Type: Boolean — Default value: `true`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 69. Capabilities
Media types	application/x.org.dkpro.perseus+xml
Outputs	POS DocumentMetaData Lemma Sentence Token Dependency

PubAnnotation

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-pubannotation-asl

PubAnnotationReader

Implementation

org.dkpro.core.io.pubannotation.PubAnnotationReader

Description

Reader for the PubAnnotation format. Since the PubAnnotation format only associates spans/relations with simple values and since annotations are not typed, it is necessary to define target types and features via #PARAM_SPAN_TYPE and #PARAM_SPAN_LABEL_FEATURE. In PubAnnotation, every annotation has an ID. If the target type has a suitable feature to retain the ID, it can be configured via #PARAM_SPAN_ID_FEATURE. The sourcedb and sourceid from the PubAnnotation document are imported as DocumentMetaData#setCollectionId(String) collectionId and DocumentMetaData#setDocumentId(String) documentId respectively. If present, also the target is imported as DocumentMetaData#setDocumentUri(String) documentUri. The DocumentMetaData#setDocumentBaseUri(String) documentBaseUri is cleared in this case. Currently supports only span annotations, i.e. no relations or modifications. Discontinuous segments are also not supported.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
resolveNamespaces	The feature on the span annotation type which receives the label. Type: Boolean — Default value: `false`
sourceLocation	Location from which the input is read. Optional — Type: String
spanIdFeature	The feature on the span annotation type which receives the ID. Optional — Type: String
spanLabelFeature	The feature on the span annotation type which receives the label. Optional — Type: String
spanType	The span annotation type to which the PubAnnotation spans are mapped. Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 70. Capabilities
Media types	application/x.org.dkpro.pubannotation+json
Outputs	DocumentMetaData

PubAnnotationWriter

Implementation

org.dkpro.core.io.pubannotation.PubAnnotationWriter

Description

Writer for the PubAnnotation format. Since the PubAnnotation format only associates spans/relations with simple values and since annotations are not typed, it is necessary to define target types and features via #PARAM_SPAN_TYPE and #PARAM_SPAN_LABEL_FEATURE. In PubAnnotation, every annotation has an ID. If the annotation type has an ID feature, it can be configured via #PARAM_SPAN_ID_FEATURE. If this parameter is not set, the IDs are generated automatically. The sourcedb and sourceid from the PubAnnotation document are exported from DocumentMetaData#setCollectionId(String) collectionId and DocumentMetaData#setDocumentId(String) documentId respectively. The target is exported from DocumentMetaData#setDocumentUri(String) documentUri. Currently supports only span annotations, i.e. no relations or modifications. Discontinuous segments are also not supported.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.json`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.json`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
spanIdFeature	The feature on the span annotation type which receives the ID. Optional — Type: String
spanLabelFeature	The feature on the span annotation type which receives the label. Optional — Type: String
spanType	The span annotation type to which the PubAnnotation spans are mapped. Type: String
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 71. Capabilities
Media types	application/x.org.dkpro.pubannotation+json
Inputs	DocumentMetaData

Reuters-21578

Reuters21578Sgml

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-reuters-asl

Reuters21578SgmlReader

Implementation

org.dkpro.core.io.reuters.Reuters21578SgmlReader

Description

Read a Reuters-21578 corpus in SGML format.

Set the directory that contains the SGML files with #PARAM_SOURCE_LOCATION.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 72. Capabilities
Media types	application/x.org.dkpro.reuters21578+sgml
Outputs	DocumentMetaData

Reuters21578Txt

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-reuters-asl

Reuters21578TxtReader

Implementation

org.dkpro.core.io.reuters.Reuters21578TxtReader

Description

Read a Reuters-21578 corpus that has been transformed into text format using ExtractReuters in the lucene-benchmarks project.

The #PARAM_SOURCE_LOCATION parameter should typically point to the file name pattern reut2-*.txt, preceded by the corpus root directory.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 73. Capabilities
Media types	text/x.org.dkpro.reuters21578
Outputs	DocumentMetaData

RTF

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-rtf-asl

RTFReader

Implementation

org.dkpro.core.io.rtf.RTFReader

Description

Read RTF (Rich Text Format) files. Uses RTFEditorKit for parsing RTF.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 74. Capabilities
Media types	application/rtf text/rtf
Outputs	DocumentMetaData

Solr

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-solr-asl

SolrWriter

Implementation

org.dkpro.core.io.solr.SolrWriter

Description

A simple implementation of SolrWriter_ImplBase

Parameters

numThreads	The number of background numThreads used to empty the queue. Type: Integer — Default value: `1`
optimizeIndex	If set to true, the index is optimized once all documents are uploaded. Default is false. Type: Boolean — Default value: `false`
queueSize	The buffer size before the documents are sent to the server (default: 10000). Type: Integer — Default value: `10000`
solrIdField	The name of the id field in the Solr schema (default: "id"). Type: String — Default value: `id`
targetLocation	Solr server URL string in the form ://:/, e.g. http://localhost:8983/solr/collection1 Type: String
textField	The name of the text field in the Solr schema (default: "text"). Type: String — Default value: `text`
update	Define whether existing documents with same ID are updated (true) of overwritten (false)? Type: Boolean — Default value: `true`
waitFlush	When committing to the index, i.e. when all documents are processed, block until index changes are flushed to disk? Type: Boolean — Default value: `true`
waitSearcher	When committing to the index, i.e. when all documents are processed, block until a new searcher is opened and registered as the main query searcher, making the changes visible? Type: Boolean — Default value: `true`

Table 75. Capabilities
Media types	none specified
Inputs	none specified

TCF

Tcf

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-tcf-asl

The TCF (Text Corpus Format) was created in the context of the CLARIN project. It is mainly used to exchange data between the different web-services that are part of the WebLicht platform.

TcfReader

Implementation

org.dkpro.core.io.tcf.TcfReader

Description

Reader for the WebLicht TCF format. It reads all the available annotation Layers from the TCF file and convert it to a CAS annotations. The TCF data do not have begin/end offsets for all of its annotations which is required in CAS annotation. Hence, addresses are manually calculated per tokens and stored in a map (token_id, token(CAS object)) where later we get can get the offset from the token

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 76. Capabilities
Media types	text/tcf+xml
Outputs	CoreferenceChain CoreferenceLink POS DocumentMetaData NamedEntity Lemma Sentence Token Dependency SofaChangeAnnotation

TcfWriter

Implementation

org.dkpro.core.io.tcf.TcfWriter

Description

Writer for the WebLicht TCF format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.tcf`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.tcf`
merge	Merge with source TCF file if one is available. Type: Boolean — Default value: `true`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
preserveIfEmpty	If there are no annotations for a particular layer in the CAS, preserve any potentially existing annotations in the original TCF. Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
tcfVersion	TCF version. Type: String — Default value: `0.4`
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 77. Capabilities
Media types	text/tcf+xml
Inputs	CoreferenceChain CoreferenceLink POS DocumentMetaData NamedEntity Lemma Sentence Token Dependency SofaChangeAnnotation

TEI

Tei

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-tei-asl

The TEI P5 XML format is a widely used standard format. It is a very complex format and furthermore is often extended for specific corpora. The reader and writer components offered by DKPro Core support various common element types, but by far not all.

Known corpora in this format

TeiReader

Implementation

org.dkpro.core.io.tei.TeiReader

Description

Reader for the TEI XML.

Supported TEI XML elements and attributes
Element	Description	DKPro Core type	Attribute mappings
`TEI`	document boundary	`getNext(...)` returns one TEI document at a time
`title`	document title	DocumentMetaData
`s`	s-unit	Sentence
`u`	utterance	Sentence
`p`	paragraph	Paragraph
`rs`	referencing string	NamedEntity	`type` -> value
`phr`	phrase	Constituent	`type` -> constituentType, `function` -> syntacticFunction
`w`	word	Token	(`pos`, `type`) -> POS.PosValue (`pos` preferred over `type`)
`mw`	multi-word	Token	same as for `w`
`c`	character	Token	same as for `w`

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
elementsToTrim	Trim the given elements (remote leading and trailing whitespace). DKPro Core usually expects annotations to start and end at a non-whitespace character. Type: String[] — Default value: `[s, u, p, rs, w, c, mw]`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
omitIgnorableWhitespace	Do not write ignoreable whitespace from the XML file to the CAS. Type: Boolean — Default value: `false`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readConstituent	Write constituent annotations to the CAS. Type: Boolean — Default value: `true`
readLemma	Write lemma annotations to the CAS. Type: Boolean — Default value: `true`
readNamedEntity	Write named entity annotations to the CAS. Type: Boolean — Default value: `true`
readPOS	Write part-of-speech annotations to the CAS. Type: Boolean — Default value: `true`
readParagraph	Write paragraphs annotations to the CAS. Type: Boolean — Default value: `true`
readSentence	Write sentence annotations to the CAS. Type: Boolean — Default value: `true`
readToken	Write token annotations to the CAS. Type: Boolean — Default value: `true`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`
useFilenameId	When not using the XML ID, use only the filename instead of the whole URL as ID. Mind that the filenames should be unique in this case. Type: Boolean — Default value: `false`
useXmlId	Use the xml:id attribute on the TEI elements as document ID. Mind that many TEI files may not have this attribute on all TEI elements and you may end up with no document ID at all. Also mind that the IDs should be unique. Type: Boolean — Default value: `false`
utterancesAsSentences	Interpret utterances "u" as sentenes "s". (EXPERIMENTAL) Type: Boolean — Default value: `false`

Table 78. Capabilities
Media types	application/tei+xml
Outputs	POS DocumentMetaData NamedEntity Lemma Paragraph Sentence Token Constituent

TeiWriter

Implementation

org.dkpro.core.io.tei.TeiWriter

Description

UIMA CAS consumer writing the CAS document text in TEI format.

Parameters

cTextPattern	A token matching this pattern is rendered as a TEI "c" element instead of a "w" element. Type: String — Default value: [,.:;()]\|(``)\|('')\|(--)
compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.xml`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.xml`
indent	Indent the XML. Type: Boolean — Default value: `false`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
writeConstituent	Write constituent annotations to the CAS. Disabled by default because it requires type priorities to be set up (Constituents must have a higher prio than Tokens). Type: Boolean — Default value: `false`
writeNamedEntity	Write named entity annotations to the CAS. Overlapping named entities are not supported. Type: Boolean — Default value: `true`

Table 79. Capabilities
Media types	application/tei+xml
Inputs	POS DocumentMetaData NamedEntity Lemma Paragraph Sentence Token Constituent

Text

String

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-text-asl

StringReader

Implementation

org.dkpro.core.io.text.StringReader

Description

Simple reader that generates a CAS from a String. This can be useful in situations where a reader is preferred over manually crafting a CAS using JCasFactory#createJCas().

Parameters

collectionId	The collection ID to set in the DocumentMetaData. Type: String — Default value: `COLLECTION_ID`
documentBaseUri	The document base URI to set in the DocumentMetaData. Optional — Type: String
documentId	The document ID to set in the DocumentMetaData. Type: String — Default value: `DOCUMENT_ID`
documentText	The document text. Type: String
documentUri	The document URI to set in the DocumentMetaData. Type: String — Default value: `STRING`
language	Set this as the language of the produced documents. Type: String

Table 80. Capabilities
Media types	text/plain
Outputs	DocumentMetaData

Text

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-text-asl

TextReader

Implementation

org.dkpro.core.io.text.TextReader

Description

UIMA collection reader for plain text files.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Name of configuration parameter that contains the character encoding used by the input files. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 81. Capabilities
Media types	text/plain
Outputs	DocumentMetaData

TextWriter

Implementation

org.dkpro.core.io.text.TextWriter

Description

UIMA CAS consumer writing the CAS document text as plain text file.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.txt`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.txt`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 82. Capabilities
Media types	text/plain
Inputs	DocumentMetaData

TokenizedText

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-text-asl

TokenizedTextWriter

Implementation

org.dkpro.core.io.text.TokenizedTextWriter

Description

Write texts into into a large file containing one sentence per line and tokens separated by whitespace. Optionally, annotations other than tokens (e.g. lemmas) are written as specified by #PARAM_FEATURE_PATH.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
coveringType	In the output file, each unit of the covering type is written into a separate line. The default (set in #DEFAULT_COVERING_TYPE), is sentences so that each sentence is written to a line. If no line breaks within a document are desired, set this value to null. Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
extension	Set the output file extension. Type: String — Default value: `.txt`
featurePath	The feature path, e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value for lemmas. Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token`
numberRegex	Regular expression to match numbers. These are written to the output as NUM. Type: String — Default value: ``
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stopwordsFile	All the tokens listed in this file (one token per line) are replaced by STOP. Empty lines and lines starting with # are ignored. Casing is ignored. Type: String — Default value: ``
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Encoding for the target file. Default is UTF-8. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 83. Capabilities
Media types	text/plain
Inputs	DocumentMetaData

TGrep2

TGrep

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-tgrep-gpl

TGrep and TGrep2 are a tools to search over syntactic parse trees represented as bracketed structures. This module supports in particular TGrep2 and allows to conveniently generate TGrep2 indexes which can then be searched. Search is not supported by this module.

See also

TGrep2

TGrepWriter

Implementation

org.dkpro.core.io.tgrep.TGrepWriter

Description

TGrep2 corpus file writer. Requires PennTrees to be annotated before.

Parameters

compression	Method to compress the tgrep file (only used if PARAM_WRITE_T2C is true). Only NONE, GZIP and BZIP2 are supported. Type: String — Default value: `NONE`
dropMalformedTrees	If true, silently drops malformed Penn Trees instead of throwing an exception. Type: Boolean — Default value: `false`
targetLocation	Path to which the output is written. Type: String
writeComments	Set this parameter to true if you want to add a comment to each PennTree which is written to the output files. The comment is of the form documentId,beginOffset,endOffset. Type: Boolean — Default value: `true`
writeT2c	Set this parameter to true if you want to encode directly into the tgrep2 binary format. Type: Boolean — Default value: `true`

Table 84. Capabilities
Media types	application/x.org.dkpro.tgrep2
Inputs	PennTree

TIGER-XML

TigerXml

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-tiger-asl

The TIGER XML format was created for encoding syntactic constituency structures in the German TIGER corpus. It has since been used for many other corpora as well. TIGERSearch is a linguistic search engine specifically targetting this format. The format has later been extended to also support semantic frame annotations.

Known corpora in this format

Floresta Sintá(c)tica (Bosque) - Portuguese
Semeval-2 Task 10 - (extended format)
Składnica frazowa - Polish
Swedish Treebank - Swedish
Talbanken05 - Swedish
TIGER - German

TigerXmlReader

Implementation

org.dkpro.core.io.tiger.TigerXmlReader

Description

UIMA collection reader for TIGER-XML files. Also supports the augmented format used in the Semeval 2010 task which includes semantic role data.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
ignoreIllegalSentences	If a sentence has an illegal structure (e.g. TIGER 2.0 has non-terminal nodes that do not have child nodes), then just ignore these sentences. Type: Boolean — Default value: `false`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readPennTree	Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset! Type: Boolean — Default value: `false`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 85. Capabilities
Media types	application/x.org.dkpro.semeval-2010+xml application/x.org.dkpro.tiger+xml
Outputs	POS DocumentMetaData Lemma Sentence Token SemArg SemPred Constituent

TigerXmlWriter

Implementation

org.dkpro.core.io.tiger.TigerXmlWriter

Description

UIMA CAS consumer writing the CAS document text in the TIGER-XML format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.xml`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.xml`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 86. Capabilities
Media types	application/x.org.dkpro.tiger+xml
Inputs	POS DocumentMetaData Lemma Sentence Token Constituent

Tika

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-tika-asl

TikaReader

Implementation

org.dkpro.core.io.tika.TikaReader

Description

Reader for many file formats based on Apache Tika.

Parameters

bufferSize	Internal buffer size. If the buffer size is exceeded, the reader will throw an exception (-1 means unlimited size). Optional — Type: Integer — Default value: `-1`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
parseEmbeddedDocuments	Parse embedded documents in addition to the main document. Optional — Type: Boolean — Default value: `false`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 87. Capabilities
Media types	none specified
Outputs	DocumentMetaData

TUEBADZ

TuebaDZ

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-tuebadz-asl

The TüBa-D/Z treebank is a syntactically annotated German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz).

Sentences have a header line and are followed by a blank new line.

Table 88. Columns
Column	Type/Feature	Description
FORM	Token	Word form or punctuation symbol.
POSTAG	POS PosValue	Fine-grained part-of-speech tag, where the tagset depends on the language.
CHUNK	Chunk	chunk (BIO encoded) - For named entities, it can also include its type, e.g., B-NX=ORG

Example

%% sent no. 1
Veruntreute  VVFIN   B-VXFIN
die          ART     B-NX=ORG
AWO          NN      I-NX=ORG
Spendengeld  NN      B-NX
?   $.  O

Known corpora in this format

TüBa-D/Z - German

TuebaDZReader

Implementation

org.dkpro.core.io.tuebadz.TuebaDZReader

Description

Reads the Tüba-D/Z chunking format.

Parameters

ChunkMappingLocation	Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ChunkTagSet	Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
readChunk	Read chunk information. Type: Boolean — Default value: `true`
readNamedEntity	Read named entity information. Type: Boolean — Default value: `false`
readPOS	Write part-of-speech information. Type: Boolean — Default value: `true`
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 89. Capabilities
Media types	application/x.org.dkpro.tuebadz-chunk
Outputs	DocumentMetaData Sentence Token Chunk

TüPP-D/Z

Tuepp

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-tuepp-asl

TüPP D/Z is a collection of articles from the German newspaper taz (die tageszeitung) annotated and encoded in a XML format.

Known corpora in this format

TüPP-D/Z - German

TueppReader

Implementation

org.dkpro.core.io.tuepp.TueppReader

Description

UIMA collection reader for Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) XML files.

Only the part-of-speech with the best rank (rank 1) is read, if there is a tie between multiple tags, the first one from the XML file is read.
Only the first lemma (baseform) from the XML file is read.
Token are read, but not the specific kind of token (e.g. TEL, AREA, etc.).
Article boundaries are not read.
Paragraph boundaries are not read.
Lemma information is read, but morphological information is not read.
Chunk, field, and clause information is not read.
Meta data headers are not read.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
POSTagSet	Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers. Optional — Type: String
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mappingEnabled	Enable/disable type mapping. Type: Boolean — Default value: `true`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 90. Capabilities
Media types	application/x.org.dkpro.tuepp+xml
Outputs	POS DocumentMetaData Lemma Sentence Token

UIMA Binary CAS

BinaryCas

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-bincas-asl

The CAS is the native data model used by UIMA. There are various ways of saving CAS data, using XMI, XCAS, or binary formats. This module supports the binary formats.

See also

Compressed Binary CASes

BinaryCasReader

Implementation

org.dkpro.core.io.bincas.BinaryCasReader

Description

UIMA Binary CAS formats reader.

Parameters

addDocumentMetadata	Add DKPro Core metadata if it is not already present in the document. Type: Boolean — Default value: `true`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mergeTypeSystem	Determines whether the type system from a currently read file should be merged with the current type system Type: Boolean — Default value: `false`
overrideDocumentMetadata	Generate new DKPro Core document metadata (i.e. title, ID, URI) for the document instead of retaining what is already present in the XMI file. Type: Boolean — Default value: `false`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
typeSystemLocation	The location from which to obtain the type system when the CAS is stored in form 0. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 91. Capabilities
Media types	application/x.org.dkpro.uima+binary
Outputs	none specified

BinaryCasWriter

Implementation

org.dkpro.core.io.bincas.BinaryCasWriter

Description

Write CAS in one of the UIMA binary formats.

All the supported formats except 6+ can also be loaded and saved via the UIMA CasIOUtils.

Supported formats
Format	Description	Type system on load	CAS Addresses preserved
`SERIALIZED` or `S`	CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). Because these structures are pre-allocated in memory at larger sizes than what is actually required, files in this format may be larger than necessary. However, the CAS addresses of feature structures are preserved in this format. When the data is loaded back into a CAS, it must have been initialized with the same type system as the original CAS.	must be the same	yes
`SERIALIZED_TSI` or `S+`	CAS structures are dumped to disc as they are using Java serialization as in form 0, but now using the CASCompleteSerializer which includes CAS metadata like type system and index repositories.	is reinitialized	yes
`BINARY` or 0	CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). This is basically the same as format S but includes a UIMA header and can be read using org.apache.uima.cas.impl.Serialization#deserializeCAS.	must be the same	yes
`BINARY_TSI` or 0	The same as `BINARY_TSI`, except that the type system and index configuration are also stored in the file. However, lenient loading or reinitalizing the CAS with this information is presently not supported.	must be the same	yes
`COMPRESSED` or `4`	UIMA binary serialization saving all feature structures (reachable or not). This format internally uses gzip compression and a binary representation of the CAS, making it much more efficient than format 0.	must be the same	yes
`COMPRESSED_FILTERED` or `6`	UIMA binary serialization as format 4, but saving only reachable feature structures.	must be the same	no
6+	This is a legacy format specific to DKPro Core. Since UIMA 2.9.0, `COMPRESSED_FILTERED_TSI` is supported and should be used instead of this format. UIMA binary serialization as format 6, but also contains the type system definition. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system.	lenient loading	no
`COMPRESSED_FILTERED_TS`	Same as `COMPRESSED_FILTERED`, but also contains the type system definition. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system.	lenient loading	no
`COMPRESSED_FILTERED_TSI`	Default. UIMA binary serialization as format 6, but also contains the type system definition and index definitions. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system.	lenient loading	no

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	The file extension. If this is set to AUTO, then the extension will be chosen based on the default extension specified by the UIMA SerialFormat class. However, this only works when using the new long format names (e.g. `COMPRESSED_FILTERED_TSI`). When using the old short names (e.g. `6`), the default extension .bin is used. Type: String — Default value: `AUTO`
format	Binary format to produce. Type: String — Default value: `COMPRESSED_FILTERED_TSI`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
typeSystemLocation	Location to write the type system to. The type system is saved using Java serialization, it is not saved as a XML type system description. We recommend to use the name typesystem.ser. The #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the type system file should be compressed or not is detected from the file name extension (e.g. ".gz"). If this parameter is set, the type system and index repository are no longer serialized into the same file as the test of the CAS. The SerializedCasReader can currently not read such files. Use this only if you really know what you are doing. This parameter has no effect if formats S+ or 6+ are used as the type system information is embedded in each individual file. Otherwise, it is recommended that this parameter be set unless some other mechanism is used to initialize the CAS with the same type system and index repository during reading that was used during writing. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 92. Capabilities
Media types	application/x.org.dkpro.uima+binary
Inputs	DocumentMetaData

SerializedCas

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-bincas-asl

SerializedCasReader

Implementation

org.dkpro.core.io.bincas.SerializedCasReader

Description

null

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
typeSystemLocation	The file from which to obtain the type system if it is not embedded in the serialized CAS. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 93. Capabilities
Media types	none specified
Outputs	none specified

SerializedCasWriter

Implementation

org.dkpro.core.io.bincas.SerializedCasWriter

Description

null

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.ser`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
typeSystemLocation	Location to write the type system to. The type system is saved using Java serialization, it is not saved as a XML type system description. We recommend to use the name typesystem.ser. The #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the type system file should be compressed or not is detected from the file name extension (e.g. ".gz"). If this parameter is set, the type system and index repository are no longer serialized into the same file as the test of the CAS. The SerializedCasReader can currently not read such files. Use this only if you really know what you are doing. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 94. Capabilities
Media types	none specified
Inputs	DocumentMetaData

UIMA JSON

Json

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-json-asl

JsonWriter

Implementation

org.dkpro.core.io.json.JsonWriter

Description

UIMA JSON format writer.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
jsonContextFormat	The level of detail to use for the context (i.e. type system) information. Type: String — Default value: `omitExpandedTypeNames`
omitDefaultValues	Whether to fields that have their default values from the JSON output. Type: Boolean — Default value: `true`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
prettyPrint	Whether to pretty-print the JSON output. Type: Boolean — Default value: `true`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
typeSystemFile	Location to write the type system to. If this is not set, a file called typesystem.xml will be written to the XMI output path. If this is set, it is expected to be a file relative to the current work directory or an absolute file. If this parameter is set, the #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the file name ends in ".gz", the file will be compressed, otherwise not. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 95. Capabilities
Media types	application/x.org.dkpro.uima+json
Inputs	DocumentMetaData

UIMA XMI

Xmi

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-xmi-asl

One of the official formats supported by UIMA is the XMI format. It is an XML-based format that does not support a few very specific characters which are invalid in XML. But it is able to capture all the information contained in the CAS. The XMI format is the de-facto standard for exchanging data in the UIMA world. Most UIMA-related tools support it.

The XMI format does not include type system information. It is therefore recommended to always configure the XmiWriter component to also write out the type system to a file.

If you with to view annotated documents using the UIMA CAS Editor in Eclipse, you can e.g. set up your XmiWriter in the following way to write out XMIs and a type system file:

AnalysisEngineDescription xmiWriter =
  AnalysisEngineFactory.createEngineDescription(
      XmiWriter.class,
      XmiWriter.PARAM_TARGET_LOCATION, ".",
      XmiWriter.PARAM_TYPE_SYSTEM_FILE, "typesystem.xml");

XmiReader

Implementation

org.dkpro.core.io.xmi.XmiReader

Description

Reader for UIMA XMI files.

Parameters

addDocumentMetadata	Add DKPro Core metadata if it is not already present in the document. Type: Boolean — Default value: `true`
includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
lenient	In lenient mode, unknown types are ignored and do not cause an exception to be thrown. Type: Boolean — Default value: `false`
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
mergeTypeSystem	Determines whether the type system from a currently read file should be merged with the current type system. Type: Boolean — Default value: `false`
overrideDocumentMetadata	Generate new DKPro Core document metadata (i.e. title, ID, URI) for the document instead of retaining what is already present in the XMI file. Type: Boolean — Default value: `false`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
typeSystemFile	If a type system is specified, then the type system already in the CAS is replaced by this one. Except if XmiReader#PARAM_MERGE_TYPE_SYSTEM is enabled, in which case it will be merged with the type system already present in the CAS. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 96. Capabilities
Media types	application/vnd.xmi+xml application/x.org.dkpro.uima+xmi
Outputs	DocumentMetaData

XmiWriter

Implementation

org.dkpro.core.io.xmi.XmiWriter

Description

UIMA XMI format writer.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.xmi`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.xmi`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
prettyPrint	Format and indent the XML. Type: Boolean — Default value: `true`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
typeSystemFile	Location to write the type system to. If this is not set, a file called typesystem.xml will be written to the XMI output path. If this is set, it is expected to be a file relative to the current work directory or an absolute file. If this parameter is set, the #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the file name ends in ".gz", the file will be compressed, otherwise not. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
version	Defines the XML version used for serializing the data. The default is XML "1.0". However, XML 1.0 does not support certain Unicode characters. To support a wider range of characters, you can switch this parameter to "1.1". Type: String — Default value: `1.0`

Table 97. Capabilities
Media types	application/vnd.xmi+xml application/x.org.dkpro.uima+xmi
Inputs	DocumentMetaData

Web1T n-grams

Web1T

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-web1t-asl

The Web1T n-gram corpus is a huge collection of n-grams collected from the internet. The jweb1t library allows to access this corpus efficiently. This module provides support for the file format used by the Web1T n-gram corpus and allows to conveniently created jweb1t indexes.

See also

Web1TWriter

Implementation

org.dkpro.core.io.web1t.Web1TWriter

Description

Web1T n-gram index format writer.

Parameters

contextType	The type being used for segments Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence`
createIndexes	Create the indexes that jWeb1T needs to operate. (default: true) Optional — Type: Boolean — Default value: `true`
inputTypes	Types to generate n-grams from. Example: Token.class.getName() + "/pos/PosValue" for part-of-speech n-grams Type: String[]
lowercase	Create a lower case index. Optional — Type: Boolean — Default value: `false`
maxNgramLength	Maximum n-gram length. Optional — Type: Integer — Default value: `3`
minFreq	Specifies the minimum frequency a NGram must have to be written to the final index. The specified value is interpreted as inclusive value, the default is 1. Thus, all NGrams with a frequency of at least 1 or higher will be written. Optional — Type: Integer — Default value: `1`
minNgramLength	Minimum n-gram length. Optional — Type: Integer — Default value: `1`
splitFileTreshold	The input file(s) is/are split into smaller files for quick access. An own file is created if the first two starting letters (or the starting letter if the word has a length of 1 character) account for at least x% of all starting letters in the input file(s). The default value for splitting a file is 1.0%. Every word that has starting characters which does not suffice the threshold is written with other words that also did not meet the threshold into an own file for miscellaneous words. A high threshold will lead to only a few, but large files and a most likely very large misc. file. A low threshold results in many small files. Use a zero or a negative value to write everything to one file. Optional — Type: Float — Default value: `1.0`
targetEncoding	Character encoding of the output data. Optional — Type: String — Default value: `UTF-8`
targetLocation	Location to which the output is written. Type: String

Table 98. Capabilities
Media types	text/x.org.dkpro.ngram
Inputs	Sentence

WebAnno TSV

WebannoTsv3X

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-webanno-asl

WebannoTsv3XReader

Implementation

org.dkpro.core.io.webanno.tsv.WebannoTsv3XReader

Description

Reads the WebAnno TSV v3.x format.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceEncoding	Character encoding of the input data. Type: String — Default value: `UTF-8`
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 99. Capabilities
Media types	text/x.org.dkpro.webanno-tsv3
Outputs	none specified

WebannoTsv3XWriter

Implementation

org.dkpro.core.io.webanno.tsv.WebannoTsv3XWriter

Description

Writes the WebAnno TSV v3.x format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.tsv`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	The character encoding used by the input files. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 100. Capabilities
Media types	text/x.org.dkpro.webanno-tsv3
Inputs	DocumentMetaData Sentence Token

Wikipedia via Bliki Engine

BlikiWikipedia

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-bliki-asl

Access the online Wikipedia and extract its contents using the Bliki engine.

See also

Java Wikipedia API (Bliki engine)

BlikiWikipediaReader

Implementation

org.dkpro.core.io.bliki.BlikiWikipediaReader

Description

Bliki-based Wikipedia reader.

Parameters

language	The language of the wiki installation. Type: String
outputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
pageTitles	Which page titles should be retrieved. Type: String[]
sourceLocation	Wikiapi URL E.g. for the English Wikipedia it should be: http://en.wikipedia.org/w/api.php Type: String

Table 101. Capabilities
Media types	none specified
Outputs	DocumentMetaData

Wikipedia via JWPL

WikipediaArticle

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-jwpl-asl

WikipediaArticleReader

Implementation

org.dkpro.core.io.jwpl.WikipediaArticleReader

Description

Reads all article pages. A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects, disambiguation pages, or discussion pages are regarded, however.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
OnlyFirstParagraph	If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: `false`
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
PageIdFromArray	Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[]
PageIdsFromFile	Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitleFromFile	Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitlesFromArray	Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[]
Password	The password of the database account. Type: String
User	The username of the database account. Type: String

Table 102. Capabilities
Media types	none specified
Outputs	none specified

WikipediaArticleInfo

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-jwpl-asl

WikipediaArticleInfoReader

Implementation

org.dkpro.core.io.jwpl.WikipediaArticleInfoReader

Description

Reads all general article infos without retrieving the whole Page objects

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
Password	The password of the database account. Type: String
User	The username of the database account. Type: String

Table 103. Capabilities
Media types	none specified
Outputs	DocumentMetaData ArticleInfo

WikipediaDiscussion

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-jwpl-asl

WikipediaDiscussionReader

Implementation

org.dkpro.core.io.jwpl.WikipediaDiscussionReader

Description

Reads all discussion pages.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
PageIdFromArray	Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[]
PageIdsFromFile	Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitleFromFile	Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitlesFromArray	Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[]
Password	The password of the database account. Type: String
User	The username of the database account. Type: String

Table 104. Capabilities
Media types	none specified
Outputs	DBConfig

WikipediaLink

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-jwpl-asl

WikipediaLinkReader

Implementation

org.dkpro.core.io.jwpl.WikipediaLinkReader

Description

Read links from Wikipedia.

Parameters

AllowedLinkTypes	Which types of links are allowed? Type: String[]
CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
PageIdFromArray	Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[]
PageIdsFromFile	Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitleFromFile	Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitlesFromArray	Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[]
Password	The password of the database account. Type: String
User	The username of the database account. Type: String

Table 105. Capabilities
Media types	none specified
Outputs	DBConfig WikipediaLink

WikipediaPage

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-jwpl-asl

WikipediaPageReader

Implementation

org.dkpro.core.io.jwpl.WikipediaPageReader

Description

Reads all Wikipedia pages in the database (articles, discussions, etc). A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects or disambiguation pages are regarded, however.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
OnlyFirstParagraph	If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: `false`
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
PageIdFromArray	Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[]
PageIdsFromFile	Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitleFromFile	Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitlesFromArray	Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[]
Password	The password of the database account. Type: String
User	The username of the database account. Type: String

Table 106. Capabilities
Media types	none specified
Outputs	DBConfig

WikipediaQuery

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-jwpl-asl

WikipediaQueryReader

Implementation

org.dkpro.core.io.jwpl.WikipediaQueryReader

Description

Reads all article pages that match a query created by the numerous parameters of this class.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
MaxCategories	Maximum number of categories. Articles with a higher number of categories will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MaxInlinks	Maximum number of incoming links. Articles with a higher number of incoming links will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MaxOutlinks	Maximum number of outgoing links. Articles with a higher number of outgoing links will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MaxRedirects	Maximum number of redirects. Articles with a higher number of redirects will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MaxTokens	Maximum number of tokens. Articles with a higher number of tokens will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MinCategories	Minimum number of categories. Articles with a lower number of categories will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MinInlinks	Minimum number of incoming links. Articles with a lower number of incoming links will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MinOutlinks	Minimum number of outgoing links. Articles with a lower number of outgoing links will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MinRedirects	Minimum number of redirects. Articles with a lower number of redirects will not be returned by the query. Optional — Type: Integer — Default value: `-1`
MinTokens	Minimum number of tokens. Articles with a lower number of tokens will not be returned by the query. Optional — Type: Integer — Default value: `-1`
OnlyFirstParagraph	If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: `false`
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
PageIdFromArray	Defines an array of page ids of the pages that should be retrieved. (Optional) Optional — Type: String[]
PageIdsFromFile	Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitleFromFile	Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional) Optional — Type: String
PageTitlesFromArray	Defines an array of page titles of the pages that should be retrieved. (Optional) Optional — Type: String[]
Password	The password of the database account. Type: String
TitlePattern	SQL-style title pattern. Only articles that match the pattern will be returned by the query. Optional — Type: String — Default value: ``
User	The username of the database account. Type: String

Table 107. Capabilities
Media types	none specified
Outputs	none specified

WikipediaRevision

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-jwpl-asl

WikipediaRevisionReader

Implementation

org.dkpro.core.io.jwpl.WikipediaRevisionReader

Description

Reads Wikipedia page revisions.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
Password	The password of the database account. Type: String
RevisionIdFromArray	Defines an array of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String[]
RevisionIdsFromFile	Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String
User	The username of the database account. Type: String

Table 108. Capabilities
Media types	none specified
Outputs	DocumentMetaData DBConfig WikipediaRevision

WikipediaRevisionPair

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-jwpl-asl

WikipediaRevisionPairReader

Implementation

org.dkpro.core.io.jwpl.WikipediaRevisionPairReader

Description

Reads pairs of adjacent revisions of all articles.

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
Host	The host server. Type: String
Language	The language of the Wikipedia that should be connected to. Type: String
MaxChange	Restrict revision pairs to cases where the length of the revisions does not differ more than this value (counted in characters). Type: Integer — Default value: `10000`
MinChange	Restrict revision pairs to cases where the length of the revisions differ more than this value (counted in characters). Type: Integer — Default value: `0`
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
Password	The password of the database account. Type: String
RevisionIdFromArray	Defines an array of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String[]
RevisionIdsFromFile	Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional) Optional — Type: String
SkipFirstNPairs	The number of revision pairs that should be skipped in the beginning. Optional — Type: Integer
User	The username of the database account. Type: String

Table 109. Capabilities
Media types	none specified
Outputs	DocumentMetaData DBConfig

WikipediaTemplateFilteredArticle

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-jwpl-asl

WikipediaTemplateFilteredArticleReader

Implementation

org.dkpro.core.io.jwpl.WikipediaTemplateFilteredArticleReader

Description

Reads all pages that contain or do not contain the templates specified in the template whitelist and template blacklist.

It is possible to just define a whitelist OR a blacklist. If both whitelist and blacklist are provided, the articles are chosen that DO contain the templates from the whitelist and at the same time DO NOT contain the templates from the blacklist (= the intersection of the "whitelist page set" and the "blacklist page set")

This reader only works if template tables have been generated for the JWPL database using the WikipediaTemplateInfoGenerator.

NOTE: This reader directly extends the WikipediaReaderBase and not the WikipediaStandardReaderBase

Parameters

CreateDBAnno	Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data. Type: Boolean — Default value: `false`
Database	The name of the database. Type: String
DoubleCheckAssociatedPages	If this option is set, discussion pages are rejected that are associated with a blacklisted article. Analogously, articles are rejected that are associated with a blacklisted discussion page. This check is rather expensive and could take a long time. This is option is not active if only a whitelist is used. Default Value: false Type: Boolean — Default value: `false`
ExactTemplateMatching	Defines whether to match the templates exactly or whether to match all templates that start with the String given in the respective parameter list. Default Value: true Type: Boolean — Default value: `true`
Host	The host server. Type: String
IncludeDiscussions	Whether the reader should read also include talk pages. Type: Boolean — Default value: `true`
Language	The language of the Wikipedia that should be connected to. Type: String
LimitNUmberOfArticlesToRead	Optional parameter that allows to define the max number of articles that should be delivered by the reader. This avoids unnecessary filtering if only a small number of articles is needed. Optional — Type: Integer
OnlyFirstParagraph	If set to true, only the first paragraph instead of the whole article is used. Type: Boolean — Default value: `false`
OutputPlainText	Whether the reader outputs plain text or wiki markup. Type: Boolean — Default value: `true`
PageBuffer	The page buffer size (#pages) of the page iterator. Type: Integer — Default value: `1000`
Password	The password of the database account. Type: String
TemplateBlacklist	Defines templates that the articles MUST NOT contain. If you also define a whitelist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist) Optional — Type: String[]
TemplateWhitelist	Defines templates that the articles MUST contain. If you also define a blacklist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist) Optional — Type: String[]
User	The username of the database account. Type: String

Table 110. Capabilities
Media types	none specified
Outputs	DocumentMetaData DBConfig

XCES-XML

XcesBasicXml

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-xces-asl

XcesBasicXmlReader

Implementation

org.dkpro.core.io.xces.XcesBasicXmlReader

Description

Reader for the basic XCES XML format.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 111. Capabilities
Media types	application/x.org.dkpro.xces-basic+xml
Outputs	Paragraph

XcesBasicXmlWriter

Implementation

org.dkpro.core.io.xces.XcesBasicXmlWriter

Description

Writer for the basic XCES XML format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.xml`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 112. Capabilities
Media types	application/x.org.dkpro.xces-basic+xml
Inputs	Paragraph

XcesXml

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-xces-asl

XcesXmlReader

Implementation

org.dkpro.core.io.xces.XcesXmlReader

Description

Reader for the XCES XML format.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 113. Capabilities
Media types	application/x.org.dkpro.xces+xml
Outputs	Lemma Paragraph Sentence Token

XcesXmlWriter

Implementation

org.dkpro.core.io.xces.XcesXmlWriter

Description

Writer for the XCES XML format.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Use this filename extension. Type: String — Default value: `.xml`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetEncoding	Character encoding of the output data. Type: String — Default value: `UTF-8`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 114. Capabilities
Media types	application/x.org.dkpro.xces+xml
Inputs	POS Lemma Paragraph Sentence Token

XML

InlineXml

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-xml-asl

InlineXmlWriter

Implementation

org.dkpro.core.io.xml.InlineXmlWriter

Description

Writes an approximation of the content of a textual CAS as an inline XML file. Optionally applies an XSLT stylesheet.

Note this component inherits the restrictions from CasToInlineXml:

Features whose values are FeatureStructures are not represented.
Feature values which are strings longer than 64 characters are truncated.
Feature values which are arrays of primitives are represented by strings that look like [ xxx, xxx ]
The Subject of analysis is presumed to be a text string.
Some characters in the document's Subject-of-analysis are replaced by blanks, because the characters aren't valid in xml documents.
It doesn't work for annotations which are overlapping, because these cannot be properly represented as properly - nested XML.

Parameters

Xslt	XSLT stylesheet to apply. Optional — Type: String
compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 115. Capabilities
Media types	application/xml text/xml
Inputs	DocumentMetaData

Xml

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-xml-asl

XmlReader

Implementation

org.dkpro.core.io.xml.XmlReader

Description

Very basic reader to load texts from a XML file.

The XML file is expected to contain one or more elements under its root note and each of these is treated as a separate document. Each of these child elements may contain further children containing text which may or may not be included into the CAS document text, depending on #PARAM_INCLUDE_TAG and #PARAM_EXCLUDE_TAG.

If you are looking for a more generic XML reader which imports the structure of an XML file into a CAS, please look at XmlDocumentReader.

Parameters

DocIdTag	tag which contains the docId Optional — Type: String
ExcludeTag	optional, tags those should not be worked on. Out them should no text be extracted and also no Annotations be produced. Type: String[] — Default value: `[]`
IncludeTag	optional, tags those should be worked on (if empty, then all tags except those ExcludeTags will be worked on) Type: String[] — Default value: `[]`
collectionId	The collection ID to set in the DocumentMetaData. Optional — Type: String
language	Set this as the language of the produced documents. Optional — Type: String
sourceLocation	Location from which the input is read. Type: String

Table 116. Capabilities
Media types	application/xml text/xml
Outputs	DocumentMetaData Field

XmlDocument

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-xml-asl

XmlDocumentReader

Implementation

org.dkpro.core.io.xml.XmlDocumentReader

Description

Simple XML reader which loads all text from the XML file into the CAS document text and generates XML annotations for all XML elements, attributes and text nodes.

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 117. Capabilities
Media types	application/xml text/xml
Outputs	DocumentMetaData XmlAttribute XmlDocument XmlElement XmlNode XmlTextNode

XmlDocumentWriter

Implementation

org.dkpro.core.io.xml.XmlDocumentWriter

Description

Simple XML write takes the XML annotations for elements, attributes and text nodes and renders them into an XML file.

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
escapeFilename	URL-encode the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `false`
filenameExtension	Specify the suffix of output files. Default value `.txt`. If the suffix is not needed, provide an empty string as value. Type: String — Default value: `.xml`
indent	Indent output . Type: Boolean — Default value: `false`
omitXmlDeclaration	Whether to omit the XML preamble. Type: Boolean — Default value: `true`
outputMethod	Output method. Type: String — Default value: `xml`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`

Table 118. Capabilities
Media types	application/xml text/xml
Inputs	DocumentMetaData XmlAttribute XmlDocument XmlElement XmlNode XmlTextNode

XmlText

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-xml-asl

XmlTextReader

Implementation

org.dkpro.core.io.xml.XmlTextReader

Description

null

Parameters

includeHidden	Include hidden files and directories. Type: Boolean — Default value: `false`
language	Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS. Optional — Type: String
logFreq	The frequency with which read documents are logged. Set to 0 or negative values to deactivate logging. Type: Integer — Default value: `1`
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Optional — Type: String[]
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`

Table 119. Capabilities
Media types	application/xml text/xml
Outputs	DocumentMetaData

XmlXPath

Group ID	org.dkpro.core
Artifact ID	dkpro-core-io-xml-asl

XmlXPathReader

Implementation

org.dkpro.core.io.xml.XmlXPathReader

Description

A component reader for XML files implemented with XPath.

This is currently optimized for TREC format, which means the style topics are presented in. You should provide the parameter XPath expression that of the parent node And the child nodes of each parent node will be stored separately in its own CAS.

If your expression evaluates to leaf nodes, empty CASes will be created.

Parameters

caseSensitive	States whether the matching is done case sensitive. (default: true) Optional — Type: Boolean — Default value: `true`
docIdTag	Tag which contains the docId. If it is given, it will be ensured that within the same document there is only one id tag and it is not empty Optional — Type: String
excludeTags	Tags which should be ignored. If empty then all tags will be processed. If this and PARAM_INCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed. Type: String[] — Default value: `[]`
includeTags	Tags which should be worked on. If empty then all tags will be processed. If this and PARAM_EXCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed. Type: String[] — Default value: `[]`
language	Language of the documents. If given, it will be set in each CAS. Optional — Type: String
patterns	A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard `/*/` can be used to address any number of sub-directories. The wildcard can be used to a address a part of a name. Type: String[]
rootXPath	Specifies the XPath expression to all nodes to be processed. Different segments will be separated via PARAM_ID_TAG, and each segment will be stored in a separate CAS. Type: String
sourceLocation	Location from which the input is read. Optional — Type: String
useDefaultExcludes	Use the default excludes. Type: Boolean — Default value: `true`
workingDir	Specify to substitute tag names in CAS. Please give the substitutions each in before - after order. For example to substitute "foo" with "bar", and "hey" with "ho", you can provide { "foo", "bar", "hey", "ho" }. Optional — Type: String[]

Table 120. Capabilities
Media types	application/xml text/xml
Outputs	DocumentMetaData Field