The document provides detailed information about the DKPro Core input and output formats.

Overview

Table 1. Formats (62)
Format Reader Writer

AclAnthology

AclAnthologyReader

none

Ancora

AncoraReader

none

BinaryCas

BinaryCasReader

BinaryCasWriter

BlikiWikipedia

BlikiWikipediaReader

none

Bnc

BncReader

none

Brat

BratReader

BratWriter

Combination

CombinationReader

none

Conll2000

Conll2000Reader

Conll2000Writer

Conll2002

Conll2002Reader

Conll2002Writer

Conll2003

Conll2003Reader

Conll2003Writer

Conll2006

Conll2006Reader

Conll2006Writer

Conll2008

Conll2008Reader

Conll2008Writer

Conll2009

Conll2009Reader

Conll2009Writer

Conll2012

Conll2012Reader

Conll2012Writer

ConllU

ConllUReader

ConllUWriter

DiTop

none

DiTopWriter

Html

HtmlReader

none

ImsCwb

ImsCwbReader

ImsCwbWriter

InlineXml

none

InlineXmlWriter

Jdbc

JdbcReader

none

Json

none

JsonWriter

Lif

LifReader

LifWriter

Lxf

LxfReader

LxfWriter

MalletLdaTopicProportions

none

MalletLdaTopicProportionsWriter

MalletLdaTopicsProportionsSorted

none

MalletLdaTopicsProportionsSortedWriter

NYTCollection

NYTCollectionReader

none

NegraExport

NegraExportReader

none

Nif

NifReader

NifWriter

Pdf

PdfReader

none

PennTreebankChunked

PennTreebankChunkedReader

none

PennTreebankCombined

PennTreebankCombinedReader

PennTreebankCombinedWriter

RTF

RTFReader

none

Reuters21578Sgml

Reuters21578SgmlReader

none

Reuters21578Txt

Reuters21578TxtReader

none

SerializedCas

SerializedCasReader

SerializedCasWriter

Solr

none

SolrWriter

String

StringReader

none

TGrep

none

TGrepWriter

Tcf

TcfReader

TcfWriter

Tei

TeiReader

TeiWriter

Text

TextReader

TextWriter

TigerXml

TigerXmlReader

TigerXmlWriter

Tika

TikaReader

none

TokenizedText

none

TokenizedTextWriter

TuebaDZ

TuebaDZReader

none

Tuepp

TueppReader

none

Web1T

none

Web1TWriter

WikipediaArticle

WikipediaArticleReader

none

WikipediaArticleInfo

WikipediaArticleInfoReader

none

WikipediaDiscussion

WikipediaDiscussionReader

none

WikipediaLink

WikipediaLinkReader

none

WikipediaPage

WikipediaPageReader

none

WikipediaQuery

WikipediaQueryReader

none

WikipediaRevision

WikipediaRevisionReader

none

WikipediaRevisionPair

WikipediaRevisionPairReader

none

WikipediaTemplateFilteredArticle

WikipediaTemplateFilteredArticleReader

none

XcesBasicXml

XcesBasicXmlReader

XcesBasicXmlWriter

XcesXml

XcesXmlReader

XcesXmlWriter

Xmi

XmiReader

XmiWriter

Xml

XmlReader

none

XmlText

XmlTextReader

none

XmlXPath

XmlXPathReader

none

I/O components

ACL Anthology

AclAnthology

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.aclanthology-asl

Known corpora in this format

AclAnthologyReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.aclanthology.AclAnthologyReader

Description

Reads the ACL anthology corpus and outputs CASes with plain text documents.

The reader tries to strip out hyphenation and replace problematic characters to produce a cleaned text. Otherwise, it is a plain text reader.

Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files. If not specified, the default system encoding will be used.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 2. Capabilities

Media types

text/plain

Outputs

AnCora

Ancora

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.ancora-asl

AncoraReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.ancora.AncoraReader

Description

Read AnCora XML format.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

dropSentencesMissingPosTags

Type: Boolean  — Default value: false

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readLemma

Write lemma annotations to the CAS.

Type: Boolean  — Default value: true

readPOS

Write part-of-speech annotations to the CAS.

Type: Boolean  — Default value: true

readSentence

Write sentence annotations to the CAS.

Type: Boolean  — Default value: true

readToken

Write token annotations to the CAS.

Type: Boolean  — Default value: true

sourceLocation

Location from which the input is read.

Optional — Type: String

splitMultiWordTokens

Type: Boolean  — Default value: true

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 3. Capabilities

Media types

application/x.org.dkpro.ancora+xml application/xml

Outputs

brat file format

Brat

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.brat-asl

BratReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.brat.BratReader

Description

Reader for the brat format.

Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

relationTypes

Types that are relations. It is mandatory to provide the type name followed by two feature names that represent Arg1 and Arg2 separated by colons, e.g. de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent{A}. Additionally, a subcategorization feature may be specified.

Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent{A}]

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

textAnnotationTypes

Types that are text annotations. It is mandatory to provide the type name which can optionally be followed by a subcategorization feature. Using this parameter is only necessary to specify a subcategorization feature. Otherwise, text annotation types are automatically detected.

Type: String[]  — Default value: []

typeMappings

Optional — Type: String[]  — Default value: []

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 4. Capabilities

Media types

none specified

Outputs

none specified

BratWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.brat.BratWriter

Description

Writer for the brat annotation format.

Known issues:

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

enableTypeMappings

Enable type mappings.

Type: Boolean  — Default value: false

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

excludeTypes

Types that will not be written to the exported file.

Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence]

filenameExtension

Specify the suffix of output files. Default value .ann. If the suffix is not needed, provide an empty string as value.

Type: String  — Default value: .ann

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

palette

Colors to be used for the visual configuration that is generated for brat.

Optional — Type: String[]  — Default value: [#8dd3c7, #ffffb3, #bebada, #fb8072, #80b1d3, #fdb462, #b3de69, #fccde5, #d9d9d9, #bc80bd, #ccebc5, #ffed6f]

relationTypes

Types that are relations. It is mandatory to provide the type name followed by two feature names that represent Arg1 and Arg2 separated by colons, e.g. de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent.

Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency:Governor:Dependent]

shortAttributeNames

Whether to render attributes by their short name or by their qualified name.

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

spanTypes

Types that are text annotations (aka entities or spans).

Type: String[]  — Default value: []

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

textFilenameExtension

Specify the suffix of text output files. Default value .txt. If the suffix is not needed, provide an empty string as value.

Type: String  — Default value: .txt

typeMappings

FIXME

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.(\\w+) → $1, de.tudarmstadt.ukp.dkpro.core.api.ner.type.(\\w+) → $1]

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeNullAttributes

Enable writing of features with null values.

Type: Boolean  — Default value: false

writeRelationAttributes

The brat web application can currently not handle attributes on relations, thus they are disabled by default. Here they can be enabled again.

Type: Boolean  — Default value: false

Table 5. Capabilities

Media types

none specified

Inputs

none specified

British National Corpus

Bnc

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.bnc-asl

Known corpora in this format

BncReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bnc.BncReader

Description

Reader for the British National Corpus (XML version).

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 6. Capabilities

Media types

application/xml

Outputs

Combination

Combination

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.combination-asl

CombinationReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.combination.CombinationReader

Description

Combines multiple readers into a single reader.

Parameters
readers

Type: String[]

Table 7. Capabilities

Media types

none specified

Outputs

none specified

CoNLL

Conll2000

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2000 format represents POS and Chunk tags. Fields in a line are separated by spaces. Sentences are separated by a blank new line.

Table 8. Columns
Column Type Description

FORM

Token

token

POSTAG 

POS

part-of-speech tag

 CHUNK

Chunk

chunk (IOB1 encoded)

Example
He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O
Table 9. Known corpora in this format
Corpus Language

CoNLL 2000 Chunking Corpus

English

CoNLL 2000 Chunking Corpus (NLTK)

English

Conll2000Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2000Reader

Description

Reads the CoNLL 2000 chunking format.

Parameters
ChunkMappingLocation

Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

ChunkTagSet

Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readChunk

Write chunk information. Default: true

Type: Boolean  — Default value: true

readPOS

Write part-of-speech information. Default: true

Type: Boolean  — Default value: true

sourceEncoding

Character encoding of the input data.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 10. Capabilities

Media types

text/x.org.dkpro.conll-2000

Outputs

Conll2000Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2000Writer

Description

Writes the CoNLL 2000 chunking format.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Type: String  — Default value: .conll

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeChunk

Type: Boolean  — Default value: true

writePOS

Type: Boolean  — Default value: true

Table 11. Capabilities

Media types

text/x.org.dkpro.conll-2000

Inputs

Conll2002

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2002 format encodes named entity spans. Fields are separated by a single space. Sentences are separated by a blank new line.

Table 12. Columns
Column Type/Feature  Description

FORM

Token

Word form or punctuation symbol.

NER

NamedEntity

named entity (IOB2 encoded)

Example
Wolff      B-PER
,          O
currently  O
a          O
journalist O
in         O
Argentina  B-LOC
,          O
played     O
with       O
Del        B-PER
Bosque     I-PER
in         O
the        O
final      O
years      O
of         O
the        O
seventies  O
in         O
Real       B-ORG
Madrid     I-ORG
.          O
For readability, the columns in the example above are aligned. In actual files, there is only a single space separating the fields in each line.
Table 13. Known corpora in this format
Corpus Language

AQMAR Arabic Wikipedia Named Entity Corpus

Arabic

CoNLL 2002 dataset

Spanish

CoNLL 2002 dataset

Dutch

Conll2002Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2002Reader

Description

Reads by default the CoNLL 2002 named entity format.

The reader is also compatible with the CoNLL-based GermEval 2014 named entity format, in which the columns are separated by a tab, and there is an extra column for embedded named entities, besides the token number being put in the first column (see below). For that, additional parameters are provided, by which one can determine the column separator, whether there is an additional first column for token numbers, and whether embedded named entities should be read. (Note: Currently, the reader only reads the outer named entities, not the embedded ones.


The following snippet shows an example of the TSV format
# http://de.wikipedia.org/wiki/Manfred_Korfmann [2009-10-17]
1  Aufgrund          O           O
2  seiner            O           O
3  Initiative        O           O
4  fand              O           O
5  2001/2002         O           O
6  in                O           O
7  Stuttgart         B-LOC       O
8  ,                 O           O
9  Braunschweig      B-LOC       O
10 und               O           O
11 Bonn              B-LOC       O
12 eine              O           O
13 große             O           O
14 und               O           O
15 publizistisch     O           O
16 vielbeachtete     O           O
17 Troia-Ausstellung B-LOCpart   O
18 statt             O           O
19 ,                 O           O
20 „                 O           O
21 Troia             B-OTH       B-LOC
22 -                 I-OTH       O
23 Traum             I-OTH       O
24 und               I-OTH       O
25 Wirklichkeit      I-OTH       O
26 “                 O           O
27 .                 O           O
  1. WORD_NUMBER - token number
  2. FORM - token
  3. NER1 - outer named entity (BIO encoded)
  4. NER2 - embedded named entity (BIO encoded)
The sentence is encoded as one token per line, with information provided in tab-separated columns. The first column contains either a #, which signals the source the sentence is cited from and the date it was retrieved, or the token number within the sentence. The second column contains the token. Name spans are encoded in the BIO-scheme. Outer spans are encoded in the third column, embedded spans in the fourth column.
Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

columnSeparator

Column separator parameter. Acceptable input values come from ColumnSeparators.
Example usage: if you want to define 'tab' as the column separator the following value should be input for this parameter Conll2002Reader.ColumnSeparators.TAB.getName()

Optional — Type: String  — Default value: space

hasEmbeddedNamedEntity

Has embedded named entity extra column. Default: false

Optional — Type: Boolean  — Default value: false

hasHeader

Indicates that there is a header line before the sentence

Optional — Type: Boolean  — Default value: false

hasTokenNumber

Token number flag. When true, the first column contains the token number inside the sentence (as in GermEval 2014 format)

Optional — Type: Boolean  — Default value: false

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readNamedEntity

Read named entity information. Default: true

Type: Boolean  — Default value: true

sourceEncoding

Character encoding of the input data.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 14. Capabilities

Media types

text/x.org.dkpro.conll-2002 text/x.org.dkpro.germeval-2014

Outputs

Conll2002Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2002Writer

Description

Writes the CoNLL 2002 named entity format.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Type: String  — Default value: .conll

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeNamedEntity

Type: Boolean  — Default value: true

Table 15. Capabilities

Media types

text/x.org.dkpro.conll-2002

Inputs

Conll2003

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2004 format encodes named entity spans and chunk spans. Fields are separated by a single space. Sentences are separated by a blank new line. Named entities and chunks are encoded in the IOB1 format. I.e. a B prefix is only used if the category of the following span differs from the category of the current span.

Table 16. Columns
Column Type/Feature  Description

FORM

Token

Word form or punctuation symbol.

CHUNK

Chunk

chunk (IOB1 encoded)

NER

Named entity

named entity (IOB1 encoded)

Example
U.N.         NNP  I-NP  I-ORG
official     NN   I-NP  O
Ekeus        NNP  I-NP  I-PER
heads        VBZ  I-VP  O
for          IN   I-PP  O
Baghdad      NNP  I-NP  I-LOC
.            .    O     O
For readability, the columns in the example above are aligned. In actual files, there is only a single space separating the fields in each line.
Table 17. Known corpora in this format
Corpus Language

AQMAR Arabic Wikipedia Named Entity Corpus

Arabic

CoNLL 2002 dataset

Spanish

CoNLL 2002 dataset

Dutch

Conll2003Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2003Reader

Description

Reads the CoNLL 2003 format.

Parameters
ChunkMappingLocation

Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

ChunkTagSet

Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readChunk

Write chunk information. Default: true

Type: Boolean  — Default value: true

readNamedEntity

Read named entity information. Default: true

Type: Boolean  — Default value: true

readPOS

Write part-of-speech information. Default: true

Type: Boolean  — Default value: true

sourceEncoding

Character encoding of the input data.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 18. Capabilities

Media types

text/x.org.dkpro.conll-2003

Outputs

Conll2003Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2003Writer

Description

Writes the CoNLL 2003 format.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Type: String  — Default value: .conll

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeChunk

Type: Boolean  — Default value: true

writeNamedEntity

Type: Boolean  — Default value: true

writePOS

Type: Boolean  — Default value: true

Table 19. Capabilities

Media types

text/x.org.dkpro.conll-2003

Inputs

Conll2006

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2006 (aka CoNLL-X) format targets dependency parsing. Columns are tab-separated. Sentences are separated by a blank new line.

Table 20. Columns
Column Type/Feature  Description

ID

ignored

Token counter, starting at 1 for each new sentence.

FORM

Token

Word form or punctuation symbol.

LEMMA

Lemma

Lemma of the word form.

CPOSTAG

POS coarseValue

POSTAG

POS PosValue

Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.

FEATS

MorphologicalFeatures

Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available.

HEAD

Dependency

Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.

DEPREL

Dependency

Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.

PHEAD

ignored

Projective head of current token, which is either a value of ID or zero ('0'), or an underscore if not available. Note that depending on the original treebank annotation, there may be multiple tokens an with ID of zero. The dependency structure resulting from the PHEAD column is guaranteed to be projective (but is not available for all languages), whereas the structures resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available).

PDEPREL

ignored

Dependency relation to the PHEAD, or an underscore if not available. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.

Example
Heutzutage heutzutage ADV _ _ ADV _ _
Table 21. Known corpora in this format
Corpus Language

CoNLL-X Shared Task free data

Danish, Dutch, Portuguese, and Swedish

Copenhagen Dependency Treebanks

Danish

FinnTreeBank (in recent versions with additional pseudo-XML metadata)

Finnish

Floresta Sintá(c)tica (Bosque-CoNLL)

Portuguese

Sequoia corpus

French

SETimes.HR corpus and dependency treebank of Croatian

Croatian

Składnica zależnościowa

Polish

Slovene Dependency Treebank

Slovene

Swedish Treebank

Swedish

Talbanken05

Swedish

Uppsala Persian Dependency Treebank

Persian (Farsi)

Norwegian Dependency Treebank (NDT)

Norwegian

IULA Resources. Corpus & Tools. IULA Spanish LSP Treebank

Spanish

Turin University Treebank

Italian

Conll2006Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2006Reader

Description

Reads a file in the CoNLL-2006 format (aka CoNLL-X).

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readCPOS

Type: Boolean  — Default value: true

readDependency

Type: Boolean  — Default value: true

readLemma

Type: Boolean  — Default value: true

readMorph

Type: Boolean  — Default value: true

readPOS

Type: Boolean  — Default value: true

sourceEncoding

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useCPosAsPos

Enable to use CPOS (column 4) as the part-of-speech tag. Otherwise the POS (column 3) is used.

Type: Boolean  — Default value: false

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 22. Capabilities

Media types

text/x.org.dkpro.conll-2006

Outputs

Conll2006Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2006Writer

Description

Writes a file in the CoNLL-2006 format (aka CoNLL-X).

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Type: String  — Default value: .conll

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeCPOS

Type: Boolean  — Default value: true

writeDependency

Type: Boolean  — Default value: true

writeLemma

Type: Boolean  — Default value: true

writeMorph

Type: Boolean  — Default value: true

writePOS

Type: Boolean  — Default value: true

Table 23. Capabilities

Media types

text/x.org.dkpro.conll-2006

Inputs

Conll2008

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2008 format targets syntactic and semantic dependencies. Columns are tab-separated. Sentences are separated by a blank new line.

Table 24. Columns
Column Type/Feature  Description

ID

ignored

Token counter, starting at 1 for each new sentence.

FORM

Token

Word form or punctuation symbol.

LEMMA

Lemma

Lemma of the word form.

GPOS

POS PosValue

Golf fine-grained part-of-speech tag, where the tagset depends on the language.

PPOS

ignored

Automatically predicted major POS by a language-specific tagger.

SPLIT_FORM

ignored

Tokens split at hyphens and slashes.

SPLIT_LEMMA

ignored

Predicted lemma of SPLIT_FORM.

PPOSS

ignored

Predicted POS tags of the split forms.

HEAD

Dependency

Head of the current token, which is either a value of ID or zero (`0). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.

DEPREL

Dependency

Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply ROOT.

PRED

SemPred

(sense) identifier of a semantic 'predicate' coming from a current token.

APREDs

SemArg

Columns with argument labels for each semantic predicate (in the ID order).

Example
1   Some    some    DT  _   Some    some    DT  10  SBJ _   _   _   _   A1  _   _   _
2   of  of  IN  _   of  of  IN  1   NMOD    _   _   _   _   _   _   _   _
3   the the DT  _   the the DT  5   NMOD    _   _   _   _   _   _   _   _
4   strongest   strongest   JJS _   strongest   strong  JJS 5   NMOD    _   _   _   _   _   _   _   _
5   critics critics NNS _   critics critic  NNS 2   PMOD    critic.01   A0  _   _   _   _   _   _
6   of  of  IN  _   of  of  IN  5   NMOD    _   A1  _   _   _   _   _   _
7   our our PRP$    _   our our PRP$    9   NMOD    _   _   A1  A0  _   _   _   _
8   welfare welfare NN  _   welfare welfare NN  9   NMOD    welfare.01  _   A2  _   _   _   _   _
9   system  system  NN  _   system  system  NN  6   PMOD    system.01   _   _   _   _   _   _   _
10  are are VBP _   are be  VBP 0   ROOT    be.01   _   _   _   _   _   _   _
11  the the DT  _   the the DT  12  NMOD    _   _   _   _   _   _   _   _
12  people  people  NNS _   people  people  NNS 10  PRD person.02   _   _   _   A2  A0  A0  A1
13  who who WP  _   who who WP  14  SBJ _   _   _   _   _   _   _   _
14  have    have    VBP _   have    have    VBP 12  NMOD    have.04 _   _   _   _   SU  _   _
15  become  become  VBN _   become  become  VBN 14  VC  become.01   _   _   _   _   A1  A1  _
16  dependent   dependent   JJ  _   dependent   dependent   JJ  15  PRD _   _   _   _   _   _   _   _
17  on  on  IN  _   on  on  IN  16  AMOD    _   _   _   _   _   _   _   _
18  it  it  PRP _   it  it  PRP 17  PMOD    _   _   _   _   _   _   _   _
19  .   .   .   _   .   .   .   10  P   _   _   _   _   _   _   _   _
Table 25. Known corpora in this format
Corpus Language

MASC-CONLL

English

Conll2008Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2008Reader

Description

Reads a file in the CoNLL-2008 format.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readDependency

Type: Boolean  — Default value: true

readLemma

Type: Boolean  — Default value: true

readPOS

Type: Boolean  — Default value: true

readSemanticPredicate

Type: Boolean  — Default value: true

sourceEncoding

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 26. Capabilities

Media types

text/x.org.dkpro.conll-2008

Outputs

Conll2008Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2008Writer

Description

Writes a file in the CoNLL-2008 format.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Type: String  — Default value: .conll

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeDependency

Type: Boolean  — Default value: true

writeLemma

Type: Boolean  — Default value: true

writeMorph

Type: Boolean  — Default value: true

writePOS

Type: Boolean  — Default value: true

writeSemanticPredicate

Type: Boolean  — Default value: true

Table 27. Capabilities

Media types

text/x.org.dkpro.conll-2008

Inputs

Conll2009

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2009 format targets semantic role labeling. Columns are tab-separated. Sentences are separated by a blank new line.

Table 28. Columns
Column Type/Feature  Description

ID

ignored

Token counter, starting at 1 for each new sentence.

FORM

Token

Word form or punctuation symbol.

LEMMA

Lemma

Lemma of the word form.

PLEMMA

ignored

Automatically predicted lemma of FORM.

POS

POS PosValue

Fine-grained part-of-speech tag, where the tagset depends on the language.

PPOS

ignored

Automatically predicted major POS by a language-specific tagger.

FEATS

MorphologicalFeatures

Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available.

PFEAT

ignored)

Automatically predicted morphological features (if applicable).

HEAD

Dependency

Head of the current token, which is either a value of ID or zero (`0). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.

PHEAD

ignored

Automatically predicted syntactic head.

DEPREL

Dependency

Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply ROOT.

PDEPREL

ignored

Automatically predicted dependency relation to PHEAD.

FILLPRED

ignored

Contains Y for argument-bearing tokens.

PRED

SemPred

(sense) identifier of a semantic 'predicate' coming from a current token.

APREDs

SemArg

Columns with argument labels for each semantic predicate (in the ID order).

Example
1   The the the DT  DT  _   _   4   4   NMOD    NMOD    _   _   _   _
2   most    most    most    RBS RBS _   _   3   3   AMOD    AMOD    _   _   _   _
3   troublesome troublesome troublesome JJ  JJ  _   _   4   4   NMOD    NMOD    _   _   _   _
4   report  report  report  NN  NN  _   _   5   5   SBJ SBJ _   _   _   _
5   may may may MD  MD  _   _   0   0   ROOT    ROOT    _   _   _   _
6   be  be  be  VB  VB  _   _   5   5   VC  VC  _   _   _   _
7   the the the DT  DT  _   _   11  11  NMOD    NMOD    _   _   _   _
8   August  august  august  NNP NNP _   _   11  11  NMOD    NMOD    _   _   _   AM-TMP
9   merchandise merchandise merchandise NN  NN  _   _   10  10  NMOD    NMOD    _   _   A1  _
10  trade   trade   trade   NN  NN  _   _   11  11  NMOD    NMOD    Y   trade.01    _   A1
11  deficit deficit deficit NN  NN  _   _   6   6   PRD PRD Y   deficit.01  _   A2
12  due due due JJ  JJ  _   _   13  11  AMOD    APPO    _   _   _   _
13  out out out IN  IN  _   _   11  12  APPO    AMOD    _   _   _   _
14  tomorrow    tomorrow    tomorrow    NN  NN  _   _   13  12  TMP TMP _   _   _   _
15  .   .   .   .   .   _   _   5   5   P   P   _   _   _   _
Table 29. Known corpora in this format
Corpus Language

CoNLL 2009 Shared Task

Catalan, German, Japanese, Spanish

Conll2009Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2009Reader

Description

Reads a file in the CoNLL-2009 format.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readDependency

Type: Boolean  — Default value: true

readLemma

Type: Boolean  — Default value: true

readMorph

Type: Boolean  — Default value: true

readPOS

Type: Boolean  — Default value: true

readSemanticPredicate

Type: Boolean  — Default value: true

sourceEncoding

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 30. Capabilities

Media types

text/x.org.dkpro.conll-2009

Outputs

Conll2009Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2009Writer

Description

Writes a file in the CoNLL-2009 format.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Type: String  — Default value: .conll

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeDependency

Type: Boolean  — Default value: true

writeLemma

Type: Boolean  — Default value: true

writeMorph

Type: Boolean  — Default value: true

writePOS

Type: Boolean  — Default value: true

writeSemanticPredicate

Type: Boolean  — Default value: true

Table 31. Capabilities

Media types

text/x.org.dkpro.conll-2009

Inputs

Conll2012

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2012 format targets semantic role labeling and coreference. Columns are tab-separated. Sentences are separated by a blank new line.

Table 32. Columns
Column Type/Feature  Description

Document ID

ignored

This is a variation on the document filename.</li>

Part number

ignored

Some files are divided into multiple parts numbered as 000, 001, 002, …​ etc.

Word number

ignored

</li>

Word itself

document text

This is the token as segmented/tokenized in the Treebank. Initially the *_skel file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.

Part-of-Speech

POS

Parse bit

Constituent

This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the ([pos] [word]) string (or leaf) and concatenating the items in the rows of that column.

Predicate lemma

Lemma

The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-".

Predicate Frameset ID

SemPred

This is the PropBank frameset ID of the predicate in Column 7.

Word sense

ignored

This is the word sense of the word in Column 3.

Speaker/Author

ignored

This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data.

Named Entities

NamedEntity

These columns identifies the spans representing various named entities.

Predicate Arguments

SemPred

There is one column each of predicate argument structure information for the predicate mentioned in Column 7.

Coreference

CoreferenceChain

Coreference chain information encoded in a parenthesis structure.

Example
en-orig.conll   0   0       John   NNP   (TOP(S(NP*)      john   -   -          -   (PERSON)       (A0) (1)
en-orig.conll   0   1       went   VBD         (VP*         go go.02   -          -         *        (V*) -
en-orig.conll   0   2         to    TO         (PP*         to   -   -          -         *          *  -
en-orig.conll   0   3        the    DT         (NP*        the   -   -          -         *          *  (2
en-orig.conll   0   4     market    NN          *)))    market   -   -          -         *        (A1) 2)
en-orig.conll   0   5          .     .           *))         .   -   -          -         *          *  -

Conll2012Reader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2012Reader

Description

Reads a file in the CoNLL-2012 format.

Parameters
ConstituentMappingLocation

Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

ConstituentTagSet

Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readConstituent

Type: Boolean  — Default value: true

readCoreference

Type: Boolean  — Default value: true

readLemma

Disabled by default because CoNLL 2012 format does not include lemmata for all words, only for predicates.

Type: Boolean  — Default value: false

readNamedEntity

Type: Boolean  — Default value: true

readPOS

Type: Boolean  — Default value: true

readSemanticPredicate

Type: Boolean  — Default value: true

readWordSense

Type: Boolean  — Default value: true

sourceEncoding

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

useHeaderMetadata

Use the document ID declared in the file header instead of using the filename.

Type: Boolean  — Default value: true

writeTracesToText

Optional — Type: Boolean  — Default value: false

Table 33. Capabilities

Media types

text/x.org.dkpro.conll-2012

Outputs

Conll2012Writer

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.Conll2012Writer

Description

Writer for the CoNLL-2012 format.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Type: String  — Default value: .conll

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeLemma

Type: Boolean  — Default value: true

writePOS

Type: Boolean  — Default value: true

writeSemanticPredicate

Type: Boolean  — Default value: true

Table 34. Capabilities

Media types

text/x.org.dkpro.conll-2012

Inputs

ConllU

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.conll-asl

The CoNLL 2012 format targets semantic role labeling and coreference. Columns are tab-separated. Sentences are separated by a blank new line.

Table 35. Columns
Column Type/Feature  Description

ID

ignored

Word index, integer starting at 1 for each new sentence; may be a range for tokens with multiple words.

FORM

Token

Word form or punctuation symbol.

LEMMA

Lemma

Lemma or stem of word form.

CPOSTAG

POS coarseValue

Part-of-speech tag from the universal POS tag set.

POSTAG

POS PosValue

Language-specific part-of-speech tag; underscore if not available.

FEATS

MorphologicalFeatures

List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.

HEAD

Dependency

Head of the current token, which is either a value of ID or zero (0).

DEPREL

Dependency

Universal Stanford dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.

DEPS

Dependency

List of secondary dependencies (head-deprel pairs).

MISC

unused

Any other annotation.

Example
1   They    they    PRON    PRN Case=Nom|Number=Plur    2   nsubj   4:nsubj _
2   buy buy VERB    VB  Number=Plur|Person=3|Tense=Pres 0   root    _   _
3   and and CONJ    CC  _   2   cc  _   _
4   sell    sell    VERB    VB  Number=Plur|Person=3|Tense=Pres 2   conj    0:root  _
5   books   book    NOUN    NNS Number=Plur 2   dobj    4:dobj  SpaceAfter=No
6   .   .   PUNCT   .   _   2   punct   _   _
Table 36. Known corpora in this format
Corpus Language

Universal Dependency Treebank

Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish

ConllUReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.ConllUReader

Description

Reads a file in the CoNLL-U format.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readCPOS

Type: Boolean  — Default value: true

readDependency

Type: Boolean  — Default value: true

readLemma

Type: Boolean  — Default value: true

readMorph

Type: Boolean  — Default value: true

readPOS

Type: Boolean  — Default value: true

sourceEncoding

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useCPosAsPos

Type: Boolean  — Default value: false

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 37. Capabilities

Media types

text/x.org.dkpro.conll-u

Outputs

ConllUWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.conll.ConllUWriter

Description

Writes a file in the CoNLL-U format.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Type: String  — Default value: .conll

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeCPOS

Type: Boolean  — Default value: true

writeDependency

Type: Boolean  — Default value: true

writeLemma

Type: Boolean  — Default value: true

writeMorph

Type: Boolean  — Default value: true

writePOS

Type: Boolean  — Default value: true

Table 38. Capabilities

Media types

text/x.org.dkpro.conll-u

Inputs

Ditop

DiTop

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.ditop-asl

DiTopWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.ditop.DiTopWriter

Description

This annotator (consumer) writes output files as required by DiTop. It requires JCas input annotated by de.tudarmstadt.ukp.dkpro.core.mallet.lda.MalletLdaTopicModelInferencer using the same model.

Parameters
appendConfig

If set to true, the new corpus will be appended to an existing config file. If false, the existing file is overwritten. Default: true.

Type: Boolean  — Default value: true

collectionValues

If set, only documents with one of the listed collection IDs are written, all others are ignored. If this is empty (null), all documents are written.

Optional — Type: String[]

collectionValuesExactMatch

If true (default), only write documents with collection ids matching one of the collection values exactly. If false, write documents with collection ids containing any of the collection value string in collection while ignoring cases.

Type: Boolean  — Default value: true

compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

corpusName

The corpus name is used to name the corresponding sub-directory and will be set in the configuration file.

Type: String

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

maxTopicWords

The maximum number of topic words to extract. Default: 15

Type: Integer  — Default value: 15

modelLocation

A Mallet file storing a serialized ParallelTopicModel.

Type: String

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Directory in which to store output files.

Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 39. Capabilities

Media types

application/x.org.dkpro.ditop

Inputs

HTML

Html

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.html-asl

HtmlReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.html.HtmlReader

Description

Reads the contents of a given URL and strips the HTML. Returns the textual contents. Also recognizes headings and paragraphs.

Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 40. Capabilities

Media types

application/xhtml+xml text/html

Outputs

IMS Corpus Workbench

ImsCwb

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.imscwb-asl

The IMS Open Corpus Workbench is a linguistic search engine. It uses a tab-separated format with limited markup (e.g. for sentences, documents, but not recursive structures like parse-trees). If a local installation of the corpus workbench is available, it can be used by this module to immediately generate the corpus workbench index format. Search is not supported by this module.

Known corpora in this format

ImsCwbReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.imscwb.ImsCwbReader

Description

Reads a tab-separated format including pseudo-XML tags.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

POSTagSet

Specify which tag set should be used to locate the mapping file.

Optional — Type: String

generateNewIds

If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. (Default: false)

Type: Boolean  — Default value: false

idIsUrl

If true, the unit text ID encoded in the corpus file is stored as the URI in the document meta data. This setting has is not affected by #PARAM_GENERATE_NEW_IDS (Default: false)

Type: Boolean  — Default value: false

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readLemma

Read lemmas. Default: true

Type: Boolean  — Default value: true

readPOS

Read part-of-speech tags and generate POS annotations or subclasses if a #PARAM_POS_TAG_SET tag set or #PARAM_POS_MAPPING_LOCATION mapping file is used. Default: true

Type: Boolean  — Default value: true

readSentence

Read sentences. Default: true

Type: Boolean  — Default value: true

readToken

Read tokens and generate Token annotations. Default: true

Type: Boolean  — Default value: true

replaceNonXml

Replace non-XML characters with spaces. (Default: true)

Type: Boolean  — Default value: true

sourceEncoding

Character encoding of the output.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 41. Capabilities

Media types

text/x.org.dkpro.imscwb

Outputs

ImsCwbWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.imscwb.ImsCwbWriter

Description

This Consumer outputs the content of all CASes into the IMS workbench format. This writer produces a text file which needs to be converted to the binary IMS CWB index files using the command line tools that come with the CWB. It is possible to set the parameter #PARAM_CQP_HOME to directly create output in the native binary CQP format via the original CWB command line tools.

Parameters
additionalFeatures

Write additional token-level annotation features. These have to be given as an array of fully qualified feature paths (fully.qualified.classname/featureName). The names for these annotations in CQP are their lowercase shortnames.

Optional — Type: String[]

corpusName

The name of the generated corpus.

Type: String  — Default value: corpus

cqpCompress

Set this parameter to compress the token streams and the indexes using cwb-huffcode and cwb-compress-rdx. With modern hardware, this may actually slow down queries, so we turn it off by default. If you have large data sets, you best try yourself what works best for you. (default: false)

Type: Boolean  — Default value: false

cqpHome

Set this parameter to the directory containing the cwb-encode and cwb-makeall commands if you want the write to directly encode into the CQP binary format.

Optional — Type: String

cqpwebCompatibility

Make document IDs compatible with CQPweb. CQPweb demands an id consisting of only letters, numbers and underscore.

Type: Boolean  — Default value: false

sentenceTag

Type: String  — Default value: s

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Location to which the output is written.

Type: String

writeCPOS

Write coarse-grained part-of-speech tags. These are the simple names of the UIMA types used to represent the part-of-speech tag.

Type: Boolean  — Default value: false

writeDocId

Write the document ID for each token. It is usually a better idea to generate a #PARAM_WRITE_DOCUMENT_TAG document tag or a #PARAM_WRITE_TEXT_TAG text tag which also contain the document ID that can be queried in CQP.

Type: Boolean  — Default value: false

writeDocumentTag

Write a pseudo-XML tag with the name document to mark the start and end of a document.

Type: Boolean  — Default value: false

writeLemma

Write lemmata.

Type: Boolean  — Default value: true

writeOffsets

Write the start and end position of each token.

Type: Boolean  — Default value: false

writePOS

Write part-of-speech tags.

Type: Boolean  — Default value: true

writeTextTag

Write a pseudo-XML tag with the name text to mark the start and end of a document. This is used by CQPweb.

Type: Boolean  — Default value: true

Table 42. Capabilities

Media types

text/x.org.dkpro.imscwb

Inputs

JDBC

Jdbc

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.jdbc-asl

JdbcReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jdbc.JdbcReader

Description

Collection reader for JDBC database.The obtained data will be written into CAS DocumentText as well as fields of the DocumentMetaData annotation.

The field names are available as constants and begin with CAS_. Please specify the mapping of the columns and the field names in the query. For example,

SELECT text AS cas_text, title AS cas_metadata_title FROM test_table

will create a CAS for each record, write the content of "text" column into CAS document text and that of "title" column into the document title field of the DocumentMetaData annotation.

Parameters
connection

Specifies the URL to the database.

If used with uimaFIT and the value is not given, jdbc:mysql://127.0.0.1/ will be taken.

Do not use this parameter to add additional parameters, but use #PARAM_CONNECTION_PARAMS instead.

Type: String  — Default value: jdbc:mysql://127.0.0.1/

connectionParams

Add additional parameters for the connection URL here in a single string: [&propertyName1=propertyValue1[&propertyName2=propertyValue2]...].

Type: String  — Default value: ``

database

Specifies name of the database to be accessed.

Type: String

driver

Specify the class name of the JDBC driver.

If used with uimaFIT and the value is not given, com.mysql.cj.jdbc.Driver will be taken.

Type: String  — Default value: com.mysql.cj.jdbc.Driver

language

Specifies the language.

Optional — Type: String

password

Specifies the password for database access.

Type: String

query

Specifies the query.

Type: String

user

Specifies the user name for database access.

Type: String

Table 43. Capabilities

Media types

none specified

Outputs

LIF

Lif

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.lif-asl

LifReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.lif.LifReader

Description

Reader for the LIF format.

Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 44. Capabilities

Media types

application/x.org.dkpro.lif+json

Outputs

LifWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.lif.LifWriter

Description

Writer for the LIF format.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Specify the suffix of output files. Default value .json. If the suffix is not needed, provide an empty string as value.

Type: String  — Default value: .json

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 45. Capabilities

Media types

application/x.org.dkpro.lif+json

Inputs

LXF

Lxf

Group ID

org.dkpro.core

Artifact ID

dkpro-core-io-lxf-asl

LxfReader

Implementation

org.dkpro.core.io.lxf.LxfReader

Description
null
Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 46. Capabilities

Media types

application/x.org.dkpro.lxf+json

Outputs

LxfWriter

Implementation

org.dkpro.core.io.lxf.LxfWriter

Description
null
Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

delta

Type: Boolean  — Default value: false

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Specify the suffix of output files. Default value .lxf. If the suffix is not needed, provide an empty string as value.

Type: String  — Default value: .lxf

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 47. Capabilities

Media types

application/x.org.dkpro.lxf+json

Inputs

Mallet

MalletLdaTopicProportions

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.mallet-asl

MalletLdaTopicProportionsWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.mallet.lda.io.MalletLdaTopicProportionsWriter

Description

Write topic proportions to a file in the shape [\t]\t\t...

This writer depends on the TopicDistribution annotation which needs to be created by MalletLdaTopicModelInferencer before.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

If #PARAM_SINGULAR_TARGET is set to false (default), this extension will be appended to the output files. Default: .topics.

Type: String  — Default value: .topics

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeDocid

If set to true (default), each output line is preceded by the document id.

Type: Boolean  — Default value: true

Table 48. Capabilities

Media types

none specified

Inputs

none specified

MalletLdaTopicsProportionsSorted

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.mallet-asl

MalletLdaTopicsProportionsSortedWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.mallet.lda.io.MalletLdaTopicsProportionsSortedWriter

Description

Write the topic proportions according to an LDA topic model to an output file. The proportions need to be inferred in a previous step using MalletLdaTopicModelInferencer.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

nTopics

Type: Integer  — Default value: 3

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 49. Capabilities

Media types

none specified

Inputs

none specified

NEGRA

NegraExport

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.negra-asl

NegraExportReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.negra.NegraExportReader

Description

This CollectionReader reads a file which is formatted in the NEGRA export format. The texts and add. information like constituent structure is reproduced in CASes, one CAS per text (article) .

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

collectionId

The collection ID to the written to the document meta data. (Default: none)

Optional — Type: String

documentUnit

What indicates if a new CAS should be started. E.g., if set to DocumentUnit#ORIGIN_NAME ORIGIN_NAME, a new CAS is generated whenever the origin name of the current sentence differs from the origin name of the last sentence. (Default: ORIGIN_NAME)

Type: String  — Default value: ORIGIN_NAME

generateNewIds

If true, the unit IDs are used only to detect if a new document (CAS) needs to be created, but for the purpose of setting the document ID, a new ID is generated. (Default: false)

Type: Boolean  — Default value: false

language

The language.

Optional — Type: String

readLemma

Write lemma information. Default: true

Type: Boolean  — Default value: true

readPOS

Write part-of-speech information. Default: true

Type: Boolean  — Default value: true

readPennTree

Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset! Default: false

Type: Boolean  — Default value: false

sourceEncoding

Character encoding of the input data.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Type: String

Table 50. Capabilities

Media types

application/x.org.dkpro.negra3 application/x.org.dkpro.negra4

Outputs

New York Times Corpus

NYTCollection

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

dkpro-core-io-nyt-asl

NYTCollectionReader

Implementation

org.dkpro.core.io.nyt.NYTCollectionReader

Description
null
Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

offset

A number of documents which will be skipped at the beginning.

Optional — Type: Integer

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 51. Capabilities

Media types

none specified

Outputs

none specified

NIF

Nif

Group ID

org.dkpro.core

Artifact ID

dkpro-core-io-nif-asl

The NLP Interchange Format (NIF) provides a way of representing NLP information using semantic web technology, specifically RDF and OWL.

NifReader

Implementation

org.dkpro.core.io.nif.NifReader

Description

Reader for the NLP Interchange Format (NIF). The file format (e.g. TURTLE, etc.) is automatically chosen depending on the name of the file(s) being read. Compressed files are supported.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 52. Capabilities

Media types

application/x.org.dkpro.nif+turtle

Outputs

NifWriter

Implementation

org.dkpro.core.io.nif.NifWriter

Description

Writer for the NLP Interchange Format (NIF).

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Specify the suffix of output files. Default value .ttl. The file format will be chosen depending on the file suffice.

Type: String  — Default value: .ttl

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 53. Capabilities

Media types

application/x.org.dkpro.nif+turtle

Inputs

PDF

Pdf

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.pdf-asl

PdfReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.pdf.PdfReader

Description

Collection reader for PDF files. Uses simple heuristics to detect headings and paragraphs.

Parameters
endPage

The last page to be extracted from the PDF.

Optional — Type: Integer  — Default value: -1

headingType

The type used to annotate headings.

Optional — Type: String  — Default value: <built-in>

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

paragraphType

The type used to annotate paragraphs.

Optional — Type: String  — Default value: <built-in>

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

startPage

The first page to be extracted from the PDF.

Optional — Type: Integer  — Default value: -1

substitutionTableLocation

The location of the substitution table use to post-process the text extracted form the PDF, e.g. to convert ligatures to separate characters.

Optional — Type: String  — Default value: <built-in>

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 54. Capabilities

Media types

application/pdf

Outputs

Penn Treebank Format

PennTreebankChunked

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.penntree-asl

PennTreebankChunkedReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.penntree.PennTreebankChunkedReader

Description

Penn Treebank chunked format reader.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readChunk

Write chunk annotations to the CAS.

Type: Boolean  — Default value: true

readPOS

Write part-of-speech annotations to the CAS.

Type: Boolean  — Default value: true

readSentence

Write sentence annotations to the CAS.

Type: Boolean  — Default value: true

readToken

Write token annotations to the CAS.

Type: Boolean  — Default value: true

sourceEncoding

Character encoding of the input data.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 55. Capabilities

Media types

text/x.org.dkpro.ptb-chunked

Outputs

PennTreebankCombined

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.penntree-asl

Known corpora in this format

PennTreebankCombinedReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.penntree.PennTreebankCombinedReader

Description

Penn Treebank combined format reader.

Parameters
ConstituentMappingLocation

Load the constituent tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

ConstituentTagSet

Use this constituent tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Optional — Type: Boolean  — Default value: true

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readPOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Default: true

Type: Boolean  — Default value: true

removeTraces

Optional — Type: Boolean  — Default value: true

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

writeTracesToText

Optional — Type: Boolean  — Default value: false

Table 56. Capabilities

Media types

text/x.org.dkpro.ptb-combined

Outputs

PennTreebankCombinedWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.penntree.PennTreebankCombinedWriter

Description

Penn Treebank combined format writer.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

emptyRootLabel

Type: Boolean  — Default value: false

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Specify the suffix of output files. Default value .mrg. If the suffix is not needed, provide an empty string as value.

Type: String  — Default value: .mrg

noRootLabel

Type: Boolean  — Default value: false

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 57. Capabilities

Media types

text/x.org.dkpro.ptb-combined

Inputs

Reuters-21578

Reuters21578Sgml

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.reuters-asl

Reuters21578SgmlReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.reuters.Reuters21578SgmlReader

Description

Read a Reuters-21578 corpus in SGML format.

Set the directory that contains the SGML files with #PARAM_SOURCE_LOCATION.

Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 58. Capabilities

Media types

application/x.org.dkpro.reuters21578+sgml

Outputs

Reuters21578Txt

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.reuters-asl

Reuters21578TxtReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.reuters.Reuters21578TxtReader

Description

Read a Reuters-21578 corpus that has been transformed into text format using ExtractReuters in the lucene-benchmarks project.

The #PARAM_SOURCE_LOCATION parameter should typically point to the file name pattern reut2-*.txt, preceded by the corpus root directory.

Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 59. Capabilities

Media types

text/x.org.dkpro.reuters21578

Outputs

RTF

RTF

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.rtf-asl

RTFReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.rtf.RTFReader

Description

Read RTF (Rich Text Format) files. Uses RTFEditorKit for parsing RTF.

Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 60. Capabilities

Media types

application/rtf text/rtf

Outputs

Solr

Solr

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.solr-asl

SolrWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.solr.SolrWriter

Description

A simple implementation of SolrWriter_ImplBase

Parameters
numThreads

The number of background numThreads used to empty the queue. Default: 1.

Type: Integer  — Default value: 1

optimizeIndex

If set to true, the index is optimized once all documents are uploaded. Default is false.

Type: Boolean  — Default value: false

queueSize

The buffer size before the documents are sent to the server (default: 10000).

Type: Integer  — Default value: 10000

solrIdField

The name of the id field in the Solr schema (default: "id").

Type: String  — Default value: id

targetLocation

Solr server URL string in the form ://:/, e.g. http://localhost:8983/solr/collection1

Type: String

textField

The name of the text field in the Solr schema (default: "text").

Type: String  — Default value: text

update

Define whether existing documents with same ID are updated (true) of overwritten (false)? Default: true (update).

Type: Boolean  — Default value: true

waitFlush

When committing to the index, i.e. when all documents are processed, block until index changes are flushed to disk? Default: true.

Type: Boolean  — Default value: true

waitSearcher

When committing to the index, i.e. when all documents are processed, block until a new searcher is opened and registered as the main query searcher, making the changes visible? Default: true.

Type: Boolean  — Default value: true

Table 61. Capabilities

Media types

none specified

Inputs

none specified

TCF

Tcf

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.tcf-asl

The TCF (Text Corpus Format) was created in the context of the CLARIN project. It is mainly used to exchange data between the different web-services that are part of the WebLicht platform.

TcfReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tcf.TcfReader

Description

Reader for the WebLicht TCF format. It reads all the available annotation Layers from the TCF file and convert it to a CAS annotations. The TCF data do not have begin/end offsets for all of its annotations which is required in CAS annotation. Hence, addresses are manually calculated per tokens and stored in a map (token_id, token(CAS object)) where later we get can get the offset from the token

Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 62. Capabilities

Media types

text/tcf+xml

Outputs

TcfWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tcf.TcfWriter

Description

Writer for the WebLicht TCF format.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Specify the suffix of output files. Default value .tcf. If the suffix is not needed, provide an empty string as value.

Type: String  — Default value: .tcf

merge

Merge with source TCF file if one is available.
Default: true

Type: Boolean  — Default value: true

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

preserveIfEmpty

If there are no annotations for a particular layer in the CAS, preserve any potentially existing annotations in the original TCF.
Default: false

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 63. Capabilities

Media types

text/tcf+xml

Inputs

TEI

Tei

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.tei-asl

TeiReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tei.TeiReader

Description

Reader for the TEI XML.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

omitIgnorableWhitespace

Do not write ignoreable whitespace from the XML file to the CAS.

Type: Boolean  — Default value: false

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readConstituent

Write constituent annotations to the CAS.

Type: Boolean  — Default value: true

readLemma

Write lemma annotations to the CAS.

Type: Boolean  — Default value: true

readNamedEntity

Write named entity annotations to the CAS.

Type: Boolean  — Default value: true

readPOS

Write part-of-speech annotations to the CAS.

Type: Boolean  — Default value: true

readParagraph

Write paragraphs annotations to the CAS.

Type: Boolean  — Default value: true

readSentence

Write sentence annotations to the CAS.

Type: Boolean  — Default value: true

readToken

Write token annotations to the CAS.

Type: Boolean  — Default value: true

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

useFilenameId

When not using the XML ID, use only the filename instead of the whole URL as ID. Mind that the filenames should be unique in this case.

Type: Boolean  — Default value: false

useXmlId

Use the xml:id attribute on the TEI elements as document ID. Mind that many TEI files may not have this attribute on all TEI elements and you may end up with no document ID at all. Also mind that the IDs should be unique.

Type: Boolean  — Default value: false

utterancesAsSentences

Interpret utterances "u" as sentenes "s". (EXPERIMENTAL)

Type: Boolean  — Default value: false

Table 64. Capabilities

Media types

application/tei+xml

Outputs

TeiWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tei.TeiWriter

Description

UIMA CAS consumer writing the CAS document text in TEI format.

Parameters
cTextPattern

A token matching this pattern is rendered as a TEI "c" element instead of a "w" element.

Type: String  — Default value: [,.:;()]|(``)|('')|(--)

compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Specify the suffix of output files. Default value .xml. If the suffix is not needed, provide an empty string as value.

Type: String  — Default value: .xml

indent

Indent the XML.

Type: Boolean  — Default value: false

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

writeConstituent

Write constituent annotations to the CAS. Disabled by default because it requires type priorities to be set up (Constituents must have a higher prio than Tokens).

Type: Boolean  — Default value: false

writeNamedEntity

Write named entity annotations to the CAS. Overlapping named entities are not supported.

Type: Boolean  — Default value: true

Table 65. Capabilities

Media types

application/tei+xml

Inputs

Text

String

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.text-asl

StringReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.text.StringReader

Description

Simple reader that generates a CAS from a String. This can be useful in situations where a reader is preferred over manually crafting a CAS using JCasFactory#createJCas().

Parameters
collectionId

The collection ID to set in the DocumentMetaData.

Type: String  — Default value: COLLECTION_ID

documentBaseUri

The document base URI to set in the DocumentMetaData.

Optional — Type: String

documentId

The document ID to set in the DocumentMetaData.

Type: String  — Default value: DOCUMENT_ID

documentText

The document text.

Type: String

documentUri

The document URI to set in the DocumentMetaData.

Type: String  — Default value: STRING

language

Set this as the language of the produced documents.

Type: String

Table 66. Capabilities

Media types

text/plain

Outputs

Text

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.text-asl

TextReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.text.TextReader

Description

UIMA collection reader for plain text files.

Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceEncoding

Name of configuration parameter that contains the character encoding used by the input files.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 67. Capabilities

Media types

text/plain

Outputs

TextWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.text.TextWriter

Description

UIMA CAS consumer writing the CAS document text as plain text file.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Specify the suffix of output files. Default value .txt. If the suffix is not needed, provide an empty string as value.

Type: String  — Default value: .txt

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 68. Capabilities

Media types

text/plain

Inputs

TokenizedText

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.text-asl

TokenizedTextWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.text.TokenizedTextWriter

Description

This class writes a set of pre-processed documents into a large text file containing one sentence per line and tokens split by whitespaces. Optionally, annotations other than tokens (e.g. lemmas) are written as specified by #PARAM_FEATURE_PATH.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

coveringType

In the output file, each unit of the covering type is written into a separate line. The default (set in #DEFAULT_COVERING_TYPE), is sentences so that each sentence is written to a line.

If no linebreaks within a document is desired, set this value to null.

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

extension

Set the output file extension. Default: .txt.

Type: String  — Default value: .txt

featurePath

The feature path, e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value for lemmas. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token (i.e. token texts).

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

numberRegex

Type: String  — Default value: ``

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stopwordsFile

All the tokens listed in this file (one token per line) are replaced by STOP. Empty lines and lines starting with # are ignored. Casing is ignored.

Type: String  — Default value: ``

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Encoding for the target file. Default is UTF-8.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 69. Capabilities

Media types

text/plain

Inputs

TGrep2

TGrep

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.tgrep-gpl

TGrep and TGrep2 are a tools to search over syntactic parse trees represented as bracketed structures. This module supports in particular TGrep2 and allows to conveniently generate TGrep2 indexes which can then be searched. Search is not supported by this module.

See also

TGrepWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tgrep.TGrepWriter

Description

TGrep2 corpus file writer. Requires PennTrees to be annotated before.

Parameters
compression

Method to compress the tgrep file (only used if PARAM_WRITE_T2C is true). Only NONE, GZIP and BZIP2 are supported. Default: CompressionMethod#NONE

Type: String  — Default value: NONE

dropMalformedTrees

If true, silently drops malformed Penn Trees instead of throwing an exception. Default: false

Type: Boolean  — Default value: false

targetLocation

Path to which the output is written.

Type: String

writeComments

Set this parameter to true if you want to add a comment to each PennTree which is written to the output files. The comment is of the form documentId,beginOffset,endOffset. Default: true

Type: Boolean  — Default value: true

writeT2c

Set this parameter to true if you want to encode directly into the tgrep2 binary format. Default: true

Type: Boolean  — Default value: true

Table 70. Capabilities

Media types

application/x.org.dkpro.tgrep2

Inputs

TIGER-XML

TigerXml

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.tiger-asl

The TIGER XML format was created for encoding syntactic constituency structures in the German TIGER corpus. It has since been used for many other corpora as well. TIGERSearch is a linguistic search engine specifically targetting this format. The format has later been extended to also support semantic frame annotations.

Known corpora in this format

TigerXmlReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tiger.TigerXmlReader

Description

UIMA collection reader for TIGER-XML files. Also supports the augmented format used in the Semeval 2010 task which includes semantic role data.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

ignoreIllegalSentences

If a sentence has an illegal structure (e.g. TIGER 2.0 has non-terminal nodes that do not have child nodes), then just ignore these sentences. Default: false

Type: Boolean  — Default value: false

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readPennTree

Write Penn Treebank bracketed structure information. Mind this may not work with all tagsets, in particular not with such that contain "(" or ")" in their tags. The tree is generated using the original tag set in the corpus, not using the mapped tagset! Default: false

Type: Boolean  — Default value: false

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 71. Capabilities

Media types

application/x.org.dkpro.semeval-2010+xml application/x.org.dkpro.tiger+xml

Outputs

TigerXmlWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tiger.TigerXmlWriter

Description

UIMA CAS consumer writing the CAS document text in the TIGER-XML format.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Specify the suffix of output files. Default value .xml. If the suffix is not needed, provide an empty string as value.

Type: String  — Default value: .xml

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 72. Capabilities

Media types

application/x.org.dkpro.tiger+xml

Inputs

Tika

Tika

Group ID

org.dkpro.core

Artifact ID

dkpro-core-io-tika-asl

TikaReader

Implementation

org.dkpro.core.io.tika.TikaReader

Description

Reader for many file formats based on Apache Tika.

Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 73. Capabilities

Media types

none specified

Outputs

none specified

TUEBADZ

TuebaDZ

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.tuebadz-asl

The TüBa-D/Z treebank is a syntactically annotated German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz).

Sentences have a header line and are followed by a blank new line.

Table 74. Columns
Column Type/Feature  Description

FORM

Token

Word form or punctuation symbol.

POSTAG

POS PosValue

Fine-grained part-of-speech tag, where the tagset depends on the language.

CHUNK

Chunk

chunk (BIO encoded) - For named entities, it can also include its type, e.g., B-NX=ORG

Example
%% sent no. 1
Veruntreute  VVFIN   B-VXFIN
die          ART     B-NX=ORG
AWO          NN      I-NX=ORG
Spendengeld  NN      B-NX
?   $.  O
Known corpora in this format

TuebaDZReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tuebadz.TuebaDZReader

Description

Reads the Tüba-D/Z chunking format.

Parameters
ChunkMappingLocation

Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

ChunkTagSet

Use this chunk tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

readChunk

Read chunk information. Default: true

Type: Boolean  — Default value: true

readNamedEntity

Read named entity information. Default: false

Type: Boolean  — Default value: false

readPOS

Write part-of-speech information. Default: true

Type: Boolean  — Default value: true

sourceEncoding

Character encoding of the input data.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 75. Capabilities

Media types

application/x.org.dkpro.tuebadz-chunk

Outputs

TüPP-D/Z

Tuepp

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.tuepp-asl

TüPP D/Z is a collection of articles from the German newspaper taz (die tageszeitung) annotated and encoded in a XML format.

Known corpora in this format

TueppReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.tuepp.TueppReader

Description
UIMA collection reader for Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) XML files.
  • Only the part-of-speech with the best rank (rank 1) is read, if there is a tie between multiple tags, the first one from the XML file is read.
  • Only the first lemma (baseform) from the XML file is read.
  • Token are read, but not the specific kind of token (e.g. TEL, AREA, etc.).
  • Article boundaries are not read.
  • Paragraph boundaries are not read.
  • Lemma information is read, but morphological information is not read.
  • Chunk, field, and clause information is not read.
  • Meta data headers are not read.
Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

POSTagSet

Use this part-of-speech tag set to use to resolve the tag set mapping instead of using the tag set defined as part of the model meta data. This can be useful if a custom model is specified which does not have such meta data, or it can be used in readers.

Optional — Type: String

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceEncoding

Character encoding of the input data.

Type: String  — Default value: UTF-8

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 76. Capabilities

Media types

application/x.org.dkpro.tuepp+xml

Outputs

UIMA Binary CAS

BinaryCas

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.bincas-asl

The CAS is the native data model used by UIMA. There are various ways of saving CAS data, using XMI, XCAS, or binary formats. This module supports the binary formats.

BinaryCasReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bincas.BinaryCasReader

Description

UIMA Binary CAS formats reader.

Parameters
addDocumentMetadata

Add DKPro Core metadata if it is not already present in the document.

Type: Boolean  — Default value: true

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

mergeTypeSystem

Determines whether the type system from a currently read file should be merged with the current type system

Type: Boolean  — Default value: false

overrideDocumentMetadata

Generate new DKPro Core document metadata (i.e. title, ID, URI) for the document instead of retaining what is already present in the XMI file.

Type: Boolean  — Default value: false

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

typeSystemLocation

The location from which to obtain the type system when the CAS is stored in form 0.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 77. Capabilities

Media types

application/x.org.dkpro.uima+binary

Outputs

none specified

BinaryCasWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bincas.BinaryCasWriter

Description

Write CAS in one of the UIMA binary formats.

All the supported formats except 6+ can also be loaded and saved via the UIMA CasIOUtils.

Supported formats
Format Description Type system on load CAS Addresses preserved
SERIALIZED or S CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). Because these structures are pre-allocated in memory at larger sizes than what is actually required, files in this format may be larger than necessary. However, the CAS addresses of feature structures are preserved in this format. When the data is loaded back into a CAS, it must have been initialized with the same type system as the original CAS. must be the same yes
SERIALIZED_TSI or S+ CAS structures are dumped to disc as they are using Java serialization as in form 0, but now using the CASCompleteSerializer which includes CAS metadata like type system and index repositories. is reinitialized yes
BINARY or 0 CAS structures are dumped to disc as they are using Java serialization (CASSerializer ). This is basically the same as format S but includes a UIMA header and can be read using org.apache.uima.cas.impl.Serialization#deserializeCAS. must be the same yes
BINARY_TSI or 0 The same as BINARY_TSI, except that the type system and index configuration are also stored in the file. However, lenient loading or reinitalizing the CAS with this information is presently not supported. must be the same yes
COMPRESSED or 4 UIMA binary serialization saving all feature structures (reachable or not). This format internally uses gzip compression and a binary representation of the CAS, making it much more efficient than format 0. must be the same yes
COMPRESSED_FILTERED or 6 UIMA binary serialization as format 4, but saving only reachable feature structures. must be the same no
6+ This is a legacy format specific to DKPro Core. Since UIMA 2.9.0, COMPRESSED_FILTERED_TSI is supported and should be used instead of this format. UIMA binary serialization as format 6, but also contains the type system definition. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system. lenient loading no
COMPRESSED_FILTERED_TS Same as COMPRESSED_FILTERED, but also contains the type system definition. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system. lenient loading no
COMPRESSED_FILTERED_TSI Default. UIMA binary serialization as format 6, but also contains the type system definition and index definitions. This allows the BinaryCasReader to load data leniently into a CAS that has been initialized with a different type system. lenient loading no
Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

The file extension. If this is set to AUTO, then the extension will be chosen based on the default extension specified by the UIMA SerialFormat class. However, this only works when using the new long format names (e.g. COMPRESSED_FILTERED_TSI). When using the old short names (e.g. 6), the default extension .bin is used.

Type: String  — Default value: AUTO

format

Type: String  — Default value: COMPRESSED_FILTERED_TSI

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

typeSystemLocation

Location to write the type system to. The type system is saved using Java serialization, it is not saved as a XML type system description. We recommend to use the name typesystem.ser.
The #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the type system file should be compressed or not is detected from the file name extension (e.g. ".gz").
If this parameter is set, the type system and index repository are no longer serialized into the same file as the test of the CAS. The SerializedCasReader can currently not read such files. Use this only if you really know what you are doing.
This parameter has no effect if formats S+ or 6+ are used as the type system information is embedded in each individual file. Otherwise, it is recommended that this parameter be set unless some other mechanism is used to initialize the CAS with the same type system and index repository during reading that was used during writing.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 78. Capabilities

Media types

application/x.org.dkpro.uima+binary

Inputs

SerializedCas

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.bincas-asl

SerializedCasReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bincas.SerializedCasReader

Description
null
Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

typeSystemLocation

The file from which to obtain the type system if it is not embedded in the serialized CAS.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 79. Capabilities

Media types

none specified

Outputs

none specified

SerializedCasWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bincas.SerializedCasWriter

Description
null
Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Type: String  — Default value: .ser

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

typeSystemLocation

Location to write the type system to. The type system is saved using Java serialization, it is not saved as a XML type system description. We recommend to use the name typesystem.ser.
The #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the type system file should be compressed or not is detected from the file name extension (e.g. ".gz").
If this parameter is set, the type system and index repository are no longer serialized into the same file as the test of the CAS. The SerializedCasReader can currently not read such files. Use this only if you really know what you are doing.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 80. Capabilities

Media types

none specified

Inputs

UIMA JSON

Json

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.json-asl

JsonWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.json.JsonWriter

Description

UIMA JSON format writer.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

jsonContextFormat

Type: String  — Default value: omitExpandedTypeNames

omitDefaultValues

Type: Boolean  — Default value: true

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

prettyPrint

Type: Boolean  — Default value: true

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

typeSystemFile

Location to write the type system to. If this is not set, a file called typesystem.xml will be written to the XMI output path. If this is set, it is expected to be a file relative to the current work directory or an absolute file.
If this parameter is set, the #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the file name ends in ".gz", the file will be compressed, otherwise not.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 81. Capabilities

Media types

application/x.org.dkpro.uima+json

Inputs

UIMA XMI

Xmi

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.xmi-asl

One of the official formats supported by UIMA is the XMI format. It is an XML-based format that does not support a few very specific characters which are invalid in XML. But it is able to capture all the information contained in the CAS. The XMI format is the de-facto standard for exchanging data in the UIMA world. Most UIMA-related tools support it.

The XMI format does not include type system information. It is therefore recommended to always configure the XmiWriter component to also write out the type system to a file.

If you with to view anntated documents using the UIMA CAS Editor in Eclipse, you can e.g. set up your XmiWriter in the following way to write out XMIs and a type system file:

AnalysisEngineDescription xmiWriter =
  AnalysisEngineFactory.createEngineDescription(
      XmiWriter.class,
      XmiWriter.PARAM_TARGET_LOCATION, ".",
      XmiWriter.PARAM_TYPE_SYSTEM_FILE, "typesystem.xml");

XmiReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiReader

Description

Reader for UIMA XMI files.

Parameters
addDocumentMetadata

Add DKPro Core metadata if it is not already present in the document.

Type: Boolean  — Default value: true

includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

lenient

In lenient mode, unknown types are ignored and do not cause an exception to be thrown.

Type: Boolean  — Default value: false

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

overrideDocumentMetadata

Generate new DKPro Core document metadata (i.e. title, ID, URI) for the document instead of retaining what is already present in the XMI file.

Type: Boolean  — Default value: false

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 82. Capabilities

Media types

application/vnd.xmi+xml application/x.org.dkpro.uima+xmi

Outputs

XmiWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xmi.XmiWriter

Description

UIMA XMI format writer.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameExtension

Specify the suffix of output files. Default value .xmi. If the suffix is not needed, provide an empty string as value.

Type: String  — Default value: .xmi

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

prettyPrint

Type: Boolean  — Default value: true

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

typeSystemFile

Location to write the type system to. If this is not set, a file called typesystem.xml will be written to the XMI output path. If this is set, it is expected to be a file relative to the current work directory or an absolute file.
If this parameter is set, the #PARAM_COMPRESSION parameter has no effect on the type system. Instead, if the file name ends in ".gz", the file will be compressed, otherwise not.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 83. Capabilities

Media types

application/vnd.xmi+xml application/x.org.dkpro.uima+xmi

Inputs

Web1T n-grams

Web1T

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.web1t-asl

The Web1T n-gram corpus is a huge collection of n-grams collected from the internet. The jweb1t library allows to access this corpus efficiently. This module provides support for the file format used by the Web1T n-gram corpus and allows to conveniently created jweb1t indexes.

Web1TWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.web1t.Web1TWriter

Description

Web1T n-gram index format writer.

Parameters
contextType

The type being used for segments

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence

createIndexes

Create the indexes that jWeb1T needs to operate. (default: true)

Optional — Type: Boolean  — Default value: true

inputTypes

Types to generate n-grams from. Example: Token.class.getName() + "/pos/PosValue" for part-of-speech n-grams

Type: String[]

lowercase

Create a lower case index.

Optional — Type: Boolean  — Default value: false

maxNgramLength

Maximum n-gram length. Default: 3

Optional — Type: Integer  — Default value: 3

minFreq

Specifies the minimum frequency a NGram must have to be written to the final index. The specified value is interpreted as inclusive value, the default is 1. Thus, all NGrams with a frequency of at least 1 or higher will be written.

Optional — Type: Integer  — Default value: 1

minNgramLength

Minimum n-gram length. Default: 1

Optional — Type: Integer  — Default value: 1

splitFileTreshold

The input file(s) is/are split into smaller files for quick access. An own file is created if the first two starting letters (or the starting letter if the word has a length of 1 character) account for at least x% of all starting letters in the input file(s). The default value for splitting a file is 1.0%. Every word that has starting characters which does not suffice the threshold is written with other words that also did not meet the threshold into an own file for miscellaneous words. A high threshold will lead to only a few, but large files and a most likely very large misc. file. A low threshold results in many small files. Use a zero or a negative value to write everything to one file.

Optional — Type: Float  — Default value: 1.0

targetEncoding

Character encoding of the output data.

Optional — Type: String  — Default value: UTF-8

targetLocation

Location to which the output is written.

Type: String

Table 84. Capabilities

Media types

text/x.org.dkpro.ngram

Inputs

Wikipedia via Bliki Engine

BlikiWikipedia

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.bliki-asl

Access the online Wikipedia and extract its contents using the Bliki engine.

BlikiWikipediaReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.bliki.BlikiWikipediaReader

Description

Bliki-based Wikipedia reader.

Parameters
language

The language of the wiki installation.

Type: String

outputPlainText

Whether the reader outputs plain text or wiki markup.

Type: Boolean  — Default value: true

pageTitles

Which page titles should be retrieved.

Type: String[]

sourceLocation

Wikiapi URL E.g. for the English Wikipedia it should be: http://en.wikipedia.org/w/api.php

Type: String

Table 85. Capabilities

Media types

none specified

Outputs

Wikipedia via JWPL

WikipediaArticle

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaArticleReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaArticleReader

Description

Reads all article pages. A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects, disambiguation pages, or discussion pages are regarded, however.

Parameters
CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Type: Boolean  — Default value: false

Database

The name of the database.

Type: String

Host

The host server.

Type: String

Language

The language of the Wikipedia that should be connected to.

Type: String

OnlyFirstParagraph

If set to true, only the first paragraph instead of the whole article is used.

Type: Boolean  — Default value: false

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Type: Boolean  — Default value: true

PageBuffer

The page buffer size (#pages) of the page iterator.

Type: Integer  — Default value: 1000

PageIdFromArray

Defines an array of page ids of the pages that should be retrieved. (Optional)

Optional — Type: String[]

PageIdsFromFile

Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)

Optional — Type: String

PageTitleFromFile

Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)

Optional — Type: String

PageTitlesFromArray

Defines an array of page titles of the pages that should be retrieved. (Optional)

Optional — Type: String[]

Password

The password of the database account.

Type: String

User

The username of the database account.

Type: String

Table 86. Capabilities

Media types

none specified

Outputs

none specified

WikipediaArticleInfo

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaArticleInfoReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaArticleInfoReader

Description

Reads all general article infos without retrieving the whole Page objects

Parameters
CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Type: Boolean  — Default value: false

Database

The name of the database.

Type: String

Host

The host server.

Type: String

Language

The language of the Wikipedia that should be connected to.

Type: String

Password

The password of the database account.

Type: String

User

The username of the database account.

Type: String

Table 87. Capabilities

Media types

none specified

Outputs

WikipediaDiscussion

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaDiscussionReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaDiscussionReader

Description

Reads all discussion pages.

Parameters
CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Type: Boolean  — Default value: false

Database

The name of the database.

Type: String

Host

The host server.

Type: String

Language

The language of the Wikipedia that should be connected to.

Type: String

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Type: Boolean  — Default value: true

PageBuffer

The page buffer size (#pages) of the page iterator.

Type: Integer  — Default value: 1000

PageIdFromArray

Defines an array of page ids of the pages that should be retrieved. (Optional)

Optional — Type: String[]

PageIdsFromFile

Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)

Optional — Type: String

PageTitleFromFile

Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)

Optional — Type: String

PageTitlesFromArray

Defines an array of page titles of the pages that should be retrieved. (Optional)

Optional — Type: String[]

Password

The password of the database account.

Type: String

User

The username of the database account.

Type: String

Table 88. Capabilities

Media types

none specified

Outputs

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaLinkReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaLinkReader

Description

Read links from Wikipedia.

Parameters
AllowedLinkTypes

Which types of links are allowed?

Type: String[]

CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Type: Boolean  — Default value: false

Database

The name of the database.

Type: String

Host

The host server.

Type: String

Language

The language of the Wikipedia that should be connected to.

Type: String

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Type: Boolean  — Default value: true

PageBuffer

The page buffer size (#pages) of the page iterator.

Type: Integer  — Default value: 1000

PageIdFromArray

Defines an array of page ids of the pages that should be retrieved. (Optional)

Optional — Type: String[]

PageIdsFromFile

Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)

Optional — Type: String

PageTitleFromFile

Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)

Optional — Type: String

PageTitlesFromArray

Defines an array of page titles of the pages that should be retrieved. (Optional)

Optional — Type: String[]

Password

The password of the database account.

Type: String

User

The username of the database account.

Type: String

Table 89. Capabilities

Media types

none specified

Outputs

WikipediaPage

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaPageReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaPageReader

Description

Reads all Wikipedia pages in the database (articles, discussions, etc). A parameter controls whether the full article or only the first paragraph is set as the document text. No Redirects or disambiguation pages are regarded, however.

Parameters
CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Type: Boolean  — Default value: false

Database

The name of the database.

Type: String

Host

The host server.

Type: String

Language

The language of the Wikipedia that should be connected to.

Type: String

OnlyFirstParagraph

If set to true, only the first paragraph instead of the whole article is used.

Type: Boolean  — Default value: false

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Type: Boolean  — Default value: true

PageBuffer

The page buffer size (#pages) of the page iterator.

Type: Integer  — Default value: 1000

PageIdFromArray

Defines an array of page ids of the pages that should be retrieved. (Optional)

Optional — Type: String[]

PageIdsFromFile

Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)

Optional — Type: String

PageTitleFromFile

Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)

Optional — Type: String

PageTitlesFromArray

Defines an array of page titles of the pages that should be retrieved. (Optional)

Optional — Type: String[]

Password

The password of the database account.

Type: String

User

The username of the database account.

Type: String

Table 90. Capabilities

Media types

none specified

Outputs

WikipediaQuery

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaQueryReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaQueryReader

Description

Reads all article pages that match a query created by the numerous parameters of this class.

Parameters
CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Type: Boolean  — Default value: false

Database

The name of the database.

Type: String

Host

The host server.

Type: String

Language

The language of the Wikipedia that should be connected to.

Type: String

MaxCategories

Maximum number of categories. Articles with a higher number of categories will not be returned by the query.

Optional — Type: Integer  — Default value: -1

MaxInlinks

Maximum number of incoming links. Articles with a higher number of incoming links will not be returned by the query.

Optional — Type: Integer  — Default value: -1

MaxOutlinks

Maximum number of outgoing links. Articles with a higher number of outgoing links will not be returned by the query.

Optional — Type: Integer  — Default value: -1

MaxRedirects

Maximum number of redirects. Articles with a higher number of redirects will not be returned by the query.

Optional — Type: Integer  — Default value: -1

MaxTokens

Maximum number of tokens. Articles with a higher number of tokens will not be returned by the query.

Optional — Type: Integer  — Default value: -1

MinCategories

Minimum number of categories. Articles with a lower number of categories will not be returned by the query.

Optional — Type: Integer  — Default value: -1

MinInlinks

Minimum number of incoming links. Articles with a lower number of incoming links will not be returned by the query.

Optional — Type: Integer  — Default value: -1

MinOutlinks

Minimum number of outgoing links. Articles with a lower number of outgoing links will not be returned by the query.

Optional — Type: Integer  — Default value: -1

MinRedirects

Minimum number of redirects. Articles with a lower number of redirects will not be returned by the query.

Optional — Type: Integer  — Default value: -1

MinTokens

Minimum number of tokens. Articles with a lower number of tokens will not be returned by the query.

Optional — Type: Integer  — Default value: -1

OnlyFirstParagraph

If set to true, only the first paragraph instead of the whole article is used.

Type: Boolean  — Default value: false

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Type: Boolean  — Default value: true

PageBuffer

The page buffer size (#pages) of the page iterator.

Type: Integer  — Default value: 1000

PageIdFromArray

Defines an array of page ids of the pages that should be retrieved. (Optional)

Optional — Type: String[]

PageIdsFromFile

Defines the path to a file containing a line-separated list of page ids of the pages that should be retrieved. (Optional)

Optional — Type: String

PageTitleFromFile

Defines the path to a file containing a line-separated list of page titles of the pages that should be retrieved. (Optional)

Optional — Type: String

PageTitlesFromArray

Defines an array of page titles of the pages that should be retrieved. (Optional)

Optional — Type: String[]

Password

The password of the database account.

Type: String

TitlePattern

SQL-style title pattern. Only articles that match the pattern will be returned by the query.

Optional — Type: String  — Default value: ``

User

The username of the database account.

Type: String

Table 91. Capabilities

Media types

none specified

Outputs

none specified

WikipediaRevision

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaRevisionReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaRevisionReader

Description

Reads Wikipedia page revisions.

Parameters
CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Type: Boolean  — Default value: false

Database

The name of the database.

Type: String

Host

The host server.

Type: String

Language

The language of the Wikipedia that should be connected to.

Type: String

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Type: Boolean  — Default value: true

PageBuffer

The page buffer size (#pages) of the page iterator.

Type: Integer  — Default value: 1000

Password

The password of the database account.

Type: String

RevisionIdFromArray

Defines an array of revision ids of the revisions that should be retrieved. (Optional)

Optional — Type: String[]

RevisionIdsFromFile

Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional)

Optional — Type: String

User

The username of the database account.

Type: String

Table 92. Capabilities

Media types

none specified

Outputs

WikipediaRevisionPair

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaRevisionPairReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaRevisionPairReader

Description

Reads pairs of adjacent revisions of all articles.

Parameters
CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Type: Boolean  — Default value: false

Database

The name of the database.

Type: String

Host

The host server.

Type: String

Language

The language of the Wikipedia that should be connected to.

Type: String

MaxChange

Restrict revision pairs to cases where the length of the revisions does not differ more than this value (counted in characters).

Type: Integer  — Default value: 10000

MinChange

Restrict revision pairs to cases where the length of the revisions differ more than this value (counted in characters).

Type: Integer  — Default value: 0

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Type: Boolean  — Default value: true

PageBuffer

The page buffer size (#pages) of the page iterator.

Type: Integer  — Default value: 1000

Password

The password of the database account.

Type: String

RevisionIdFromArray

Defines an array of revision ids of the revisions that should be retrieved. (Optional)

Optional — Type: String[]

RevisionIdsFromFile

Defines the path to a file containing a line-separated list of revision ids of the revisions that should be retrieved. (Optional)

Optional — Type: String

SkipFirstNPairs

The number of revision pairs that should be skipped in the beginning.

Optional — Type: Integer

User

The username of the database account.

Type: String

Table 93. Capabilities

Media types

none specified

Outputs

WikipediaTemplateFilteredArticle

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.jwpl-asl

WikipediaTemplateFilteredArticleReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.jwpl.WikipediaTemplateFilteredArticleReader

Description

Reads all pages that contain or do not contain the templates specified in the template whitelist and template blacklist.

It is possible to just define a whitelist OR a blacklist. If both whitelist and blacklist are provided, the articles are chosen that DO contain the templates from the whitelist and at the same time DO NOT contain the templates from the blacklist (= the intersection of the "whitelist page set" and the "blacklist page set")

This reader only works if template tables have been generated for the JWPL database using the WikipediaTemplateInfoGenerator.

NOTE: This reader directly extends the WikipediaReaderBase and not the WikipediaStandardReaderBase

Parameters
CreateDBAnno

Sets whether the database configuration should be stored in the CAS, so that annotators down the pipeline can access additional data.

Type: Boolean  — Default value: false

Database

The name of the database.

Type: String

DoubleCheckAssociatedPages

If this option is set, discussion pages are rejected that are associated with a blacklisted article. Analogously, articles are rejected that are associated with a blacklisted discussion page.

This check is rather expensive and could take a long time. This is option is not active if only a whitelist is used.

Default Value: false

Type: Boolean  — Default value: false

ExactTemplateMatching

Defines whether to match the templates exactly or whether to match all templates that start with the String given in the respective parameter list.

Default Value: true

Type: Boolean  — Default value: true

Host

The host server.

Type: String

IncludeDiscussions

Whether the reader should read also include talk pages.

Type: Boolean  — Default value: true

Language

The language of the Wikipedia that should be connected to.

Type: String

LimitNUmberOfArticlesToRead

Optional parameter that allows to define the max number of articles that should be delivered by the reader.

This avoids unnecessary filtering if only a small number of articles is needed.

Optional — Type: Integer

OnlyFirstParagraph

If set to true, only the first paragraph instead of the whole article is used.

Type: Boolean  — Default value: false

OutputPlainText

Whether the reader outputs plain text or wiki markup.

Type: Boolean  — Default value: true

PageBuffer

The page buffer size (#pages) of the page iterator.

Type: Integer  — Default value: 1000

Password

The password of the database account.

Type: String

TemplateBlacklist

Defines templates that the articles MUST NOT contain.

If you also define a whitelist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist)

Optional — Type: String[]

TemplateWhitelist

Defines templates that the articles MUST contain.

If you also define a blacklist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist)

Optional — Type: String[]

User

The username of the database account.

Type: String

Table 94. Capabilities

Media types

none specified

Outputs

XCES-XML

XcesBasicXml

Group ID

org.dkpro.core

Artifact ID

dkpro-core-io-xces-asl

XcesBasicXmlReader

Implementation

org.dkpro.core.io.xces.XcesBasicXmlReader

Description
null
Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 95. Capabilities

Media types

none specified

Outputs

XcesBasicXmlWriter

Implementation

org.dkpro.core.io.xces.XcesBasicXmlWriter

Description
null
Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameSuffix

Type: String  — Default value: .xml

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 96. Capabilities

Media types

none specified

Inputs

XcesXml

Group ID

org.dkpro.core

Artifact ID

dkpro-core-io-xces-asl

XcesXmlReader

Implementation

org.dkpro.core.io.xces.XcesXmlReader

Description
null
Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 97. Capabilities

Media types

none specified

Outputs

XcesXmlWriter

Implementation

org.dkpro.core.io.xces.XcesXmlWriter

Description
null
Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filenameSuffix

Type: String  — Default value: .xml

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetEncoding

Character encoding of the output data.

Type: String  — Default value: UTF-8

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 98. Capabilities

Media types

none specified

Inputs

XML

InlineXml

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.xml-asl

InlineXmlWriter

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xml.InlineXmlWriter

Description

Writes an approximation of the content of a textual CAS as an inline XML file. Optionally applies an XSLT stylesheet.

Note this component inherits the restrictions from CasToInlineXml:

  • Features whose values are FeatureStructures are not represented.
  • Feature values which are strings longer than 64 characters are truncated.
  • Feature values which are arrays of primitives are represented by strings that look like [ xxx, xxx ]
  • The Subject of analysis is presumed to be a text string.
  • Some characters in the document's Subject-of-analysis are replaced by blanks, because the characters aren't valid in xml documents.
  • It doesn't work for annotations which are overlapping, because these cannot be properly represented as properly - nested XML.
Parameters
Xslt

XSLT stylesheet to apply.

Optional — Type: String

compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

Table 99. Capabilities

Media types

application/xml text/xml

Inputs

Xml

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.xml-asl

XmlReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xml.XmlReader

Description

Reader for XML files.

Parameters
DocIdTag

tag which contains the docId

Optional — Type: String

ExcludeTag

optional, tags those should not be worked on. Out them should no text be extracted and also no Annotations be produced.

Type: String[]  — Default value: []

IncludeTag

optional, tags those should be worked on (if empty, then all tags except those ExcludeTags will be worked on)

Type: String[]  — Default value: []

collectionId

The collection ID to set in the DocumentMetaData.

Optional — Type: String

language

Set this as the language of the produced documents.

Optional — Type: String

sourceLocation

Location from which the input is read.

Type: String

Table 100. Capabilities

Media types

application/xml text/xml

Outputs

XmlText

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.xml-asl

XmlTextReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xml.XmlTextReader

Description
null
Parameters
includeHidden

Include hidden files and directories.

Type: Boolean  — Default value: false

language

Name of optional configuration parameter that contains the language of the documents in the input directory. If specified, this information will be added to the CAS.

Optional — Type: String

logFreq

The frequency with which read documents are logged. Default: 1 (log every document).

Set to 0 or negative values to deactivate logging.

Type: Integer  — Default value: 1

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Optional — Type: String[]

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

Table 101. Capabilities

Media types

application/xml text/xml

Outputs

XmlXPath

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.io.xml-asl

XmlXPathReader

Implementation

de.tudarmstadt.ukp.dkpro.core.io.xml.XmlXPathReader

Description

A component reader for XML files implemented with XPath.

This is currently optimized for TREC format, which means the style topics are presented in. You should provide the parameter XPath expression that of the parent node And the child nodes of each parent node will be stored separately in its own CAS.

If your expression evaluates to leaf nodes, empty CASes will be created.

Parameters
caseSensitive

States whether the matching is done case sensitive. (default: true)

Optional — Type: Boolean  — Default value: true

docIdTag

Tag which contains the docId. If it is given, it will be ensured that within the same document there is only one id tag and it is not empty

Optional — Type: String

excludeTags

Tags which should be ignored. If empty then all tags will be processed.

If this and PARAM_INCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed.

Type: String[]  — Default value: []

includeTags

Tags which should be worked on. If empty then all tags will be processed.

If this and PARAM_EXCLUDE_TAGS are both provided, tags in set PARAM_INCLUDE_TAGS - PARAM_EXCLUDE_TAGS will be processed.

Type: String[]  — Default value: []

language

Language of the documents. If given, it will be set in each CAS.

Optional — Type: String

patterns

A set of Ant-like include/exclude patterns. A pattern starts with #INCLUDE_PREFIX [+] if it is an include pattern and with #EXCLUDE_PREFIX [-] if it is an exclude pattern. The wildcard /**/ can be used to address any number of sub-directories. The wildcard * can be used to a address a part of a name.

Type: String[]

rootXPath

Specifies the XPath expression to all nodes to be processed. Different segments will be separated via PARAM_ID_TAG, and each segment will be stored in a separate CAS.

Type: String

sourceLocation

Location from which the input is read.

Optional — Type: String

useDefaultExcludes

Use the default excludes.

Type: Boolean  — Default value: true

workingDir

Specify to substitute tag names in CAS.

Please give the substitutions each in before - after order. For example to substitute "foo" with "bar", and "hey" with "ho", you can provide { "foo", "bar", "hey", "ho" }.

Optional — Type: String[]

Table 102. Capabilities

Media types

application/xml text/xml

Outputs