DKPro Core

SpellingAnomaly SuggestedAction

Outputs

LanguageToolChecker

Role: Checker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolChecker

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

Parameters

language (String) [optional]: Use this language instead of the document language to resolve the model.

Inputs and outputs

Inputs	none specified
Outputs	GrammarAnomaly

Inputs

none specified

Outputs

GrammarAnomaly

Chunker

Table 3. Analysis Components in group Chunker (2)
Component	Description
OpenNlpChunker	Chunk annotator using OpenNLP.
TreeTaggerChunker	Chunk annotator using TreeTagger.

OpenNlpChunker

Role: Chunker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpChunker

Chunk annotator using OpenNLP.

Parameters

ChunkMappingLocation (String) [optional]: Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs	POS Sentence Token
Outputs	Chunk

Inputs

Outputs

Chunk

Models

Language	Variant	Version
en	default	20100908.1

Language

Variant

Version

20100908.1

TreeTaggerChunker

Role: Chunker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.treetagger-asl
Class: de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerChunker

Chunk annotator using TreeTagger.

Parameters

ChunkMappingLocation (String) [optional]: Location of the mapping file for chunk tags to UIMA types.
executablePath (String) [optional]: Use this TreeTagger executable instead of trying to locate the executable automatically.
flushSequence (String) [optional]: A sequence to flush the internal TreeTagger buffer and to force it to output the rest of the completed analysis. This is typically just a sequence of like 5-10 full stops (".") separated by new line characters. However, some models may require a different flush sequence, e.g. a short sentence in the respective language. For chunker models, mind that the sentence must also be POS tagged, e.g. Nous-PRO:PER\n....
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
performanceMode (Boolean) = false: TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs	POS
Outputs	Chunk

Inputs

CoreferenceChain CoreferenceLink

Outputs

Chunk

Models

Language	Variant	Version
de	le	20110429.1
en	iso8859-le	20090824.1
en	le	20140520.1
fr	le	20141218.2

Language

Variant

Version

20110429.1

iso8859-le

20090824.1

20140520.1

20141218.2

Coreference resolver

Table 4. Analysis Components in group Coreference resolver (1)
Component	Description
StanfordCoreferenceResolver	No description

StanfordCoreferenceResolver

Role: Coreference resolver
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordCoreferenceResolver

null

Parameters

maxDist (Integer) = -1: DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance)
postprocessing (Boolean) = false: DCoRef parameter: Do post processing
score (Boolean) = false: DCoRef parameter: Scoring the output of the system
sieves (String) = MarkRole, DiscourseMatch, ExactStringMatch, RelaxedExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, RelaxedHeadMatch, PronounMatch: DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/.
singleton (Boolean) = true: DCoRef parameter: setting singleton predictor

Inputs and outputs

Inputs	POS NamedEntity Lemma Sentence Token Constituent
Outputs	CoreferenceChain CoreferenceLink

Inputs

POS NamedEntity Lemma Sentence Token Constituent

Outputs

Models

Language	Variant	Version
en	default	${core.version}.1

Language

Variant

Version

${core.version}.1

Language Identifier

Table 5. Analysis Components in group Language Identifier (3)
Component	Description
LangDetectLanguageIdentifier	Langdetect language identifier based on character n-grams.
LanguageDetectorWeb1T	Language detector based on n-gram frequency counts, e.g. as provided by Web1T
LanguageIdentifier	Detection based on character n-grams.

LangDetectLanguageIdentifier

Role: Language Identifier
Artifact ID: de.tudarmstadt.ukp.dkpro.core.langdetect-asl
Class: de.tudarmstadt.ukp.dkpro.core.langdetect.LangDetectLanguageIdentifier

Langdetect language identifier based on character n-grams.

Parameters

modelLocation (String) [optional]: Location from which the model is read.
modelVariant (String) [optional]: Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Models

Language	Variant	Version
any	socialmedia	20141013.1
any	wikipedia	20141013.1

Language

Variant

Version

any

socialmedia

20141013.1

any

wikipedia

20141013.1

LanguageDetectorWeb1T

Role: Language Identifier
Artifact ID: de.tudarmstadt.ukp.dkpro.core.ldweb1t-asl
Class: de.tudarmstadt.ukp.dkpro.core.ldweb1t.LanguageDetectorWeb1T

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

Parameters

maxNGramSize (Integer) = 3: The maximum n-gram size that should be considered. Default is 3.
minNGramSize (Integer) = 1: The minimum n-gram size that should be considered. Default is 1.

LanguageIdentifier

Role: Language Identifier
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textcat-asl
Class: de.tudarmstadt.ukp.dkpro.core.textcat.LanguageIdentifier

Detection based on character n-grams. Uses the Java Text Categorizing Library based on a technique by Cavnar and Trenkle.

References:

Cavnar, W. B. and J. M. Trenkle (1994). N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

Lemmatizer

Table 6. Analysis Components in group Lemmatizer (6)
Component	Description
ClearNlpLemmatizer	Lemmatizer using Clear NLP.
GateLemmatizer	Wrapper for the GATE rule based lemmatizer.
LanguageToolLemmatizer	Naive lexicon-based lemmatizer.
MateLemmatizer	DKPro Annotator for the MateToolsLemmatizer.
MorphaLemmatizer	Lemmatize based on a finite-state machine.
StanfordLemmatizer	Stanford Lemmatizer component.

ClearNlpLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpLemmatizer

Lemmatizer using Clear NLP.

Parameters

language (String) = en [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.

Inputs and outputs

Inputs	POS Sentence Token
Outputs	Lemma

Inputs

Outputs

Models

Language	Variant	Version
en	default	20130715.0

Language

Variant

Version

20130715.0

GateLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.gate-gpl
Class: de.tudarmstadt.ukp.dkpro.core.gate.GateLemmatizer

Wrapper for the GATE rule based lemmatizer. Based on code by Asher Stern from the BIUTEE textual entailment tool.

Parameters

language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.

LanguageToolLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolLemmatizer

Naive lexicon-based lemmatizer. The words are looked up using the wordform lexicons of LanguageTool. Multiple readings are produced. The annotator simply takes the most frequent lemma from those readings. If no readings could be found, the original text is assigned as lemma.

Parameters

sanitize (Boolean) = true
sanitizeChars (String[]) = [(, ), [, ]]

Inputs and outputs

Inputs	Sentence Token
Outputs	Lemma

Inputs

Outputs

MateLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateLemmatizer

DKPro Annotator for the MateToolsLemmatizer.

Parameters

language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
uppercase (Boolean) = false: Try reconstructing proper casing for lemmata. This is useful for German, but e.g. for English creates odd results.
variant (String) [optional]: Override the default variant used to locate the model.

Inputs and outputs

Inputs	Sentence Token
Outputs	Lemma

Inputs

Outputs

Models

Language	Variant	Version
de	tiger	20121024.1
en	conll2009	20130117.1
es	conll2009	20130117.1
fr	ftb	20130918.0

Language

Variant

Version

20121024.1

20130117.1

20130117.1

20130918.0

MorphaLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.morpha-asl
Class: de.tudarmstadt.ukp.dkpro.core.morpha.MorphaLemmatizer

Lemmatize based on a finite-state machine. Uses the Java port of Morpha.

References:

Minnen, G., J. Carroll and D. Pearce (2001). Applied morphological processing of English, Natural Language Engineering, 7(3). 207-223.

Parameters

readPOS (Boolean) = false: Pass part-of-speech information on to Morpha. Since we currently do not know in which format the part-of-speech tags are expected by Morpha, we just pass on the actual pos tag value we get from the token. This may produce worse results than not passing on pos tags at all, so this is disabled by default.

Inputs and outputs

Inputs	POS Sentence Token
Outputs	Lemma

Inputs

Outputs

StanfordLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordLemmatizer

Stanford Lemmatizer component. The Stanford Morphology-class computes the base form of English words, by removing just inflections (not derivational morphology). That is, it only does noun plurals, pronoun case, and verb endings, and not things like comparative adjectives or derived nominals. It is based on a finite-state transducer implemented by John Carroll et al., written in flex and publicly available. See: http://www.informatics.susx.ac.uk/research/nlp/carroll/morph.html

This only works for ENGLISH.

Parameters

ptb3Escaping (Boolean) = true: Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).
quoteBegin (String[]) [optional]: List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.
quoteEnd (String[]) [optional]: List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Inputs and outputs

Inputs	POS Token
Outputs	Lemma

Inputs

POS Token

Outputs

Morphological analyzer

Table 7. Analysis Components in group Morphological analyzer (2)
Component	Description
RfTagger	Rftagger morphological analyzer.
SfstAnnotator	Sfst morphological analyzer.

RfTagger

Role: Morphological analyzer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.rftagger-asl
Class: de.tudarmstadt.ukp.dkpro.core.rftagger.RfTagger

Rftagger morphological analyzer.

Parameters

MorphMappingLocation (String) [optional]
POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelEncoding (String) [optional]: The character encoding used by the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Write the tag set(s) to the log when a model is loaded.

Inputs and outputs

Inputs	Sentence Token
Outputs	MorphologicalFeatures POS

Inputs

MorphologicalFeatures POS

Outputs

Models

Language	Variant	Version
cz	cac	20150728.1
de	tiger	20150928.1
hu	szeged	20150728.1
ru	ric	20150728.1
sk	snk	20150728.1
sl	jos	20150728.1

Language

Variant

Version

20150728.1

20150928.1

20150728.1

20150728.1

20150728.1

20150728.1

SfstAnnotator

Role: Morphological analyzer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.sfst-gpl
Class: de.tudarmstadt.ukp.dkpro.core.sfst.SfstAnnotator

Sfst morphological analyzer.

Parameters

MorphMappingLocation (String) [optional]
language (String) [optional]: Use this language instead of the document language to resolve the model.
mode (String) = FIRST
modelEncoding (String) = UTF-8: Specifies the model encoding.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Write the tag set(s) to the log when a model is loaded.
writeLemma (Boolean) = true: Write lemma information. Default: true
writePOS (Boolean) = true: Write part-of-speech information. Default: true

Inputs and outputs

Inputs	Sentence Token
Outputs	MorphologicalFeatures POS

Inputs

MorphologicalFeatures POS

Outputs

Models

Language	Variant	Version
de	morphisto-ca	20110202.1
de	smor-ca	20140801.1
de	zmorge-newlemma-ca	20140521.1
de	zmorge-orig-ca	20140521.1
it	pippi-ca	20090223.1
tr	trmorph-ca	20130219.1

Language

Variant

Version

20110202.1

20140801.1

20140521.1

20140521.1

20090223.1

20130219.1

Named Entity Recognizer

Table 8. Analysis Components in group Named Entity Recognizer (2)
Component	Description
OpenNlpNamedEntityRecognizer	OpenNLP name finder wrapper.
StanfordNamedEntityRecognizer	Stanford Named Entity Recognizer component.

OpenNlpNamedEntityRecognizer

Role: Named Entity Recognizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpNamedEntityRecognizer

OpenNLP name finder wrapper.

Parameters

NamedEntityMappingLocation (String) [optional]: Location of the mapping file for named entity tags to UIMA types.
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Location from which the model is read.
modelVariant (String) = person: Variant of a model the model. Used to address a specific model if here are multiple models for one language.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded.

Inputs and outputs

Inputs	Token
Outputs	NamedEntity

Inputs

Outputs

NamedEntity

Models

Language	Variant	Version
en	date	20100907.0
en	location	20100907.0
en	money	20100907.0
en	organization	20100907.0
en	percentage	20100907.0
en	person	20130624.1
en	time	20100907.0
es	location	20100908.0
es	misc	20100908.0
es	organization	20100908.0
es	person	20100908.0
nl	location	20100908.0
nl	misc	20100908.0
nl	organization	20100908.0
nl	person	20100908.0

Language

Variant

Version

20100907.0

20100907.0

20100907.0

20100907.0

20100907.0

20130624.1

20100907.0

20100908.0

20100908.0

20100908.0

20100908.0

20100908.0

20100908.0

20100908.0

20100908.0

StanfordNamedEntityRecognizer

Role: Named Entity Recognizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizer

Stanford Named Entity Recognizer component.

Parameters

NamedEntityMappingLocation (String) [optional]: Location of the mapping file for named entity tags to UIMA types.
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Location from which the model is read.
modelVariant (String) [optional]: Variant of a model the model. Used to address a specific model if here are multiple models for one language.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded.
ptb3Escaping (Boolean) = true: Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).
quoteBegin (String[]) [optional]: List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.
quoteEnd (String[]) [optional]: List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Inputs and outputs

Inputs	Sentence Token
Outputs	NamedEntity

Inputs

all.3class.caseless.distsim.crf

Outputs

NamedEntity

Models

Language	Variant	Version
de	dewac_175m_600.crf	20150130.1
de	hgc_175m_600.crf	20150130.1
en	all.3class.caseless.distsim.crf	20160110.0
en	all.3class.distsim.crf	20150420.1
en	all.3class.nodistsim.crf	20160110.1
en	conll.4class.caseless.distsim.crf	20160110.0
en	conll.4class.distsim.crf	20150420.1
en	conll.4class.nodistsim.crf	20160110.1
en	muc.7class.caseless.distsim.crf	20150129.0
en	muc.7class.distsim.crf	20150129.1
en	muc.7class.nodistsim.crf	20160110.1
en	nowiki.3class.caseless.distsim.crf	20160110.0
en	nowiki.3class.nodistsim.crf	20160110.0
es	ancora.distsim.s512.crf	20140826.1

Language

Variant

Version

dewac_175m_600.crf

20150130.1

hgc_175m_600.crf

20150130.1

20160110.0

all.3class.distsim.crf

20150420.1

all.3class.nodistsim.crf

20160110.1

conll.4class.caseless.distsim.crf

20160110.0

conll.4class.distsim.crf

20150420.1

conll.4class.nodistsim.crf

20160110.1

muc.7class.caseless.distsim.crf

20150129.0

muc.7class.distsim.crf

20150129.1

muc.7class.nodistsim.crf

20160110.1

nowiki.3class.caseless.distsim.crf

20160110.0

nowiki.3class.nodistsim.crf

20160110.0

ancora.distsim.s512.crf

20140826.1

Parser

Table 9. Analysis Components in group Parser (7)
Component	Description
BerkeleyParser	Berkeley Parser annotator .
ClearNlpParser	Clear parser annotator.
MaltParser	Dependency parsing using MaltPaser.
MateParser	DKPro Annotator for the MateToolsParser.
MstParser	Dependency parsing using MSTParser.
OpenNlpParser	OpenNLP parser.
StanfordParser	Stanford Parser component.

BerkeleyParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.berkeleyparser-gpl
Class: de.tudarmstadt.ukp.dkpro.core.berkeleyparser.BerkeleyParser

Berkeley Parser annotator . Requires Sentences to be annotated before.

Parameters

ConstituentMappingLocation (String) [optional]: Location of the mapping file for constituent tags to UIMA types.
POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
accurate (Boolean) = false: Set thresholds for accuracy.
Default: false (set thresholds for efficiency)
binarize (Boolean) = false: Output binarized trees.
Default: false
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
keepFunctionLabels (Boolean) = false: Retain predicted function labels. Model must have been trained with function labels.
Default: false
language (String) [optional]: Use this language instead of the language set in the CAS to locate the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false
readPOS (Boolean) = true: Sets whether to use or not to use already existing POS tags from another annotator for the parsing process.
Default: false
scores (Boolean) = false: Output inside scores (only for binarized viterbi trees).
Default: false
substates (Boolean) = false: Output sub-categories (only for binarized Viterbi trees).
Default: false
variational (Boolean) = false: Use variational rule score approximation instead of max-rule
Default: false
viterbi (Boolean) = false: Compute Viterbi derivation instead of max-rule tree.
Default: false (max-rule)
writePOS (Boolean) = false: Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.
Default: true
writePennTree (Boolean) = false: If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.
Default: false

Inputs and outputs

Inputs	Sentence Token
Outputs	PennTree Constituent

Inputs

Outputs

PennTree Constituent

Models

Language	Variant	Version
ar	sm5	20090917.1
bg	sm5	20090917.1
de	sm5	20090917.1
en	sm6	20100819.1
fr	sm5	20090917.1
zh	sm5	20090917.1

Language

Variant

Version

20090917.1

20090917.1

20090917.1

20100819.1

20090917.1

20090917.1

ClearNlpParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpParser

Clear parser annotator.

Parameters

language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Location from which the model is read.
modelVariant (String) [optional]: Variant of a model the model. Used to address a specific model if here are multiple models for one language.
printTagSet (Boolean) = false: Write the tag set(s) to the log when a model is loaded.

Inputs and outputs

Inputs	POS Lemma Sentence Token
Outputs	Dependency

Inputs

POS Lemma Sentence Token

Outputs

MaltParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.maltparser-asl
Class: de.tudarmstadt.ukp.dkpro.core.maltparser.MaltParser

Dependency parsing using MaltPaser.

Required annotations:

Token
Sentence
POS

Generated annotations:

Dependency (annotated over sentence-span)

Parameters

ignoreMissingFeatures (Boolean) = false: Process anyway, even if the model relies on features that are not supported by this component. Default: false
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs	POS Lemma Sentence Token
Outputs	Dependency

Inputs

POS Lemma Sentence Token

Outputs

Models

Language	Variant	Version
bn	linear	20120905.1
en	linear	20120312.1
en	poly	20120312.1
es	linear	20130220.0
fa	linear	20130522.1
fr	linear	20120312.1
pl	linear	20120904.1
sv	linear	20120925.2

Language

Variant

Version

20120905.1

20120312.1

20120312.1

20130220.0

20130522.1

20120312.1

20120904.1

20120925.2

MateParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateParser

DKPro Annotator for the MateToolsParser.

Please cite the following paper, if you use the parser: Bernd Bohnet. 2010. Top Accuracy and Fast Dependency Parsing is not a Contradiction. The 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China.

Parameters

DependencyMappingLocation (String) [optional]: Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs	POS Sentence Token
Outputs	Dependency

Inputs

Outputs

Models

Language	Variant	Version
de	tiger	20121024.1
en	conll2009	20130117.2
es	conll2009	20130117.1
fr	ftb	20130918.0
zh	conll2009	20130117.1

Language

Variant

Version

20121024.1

20130117.2

20130117.1

20130918.0

20130117.1

MstParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mstparser-asl
Class: de.tudarmstadt.ukp.dkpro.core.mstparser.MstParser

Dependency parsing using MSTParser.

Wrapper for the MSTParser (high memory requirements). More information about the parser can be found here here

The MSTParser models tend to be very large, e.g. the Eisner model is about 600 MB uncompressed. With this model, parsing a simple sentence with MSTParser requires about 3 GB heap memory.

This component feeds MSTParser only with the FORM (token) and POS (part-of-speech) fields. LEMMA, CPOS, and other columns from the CONLL 2006 format are not generated (cf. mstparser.DependencyInstance DependencyInstance).

Parameters

DependencyMappingLocation (String) [optional]: Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
order (Integer) [optional]: Specifies the order/scope of features. 1 only has features over single edges and 2 has features over pairs of adjacent edges in the tree. The model must have been trained with the respective order set here.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs	POS Sentence Token
Outputs	Dependency

Inputs

Outputs

Models

Language	Variant	Version
en	eisner	20100416.2
en	sample	20121019.2
hr	mte5.defnpout	20130527.1
hr	mte5.pos	20130527.1

Language

Variant

Version

20100416.2

20121019.2

20130527.1

20130527.1

OpenNlpParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpParser

OpenNLP parser. The parser ignores existing POS tags and internally creates new ones. However, these tags are only added as annotation if explicitly requested via #PARAM_WRITE_POS.

Parameters

ConstituentMappingLocation (String) [optional]: Location of the mapping file for constituent tags to UIMA types.
POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.
Default: true
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded.
Default: false
writePOS (Boolean) = false: Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.
Default: true
writePennTree (Boolean) = false: If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.
Default: false

Inputs and outputs

Inputs	Sentence Token
Outputs	PennTree Constituent

Inputs

Outputs

PennTree Constituent

Models

Language	Variant	Version
en	chunking	20120616.1

Language

Variant

Version

chunking

20120616.1

StanfordParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordParser

Stanford Parser component.

Parameters

ConstituentMappingLocation (String) [optional]: Location of the mapping file for constituent tags to UIMA types.
POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
annotationTypeToParse (String) [optional]: This parameter can be used to override the standard behavior which uses the Sentence annotation as the basic unit for parsing.
If the parameter is set with the name of an annotation type x, the parser will no longer parse Sentence-annotations, but x-Annotations.

Default: null
language (String) [optional]: Use this language instead of the document language to resolve the model and tag set mapping.
maxItems (Integer) = 200000: Controls when the factored parser considers a sentence to be too complex and falls back to the PCFG parser.
Default: 200000
maxSentenceLength (Integer) = 130: Maximum number of tokens in a sentence. Longer sentences are not parsed. This is to avoid out of memory exceptions.
Default: 130
mode (String) = TREE [optional]: Sets the kind of dependencies being created.
Default: DependenciesMode#COLLAPSED TREE
modelLocation (String) [optional]: Location from which the model is read.
modelVariant (String) [optional]: Variant of a model the model. Used to address a specific model if here are multiple models for one language.
printTagSet (Boolean) = false: Write the tag set(s) to the log when a model is loaded.
ptb3Escaping (Boolean) = true: Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).
quoteBegin (String[]) [optional]: List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.
quoteEnd (String[]) [optional]: List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.
readPOS (Boolean) = true: Sets whether to use or not to use already existing POS tags from another annotator for the parsing process.
Default: true
writeConstituent (Boolean) = true: Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization.
Default: true
writeDependency (Boolean) = true: Sets whether to create or not to create dependency annotations.
Default: true
writePOS (Boolean) = false: Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.
Default: false
writePennTree (Boolean) = false: If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.
Default: false

Inputs and outputs

Inputs	POS Sentence Token
Outputs	Constituent

Inputs

Outputs

Constituent

Models

Language	Variant	Version
ar	factored	20150129.1
ar	sr	20141031.1
de	factored	20150129.1
de	pcfg	20150129.1
de	sr	20141031.1
en	factored	20150129.1
en	pcfg	20150129.1
en	pcfg.caseless	20160110.1
en	rnn	20140104.1
en	sr	20141031.1
en	sr-beam	20141031.1
en	wsj-factored	20150129.1
en	wsj-pcfg	20150129.1
en	wsj-rnn	20140104.1
es	pcfg	20150108.1
es	sr	20141023.1
es	sr-beam	20141023.1
fr	factored	20150129.1
fr	sr	20160114.1
fr	sr-beam	20141023.1
zh	factored	20150129.1
zh	pcfg	20150129.1
zh	sr	20141023.1
zh	xinhua-factored	20150129.1
zh	xinhua-pcfg	20150129.1

Language

Variant

Version

20150129.1

20141031.1

20150129.1

pcfg

20150129.1

20141031.1

20150129.1

20150129.1

20160110.1

20140104.1

20141031.1

20141031.1

20150129.1

20150129.1

20140104.1

20150108.1

20141023.1

sr-beam

20141023.1

20150129.1

20160114.1

sr-beam

20141023.1

20150129.1

pcfg

20150129.1

20141023.1

xinhua-factored

20150129.1

xinhua-pcfg

20150129.1

Part-of-speech tagger

Table 10. Analysis Components in group Part-of-speech tagger (10)
Component	Description
ArktweetPosTagger	Wrapper for Twitter Tokenizer and POS Tagger.
ClearNlpPosTagger	Part-of-Speech annotator using Clear NLP.
HepplePosTagger	GATE Hepple part-of-speech tagger.
HunPosTagger	Part-of-Speech annotator using HunPos.
MateMorphTagger	DKPro Annotator for the MateToolsMorphTagger.
MatePosTagger	DKPro Annotator for the MateToolsPosTagger
MeCabTagger	Annotator for the MeCab Japanese POS Tagger.
OpenNlpPosTagger	Part-of-Speech annotator using OpenNLP.
StanfordPosTagger	Stanford Part-of-Speech tagger component.
TreeTaggerPosTagger	Part-of-Speech and lemmatizer annotator using TreeTagger.

ArktweetPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.arktools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetPosTagger

Wrapper for Twitter Tokenizer and POS Tagger. As described in: Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider and Noah A. Smith. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters In Proceedings of NAACL 2013.

Parameters

POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
language (String) [optional]: Use this language instead of the document language to resolve the model and tag set mapping.
modelLocation (String) [optional]: Location from which the model is read.
modelVariant (String) [optional]: Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Models

Language	Variant	Version
en	default	20120919.1
en	irc	20121211.1
en	ritter	20130723.1

Language

Variant

Version

20120919.1

irc

20121211.1

ritter

20130723.1

ClearNlpPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpPosTagger

Part-of-Speech annotator using Clear NLP. Requires Sentences to be annotated before.

Parameters

POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
dictLocation (String) [optional]: Load the dictionary from this location instead of locating the dictionary automatically.
dictVariant (String) [optional]: Override the default variant used to locate the dictionary.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the pos-tagging model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the pos-tagging model.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded.

Inputs and outputs

Inputs	Sentence Token
Outputs	POS

Inputs

Outputs

HepplePosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.gate-gpl
Class: de.tudarmstadt.ukp.dkpro.core.gate.HepplePosTagger

GATE Hepple part-of-speech tagger.

Parameters

POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language (String) [optional]: Use this language instead of the document language to resolve the model.
lexiconLocation (String) [optional]: Load the lexicon from this location instead of locating it automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false
rulesetLocation (String) [optional]: Load the ruleset from this location instead of locating it automatically.

Inputs and outputs

Inputs	Sentence Token
Outputs	POS

Inputs

Outputs

HunPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.hunpos-asl
Class: de.tudarmstadt.ukp.dkpro.core.hunpos.HunPosTagger

Part-of-Speech annotator using HunPos. Requires Sentences to be annotated before.

Parameters

POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

Models

Language

Variant

Version

20121123.2

20121123.2

20121123.2

20070724.2

20140414.0

20130509.2

20070724.2

20121123.2

20121123.2

20130119.2

20110419.2

20121123.2

20121123.2

20100215.2

20100927.2

MateMorphTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateMorphTagger

DKPro Annotator for the MateToolsMorphTagger.

Parameters

language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.

Inputs and outputs

Inputs

Lemma Sentence Token

Outputs

Morpheme

MatePosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MatePosTagger

DKPro Annotator for the MateToolsPosTagger

Parameters

POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

POS Lemma Sentence JapaneseToken

Models

Language

Variant

Version

20121024.1

20130117.1

20130117.1

20130918.0

20130117.1

MeCabTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mecab-asl
Class: de.tudarmstadt.ukp.dkpro.core.mecab.MeCabTagger

Annotator for the MeCab Japanese POS Tagger.

Parameters

language (String) [optional]: The language.
strictZoning (Boolean) = false: Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence (Boolean) = true: Create Sentence annotations.
writeToken (Boolean) = true: Create Token annotations.
zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]: A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

Models

Language

Variant

Version

OpenNlpPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTagger

Part-of-Speech annotator using OpenNLP. Requires Sentences to be annotated before.

Parameters

POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

Models

Language

Variant

Version

20120616.1

20120616.1

20120616.1

20120616.1

20120616.1

20120616.1

20131115.1

20120410.1

20140425.1

20120410.1

20120410.1

20131115.1

20120410.1

20130618.0

20120616.1

20120616.1

20120616.1

20130121.1

20130121.1

20120616.1

20120616.1

20120616.1

StanfordPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPosTagger

Stanford Part-of-Speech tagger component.

Parameters

POSMappingLocation (String) [optional]: Location of the mapping file for part-of-speech tags to UIMA types.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false
language (String) [optional]: Use this language instead of the document language to resolve the model and tag set mapping.
maxSentenceLength (Integer) [optional]: Sentences with more tokens than the specified max amount will be ignored if this parameter is set to a value larger than zero. The default value zero will allow all sentences to be POS tagged.
modelLocation (String) [optional]: Location from which the model is read.
modelVariant (String) [optional]: Variant of a model the model. Used to address a specific model if here are multiple models for one language.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false
ptb3Escaping (Boolean) = true: Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).
quoteBegin (String[]) [optional]: List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.
quoteEnd (String[]) [optional]: List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Inputs and outputs

Inputs

Outputs

caseless-left3words-distsim

Models

Language

Variant

Version

20131112.1

20140827.1

20140827.1

20140827.0

20140827.1

bidirectional-distsim

20140616.1

20140827.0

20130730.1

20140616.1

20130730.1

20130914.0

wsj-0-18-bidirectional-distsim

20160110.1

wsj-0-18-bidirectional-nodistsim

20131112.1

wsj-0-18-caseless-left3words-distsim

20140827.0

wsj-0-18-left3words-distsim

20140616.1

wsj-0-18-left3words-nodistsim

20131112.1

20151014.1

20150108.1

20140616.1

20140616.1

20140616.1

TreeTaggerPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.treetagger-asl
Class: de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger

Part-of-Speech and lemmatizer annotator using TreeTagger.

Parameters

POSMappingLocation (String) [optional]: Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
executablePath (String) [optional]: Use this TreeTagger executable instead of trying to locate the executable automatically.
internTags (Boolean) = true [optional]: Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelEncoding (String) [optional]: The character encoding used by the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
performanceMode (Boolean) = false: TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.
printTagSet (Boolean) = false: Log the tag set(s) when a model is loaded. Default: false
writeLemma (Boolean) = true: Write lemma information. Default: true
writePOS (Boolean) = true: Write part-of-speech information. Default: true

Inputs and outputs

Inputs

Outputs

POS Lemma

Models

Language

Variant

Version

20160430.1

20121207.1

20151119.1

20150724.1

20110124.1

20140704.1

20100111.1

20130516.1

20141020.1

20110819.1

20120925.1

20130107.1

20150506.1

20101115.2

20140505.1

20130725.1

20130729.1

20101115.1

Phonetic Transcriptor

Table 11. Analysis Components in group Phonetic Transcriptor (4)
Component	Description
ColognePhoneticTranscriptor	Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.
DoubleMetaphonePhoneticTranscriptor	Double-Metaphone phonetic transcription based on Apache Commons Codec.
MetaphonePhoneticTranscriptor	Metaphone phonetic transcription based on Apache Commons Codec.
SoundexPhoneticTranscriptor	Soundex phonetic transcription based on Apache Commons Codec.

ColognePhoneticTranscriptor

Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.ColognePhoneticTranscriptor

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. Works for German.

Inputs and outputs

Inputs

Outputs

DoubleMetaphonePhoneticTranscriptor

Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.DoubleMetaphonePhoneticTranscriptor

Double-Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

Inputs and outputs

Inputs

Outputs

MetaphonePhoneticTranscriptor

Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.MetaphonePhoneticTranscriptor

Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

Inputs and outputs

Inputs

Outputs

SoundexPhoneticTranscriptor

Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.SoundexPhoneticTranscriptor

Soundex phonetic transcription based on Apache Commons Codec. Works for English.

Inputs and outputs

Inputs

Outputs

Segmenter

Segmenter components identify sentence boundaries and tokens. The order in which sentence splitting and tokenization are done differs between the integrated the NLP libraries. Thus, we chose to integrate both steps into a segmenter component to avoid the need to reorder the components in a pipeline when replacing one segmenter with another.

Table 12. Analysis Components in group Segmenter (17)
Component	Description
AnnotationByLengthFilter	Removes annotations that do not conform to minimum or maximum length constraints.
ArktweetTokenizer	ArkTweet tokenizer.
BreakIteratorSegmenter	BreakIterator segmenter.
CamelCaseTokenSegmenter	Split up existing tokens again if they are camel-case text.
ClearNlpSegmenter	Tokenizer using Clear NLP.
GermanSeparatedParticleAnnotator	Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.
JTokSegmenter	JTok segmenter.
LanguageToolSegmenter	Segmenter using LanguageTool to do the heavy lifting.
LineBasedSentenceSegmenter	Annotates each line in the source text as a sentence.
OpenNlpSegmenter	Tokenizer and sentence splitter using OpenNLP.
ParagraphSplitter	This class creates paragraph annotations for the given input document.
PatternBasedTokenSegmenter	Split up existing tokens again at particular split-chars.
RegexTokenizer	This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.
StanfordSegmenter	No description
TokenMerger	Merges any Tokens that are covered by a given annotation type.
TokenTrimmer	Remove prefixes and suffixes from tokens.
WhitespaceTokenizer	A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

AnnotationByLengthFilter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.AnnotationByLengthFilter

Removes annotations that do not conform to minimum or maximum length constraints. (This was previously called TokenFilter).

Parameters

FilterTypes (String[]) = []: A set of annotation types that should be filtered.
MaxLengthFilter (Integer) = 1000: Any annotation in filterAnnotations shorter than this value will be removed.
MinLengthFilter (Integer) = 0: Any annotation in filterTypes shorter than this value will be removed.

ArktweetTokenizer

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.arktools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetTokenizer

ArkTweet tokenizer.

BreakIteratorSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.BreakIteratorSegmenter

BreakIterator segmenter.

Parameters

language (String) [optional]: The language.
splitAtApostrophe (Boolean) = false: Per default the Java BreakIterator does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered.
strictZoning (Boolean) = false: Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence (Boolean) = true: Create Sentence annotations.
writeToken (Boolean) = true: Create Token annotations.
zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]: A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

CamelCaseTokenSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.CamelCaseTokenSegmenter

Split up existing tokens again if they are camel-case text.

Parameters

deleteCover (Boolean) = true: Wether to remove the original token. Default: true

Inputs and outputs

Inputs

Outputs

ClearNlpSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter

Tokenizer using Clear NLP.

Parameters

language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.
strictZoning (Boolean) = false: Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence (Boolean) = true: Create Sentence annotations.
writeToken (Boolean) = true: Create Token annotations.
zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]: A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

GermanSeparatedParticleAnnotator

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.GermanSeparatedParticleAnnotator

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. This Annotator deals with German particle verbs. Particle verbs consist of a particle and a stem, e.g. anfangen = an+fangen There are many usages of German particle verbs where the stem and the particle are separated, e.g., Wir fangen gleich an. The TreeTagger lemmatizes the verb stem as "fangen" and the separated particle as "an", the proper verblemma "anfangen" is thus not available as an annotation. The GermanSeparatedParticleAnnotator replaces the lemma of the stem of particle-verbs (e.g., fangen) by the proper verb lemma (e.g. anfangen) and leaves the lemma of the separated particle unchanged.

Inputs and outputs

Inputs

POS Lemma Sentence Token

Outputs

JTokSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.jtok-asl
Class: de.tudarmstadt.ukp.dkpro.core.jtok.JTokSegmenter

JTok segmenter.

Parameters

language (String) [optional]: The language.
strictZoning (Boolean) = false: Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeParagraph (Boolean) = true: Create Paragraph annotations.
writeSentence (Boolean) = true: Create Sentence annotations.
writeToken (Boolean) = true: Create Token annotations.
zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]: A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

Paragraph Sentence Token

LanguageToolSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolSegmenter

Segmenter using LanguageTool to do the heavy lifting. LanguageTool internally uses different strategies for tokenization.

Parameters

language (String) [optional]: The language.
strictZoning (Boolean) = false: Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence (Boolean) = true: Create Sentence annotations.
writeToken (Boolean) = true: Create Token annotations.
zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]: A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

LineBasedSentenceSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.LineBasedSentenceSegmenter

Annotates each line in the source text as a sentence. This segmenter is not capable of creating tokens! All respective parameters have no functionality.

Parameters

language (String) [optional]: The language.
strictZoning (Boolean) = false: Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence (Boolean) = true: Create Sentence annotations.
writeToken (Boolean) = true: Create Token annotations.
zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]: A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

Sentence

OpenNlpSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpSegmenter

Tokenizer and sentence splitter using OpenNLP.

Parameters

language (String) [optional]: Use this language instead of the document language to resolve the model.
modelVariant (String) [optional]: Override the default variant used to locate the model.
segmentationModelLocation (String) [optional]: Load the segmentation model from this location instead of locating the model automatically.
strictZoning (Boolean) = false: Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
tokenizationModelLocation (String) [optional]: Load the tokenization model from this location instead of locating the model automatically.
writeSentence (Boolean) = true: Create Sentence annotations.
writeToken (Boolean) = true: Create Token annotations.
zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]: A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

Models

Language

Variant

Version

20120616.1

20120616.1

20120616.1

20120616.1

20120616.1

20120616.1

20130618.0

20130618.0

20120131.1

20120131.1

20120616.1

20120616.1

20120616.1

20120616.1

20120616.1

20120616.1

ParagraphSplitter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.ParagraphSplitter

This class creates paragraph annotations for the given input document. It searches for the occurrence of two or more line-breaks (Unix and Windows) and regards this as the boundary between paragraphs.

Parameters

splitPattern (String) = ((\r\n\r\n)(\r\n)*)|((\n\n)(\n)*): A regular expression used to detect paragraph splits. Default: #DOUBLE_LINE_BREAKS_PATTERN (split on two consecutive line breaks)

Inputs and outputs

Inputs

none specified

Outputs

Paragraph

PatternBasedTokenSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.PatternBasedTokenSegmenter

Split up existing tokens again at particular split-chars. The prefix states whether the split chars should be added as separate Token Tokens. If the #INCLUDE_PREFIX precedes the split pattern, the pattern is included. Consequently, patterns following the #EXCLUDE_PREFIX, will not be added as a Token.

Parameters

deleteCover (Boolean) = true: Wether to remove the original token. Default: true
patterns (String[]): A list of regular expressions, prefixed with #INCLUDE_PREFIX or #EXCLUDE_PREFIX. If neither of the prefixes is used, #EXCLUDE_PREFIX is assumed.

Inputs and outputs

Inputs

Outputs

RegexTokenizer

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.RegexTokenizer

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

The default behaviour is to split sentences by a line break and tokens by whitespace.

Parameters

language (String) [optional]: The language.
sentenceBoundaryRegex (String) = ``: Define the sentence boundary. Default: \n (assume one sentence per line).
strictZoning (Boolean) = false: Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
tokenBoundaryRegex (String) = [\\s\n]+: Defines the pattern that is used as token end boundary. Default: [\s\n]+ (matching whitespace and linebreaks.
When setting custom patterns, take into account that the final token is often terminated by a linebreak rather than the boundary character. Therefore, the newline typically has to be added to the group of matching characters, e.g. "tokenized-text" is correctly tokenized with the pattern [-\n].
writeSentence (Boolean) = true: Create Sentence annotations.
writeToken (Boolean) = true: Create Token annotations.
zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]: A list of type names used for zoning.

StanfordSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordSegmenter

null

Parameters

allowEmptySentences (Boolean) = false: Whether to generate empty sentences.
boundaryFollowers (String[]) = [), ], }, \", ', '', \u2019, \u201D, -RRB-, -RSB-, -RCB-, ), ], }] [optional]: This is a Set of String that are matched with .equals() which are allowed to be tacked onto the end of a sentence after a sentence boundary token, for example ")".
boundaryToDiscard (String[]) = [, NL] [optional]: The set of regex for sentence boundary tokens that should be discarded.
boundaryTokenRegex (String) = \\.|[!?]+ [optional]: The set of boundary tokens. If null, use default.
isOneSentence (Boolean) = false: Whether to treat all input as one sentence.
language (String) [optional]: The language.
languageFallback (String) [optional]
newlineIsSentenceBreak (String) = TWO_CONSECUTIVE [optional]: Strategy for treating newlines as paragraph breaks.
regionElementRegex (String) [optional]: A regular expression for element names containing a sentence region. Only tokens in such elements will be included in sentences. The start and end tags themselves are not included in the sentence.
strictZoning (Boolean) = false: Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
tokenRegexesToDiscard (String[]) = [] [optional]: The set of regex for sentence boundary tokens that should be discarded.
writeSentence (Boolean) = true: Create Sentence annotations.
writeToken (Boolean) = true: Create Token annotations.
xmlBreakElementsToDiscard (String[]) [optional]: These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary.
zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]: A list of type names used for zoning.

TokenMerger

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.TokenMerger

Merges any Tokens that are covered by a given annotation type. E.g. this component can be used to create a single tokens from all tokens that constitute a multi-token named entity.

Parameters

POSMappingLocation (String) [optional]: Override the tagset mapping.
annotationType (String): Annotation type for which tokens should be merged.
constraint (String) [optional]: A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to merge only tokens that are part of a location named entity.
language (String) [optional]: Use this language instead of the document language to resolve the model and tag set mapping.
lemmaMode (String) = JOIN: Configure what should happen to the lemma of the merged tokens. It is possible to JOIN the lemmata to a single lemma (space separated), to REMOVE the lemma or LEAVE the lemma of the first token as-is.
posType (String) [optional]: Set a new POS tag for the new merged token. This is the mapped type. If this is specified, tag set mapping will not be performed. This parameter has no effect unless PARAM_POS_VALUE is also set.
posValue (String) [optional]: Set a new POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset).

Inputs and outputs

Inputs

POS Lemma Token

Outputs

TokenTrimmer

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.TokenTrimmer

Remove prefixes and suffixes from tokens.

Parameters

prefixes (String[]): List of prefixes to remove.
suffixes (String[]): List of suffixes to remove.

Inputs and outputs

Inputs

Outputs

SemanticArgument SemanticPredicate

WhitespaceTokenizer

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.WhitespaceTokenizer

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

If PARAM_WRITE_SENTENCES is set to true, one sentence per line is assumed. Otherwise, no sentences are created.

Parameters

language (String) [optional]: The language.
strictZoning (Boolean) = false: Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence (Boolean) = true: Create Sentence annotations.
writeToken (Boolean) = true: Create Token annotations.
zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]: A list of type names used for zoning.

Semantic role labeler

Table 13. Analysis Components in group Semantic role labeler (2)
Component	Description
ClearNlpSemanticRoleLabeler	ClearNLP semantic role labeller.
MateSemanticRoleLabeler	DKPro Annotator for the MateTools Semantic Role Labeler.

ClearNlpSemanticRoleLabeler

Role: Semantic role labeler
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSemanticRoleLabeler

ClearNLP semantic role labeller.

Parameters

expandArguments (Boolean) = false: Normally the arguments point only to the head words of arguments in the dependency tree. With this option enabled, they are expanded to the text covered by the minimal and maximal token offsets of all descendants (or self) of the head word.

Warning: this parameter should be used with caution! For one, if the descentants of a head word cover a non-continuous region of the text, this information is lost. The arguments will appear to be spanning a continuous region. For another, the arguments may overlap with each other. E.g. if a sentence contains a relative clause with a verb, the subject of the main clause may be recognized as a dependent of the verb and may cause the whole main clause to be recorded in the argument.
language (String) [optional]: Use this language instead of the document language to resolve the model.
modelVariant (String) [optional]: Variant of a model the model. Used to address a specific model if here are multiple models for one language.
predModelLocation (String) [optional]: Location from which the predicate identifier model is read.
printTagSet (Boolean) = false: Write the tag set(s) to the log when a model is loaded.
roleModelLocation (String) [optional]: Location from which the roleset classification model is read.
srlModelLocation (String) [optional]: Location from which the semantic role labeling model is read.

Inputs and outputs

Inputs

POS Lemma Sentence Token Dependency

Outputs

MateSemanticRoleLabeler

Role: Semantic role labeler
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateSemanticRoleLabeler

DKPro Annotator for the MateTools Semantic Role Labeler.

Please cite the following paper, if you use the semantic role labeler Anders Björkelund, Love Hafdell, and Pierre Nugues. Multilingual semantic role labeling. In Proceedings of The Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 43--48, Boulder, June 4--5 2009.

Parameters

language (String) [optional]: Use this language instead of the document language to resolve the model.
modelLocation (String) [optional]: Load the model from this location instead of locating the model automatically.
modelVariant (String) [optional]: Override the default variant used to locate the model.

Inputs and outputs

Inputs

POS Lemma Sentence Token Dependency

Outputs

SemanticArgument SemanticPredicate

Models

Language

Variant

Version

20130105.0

20130117.0

20130320.0

20130117.0

Stemmer

Table 14. Analysis Components in group Stemmer (1)
Component	Description
SnowballStemmer	UIMA wrapper for the Snowball stemmer.

SnowballStemmer

Role: Stemmer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.snowball-asl
Class: de.tudarmstadt.ukp.dkpro.core.snowball.SnowballStemmer

UIMA wrapper for the Snowball stemmer. Annotation types to be stemmed can beconfigured by a FeaturePath.

If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.

Parameters

filterConditionOperator (String) [optional]

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

filterConditionValue (String) [optional]

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

filterFeaturePath (String) [optional]

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

language (String) [optional]

Use this language instead of the document language to resolve the model.

lowerCase (Boolean) = false [optional]

Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.

Examples
	false (default)	true
EDUCATIONAL	EDUCATIONAL	educ
Educational	Educat	educ
educational	educ	educ

paths (String[]) [optional]

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Inputs and outputs

Inputs

none specified

Outputs

Stem

Topic Model

Topic modeling is a statistical approach to discover abstract topics in a collection of documents. A topic is characterized by a probability distribution of the words in the document collection. Once a topic model has been generated, it can be used to analyze unseen documents. The result of the analysis is describes the probability by which a document belongs to each of the topics in the model.

Table 15. Analysis Components in group Topic Model (2)
Component	Description
MalletTopicModelEstimator	Estimate an LDA topic model using Mallet and write it to a file.
MalletTopicModelInferencer	Infers the topic distribution over documents using a Mallet ParallelTopicModel.

MalletTopicModelEstimator

Role: Topic Model
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mallet-asl
Class: de.tudarmstadt.ukp.dkpro.core.mallet.topicmodel.MalletTopicModelEstimator

Estimate an LDA topic model using Mallet and write it to a file. It stores all incoming CAS' to Mallet Instances before estimating the model, using a ParallelTopicModel.

Parameters

alphaSum (Float) = 1.0: The sum of alphas over all topics. Default: 1.0.
Another recommended value is 50 / T (number of topics).
beta (Float) = 0.01: Beta for a single dimension of the Dirichlet prior. Default: 0.01.
burninPeriod (Integer) = 100: The number of iterations before hyperparameter optimization begins. Default: 100
displayInterval (Integer) = 50: The interval in which to display the estimated topics. Default: 50.
displayNTopicWords (Integer) = 7: The number of top words to display during estimation. Default: 7.
minTokenLength (Integer) = 3: Ignore tokens (or lemmas, respectively) that are shorter than the given value. Default: 3.
modelEntityType (String) [optional]: If specific, the text contained in the given segmentation type annotations are fed as separate units to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored.
By default, the full document text is used as a document.
nIterations (Integer) = 1000: The number of iterations during model estimation. Default: 1000.
nThreads (Integer) = 1: The number of threads to use during model estimation. Default: 1.
nTopics (Integer) = 10: The number of topics to estimate for the topic model.
optimizeInterval (Integer) = 50: Interval for optimizing Dirichlet hyperparameters. Default: 50
randomSeed (Integer) = -1: Set random seed. If set to -1 (default), uses random generator.
saveInterval (Integer) = 0: Define how often to save a serialized model during estimation. Default: 0 (only save when estimation is done).
targetLocation (String): The target model file location.
typeName (String) = de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token: The annotation type to use for the topic model. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token.
useLemma (Boolean) = false: If set, uses lemmas instead of original text as features.
useSymmetricAlph (Boolean) = false: Use a symmatric alpha value during model estimation? Default: false.

Inputs and outputs

Inputs

Outputs

none specified

MalletTopicModelInferencer

Role: Topic Model
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mallet-asl
Class: de.tudarmstadt.ukp.dkpro.core.mallet.topicmodel.MalletTopicModelInferencer

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

Parameters

burnIn (Integer) = 1: The number of iterations before hyperparameter optimization begins. Default: 1
maxTopicAssignments (Integer) = 0: Maximum number of topics to assign. If not set (or <= 0), the number of topics in the model divided by 10 is set.
minTokenLength (Integer) = 3: Ignore tokens (or lemmas, respectively) that are shorter than the given value. Default: 3.
minTopicProb (Float) = 0.2: Minimum topic proportion for the document-topic assignment.
modelLocation (String)
nIterations (Integer) = 10: The number of iterations during inference. Default: 10.
thinning (Integer) = 5
typeName (String) = de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token: The annotation type to use as tokens. Default: Token
useLemma (Boolean) = false: If set, uses lemmas instead of original text as features.

Inputs and outputs

Inputs

Outputs

TopicDistribution

Transformer

Table 16. Analysis Components in group Transformer (13)
Component	Description
CapitalizationNormalizer	Takes a text and replaces wrong capitalization
CjfNormalizer	Converts traditional Chinese to simplified Chinese or vice-versa.
DictionaryBasedTokenTransformer	Reads a tab-separated file containing mappings from one token to another.
ExpressiveLengtheningNormalizer	Takes a text and shortens extra long words
FileBasedTokenTransformer	Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.
HyphenationRemover	Simple dictionary-based hyphenation remover.
RegexBasedTokenTransformer	A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.
ReplacementFileNormalizer	Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens
SharpSNormalizer	Takes a text and replaces sharp s
SpellingNormalizer	Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.
StanfordPtbTransformer	Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.
TokenCaseTransformer	Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.
UmlautNormalizer	Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

CapitalizationNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.CapitalizationNormalizer

Takes a text and replaces wrong capitalization

Parameters

typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Inputs and outputs

Inputs

Outputs

none specified

CjfNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.CjfNormalizer

Converts traditional Chinese to simplified Chinese or vice-versa.

Parameters

direction (String) = TO_SIMPLIFIED
typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

DictionaryBasedTokenTransformer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.DictionaryBasedTokenTransformer

Reads a tab-separated file containing mappings from one token to another. All tokens that match an entry in the first column are changed to the corresponding token in the second column.

Parameters

commentMarker (String) = #: Lines starting with this character (or String) are ignored. Default: '#'
modelEncoding (String) = UTF-8
modelLocation (String)
separator (String) = ``: Separator for mappings file. Default: "\t" (TAB).
typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

ExpressiveLengtheningNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.ExpressiveLengtheningNormalizer

Takes a text and shortens extra long words

Parameters

typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Inputs and outputs

Inputs

Outputs

none specified

FileBasedTokenTransformer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.FileBasedTokenTransformer

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

Parameters

ignoreCase (Boolean) = false
modelLocation (String)
replacement (String)
typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

HyphenationRemover

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.HyphenationRemover

Simple dictionary-based hyphenation remover.

Parameters

modelEncoding (String) = UTF-8
modelLocation (String)
typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

RegexBasedTokenTransformer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.RegexBasedTokenTransformer

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

The parameters #PARAM_REGEX defines the regular expression to be searcher, #PARAM_REPLACEMENT defines the string with which matching patterns are replaces.

Parameters

regex (String): Define the regular expression to be replaced
replacement (String): Define the string to replace matching tokens with
typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

ReplacementFileNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.ReplacementFileNormalizer

Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens

Parameters

modelLocation (String): Location of a file which contains all replacing characters
srcExpressionSurroundings (String) = IRRELEVANT
targetExpressionSurroundings (String) = NOTHING

Inputs and outputs

Inputs

Outputs

SofaChangeAnnotation

SharpSNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.SharpSNormalizer

Takes a text and replaces sharp s

Parameters

MinFrequencyThreshold (Integer) = 100
typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

SpellingNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.SpellingNormalizer

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

Parameters

typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Inputs and outputs

Inputs

SpellingAnomaly

Outputs

none specified

StanfordPtbTransformer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPtbTransformer

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. This component operates directly on the text and does not require prior segmentation.

Parameters

typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

TokenCaseTransformer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.TokenCaseTransformer

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

Parameters

tokenCase (String)

The case to convert tokens to:

UPPERCASE: uppercase everything.
LOWERCASE: lowercase everything.
NORMALCASE: retain first letter in word and after hyphens, lowercase everything else.

typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

UmlautNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.UmlautNormalizer

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

Parameters

MinFrequencyThreshold (Integer) = 100
typesToCopy (String[]) = []: A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Inputs and outputs

Inputs

DocumentMetaData SofaChangeAnnotation

Outputs

none specified

Other

Table 17. Analysis Components in group Other (20)
Component	Description
AnnotationByTextFilter	Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.
ApplyChangesAnnotator	Applies changes annotated using a SofaChangeAnnotation.
Backmapper	After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.
CompoundAnnotator	Annotates compound parts and linking morphemes.
CorrectionsContextualizer	This component assumes that some spell checker has already been applied upstream (e.g.
DictionaryAnnotator	Takes a plain text file with phrases as input and annotates the phrases in the CAS file.
JCasHolder	Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.
NGramAnnotator	N-gram annotator.
NorvigSpellingCorrector	Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.
PosFilter	Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.
PosMapper	Maps existing POS tags from one tagset to another using a user provided properties file.
ReadabilityAnnotator	Assign a set of popular readability scores to the text.
RegexTokenFilter	Remove every token that does or does not match a given regular expression.
SemanticFieldAnnotator	This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.
StanfordDependencyConverter	Converts a constituency structure into a dependency structure.
StopWordRemover	Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.
Stopwatch	Can be used to measure how long the processing between two points in a pipeline takes.
TfidfAnnotator	This component adds Tfidf annotations consisting of a term and a tfidf weight.
TfidfConsumer	This consumer builds a DfModel.
TrailingCharacterRemover	Removing trailing character (sequences) from tokens, e.g. punctuation.

AnnotationByTextFilter

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.AnnotationByTextFilter

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

Parameters

ignoreCase (Boolean) = true: If true, annotation texts are filtered case-independently. Default: true, i.e. words that occur in the list with different casing are not filtered out.
modelEncoding (String) = UTF-8
modelLocation (String)
typeName (String) = de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token: Annotation type to filter. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token.

ApplyChangesAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.castransformation-asl
Class: de.tudarmstadt.ukp.dkpro.core.castransformation.ApplyChangesAnnotator

Applies changes annotated using a SofaChangeAnnotation.

Inputs and outputs

Inputs

Outputs

DocumentMetaData SofaChangeAnnotation

Backmapper

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.castransformation-asl
Class: de.tudarmstadt.ukp.dkpro.core.castransformation.Backmapper

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

Parameters

Chain (String[]) = [source, target] [optional]: Chain of views for backmapping. This should be the reverse of the chain of views that the ApplyChangesAnnotator has used. For example, if view A has been mapped to B using ApplyChangesAnnotator, then this parameter should be set using an array containing [B, A].

CompoundAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.decompounding-asl
Class: de.tudarmstadt.ukp.dkpro.core.decompounding.uima.annotator.CompoundAnnotator

Annotates compound parts and linking morphemes.

Inputs and outputs

Inputs

Compound CompoundPart LinkingMorpheme Split

Outputs

CorrectionsContextualizer

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.jazzy-asl
Class: de.tudarmstadt.ukp.dkpro.core.jazzy.CorrectionsContextualizer

This component assumes that some spell checker has already been applied upstream (e.g. Jazzy). It then uses ngram frequencies from a frequency provider in order to rank the provided corrections.

DictionaryAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator-asl
Class: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.DictionaryAnnotator

Takes a plain text file with phrases as input and annotates the phrases in the CAS file. The annotation type defaults to NGram, but can be changed. The component requires that Tokens and Sentencees are annotated in the CAS. The format of the phrase file is one phrase per line, tokens are separated by space:

this is a phrase
another phrase

Parameters

annotationType (String) [optional]: The annotation to create on matching phases. If nothing is specified, this defaults to NGram.
modelEncoding (String) = UTF-8: The character encoding used by the model.
modelLocation (String): The file must contain one phrase per line - phrases will be split at " "
value (String) [optional]: The value to set the feature configured in #PARAM_VALUE_FEATURE to.
valueFeature (String) = value [optional]: Set this feature on the created annotations.

Inputs and outputs

Inputs

Outputs

none specified

JCasHolder

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.util.JCasHolder

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

NGramAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.ngrams-asl
Class: de.tudarmstadt.ukp.dkpro.core.ngrams.NGramAnnotator

N-gram annotator.

Parameters

N (Integer) = 3: The length of the n-grams to generate (the "n" in n-gram).

Inputs and outputs

Inputs

Outputs

NGram

NorvigSpellingCorrector

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.norvig-asl
Class: de.tudarmstadt.ukp.dkpro.core.norvig.NorvigSpellingCorrector

Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.

Inputs and outputs

Inputs

Outputs

SofaChangeAnnotation

PosFilter

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.posfilter-asl
Class: de.tudarmstadt.ukp.dkpro.core.posfilter.PosFilter

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

Parameters

Verbs (Boolean) = false: Keep/remove verbs (true: keep, false: v)
adj (Boolean) = false: Keep/remove adjectives (true: keep, false: remove)
adv (Boolean) = false: Keep/remove adverbs (true: keep, false: remove)
art (Boolean) = false: Keep/remove articles (true: keep, false: remove)
card (Boolean) = false: Keep/remove cardinal numbers (true: keep, false: remove)
conj (Boolean) = false: Keep/remove conjunctions (true: keep, false: remove)
n (Boolean) = false: Keep/remove nouns (true: keep, false: remove)
o (Boolean) = false: Keep/remove "others" (true: keep, false: remove)
pp (Boolean) = false: Keep/remove prepositions (true: keep, false: remove)
pr (Boolean) = false: Keep/remove pronouns (true: keep, false: remove)
punc (Boolean) = false: Keep/remove punctuation (true: keep, false: remove)
typeToRemove (String): The fully qualified name of the type that should be filtered.

Inputs and outputs

Inputs

Outputs

none specified

PosMapper

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.posfilter-asl
Class: de.tudarmstadt.ukp.dkpro.core.posfilter.PosMapper

Maps existing POS tags from one tagset to another using a user provided properties file.

Parameters

dkproMappingLocation (String) [optional]: A properties file containing mappings from the new tagset to (fully qualified) DKPro POS classes.
If such a file is not supplied, the DKPro POS classes stay the same regardless of the new POS tag value, and only the value is changed.
mappingFile (String): A properties file containing POS tagset mappings.

Inputs and outputs

Inputs

POS Token

Outputs

POS Token

ReadabilityAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.readability-asl
Class: de.tudarmstadt.ukp.dkpro.core.readability.ReadabilityAnnotator

Assign a set of popular readability scores to the text.

RegexTokenFilter

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.RegexTokenFilter

Remove every token that does or does not match a given regular expression.

Parameters

mustMatch (Boolean) = true: If this parameter is set to true (default), retain only tokens that match the regex given in #PARAM_REGEX. If set to false, all tokens that match the given regex are removed.
regex (String): Every token that does or does not match this regular expression will be removed.

SemanticFieldAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator-asl
Class: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.SemanticFieldAnnotator

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. This could be a lexical resource such as WordNet or a simple key-value map. The annotation is stored in the SemanticField annotation type.

Parameters

annotationType (String): Annotation types which should be annotated with semantic fields
constraint (String) [optional]: A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to annotate only tokens with semantic fields that are part of a location named entity.

Inputs and outputs

Inputs

POS Lemma Token

Outputs

NamedEntity

StanfordDependencyConverter

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordDependencyConverter

Converts a constituency structure into a dependency structure.

Parameters

language (String) [optional]: Use this language instead of the document language to resolve the model and tag set mapping.
mode (String) = TREE [optional]: Sets the kind of dependencies being created.
Default: DependenciesMode#COLLAPSED TREE
originalDependencies (Boolean) = true: Create original dependencies. If this is disabled, universal dependencies are created. The default is to create the original dependencies.

Inputs and outputs

Inputs

Token Constituent

Outputs