DKPro Core™ Component Reference

The document provides detailed information about the DKPro Core UIMA components.

Overview

Components
- Components by types they produce and consume

Analytics components

Table 1. Analysis Components (132)
Component	Description
Annotation-By-Length Filter	Removes annotations that do not conform to minimum or maximum length constraints.
Annotation-By-Text Filter	Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.
CAS Transformation - Apply	Applies changes annotated using a SofaChangeAnnotation.
ArkTweet POS-Tagger	Wrapper for Twitter Tokenizer and POS Tagger.
ArkTweet POS-Tagger Trainer	Trainer for ark-tweet POS tagger.
ArkTweet Tokenizer	ArkTweet tokenizer.
CAS Transformation - Map back	After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.
Berkeley Parser	Berkeley Parser annotator.
Java BreakIterator Segmenter	BreakIterator segmenter.
CamelCase Token Segmenter	Split up existing tokens again if they are camel-case text.
Capitalization Normalizer	Takes a text and replaces wrong capitalization
CIS Stemmer	UIMA wrapper for the CISTEM algorithm.
Chinese Traditional/Simplified Converter	Converts traditional Chinese to simplified Chinese or vice-versa.
ClearNLP Lemmatizer	Lemmatizer using Clear NLP.
ClearNLP Parser	CLEAR parser annotator.
ClearNLP POS-Tagger	Part-of-Speech annotator using Clear NLP.
ClearNLP Segmenter	Tokenizer using Clear NLP.
ClearNLP Semantic Role Labeler	ClearNLP semantic role labeller.
Commons Codec Cologne Phonetic Transcriptor	Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.
Compound Annotator	Annotates compound parts and linking morphemes.
CoreNLP Coreference Resolver	Deterministic coreference annotator from CoreNLP.
CoreNLP Dependency Parser	Dependency parser from CoreNLP.
CoreNLP Lemmatizer	Lemmatizer from CoreNLP.
CoreNLP Named Entity Recognizer	Named entity recognizer from CoreNLP.
CoreNLP Parser	Parser from CoreNLP.
CoreNLP POS-Tagger	Part-of-speech tagger from CoreNLP.
CoreNLP Segmenter	Tokenizer and sentence splitter using from Stanford CoreNLP.
Corrections Contextualizer	This component assumes that some spell checker has already been applied upstream (e.g.
Dictionary Annotator	Takes a plain text file with phrases as input and annotates the phrases in the CAS file.
Dictionary-based Token Transformer	Reads a tab-separated file containing mappings from one token to another.
Commons Codec Double-Metaphone Phonetic Transcriptor	Double-Metaphone phonetic transcription based on Apache Commons Codec.
Expressive Lengthening Normalizer	Takes a text and shortens extra long words
File-based Token Transformer	Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.
FlexTag POS-Tagger	Flexible part-of-speech tagger.
GATE Lemmatizer	Wrapper for the GATE rule based lemmatizer.
German Separated Particle Annotator	Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.
Gosen Segmenter	Segmenter for Japanese text based on GoSen.
GATE Hepple POS-Tagger	GATE Hepple part-of-speech tagger.
HunPos POS-Tagger	Part-of-Speech annotator using HunPos.
Hyphenation Remover	Simple dictionary-based hyphenation remover.
ICU Segmenter	ICU segmenter.
IXA Lemmatizer	Lemmatizer using the OpenNLP-based Ixa implementation.
IXA POS-Tagger	Part-of-Speech annotator using OpenNLP with IXA extensions.
de.tudarmstadt.ukp.dkpro.core.textnormalizer.util.JCasHolder	Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.
JTok Segmenter	JTok segmenter.
Jazzy Spellchecker	This annotator uses Jazzy for the decision whether a word is spelled correctly or not.
Lancaster Stemmer	This Paice/Husk Lancaster stemmer implementation only works with the English language so far.
LangDetect	Langdetect language identifier based on character n-grams.
Web1T Language Detector	Language detector based on n-gram frequency counts, e.g. as provided by Web1T
TextCat Language Identifier (Character N-Gram-based)	Detection based on character n-grams.
LanguageTool Grammar Checker	Detect grammatical errors in text using LanguageTool a rule based grammar checker.
LanguageTool Lemmatizer	Naive lexicon-based lemmatizer.
LanguageTool Segmenter	Segmenter using LanguageTool to do the heavy lifting.
Line-based Sentence Segmenter	Annotates each line in the source text as a sentence.
LingPipe Named Entity Recognizer	LingPipe named entity recognizer.
LingPipe Named Entity Recognizer Trainer	LingPipe named entity recognizer trainer.
LingPipe POS-Tagger	LingPipe part-of-speech tagger.
LingPipe Segmenter	LingPipe segmenter.
Mallet Embeddings Annotator	Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas.
Mallet Embeddings Trainer	Compute word embeddings from the given collection using skip-grams.
Mallet LDA Topic Model Inferencer	Infers the topic distribution over documents using a Mallet ParallelTopicModel.
Mallet LDA Topic Model Trainer	Estimate an LDA topic model using Mallet and write it to a file.
MaltParser Dependency Parser	Dependency parsing using MaltPaser.
Mate Tools Lemmatizer	DKPro Core Annotator for the MateToolsLemmatizer.
Mate Tools Morphological Analyzer	DKPro Core Annotator for the MateToolsMorphTagger.
Mate Tools Dependency Parser	DKPro Annotator for the MateToolsParser.
Mate Tools POS-Tagger	DKPro Annotator for the MateToolsPosTagger
Mate Tools Semantic Role Labeler	Annotator for the MateTools Semantic Role Labeler.
MeCab POS-Tagger	Annotator for the MeCab Japanese POS Tagger.
Commons Codec Metaphone Phonetic Transcriptor	Metaphone phonetic transcription based on Apache Commons Codec.
Morpha Lemmatizer	Lemmatize based on a finite-state machine.
MSTParser Dependency Parser	Dependency parsing using MSTParser.
N-Gram Annotator	N-gram annotator.
NLP4J Dependency Parser	Emory NLP4J dependency parser.
NLP4J Lemmatizer	Emory NLP4J lemmatizer.
NLP4J Named Entity Recognizer	Emory NLP4J name finder wrapper.
NLP4J POS-Tagger	Part-of-Speech annotator using Emory NLP4J.
NLP4J Segmenter	Segmenter using Emory NLP4J.
Simple Spelling Corrector	Identifies spelling errors using Norvig's algorithm.
OpenNLP Chunker	Chunk annotator using OpenNLP.
OpenNLP Chunker Trainer	Train a chunker model for OpenNLP.
OpenNLP Lemmatizer	Lemmatizer using OpenNLP.
OpenNLP Lemmatizer Trainer	Train a lemmatizer model for OpenNLP.
OpenNLP Named Entity Recognizer	OpenNLP name finder wrapper.
OpenNLP Named Entity Recognizer Trainer	Train a named entity recognizer model for OpenNLP.
OpenNLP Parser	OpenNLP parser.
OpenNLP POS-Tagger	Part-of-Speech annotator using OpenNLP.
OpenNLP POS-Tagger Trainer	Train a POS tagging model for OpenNLP.
OpenNLP Segmenter	Tokenizer and sentence splitter using OpenNLP.
OpenNLP Sentence Splitter Trainer	Train a sentence splitter model for OpenNLP.
OpenNLP Tokenizer Trainer	Train a tokenizer model for OpenNLP.
Paragraph Splitter	This class creates paragraph annotations for the given input document.
Pattern-based Token Segmenter	Split up existing tokens again at particular split-chars.
Phrase Annotator	Annotate phrases in a sentence.
POS Filter	Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.
POS Mapper	Maps existing POS tags from one tagset to another using a user provided properties file.
Readability Annotator	Assign a set of popular readability scores to the text.
Regex-based Token Transformer	A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.
Regex Segmenter	This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.
Regex Token Filter	Remove every token that does or does not match a given regular expression.
Replacement File Normalizer	Takes a text and replaces desired expressions.
RFTagger Morphological Analyzer	Rftagger morphological analyzer.
Semantic Field Annotator	This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.
SFST Morphological Analyzer	SFST morphological analyzer.
Sharp S (ß) Normalizer	Takes a text and replaces sharp s
Snowball Stemmer	UIMA wrapper for the Snowball stemmer.
Commons Codec Soundex Phonetic Transcriptor	Soundex phonetic transcription based on Apache Commons Codec.
Spelling Normalizer	Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.
CoreNLP Coreference Resolver (old API)	No description
CoreNLP Dependency Converter	Converts a constituency structure into a dependency structure.
CoreNLP Lemmatizer (old API)	Stanford Lemmatizer component.
CoreNLP Named Entity Recogizer (old API)	Stanford Named Entity Recognizer component.
CoreNLP Named Entity Recognizer Trainer	Train a NER model for Stanford CoreNLP Named Entity Recognizer.
CoreNLP Parser (old API)	Stanford Parser component.
CoreNLP POS-Tagger (old API)	Stanford Part-of-Speech tagger component.
CoreNLP POS-Tagger Trainer	Train a POS tagging model for the Stanford POS tagger.
Stanford Penn Treebank Normalizer	Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.
CoreNLP Segmenter (old API)	Stanford sentence splitter and tokenizer.
Stop Word Remover	Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.
Stopwatch	Can be used to measure how long the processing between two points in a pipeline takes.
TF/IDF Annotator	This component adds Tfidf annotations consisting of a term and a tfidf weight.
Token Case Transformer	Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.
Token Merger	Merges any Tokens that are covered by a given annotation type.
de.tudarmstadt.ukp.dkpro.core.tokit.TokenTrimmer	Remove prefixes and suffixes from tokens.
Trailing Character Remover	Removing trailing character (sequences) from tokens, e.g. punctuation.
TreeTagger Chunker	Chunk annotator using TreeTagger.
TreeTagger POS-Tagger	Part-of-Speech and lemmatizer annotator using TreeTagger.
UDPipe Parsito Dependency Parser	Dependency parser using UDPipe.
UDPipe MorphoDiTa Morphological Analyzer	Part-of-Speech, lemmatizer, and morphological analyzer using UDPipe.
UDPipe Segmenter	Tokenizer and sentence splitter using UDPipe.
Umlaut Normalizer	Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.
Whitespace Segmenter	A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

Checker

Table 2. Analysis Components in category Checker (2)
Component	Description
JazzyChecker	This annotator uses Jazzy for the decision whether a word is spelled correctly or not.
LanguageToolChecker	Detect grammatical errors in text using LanguageTool a rule based grammar checker.

Jazzy Spellchecker

Short name	JazzyChecker
Category	Checker
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.jazzy-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.jazzy.JazzyChecker

Description

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

Parameters

modelEncoding	The character encoding used by the model. Type: String — Default value: `UTF-8`
modelLocation	Location from which the model is read. The model file is a simple word-list with one word per line. Type: String
scoreThreshold	Determines the maximum edit distance (as an int value) that a suggestion for a spelling error may have. E.g. if set to one suggestions are limited to words within edit distance 1 to the original word. Type: Integer — Default value: `1`

Table 3. Capabilities
Inputs	Token
Outputs	SpellingAnomaly SuggestedAction
Languages	none specified

LanguageTool Grammar Checker

Short name	LanguageToolChecker
Category	Checker
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolChecker

Description

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

Parameters

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

Table 4. Capabilities
Inputs	none specified
Outputs	GrammarAnomaly
Languages	be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh

Chunker

Table 5. Analysis Components in category Chunker (3)
Component	Description
OpenNlpChunker	Chunk annotator using OpenNLP.
OpenNlpChunkerTrainer	Train a chunker model for OpenNLP.
TreeTaggerChunker	Chunk annotator using TreeTagger.

OpenNLP Chunker

Short name	OpenNlpChunker
Category	Chunker
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpChunker

Description

Chunk annotator using OpenNLP.

Parameters

ChunkMappingLocation	Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 6. Capabilities
Inputs	POS Sentence Token
Outputs	Chunk
Languages	see available models

Table 7. Models
Language	Variant	Version
en	default	20100908.1
en	perceptron-ixa	20160205.1

OpenNLP Chunker Trainer

Short name	OpenNlpChunkerTrainer
Category	Chunker
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpChunkerTrainer

Description

Train a chunker model for OpenNLP.

Parameters

algorithm	Training algorithm. Type: String — Default value: `MAXENT`
beamSize	Beam size. Type: Integer — Default value: `3`
cutoff	Frequency cut-off. Type: Integer — Default value: `5`
iterations	Number of training iterations. Type: Integer — Default value: `100`
language	Store this language to the model instead of the document language. Type: String
numThreads	Number of parallel threads. Type: Integer — Default value: `1`
targetLocation	Location to which the output is written. Type: String
trainerType	Trainer type. Type: String — Default value: `Event`

TreeTagger Chunker

Short name	TreeTaggerChunker
Category	Chunker
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.treetagger-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerChunker

Description

Chunk annotator using TreeTagger.

Parameters

ChunkMappingLocation	Location of the mapping file for chunk tags to UIMA types. Optional — Type: String
executablePath	Use this TreeTagger executable instead of trying to locate the executable automatically. Optional — Type: String
flushSequence	A sequence to flush the internal TreeTagger buffer and to force it to output the rest of the completed analysis. This is typically just a sequence of like 5-10 full stops (".") separated by new line characters. However, some models may require a different flush sequence, e.g. a short sentence in the respective language. For chunker models, mind that the sentence must also be POS tagged, e.g. Nous-PRO:PER\n.... Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	The character encoding used by the model. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
performanceMode	TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided. Type: Boolean — Default value: `false`
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 8. Capabilities
Inputs	POS
Outputs	Chunk
Languages	see available models

Table 9. Models
Language	Variant	Version
de	le	20110429.1
en	iso8859-le	20090824.1
en	le	20140520.1
fr	le	20141218.2

Coreference resolver

Table 10. Analysis Components in category Coreference resolver (2)
Component	Description
CoreNlpCoreferenceResolver	Deterministic coreference annotator from CoreNLP.
StanfordCoreferenceResolver	No description

CoreNLP Coreference Resolver

Short name	CoreNlpCoreferenceResolver
Category	Coreference resolver
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.corenlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpCoreferenceResolver

Description

Deterministic coreference annotator from CoreNLP.

Parameters

maxDist	DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance) Type: Integer — Default value: `-1`
postprocessing	DCoRef parameter: Do post-processing Type: Boolean — Default value: `false`
ptb3Escaping	Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: `true`
quoteBegin	List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
quoteEnd	List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
score	DCoRef parameter: Scoring the output of the system Type: Boolean — Default value: `false`
sieves	DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/. Type: String — Default value: `MarkRole, DiscourseMatch, ExactStringMatch, RelaxedExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, RelaxedHeadMatch, PronounMatch`
singleton	DCoRef parameter: setting singleton predictor Type: Boolean — Default value: `true`

Table 11. Capabilities
Inputs	POS NamedEntity Lemma Sentence Token Constituent
Outputs	CoreferenceChain CoreferenceLink
Languages	none specified

CoreNLP Coreference Resolver (old API)

Short name	StanfordCoreferenceResolver
Category	Coreference resolver
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordCoreferenceResolver

Description

null

Parameters

maxDist	DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance) Type: Integer — Default value: `-1`
postprocessing	DCoRef parameter: Do post processing Type: Boolean — Default value: `false`
score	DCoRef parameter: Scoring the output of the system Type: Boolean — Default value: `false`
sieves	DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/. Type: String — Default value: `MarkRole, DiscourseMatch, ExactStringMatch, RelaxedExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, RelaxedHeadMatch, PronounMatch`
singleton	DCoRef parameter: setting singleton predictor Type: Boolean — Default value: `true`

Table 12. Capabilities
Inputs	POS NamedEntity Lemma Sentence Token Constituent
Outputs	CoreferenceChain CoreferenceLink
Languages	see available models

Table 13. Models
Language	Variant	Version
en	default	${core.version}.1

Embeddings

Table 14. Analysis Components in category Embeddings (2)
Component	Description
MalletEmbeddingsAnnotator	Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas.
MalletEmbeddingsTrainer	Compute word embeddings from the given collection using skip-grams.

Mallet Embeddings Annotator

Short name	MalletEmbeddingsAnnotator
Category	Embeddings
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.mallet-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.mallet.wordembeddings.MalletEmbeddingsAnnotator

Description

Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas.

Parameters

annotateUnknownTokens	Specify how to handle unknown tokens: If this parameter is not specified, unknown tokens are not annotated. If an empty float[] is passed, a random vector is generated that is used for each unknown token. If a float[] is passed, each unknown token is annotated with that vector. The float must have the same length as the vectors in the model file. Type: Boolean — Default value: `false`
lowercase	If set to true (default: false), all tokens are lowercased. Type: Boolean — Default value: `false`
modelHasHeader	If set to true (default: false), the first line is interpreted as header line containing the number of entries and the dimensionality. This should be set to true for models generated with Word2Vec. Type: Boolean — Default value: `false`
modelIsBinary	Whether the model is in binary format instead of text format. Type: Boolean — Default value: `false`
modelLocation	The file containing the word embeddings. Currently only supports text file format. Type: String
tokenFeaturePath	The annotation type to use for the model. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token`

Table 15. Capabilities
Inputs	Token
Outputs	WordEmbedding
Languages	none specified

Mallet Embeddings Trainer

Short name	MalletEmbeddingsTrainer
Category	Embeddings
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.mallet-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.mallet.wordembeddings.MalletEmbeddingsTrainer

Description

Compute word embeddings from the given collection using skip-grams.

Set #PARAM_TOKEN_FEATURE_PATH to define what is considered as a token (Tokens, Lemmas, etc.).

Set #PARAM_COVERING_ANNOTATION_TYPE to define what is considered a document (sentences, paragraphs, etc.).

Parameters

compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
coveringAnnotationType	If specified, the text contained in the given segmentation type annotations are fed as separate units ("documents") to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored. By default, the full text is used as a document. Type: String — Default value: ``
dimensions	The dimensionality of the output word embeddings (default: 50). Type: Integer — Default value: `50`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
exampleWord	An example word that is output with its nearest neighbours once in a while (default: null, i.e. none). Optional — Type: String
filterRegex	Regular expression of tokens to be filtered. Type: String — Default value: ``
filterRegexReplacement	Value with which tokens matching the regular expression are replaced. Type: String — Default value: ``
lowercase	If set to true (default: false), all tokens are lowercased. Type: Boolean — Default value: `false`
minDocumentLength	Ignore documents with fewer tokens than this value (default: 10). Type: Integer — Default value: `10`
minTokenLength	Ignore tokens (or any other annotation type, as specified by #PARAM_TOKEN_FEATURE_PATH) that are shorter than the given value. Type: Integer — Default value: `3`
numNegativeSamples	The number of negative samples to be generated for each token (default: 5). Type: Integer — Default value: `5`
numThreads	The number of threads to use during model estimation. If not set, the number of threads is automatically set by ComponentParameters#computeNumThreads(int). Warning: do not set this to more than 1 when using very small (test) data sets on MalletEmbeddingsTrainer! This might prevent the process from terminating. Type: Integer — Default value: `0`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
paramStopwordsFile	The location of the stopwords file. Type: String — Default value: ``
paramStopwordsReplacement	If set, stopwords found in the #PARAM_STOPWORDS_FILE location are not removed, but replaced by the given string (e.g. STOP). Type: String — Default value: ``
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
tokenFeaturePath	The annotation type to use as input tokens for the model estimation. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token`
useCharacters	If true (default: false), estimate character embeddings. #PARAM_TOKEN_FEATURE_PATH is ignored. Type: Boolean — Default value: `false`
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
windowSize	The context size when generating embeddings (default: 5). Type: Integer — Default value: `5`

Gazeteer

Table 16. Analysis Components in category Gazeteer (1)
Component	Description
DictionaryAnnotator	Takes a plain text file with phrases as input and annotates the phrases in the CAS file.

Dictionary Annotator

Short name	DictionaryAnnotator
Category	Gazeteer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.dictionaryannotator-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.DictionaryAnnotator

Description

Takes a plain text file with phrases as input and annotates the phrases in the CAS file. The annotation type defaults to NGram, but can be changed. The component requires that Tokens and Sentencees are annotated in the CAS. The format of the phrase file is one phrase per line, tokens are separated by space:

this is a phrase
another phrase

Parameters

annotationType	The annotation to create on matching phases. If nothing is specified, this defaults to NGram. Optional — Type: String
modelEncoding	The character encoding used by the model. Type: String — Default value: `UTF-8`
modelLocation	The file must contain one phrase per line - phrases will be split at " " Type: String
value	The value to set the feature configured in #PARAM_VALUE_FEATURE to. Optional — Type: String
valueFeature	Set this feature on the created annotations. Optional — Type: String — Default value: `value`

Table 17. Capabilities
Inputs	Sentence Token
Outputs	none specified
Languages	none specified

Language Identifier

Table 18. Analysis Components in category Language Identifier (3)
Component	Description
LangDetectLanguageIdentifier	Langdetect language identifier based on character n-grams.
LanguageIdentifier	Detection based on character n-grams.
LanguageDetectorWeb1T	Language detector based on n-gram frequency counts, e.g. as provided by Web1T

LangDetect

Short name	LangDetectLanguageIdentifier
Category	Language Identifier
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.langdetect-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.langdetect.LangDetectLanguageIdentifier

Description

Langdetect language identifier based on character n-grams. Due to the way LangDetect is implemented, this component does not support being instantiated multiple times with different model locations. Only a single model location can be active at a time over all instances of this component.

Parameters

modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
seed	The random seed. Optional — Type: String

Table 19. Models
Language	Variant	Version
any	socialmedia	20141013.1
any	wikipedia	20141013.1

TextCat Language Identifier (Character N-Gram-based)

Short name	LanguageIdentifier
Category	Language Identifier
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textcat-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textcat.LanguageIdentifier

Description

Detection based on character n-grams. Uses the Java Text Categorizing Library based on a technique by Cavnar and Trenkle.

References

Cavnar, W. B. and J. M. Trenkle (1994). N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

Web1T Language Detector

Short name	LanguageDetectorWeb1T
Category	Language Identifier
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.ldweb1t-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.ldweb1t.LanguageDetectorWeb1T

Description

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

Parameters

maxNGramSize	The maximum n-gram size that should be considered. Default is 3. Type: Integer — Default value: `3`
minNGramSize	The minimum n-gram size that should be considered. Default is 1. Type: Integer — Default value: `1`

Lemmatizer

Table 20. Analysis Components in category Lemmatizer (11)
Component	Description
ClearNlpLemmatizer	Lemmatizer using Clear NLP.
CoreNlpLemmatizer	Lemmatizer from CoreNLP.
StanfordLemmatizer	Stanford Lemmatizer component.
GateLemmatizer	Wrapper for the GATE rule based lemmatizer.
IxaLemmatizer	Lemmatizer using the OpenNLP-based Ixa implementation.
LanguageToolLemmatizer	Naive lexicon-based lemmatizer.
MateLemmatizer	DKPro Core Annotator for the MateToolsLemmatizer.
MorphaLemmatizer	Lemmatize based on a finite-state machine.
Nlp4JLemmatizer	Emory NLP4J lemmatizer.
OpenNlpLemmatizer	Lemmatizer using OpenNLP.
OpenNlpLemmatizerTrainer	Train a lemmatizer model for OpenNLP.

ClearNLP Lemmatizer

Short name	ClearNlpLemmatizer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpLemmatizer

Description

Lemmatizer using Clear NLP.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String — Default value: `en`
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String

Table 21. Capabilities
Inputs	POS Sentence Token
Outputs	Lemma
Languages	see available models

Table 22. Models
Language	Variant	Version
en	default	20131111.0

CoreNLP Lemmatizer

Short name	CoreNlpLemmatizer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.corenlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpLemmatizer

Description

Lemmatizer from CoreNLP.

Parameters

ptb3Escaping	Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: `true`
quoteBegin	List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
quoteEnd	List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]

Table 23. Capabilities
Inputs	POS Sentence Token
Outputs	Lemma
Languages	none specified

CoreNLP Lemmatizer (old API)

Short name	StanfordLemmatizer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordLemmatizer

Description

Stanford Lemmatizer component. The Stanford Morphology-class computes the base form of English words, by removing just inflections (not derivational morphology). That is, it only does noun plurals, pronoun case, and verb endings, and not things like comparative adjectives or derived nominals. It is based on a finite-state transducer implemented by John Carroll et al., written in flex and publicly available. See: http://www.informatics.susx.ac.uk/research/nlp/carroll/morph.html

This only works for ENGLISH.

Parameters

ptb3Escaping	Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: `true`
quoteBegin	List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
quoteEnd	List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]

Table 24. Capabilities
Inputs	POS Token
Outputs	Lemma
Languages	en

GATE Lemmatizer

Short name	GateLemmatizer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.gate-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.gate.GateLemmatizer

Description

Wrapper for the GATE rule based lemmatizer. Based on code by Asher Stern from the BIUTEE textual entailment tool.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String

Table 25. Capabilities
Inputs	Token
Outputs	Lemma
Languages	see available models

Table 26. Models
Language	Variant	Version
en	default	20160531.0

IXA Lemmatizer

Short name	IxaLemmatizer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.ixa-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.ixa.IxaLemmatizer

Description

Lemmatizer using the OpenNLP-based Ixa implementation.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 27. Capabilities
Inputs	POS Sentence Token
Outputs	Lemma
Languages	see available models

Table 28. Models
Language	Variant	Version
de	perceptron-conll09	20160213.1
en	perceptron-conll09	20160211.1
en	perceptron-ud	20160214.1
en	xlemma-perceptron-ud	20160214.1
es	perceptron-ancora-2.0	20160211.1
eu	perceptron-ud	20160212.1
fr	perceptron-sequoia	20160215.1
gl	perceptron-autodict05-ctag	20160212.1
it	perceptron-ud	20160213.1
nl	perceptron-alpino	20160215.1

LanguageTool Lemmatizer

Short name	LanguageToolLemmatizer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolLemmatizer

Description

Naive lexicon-based lemmatizer. The words are looked up using the wordform lexicons of LanguageTool. Multiple readings are produced. The annotator simply takes the most frequent lemma from those readings. If no readings could be found, the original text is assigned as lemma.

Parameters

sanitize	Remove characters specified in #PARAM_SANTIZE_CHARS from lemmas. Type: Boolean — Default value: `true`
sanitizeChars	Characters to remove from lemmas if #PARAM_SANITIZE is enabled. Type: String[] — Default value: `[(, ), [, ]]`

Table 29. Capabilities
Inputs	Sentence Token
Outputs	Lemma
Languages	be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh

Mate Tools Lemmatizer

Short name	MateLemmatizer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.matetools.MateLemmatizer

Description

DKPro Core Annotator for the MateToolsLemmatizer.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
uppercase	Try reconstructing proper casing for lemmata. This is useful for German, but e.g. for English creates odd results. Type: Boolean — Default value: `false`
variant	Override the default variant used to locate the model. Optional — Type: String

Table 30. Capabilities
Inputs	Sentence Token
Outputs	Lemma
Languages	see available models

Table 31. Models
Language	Variant	Version
de	tiger	20121024.1
en	conll2009	20130117.1
es	conll2009	20130117.1
fr	ftb	20130918.0

Morpha Lemmatizer

Short name	MorphaLemmatizer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.morpha-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.morpha.MorphaLemmatizer

Description

Lemmatize based on a finite-state machine. Uses the Java port of Morpha.

References:

Minnen, G., J. Carroll and D. Pearce (2001). Applied morphological processing of English, Natural Language Engineering, 7(3). 207-223.

Parameters

readPOS

Pass part-of-speech information on to Morpha. Since we currently do not know in which format the part-of-speech tags are expected by Morpha, we just pass on the actual pos tag value we get from the token. This may produce worse results than not passing on pos tags at all, so this is disabled by default.

Type: Boolean — Default value: false

Table 32. Capabilities
Inputs	POS Sentence Token
Outputs	Lemma
Languages	en

NLP4J Lemmatizer

Short name	Nlp4JLemmatizer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.nlp4j-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.nlp4j.Nlp4JLemmatizer

Description

Emory NLP4J lemmatizer. This is a lower-casing lemmatizer.

Parameters

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

Table 33. Capabilities
Inputs	POS Sentence Token
Outputs	Lemma
Languages	none specified

OpenNLP Lemmatizer

Short name	OpenNlpLemmatizer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpLemmatizer

Description

Lemmatizer using OpenNLP.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	The character encoding used by the model. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String

Table 34. Capabilities
Inputs	POS Sentence Token
Outputs	Lemma
Languages	none specified

OpenNLP Lemmatizer Trainer

Short name	OpenNlpLemmatizerTrainer
Category	Lemmatizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpLemmatizerTrainer

Description

Train a lemmatizer model for OpenNLP.

Parameters

algorithm	Training algorithm. Type: String — Default value: `MAXENT`
beamSize	Beam size. Type: Integer — Default value: `3`
cutoff	Frequency cut-off. Type: Integer — Default value: `5`
iterations	Number of training iterations. Type: Integer — Default value: `100`
language	Store this language to the model instead of the document language. Type: String
numThreads	Number of parallel threads. Type: Integer — Default value: `1`
targetLocation	Location to which the output is written. Type: String
trainerType	Trainer type. Type: String — Default value: `Event`

Morphological analyzer

Table 35. Analysis Components in category Morphological analyzer (3)
Component	Description
MateMorphTagger	DKPro Core Annotator for the MateToolsMorphTagger.
RfTagger	Rftagger morphological analyzer.
SfstAnnotator	SFST morphological analyzer.

Mate Tools Morphological Analyzer

Short name	MateMorphTagger
Category	Morphological analyzer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.matetools.MateMorphTagger

Description

DKPro Core Annotator for the MateToolsMorphTagger.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String

Table 36. Capabilities
Inputs	Lemma Sentence Token
Outputs	Morpheme MorphologicalFeatures
Languages	see available models

Table 37. Models
Language	Variant	Version
de	tiger	20121024.1
es	conll2009	20130117.1
fr	ftb	20130918.0

RFTagger Morphological Analyzer

Short name	RfTagger
Category	Morphological analyzer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.rftagger-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.rftagger.RfTagger

Description

Rftagger morphological analyzer.

Parameters

MorphMappingLocation	Load the morphological features mapping from this location instead of locating the mapping automatically. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	The character encoding used by the model. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Write the tag set(s) to the log when a model is loaded. Type: Boolean — Default value: `false`

Table 38. Capabilities
Inputs	Sentence Token
Outputs	MorphologicalFeatures POS
Languages	see available models

Table 39. Models
Language	Variant	Version
cz	cac	20150728.1
de	tiger	20150928.1
hu	szeged	20150728.1
ru	ric	20150728.1
sk	snk	20150728.1
sl	jos	20150728.1

SFST Morphological Analyzer

Short name	SfstAnnotator
Category	Morphological analyzer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.sfst-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.sfst.SfstAnnotator

Description

SFST morphological analyzer.

Parameters

MorphMappingLocation	Load the morphological features mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
mode	Read the FIRST analysis or read ALL analyses. Type: String — Default value: `FIRST`
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	Specifies the model encoding. Type: String — Default value: `UTF-8`
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Write the tag set(s) to the log when a model is loaded. Type: Boolean — Default value: `false`
writeLemma	Write lemma information. Type: Boolean — Default value: `true`
writePOS	Write part-of-speech information. Type: Boolean — Default value: `true`

Table 40. Capabilities
Inputs	Sentence Token
Outputs	MorphologicalFeatures POS
Languages	see available models

Table 41. Models
Language	Variant	Version
de	morphisto-ca	20110202.1
de	smor-ca	20140801.1
de	zmorge-newlemma-ca	20140521.1
de	zmorge-orig-ca	20140521.1
it	pippi-ca	20090223.1
tr	trmorph-ca	20130219.1

Named Entity Recognizer

Table 42. Analysis Components in category Named Entity Recognizer (9)
Component	Description
StanfordNamedEntityRecognizer	Stanford Named Entity Recognizer component.
CoreNlpNamedEntityRecognizer	Named entity recognizer from CoreNLP.
StanfordNamedEntityRecognizerTrainer	Train a NER model for Stanford CoreNLP Named Entity Recognizer.
LingPipeNamedEntityRecognizer	LingPipe named entity recognizer.
LingPipeNamedEntityRecognizerTrainer	LingPipe named entity recognizer trainer.
Nlp4JNamedEntityRecognizer	Emory NLP4J name finder wrapper.
OpenNlpNamedEntityRecognizer	OpenNLP name finder wrapper.
OpenNlpNamedEntityRecognizerTrainer	Train a named entity recognizer model for OpenNLP.
SemanticFieldAnnotator	This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.

CoreNLP Named Entity Recogizer (old API)

Short name	StanfordNamedEntityRecognizer
Category	Named Entity Recognizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizer

Description

Stanford Named Entity Recognizer component.

Parameters

NamedEntityMappingLocation	Location of the mapping file for named entity tags to UIMA types. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
ptb3Escaping	Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: `true`
quoteBegin	List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
quoteEnd	List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]

Table 43. Capabilities
Inputs	Sentence Token
Outputs	NamedEntity
Languages	see available models

Table 44. Models
Language	Variant	Version
de	germeval2014.hgc_175m_600.crf	20180227.1
de	nemgp	20141024.1
en	all.3class.caseless.distsim.crf	20161213.0
en	all.3class.distsim.crf	20161213.1
en	all.3class.nodistsim.crf	20160110.1
en	conll.4class.caseless.distsim.crf	20160110.0
en	conll.4class.distsim.crf	20150420.1
en	conll.4class.nodistsim.crf	20160110.1
en	freme-wikiner	20150925.1
en	muc.7class.caseless.distsim.crf	20150129.0
en	muc.7class.distsim.crf	20150129.1
en	muc.7class.nodistsim.crf	20160110.1
en	nowiki.3class.caseless.distsim.crf	20161213.0
en	nowiki.3class.nodistsim.crf	20160110.0
es	ancora.distsim.s512.crf	20161211.1
es	freme-wikiner	20150925.1
fr	freme-wikiner	20150925.1
it	freme-wikiner	20150925.1
nl	freme-wikiner	20150925.1
ru	freme-wikiner	20160726.1

CoreNLP Named Entity Recognizer

Short name	CoreNlpNamedEntityRecognizer
Category	Named Entity Recognizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.corenlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpNamedEntityRecognizer

Description

Named entity recognizer from CoreNLP.

Parameters

NamedEntityMappingLocation	Location of the mapping file for named entity tags to UIMA types. Optional — Type: String
applyNumericClassifiers	Type: Boolean — Default value: `true`
language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
maxSentenceLength	Maximum sentence length. Longer sentences are skipped. Type: Integer — Default value: `2147483647`
maxTime	Maximum time to spend on a single sentence. Type: Integer — Default value: `-1`
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	The character encoding used by the model. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
numThreads	Number of parallel threads to use. Type: Integer — Default value: `0`
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
ptb3Escaping	Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: `true`
quoteBegin	List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
quoteEnd	List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]

Table 45. Capabilities
Inputs	Sentence Token
Outputs	NamedEntity
Languages	none specified

CoreNLP Named Entity Recognizer Trainer

Short name	StanfordNamedEntityRecognizerTrainer
Category	Named Entity Recognizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizerTrainer

Description

Train a NER model for Stanford CoreNLP Named Entity Recognizer.

Parameters

acceptedTagsRegex	Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type. Optional — Type: String
entitySubClassification	Label set to use for training. Options: IOB1, IOB2, IOE1, IOE2, SBIEO, IO, BIO, BILOU, noprefix Optional — Type: String — Default value: `noprefix`
propertiesFile	Training file containing the parameters. The `trainFile` or `trainFileList` and `serializeTo` parameters in this file are ignored/overridden. Optional — Type: String
retainClassification	Flag to keep the label set specified by PARAM_LABEL_SET. If set to false, representation is mapped to IOB1 on output. Optional — Type: Boolean — Default value: `true`
targetLocation	Location of the target model file. Type: String

Table 46. Capabilities
Inputs	NamedEntity Sentence Token
Outputs	none specified
Languages	none specified

LingPipe Named Entity Recognizer

Short name	LingPipeNamedEntityRecognizer
Category	Named Entity Recognizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.lingpipe-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.lingpipe.LingPipeNamedEntityRecognizer

Description

LingPipe named entity recognizer.

Parameters

NamedEntityMappingLocation	Location of the mapping file for named entity tags to UIMA types. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 47. Capabilities
Inputs	Token
Outputs	NamedEntity
Languages	see available models

Table 48. Models
Language	Variant	Version
en	bio-genetag	20110623.1
en	bio-genia	20110623.1
en	news-muc6	20110623.1

LingPipe Named Entity Recognizer Trainer

Short name	LingPipeNamedEntityRecognizerTrainer
Category	Named Entity Recognizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.lingpipe-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.lingpipe.LingPipeNamedEntityRecognizerTrainer

Description

LingPipe named entity recognizer trainer.

Parameters

acceptedTagsRegex	Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type. Optional — Type: String
targetLocation	Location to which the output is written. Type: String

NLP4J Named Entity Recognizer

Short name	Nlp4JNamedEntityRecognizer
Category	Named Entity Recognizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.nlp4j-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.nlp4j.Nlp4JNamedEntityRecognizer

Description

Emory NLP4J name finder wrapper.

Parameters

NamedEntityMappingLocation	Location of the mapping file for named entity tags to UIMA types. Optional — Type: String
ignoreMissingFeatures	Process anyway, even if the model relies on features that are not supported by this component. Type: Boolean — Default value: `false`
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 49. Capabilities
Inputs	POS Lemma Sentence Token
Outputs	NamedEntity
Languages	see available models

Table 50. Models
Language	Variant	Version
en	default	20160802.0

OpenNLP Named Entity Recognizer

Short name	OpenNlpNamedEntityRecognizer
Category	Named Entity Recognizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpNamedEntityRecognizer

Description

OpenNLP name finder wrapper.

Parameters

NamedEntityMappingLocation	Location of the mapping file for named entity tags to UIMA types. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Type: String — Default value: `person`
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 51. Capabilities
Inputs	Token
Outputs	NamedEntity
Languages	see available models

Table 52. Models
Language	Variant	Version
de	nemgp	20141024.1
en	date	20100907.0
en	location	20100907.0
en	money	20100907.0
en	organization	20100907.0
en	percentage	20100907.0
en	person	20130624.1
en	time	20100907.0
es	location	20100908.0
es	misc	20100908.0
es	organization	20100908.0
es	person	20100908.0
nl	location	20100908.0
nl	misc	20100908.0
nl	organization	20100908.0
nl	person	20100908.0

OpenNLP Named Entity Recognizer Trainer

Short name	OpenNlpNamedEntityRecognizerTrainer
Category	Named Entity Recognizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpNamedEntityRecognizerTrainer

Description

Train a named entity recognizer model for OpenNLP.

Parameters

acceptedTagsRegex	Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type. Optional — Type: String
algorithm	Type: String — Default value: `PERCEPTRON`
beamSize	Type: Integer — Default value: `3`
cutoff	Frequency cut-off. Type: Integer — Default value: `0`
featureGen	File containing the feature generation specification. Optional — Type: String
iterations	Number of training iterations. Type: Integer — Default value: `300`
language	Store this language to the model instead of the document language. Type: String
numThreads	Number of parallel threads. Type: Integer — Default value: `1`
sequenceEncoding	Type of sequence encoding to use. Type: String — Default value: `BILOU`
targetLocation	Location to which the output is written. Type: String
trainerType	Training algorithm. Type: String — Default value: `Event`

Table 53. Capabilities
Inputs	NamedEntity Sentence Token
Outputs	none specified
Languages	none specified

Semantic Field Annotator

Short name	SemanticFieldAnnotator
Category	Named Entity Recognizer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.dictionaryannotator-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.SemanticFieldAnnotator

Description

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. This could be a lexical resource such as WordNet or a simple key-value map. The annotation is stored in the SemanticField annotation type.

Parameters

annotationType	Annotation types which should be annotated with semantic fields Type: String
constraint	A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to annotate only tokens with semantic fields that are part of a location named entity. Optional — Type: String

Table 54. Capabilities
Inputs	POS Lemma Token
Outputs	NamedEntity
Languages	none specified

Parser

Table 55. Analysis Components in category Parser (12)
Component	Description
BerkeleyParser	Berkeley Parser annotator.
ClearNlpParser	CLEAR parser annotator.
StanfordDependencyConverter	Converts a constituency structure into a dependency structure.
CoreNlpDependencyParser	Dependency parser from CoreNLP.
CoreNlpParser	Parser from CoreNLP.
StanfordParser	Stanford Parser component.
MstParser	Dependency parsing using MSTParser.
MaltParser	Dependency parsing using MaltPaser.
MateParser	DKPro Annotator for the MateToolsParser.
Nlp4JDependencyParser	Emory NLP4J dependency parser.
OpenNlpParser	OpenNLP parser.
UDPipeParser	Dependency parser using UDPipe.

Berkeley Parser

Short name	BerkeleyParser
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.berkeleyparser-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.berkeleyparser.BerkeleyParser

Description

Berkeley Parser annotator. Requires Sentences to be annotated before.

Parameters

ConstituentMappingLocation	Location of the mapping file for constituent tags to UIMA types. Optional — Type: String
POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
accurate	Set thresholds for accuracy instead of efficiency. Type: Boolean — Default value: `false`
binarize	Output binarized trees. Type: Boolean — Default value: `false`
keepFunctionLabels	Retain predicted function labels. Model must have been trained with function labels. Type: Boolean — Default value: `false`
language	Use this language instead of the language set in the CAS to locate the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
readPOS	Sets whether to use or not to use already existing POS tags from another annotator for the parsing process. Type: Boolean — Default value: `true`
scores	Output inside scores (only for binarized viterbi trees). Type: Boolean — Default value: `false`
substates	Output sub-categories (only for binarized Viterbi trees). Type: Boolean — Default value: `false`
variational	Use variational rule score approximation instead of max-rule Type: Boolean — Default value: `false`
viterbi	Compute Viterbi derivation instead of max-rule tree. Type: Boolean — Default value: `false`
writePOS	Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Type: Boolean — Default value: `false`
writePennTree	If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format. Type: Boolean — Default value: `false`

Table 56. Capabilities
Inputs	Sentence Token
Outputs	PennTree Constituent
Languages	see available models

Table 57. Models
Language	Variant	Version
ar	sm5	20090917.1
bg	sm5	20090917.1
de	sm5	20090917.1
en	sm6	20100819.1
fr	sm5	20090917.1
zh	sm5	20090917.1

ClearNLP Parser

Short name	ClearNlpParser
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpParser

Description

CLEAR parser annotator.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
printTagSet	Write the tag set(s) to the log when a model is loaded. Type: Boolean — Default value: `false`

Table 58. Capabilities
Inputs	POS Lemma Sentence Token
Outputs	Dependency
Languages	see available models

Table 59. Models
Language	Variant	Version
en	mayo	20131111.0
en	ontonotes	20131128.0

CoreNLP Dependency Converter

Short name	StanfordDependencyConverter
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordDependencyConverter

Description

Converts a constituency structure into a dependency structure.

Parameters

language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
mode	Sets the kind of dependencies being created. Optional — Type: String — Default value: `TREE`
originalDependencies	Create original dependencies. If this is disabled, universal dependencies are created. The default is to create the original dependencies. Type: Boolean — Default value: `true`

Table 60. Capabilities
Inputs	Token Constituent
Outputs	Dependency
Languages	none specified

CoreNLP Dependency Parser

Short name	CoreNlpDependencyParser
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.corenlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpDependencyParser

Description

Dependency parser from CoreNLP.

Parameters

DependencyMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
extraDependencies	Types of extra edges to add to the dependency tree. Type: String — Default value: `NONE`
language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
maxSentenceLength	Maximum sentence length. Longer sentences are skipped. Type: Integer — Default value: `2147483647`
maxTime	Maximum time to spend on a single sentence. Type: Integer — Default value: `-1`
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	The character encoding used by the model. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
numThreads	Number of parallel threads to use. Type: Integer — Default value: `0`
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
ptb3Escaping	Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: `true`
quoteBegin	List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
quoteEnd	List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]

Table 61. Capabilities
Inputs	POS Sentence Token
Outputs	Dependency
Languages	see available models

Table 62. Models
Language	Variant	Version
de	ud	20161213.1
en	ptb-conll	20160119.1
en	sd	20150418.1
en	ud	20161213.1
en	wsj-sd	20150418.1
en	wsj-ud	20161213.1
fr	ud	20180227.1
zh	ctb-conll	20160119.1
zh	ptb-conll	20161223.1
zh	ud	20161223.1

CoreNLP Parser

Short name	CoreNlpParser
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.corenlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpParser

Description

Parser from CoreNLP.

Parameters

ConstituentMappingLocation	Location of the mapping file for dependency tags to UIMA types. Optional — Type: String
DependencyMappingLocation	Location of the mapping file for dependency tags to UIMA types. Optional — Type: String
POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
extraDependencies	Types of extra edges to add to the dependency tree. Type: String — Default value: `NONE`
keepPunctuation	Whether to keep punctuation dependencies in the dependency parse output of the parser. Type: Boolean — Default value: `false`
language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
maxSentenceLength	Maximum sentence length. Longer sentences are skipped. Type: Integer — Default value: `2147483647`
maxTime	Maximum time to spend on a single sentence. Type: Integer — Default value: `-1`
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	The character encoding used by the model. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
numThreads	Number of parallel threads to use. Type: Integer — Default value: `0`
originalDependencies	Generate original Stanford Dependencies grammatical relations instead of Universal Dependencies. Type: Boolean — Default value: `true`
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
ptb3Escaping	Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: `true`
quoteBegin	List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
quoteEnd	List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
readPOS	Sets whether to use or not to use existing POS tags. Type: Boolean — Default value: `true`
writeConstituent	Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization. Type: Boolean — Default value: `true`
writeDependency	Sets whether to create or not to create dependency annotations. Type: Boolean — Default value: `true`
writePOS	Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Type: Boolean — Default value: `false`
writePennTree	If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format. Type: Boolean — Default value: `false`

Table 63. Capabilities
Inputs	POS Sentence Token
Outputs	Constituent Dependency
Languages	none specified

CoreNLP Parser (old API)

Short name	StanfordParser
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordParser

Description

Stanford Parser component.

Parameters

ConstituentMappingLocation	Location of the mapping file for constituent tags to UIMA types. Optional — Type: String
POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
annotationTypeToParse	This parameter can be used to override the standard behavior which uses the Sentence annotation as the basic unit for parsing. If the parameter is set with the name of an annotation type x, the parser will no longer parse Sentence-annotations, but x-Annotations. Optional — Type: String
keepPunctuation	Whether to keep the punctuation as part of the parse tree. Type: Boolean — Default value: `false`
language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
maxItems	Controls when the factored parser considers a sentence to be too complex and falls back to the PCFG parser. Type: Integer — Default value: `200000`
maxSentenceLength	Maximum number of tokens in a sentence. Longer sentences are not parsed. This is to avoid out of memory exceptions. Type: Integer — Default value: `130`
mode	Sets the kind of dependencies being created. Optional — Type: String — Default value: `TREE`
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
printTagSet	Write the tag set(s) to the log when a model is loaded. Type: Boolean — Default value: `false`
ptb3Escaping	Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: `true`
quoteBegin	List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
quoteEnd	List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
readPOS	Sets whether to use or not to use already existing POS tags from another annotator for the parsing process. Type: Boolean — Default value: `true`
writeConstituent	Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization. Type: Boolean — Default value: `true`
writeDependency	Sets whether to create or not to create dependency annotations. Type: Boolean — Default value: `true`
writePOS	Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Type: Boolean — Default value: `false`
writePennTree	If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format. Type: Boolean — Default value: `false`

Table 64. Capabilities
Inputs	POS Sentence Token
Outputs	Constituent Dependency
Languages	see available models

Table 65. Models
Language	Variant	Version
ar	factored	20150129.1
ar	sr	20180227.1
de	factored	20150129.1
de	pcfg	20150129.1
de	sr	20141031.1
en	factored	20150129.1
en	pcfg	20150129.1
en	pcfg.caseless	20160110.1
en	rnn	20140104.1
en	sr	20141031.1
en	sr-beam	20141031.1
en	wsj-factored	20150129.1
en	wsj-pcfg	20150129.1
en	wsj-rnn	20140104.1
es	pcfg	20161211.1
es	sr	20161211.1
es	sr-beam	20161211.1
fr	factored	20150129.1
fr	sr	20160114.1
fr	sr-beam	20141023.1
zh	factored	20150129.1
zh	pcfg	20150129.1
zh	sr	20141023.1
zh	xinhua-factored	20150129.1
zh	xinhua-pcfg	20150129.1

MSTParser Dependency Parser

Short name	MstParser
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.mstparser-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.mstparser.MstParser

Description

Dependency parsing using MSTParser.

Wrapper for the MSTParser (high memory requirements). More information about the parser can be found here here

The MSTParser models tend to be very large, e.g. the Eisner model is about 600 MB uncompressed. With this model, parsing a simple sentence with MSTParser requires about 3 GB heap memory.

This component feeds MSTParser only with the FORM (token) and POS (part-of-speech) fields. LEMMA, CPOS, and other columns from the CONLL 2006 format are not generated (cf. mstparser.DependencyInstance DependencyInstance).

Parameters

DependencyMappingLocation	Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
order	Specifies the order/scope of features. 1 only has features over single edges and 2 has features over pairs of adjacent edges in the tree. The model must have been trained with the respective order set here. Optional — Type: Integer
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 66. Capabilities
Inputs	POS Sentence Token
Outputs	Dependency
Languages	see available models

Table 67. Models
Language	Variant	Version
en	eisner	20100416.2
en	sample	20121019.2
hr	mte5.defnpout	20130527.1
hr	mte5.pos	20130527.1

MaltParser Dependency Parser

Short name	MaltParser
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.maltparser-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.maltparser.MaltParser

Description

Dependency parsing using MaltPaser.

Required annotations:

Token
Sentence
POS

Generated annotations:

Dependency (annotated over sentence-span)

Parameters

ignoreMissingFeatures	Process anyway, even if the model relies on features that are not supported by this component. Type: Boolean — Default value: `false`
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 68. Capabilities
Inputs	POS Lemma Sentence Token
Outputs	Dependency
Languages	see available models

Table 69. Models
Language	Variant	Version
bn	linear	20120905.1
en	linear	20120312.1
en	poly	20120312.1
es	linear	20130220.0
fa	linear	20130522.1
fr	linear	20120312.1
pl	linear	20120904.1
sv	linear	20120925.2

Mate Tools Dependency Parser

Short name	MateParser
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.matetools.MateParser

Description

DKPro Annotator for the MateToolsParser.

Please cite the following paper, if you use the parser: Bernd Bohnet. 2010. Top Accuracy and Fast Dependency Parsing is not a Contradiction. The 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China.

Parameters

DependencyMappingLocation	Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 70. Capabilities
Inputs	POS Sentence Token
Outputs	Dependency
Languages	see available models

Table 71. Models
Language	Variant	Version
de	tiger	20121024.1
en	conll2009	20130117.2
es	conll2009	20130117.1
fa	parsper	20141124.0
fr	ftb	20130918.0
zh	conll2009	20130117.1

NLP4J Dependency Parser

Short name	Nlp4JDependencyParser
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.nlp4j-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.nlp4j.Nlp4JDependencyParser

Description

Emory NLP4J dependency parser.

Parameters

DependencyMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
ignoreMissingFeatures	Process anyway, even if the model relies on features that are not supported by this component. Type: Boolean — Default value: `false`
language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 72. Capabilities
Inputs	POS Sentence Token
Outputs	Dependency
Languages	none specified

OpenNLP Parser

Short name	OpenNlpParser
Category	Parser
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpParser

Description

OpenNLP parser. The parser ignores existing POS tags and internally creates new ones. However, these tags are only added as annotation if explicitly requested via #PARAM_WRITE_POS.

Parameters

ConstituentMappingLocation	Location of the mapping file for constituent tags to UIMA types. Optional — Type: String
POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
writePOS	Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Type: Boolean — Default value: `false`
writePennTree	If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format. Type: Boolean — Default value: `false`

Table 73. Capabilities
Inputs	Sentence Token
Outputs	PennTree Constituent
Languages	see available models

Table 74. Models
Language	Variant	Version
en	chunking	20120616.1
en	chunking-ixa	20140426.1
es	chunking-ixa	20140426.1

UDPipe Parsito Dependency Parser

Short name	UDPipeParser
Category	Parser
Group ID	org.dkpro.core
Artifact ID	dkpro-core-udpipe-asl
Implementation	org.dkpro.core.udpipe.UDPipeParser

Description

Dependency parser using UDPipe. UDPipe uses Parsito, a greedy transition-based parser utilizing an artificial neural network.

Parameters

DependencyMappingLocation	Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String

Table 75. Capabilities
Inputs	MorphologicalFeatures POS Lemma Sentence Token
Outputs	Dependency
Languages	see available models

Table 76. Models
Language	Variant	Version
en	ud	20160523.1
no	ud	20160523.1

Part-of-speech tagger

Table 77. Analysis Components in category Part-of-speech tagger (18)
Component	Description
ArktweetPosTagger	Wrapper for Twitter Tokenizer and POS Tagger.
ArktweetPosTaggerTrainer	Trainer for ark-tweet POS tagger.
ClearNlpPosTagger	Part-of-Speech annotator using Clear NLP.
CoreNlpPosTagger	Part-of-speech tagger from CoreNLP.
StanfordPosTagger	Stanford Part-of-Speech tagger component.
StanfordPosTaggerTrainer	Train a POS tagging model for the Stanford POS tagger.
FlexTagPosTagger	Flexible part-of-speech tagger.
HepplePosTagger	GATE Hepple part-of-speech tagger.
HunPosTagger	Part-of-Speech annotator using HunPos.
IxaPosTagger	Part-of-Speech annotator using OpenNLP with IXA extensions.
LingPipePosTagger	LingPipe part-of-speech tagger.
MatePosTagger	DKPro Annotator for the MateToolsPosTagger
MeCabTagger	Annotator for the MeCab Japanese POS Tagger.
Nlp4JPosTagger	Part-of-Speech annotator using Emory NLP4J.
OpenNlpPosTagger	Part-of-Speech annotator using OpenNLP.
OpenNlpPosTaggerTrainer	Train a POS tagging model for OpenNLP.
TreeTaggerPosTagger	Part-of-Speech and lemmatizer annotator using TreeTagger.
UDPipePosTagger	Part-of-Speech, lemmatizer, and morphological analyzer using UDPipe.

ArkTweet POS-Tagger

Short name	ArktweetPosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.arktools-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetPosTagger

Description

Wrapper for Twitter Tokenizer and POS Tagger. As described in: Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider and Noah A. Smith. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters In Proceedings of NAACL 2013.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String

Table 78. Capabilities
Inputs	Token
Outputs	POS
Languages	see available models

Table 79. Models
Language	Variant	Version
en	default	20120919.1
en	irc	20121211.1
en	ritter	20130723.1

ArkTweet POS-Tagger Trainer

Short name	ArktweetPosTaggerTrainer
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.arktools-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetPosTaggerTrainer

Description

Trainer for ark-tweet POS tagger.

Parameters

targetLocation	Location to which the model is written. Type: String
wordClusterFile	Classpath resource pointing to the the word cluster file calculated with brown clustering algorithm. Type: String

Table 80. Capabilities
Inputs	POS Sentence Token
Outputs	none specified
Languages	none specified

ClearNLP POS-Tagger

Short name	ClearNlpPosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpPosTagger

Description

Part-of-Speech annotator using Clear NLP. Requires Sentences to be annotated before.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
dictLocation	Load the dictionary from this location instead of locating the dictionary automatically. Optional — Type: String
dictVariant	Override the default variant used to locate the dictionary. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the pos-tagging model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the pos-tagging model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 81. Capabilities
Inputs	Sentence Token
Outputs	POS
Languages	see available models

Table 82. Models
Language	Variant	Version
en	mayo	20131111.0
en	ontonotes	20131128.0

CoreNLP POS-Tagger

Short name	CoreNlpPosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.corenlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpPosTagger

Description

Part-of-speech tagger from CoreNLP.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
maxSentenceLength	Maximum sentence length. Longer sentences are skipped. Type: Integer — Default value: `2147483647`
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	The character encoding used by the model. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
numThreads	Number of parallel threads to use. Type: Integer — Default value: `0`
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
ptb3Escaping	Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: `true`
quoteBegin	List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
quoteEnd	List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]

Table 83. Capabilities
Inputs	Sentence Token
Outputs	POS
Languages	none specified

CoreNLP POS-Tagger (old API)

Short name	StanfordPosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPosTagger

Description

Stanford Part-of-Speech tagger component.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
maxSentenceLength	Sentences with more tokens than the specified max amount will be ignored if this parameter is set to a value larger than zero. The default value zero will allow all sentences to be POS tagged. Optional — Type: Integer
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
ptb3Escaping	Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: `true`
quoteBegin	List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]
quoteEnd	List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[]

Table 84. Capabilities
Inputs	Sentence Token
Outputs	POS
Languages	see available models

Table 85. Models
Language	Variant	Version
ar	default	20180103.1
de	fast	20140827.1
de	fast-caseless	20140827.0
de	hgc	20140827.1
de	ud	20161213.1
en	bidirectional-distsim	20140616.1
en	caseless-left3words-distsim	20140827.0
en	fast.41	20130730.1
en	left3words-distsim	20140616.1
en	twitter	20130730.1
en	twitter-fast	20130914.0
en	wsj-0-18-bidirectional-distsim	20160110.1
en	wsj-0-18-bidirectional-nodistsim	20131112.1
en	wsj-0-18-caseless-left3words-distsim	20140827.0
en	wsj-0-18-left3words-distsim	20140616.1
en	wsj-0-18-left3words-nodistsim	20131112.1
es	default	20161211.1
es	distsim	20161211.1
fr	default	20140616.1
zh	distsim	20140616.1

CoreNLP POS-Tagger Trainer

Short name	StanfordPosTaggerTrainer
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPosTaggerTrainer

Description

Train a POS tagging model for the Stanford POS tagger.

Parameters

clusterFile	Distsim cluster files. Optional — Type: String
targetLocation	Location to which the output is written. Type: String
trainFile	Training file containing the parameters. The `trainFile`, `model` and `encoding` parameters in this file are ignored/overwritten. In the `arch` parameter, the string `${distsimCluster}` is replaced with the path to the cluster files if #PARAM_CLUSTER_FILE is specified. Optional — Type: String

Table 86. Capabilities
Inputs	POS Sentence Token
Outputs	none specified
Languages	none specified

FlexTag POS-Tagger

Short name	FlexTagPosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.flextag-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.flextag.FlexTagPosTagger

Description

Flexible part-of-speech tagger.

Parameters

POSMappingLocation	Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String
language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String

Table 87. Models
Language	Variant	Version
de	tiger	20170512.1
en	wsj0-18	20170512.1

GATE Hepple POS-Tagger

Short name	HepplePosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.gate-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.gate.HepplePosTagger

Description

GATE Hepple part-of-speech tagger.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
lexiconLocation	Load the lexicon from this location instead of locating it automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
rulesetLocation	Load the ruleset from this location instead of locating it automatically. Optional — Type: String

Table 88. Capabilities
Inputs	Sentence Token
Outputs	POS
Languages	see available models

Table 89. Models
Language	Variant	Version
en	annie	20160531.0

HunPos POS-Tagger

Short name	HunPosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.hunpos-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.hunpos.HunPosTagger

Description

Part-of-Speech annotator using HunPos. Requires Sentences to be annotated before.

References

HALÁCSY, Péter; KORNAI, András; ORAVECZ, Csaba. HunPos: an open source trigram tagger. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007. S. 209-212. (pdf) (bibtex)

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 90. Capabilities
Inputs	Sentence Token
Outputs	POS
Languages	see available models

Table 91. Models
Language	Variant	Version
cs	pdt	20121123.2
da	ddt	20121123.2
de	tiger	20121123.2
en	wsj	20070724.2
fa	upc	20140414.0
hr	mte5.defnpout	20130509.2
hu	szeged_kr	20070724.2
pt	bosque	20121123.2
pt	bosque	20121123.2
pt	mm	20130119.2
pt	tbchp	20110419.2
ru	rdt	20121123.2
sl	jos	20121123.2
sv	paroletags	20100215.2
sv	suctags	20100927.2

IXA POS-Tagger

Short name	IxaPosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.ixa-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.ixa.IxaPosTagger

Description

Part-of-Speech annotator using OpenNLP with IXA extensions.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	The character encoding used by the model. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 92. Models
Language	Variant	Version
de	perceptron-autodict01-conll09	20160213.1
en	maxent-100-c5-baseline-autodict01-conll09	20160211.1
en	perceptron-autodict01-conll09	20160211.1
en	perceptron-autodict01-ud	20160214.1
en	xpos-perceptron-autodict01-ud	20160214.1
es	perceptron-autodict01-ancora-2.0	20160212.1
eu	perceptron-ud	20160212.1
fr	perceptron-autodict01-sequoia	20160215.1
gl	perceptron-autdict05-ctag	20160212.1
it	perceptron-autodict01-ud	20160213.1
nl	maxent-100-c5-autodict01-alpino	20160214.1
nl	perceptron-autodict01-alpino	20160214.1

LingPipe POS-Tagger

Short name	LingPipePosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.lingpipe-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.lingpipe.LingPipePosTagger

Description

LingPipe part-of-speech tagger.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
uppercaseTags	Lingpipe models tend to be trained on lower-case tags, but our POS mappings use uppercase. Type: Boolean — Default value: `true`

Table 93. Capabilities
Inputs	Sentence Token
Outputs	POS
Languages	see available models

Table 94. Models
Language	Variant	Version
en	bio-genia	20110623.1
en	bio-medpost	20110623.1
en	general-brown	20110623.1

Mate Tools POS-Tagger

Short name	MatePosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.matetools.MatePosTagger

Description

DKPro Annotator for the MateToolsPosTagger

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 95. Capabilities
Inputs	Sentence Token
Outputs	POS
Languages	see available models

Table 96. Models
Language	Variant	Version
de	tiger	20121024.1
en	conll2009	20130117.1
es	conll2009	20130117.1
fr	ftb	20130918.0
zh	conll2009	20130117.1

MeCab POS-Tagger

Short name	MeCabTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.mecab-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.mecab.MeCabTagger

Description

Annotator for the MeCab Japanese POS Tagger.

Parameters

language	The language. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 97. Capabilities
Inputs	none specified
Outputs	POS Lemma Sentence JapaneseToken
Languages	ja

Table 98. Models
Language	Variant	Version
jp	bin-linux-x86_32	20140917.0
jp	bin-linux-x86_64	20140917.0
jp	bin-osx-x86_64	20140917.0
jp	ipadic	20070801.0

NLP4J POS-Tagger

Short name	Nlp4JPosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.nlp4j-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.nlp4j.Nlp4JPosTagger

Description

Part-of-Speech annotator using Emory NLP4J. Requires Sentences to be annotated before.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
ignoreMissingFeatures	Process anyway, even if the model relies on features that are not supported by this component. Type: Boolean — Default value: `false`
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 99. Capabilities
Inputs	Sentence Token
Outputs	POS
Languages	see available models

Table 100. Models
Language	Variant	Version
en	default	20160802.0

OpenNLP POS-Tagger

Short name	OpenNlpPosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTagger

Description

Part-of-Speech annotator using OpenNLP.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	The character encoding used by the model. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`

Table 101. Capabilities
Inputs	Sentence Token
Outputs	POS
Languages	see available models

Table 102. Models
Language	Variant	Version
da	maxent	20120616.1
da	perceptron	20120616.1
de	maxent	20120616.1
de	perceptron	20120616.1
en	maxent	20120616.1
en	perceptron	20120616.1
en	perceptron-ixa	20131115.1
es	maxent	20120410.1
es	maxent-ixa	20140425.1
es	maxent-universal	20120410.1
es	perceptron	20120410.1
es	perceptron-ixa	20131115.1
es	perceptron-universal	20120410.1
it	perceptron	20130618.0
nl	maxent	20120616.1
nl	perceptron	20120616.1
pt	maxent	20120616.1
pt	mm-maxent	20130121.1
pt	mm-perceptron	20130121.1
pt	perceptron	20120616.1
sv	maxent	20120616.1
sv	perceptron	20120616.1

OpenNLP POS-Tagger Trainer

Short name	OpenNlpPosTaggerTrainer
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTaggerTrainer

Description

Train a POS tagging model for OpenNLP.

Parameters

algorithm	Training algorithm. Type: String — Default value: `MAXENT`
beamSize	Type: Integer — Default value: `3`
cutoff	Frequency cut-off. Type: Integer — Default value: `5`
iterations	Number of training iterations. Type: Integer — Default value: `100`
language	Store this language to the model instead of the document language. Type: String
numThreads	Number of parallel threads. Type: Integer — Default value: `1`
targetLocation	Location to which the output is written. Type: String
trainerType	Trainer type. Type: String — Default value: `Event`

Table 103. Capabilities
Inputs	POS Sentence Token
Outputs	none specified
Languages	none specified

TreeTagger POS-Tagger

Short name	TreeTaggerPosTagger
Category	Part-of-speech tagger
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.treetagger-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger

Description

Part-of-Speech and lemmatizer annotator using TreeTagger.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
executablePath	Use this TreeTagger executable instead of trying to locate the executable automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelEncoding	The character encoding used by the model. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
performanceMode	TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided. Type: Boolean — Default value: `false`
printTagSet	Log the tag set(s) when a model is loaded. Type: Boolean — Default value: `false`
writeLemma	Write lemma information. Type: Boolean — Default value: `true`
writePOS	Write part-of-speech information. Type: Boolean — Default value: `true`

Table 104. Capabilities
Inputs	Token
Outputs	POS Lemma
Languages	see available models

Table 105. Models
Language	Variant	Version
bg	le	20160430.1
de	le	20170316.1
en	le	20170220.1
es	le	20161222.1
et	le	20110124.1
fi	le	20140704.1
fr	le	20100111.1
gl	le	20130516.1
gmh	le	20161107.1
it	le	20141020.1
la	le	20110819.1
mn	le	20120925.1
nl	le	20130107.1
pl	le	20150506.1
pt	le	20101115.2
ru	le	20140505.1
sk	le	20130725.1
sw	le	20130729.1
zh	le	20101115.1

UDPipe MorphoDiTa Morphological Analyzer

Short name	UDPipePosTagger
Category	Part-of-speech tagger
Group ID	org.dkpro.core
Artifact ID	dkpro-core-udpipe-asl
Implementation	org.dkpro.core.udpipe.UDPipePosTagger

Description

Part-of-Speech, lemmatizer, and morphological analyzer using UDPipe. UDPipe uses MorphoDiTa for this task, a Morphological Dictionary and Tagger.

Parameters

POSMappingLocation	Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String

Table 106. Capabilities
Inputs	Sentence Token
Outputs	MorphologicalFeatures POS Lemma
Languages	see available models

Table 107. Models
Language	Variant	Version
en	ud	20160523.1
no	ud	20160523.1

Phonetic Transcriptor

Table 108. Analysis Components in category Phonetic Transcriptor (4)
Component	Description
ColognePhoneticTranscriptor	Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.
DoubleMetaphonePhoneticTranscriptor	Double-Metaphone phonetic transcription based on Apache Commons Codec.
MetaphonePhoneticTranscriptor	Metaphone phonetic transcription based on Apache Commons Codec.
SoundexPhoneticTranscriptor	Soundex phonetic transcription based on Apache Commons Codec.

Commons Codec Cologne Phonetic Transcriptor

Short name	ColognePhoneticTranscriptor
Category	Phonetic Transcriptor
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.commonscodec.ColognePhoneticTranscriptor

Description

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. Works for German.

Table 109. Capabilities
Inputs	Token
Outputs	PhoneticTranscription
Languages	de

Commons Codec Double-Metaphone Phonetic Transcriptor

Short name	DoubleMetaphonePhoneticTranscriptor
Category	Phonetic Transcriptor
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.commonscodec.DoubleMetaphonePhoneticTranscriptor

Description

Double-Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

Table 110. Capabilities
Inputs	Token
Outputs	PhoneticTranscription
Languages	none specified

Commons Codec Metaphone Phonetic Transcriptor

Short name	MetaphonePhoneticTranscriptor
Category	Phonetic Transcriptor
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.commonscodec.MetaphonePhoneticTranscriptor

Description

Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

Table 111. Capabilities
Inputs	Token
Outputs	PhoneticTranscription
Languages	none specified

Commons Codec Soundex Phonetic Transcriptor

Short name	SoundexPhoneticTranscriptor
Category	Phonetic Transcriptor
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.commonscodec.SoundexPhoneticTranscriptor

Description

Soundex phonetic transcription based on Apache Commons Codec. Works for English.

Table 112. Capabilities
Inputs	Token
Outputs	PhoneticTranscription
Languages	en

Segmenter

Segmenter components identify sentence boundaries and tokens. The order in which sentence splitting and tokenization are done differs between the integrated the NLP libraries. Thus, we chose to integrate both steps into a segmenter component to avoid the need to reorder the components in a pipeline when replacing one segmenter with another.

Table 113. Analysis Components in category Segmenter (25)
Component	Description
AnnotationByLengthFilter	Removes annotations that do not conform to minimum or maximum length constraints.
ArktweetTokenizer	ArkTweet tokenizer.
CamelCaseTokenSegmenter	Split up existing tokens again if they are camel-case text.
ClearNlpSegmenter	Tokenizer using Clear NLP.
CoreNlpSegmenter	Tokenizer and sentence splitter using from Stanford CoreNLP.
StanfordSegmenter	Stanford sentence splitter and tokenizer.
GermanSeparatedParticleAnnotator	Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.
GosenSegmenter	Segmenter for Japanese text based on GoSen.
IcuSegmenter	ICU segmenter.
JTokSegmenter	JTok segmenter.
BreakIteratorSegmenter	BreakIterator segmenter.
LanguageToolSegmenter	Segmenter using LanguageTool to do the heavy lifting.
LineBasedSentenceSegmenter	Annotates each line in the source text as a sentence.
LingPipeSegmenter	LingPipe segmenter.
Nlp4JSegmenter	Segmenter using Emory NLP4J.
OpenNlpSegmenter	Tokenizer and sentence splitter using OpenNLP.
OpenNlpSentenceTrainer	Train a sentence splitter model for OpenNLP.
OpenNlpTokenTrainer	Train a tokenizer model for OpenNLP.
ParagraphSplitter	This class creates paragraph annotations for the given input document.
PatternBasedTokenSegmenter	Split up existing tokens again at particular split-chars.
RegexSegmenter	This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.
TokenMerger	Merges any Tokens that are covered by a given annotation type.
UDPipeSegmenter	Tokenizer and sentence splitter using UDPipe.
WhitespaceSegmenter	A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.
TokenTrimmer	Remove prefixes and suffixes from tokens.

Annotation-By-Length Filter

Short name	AnnotationByLengthFilter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.AnnotationByLengthFilter

Description

Removes annotations that do not conform to minimum or maximum length constraints. (This was previously called TokenFilter).

Parameters

FilterTypes	A set of annotation types that should be filtered. Type: String[] — Default value: `[]`
MaxLengthFilter	Any annotation in filterAnnotations shorter than this value will be removed. Type: Integer — Default value: `1000`
MinLengthFilter	Any annotation in filterTypes shorter than this value will be removed. Type: Integer — Default value: `0`

ArkTweet Tokenizer

Short name	ArktweetTokenizer
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.arktools-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetTokenizer

Description

ArkTweet tokenizer.

CamelCase Token Segmenter

Short name	CamelCaseTokenSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.CamelCaseTokenSegmenter

Description

Split up existing tokens again if they are camel-case text.

Parameters

deleteCover	Whether to remove the original token. Type: Boolean — Default value: `true`
markupType	Optional annotation type to markup the original covered token area when specified. This type must be a subtype of Annotation. Optional — Type: String

Table 114. Capabilities
Inputs	Token
Outputs	Token
Languages	none specified

ClearNLP Segmenter

Short name	ClearNlpSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter

Description

Tokenizer using Clear NLP.

Parameters

language	The language. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 115. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	en

Table 116. Models
Language	Variant	Version
en	default	20131111.0

CoreNLP Segmenter

Short name	CoreNlpSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.corenlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpSegmenter

Description

Tokenizer and sentence splitter using from Stanford CoreNLP.

Parameters

boundaryMultiTokenRegex	A TokensRegex multi-token pattern for finding boundaries. Optional — Type: String
boundaryToDiscard	The set of regular expressions for sentence boundary tokens that should be discarded. Optional — Type: String[] — Default value: `[, NL]`
boundaryTokenRegex	The set of boundary tokens. Optional — Type: String — Default value: `[.\u3002]\|[!?\uFF01\uFF1F]+`
htmlElementsToDiscard	These are elements like "p" or "sent", which will be wrapped into regular expressions for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary. Optional — Type: String[]
language	The language. Optional — Type: String
modelLocation	Location from which the model is read. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
newlineIsSentenceBreak	Strategy for treating newlines as sentence breaks. Optional — Type: String — Default value: `two`
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
tokenRegexesToDiscard	The set of regular expressions for sentence boundary tokens that should be discarded. Optional — Type: String[] — Default value: `[]`
tokenizationOption	Additional options that should be passed to the tokenizers. Optional — Type: String
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 117. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	none specified

CoreNLP Segmenter (old API)

Short name	StanfordSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordSegmenter

Description

Stanford sentence splitter and tokenizer.

Parameters

allowEmptySentences	Whether to generate empty sentences. Type: Boolean — Default value: `false`
boundaryFollowersRegex	This is a Set of String that are matched with .equals() which are allowed to be tacked onto the end of a sentence after a sentence boundary token, for example ")". Optional — Type: String — Default value: `[\\p{Pe}\\p{Pf}\"'>\uFF02\uFF07\uFF1E]\|''\|-R[CRS]B-`
boundaryToDiscard	The set of regex for sentence boundary tokens that should be discarded. Optional — Type: String[] — Default value: `[, NL]`
boundaryTokenRegex	The set of boundary tokens. If null, use default. Optional — Type: String — Default value: `[.\u3002]\|[!?\uFF01\uFF1F]+`
isOneSentence	Whether to treat all input as one sentence. Type: Boolean — Default value: `false`
language	The language. Optional — Type: String
languageFallback	If this component is not configured for a specific language and if the language stored in the document metadata is not supported, use the given language as a fallback. Optional — Type: String
newlineIsSentenceBreak	Strategy for treating newlines as paragraph breaks. Optional — Type: String — Default value: `TWO_CONSECUTIVE`
regionElementRegex	A regular expression for element names containing a sentence region. Only tokens in such elements will be included in sentences. The start and end tags themselves are not included in the sentence. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
tokenRegexesToDiscard	The set of regex for sentence boundary tokens that should be discarded. Optional — Type: String[] — Default value: `[]`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
xmlBreakElementsToDiscard	These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary. Optional — Type: String[]
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 118. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	en, es, fr

German Separated Particle Annotator

Short name	GermanSeparatedParticleAnnotator
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.GermanSeparatedParticleAnnotator

Description

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. This Annotator deals with German particle verbs. Particle verbs consist of a particle and a stem, e.g. anfangen = an+fangen There are many usages of German particle verbs where the stem and the particle are separated, e.g., Wir fangen gleich an. The TreeTagger lemmatizes the verb stem as "fangen" and the separated particle as "an", the proper verblemma "anfangen" is thus not available as an annotation. The GermanSeparatedParticleAnnotator replaces the lemma of the stem of particle-verbs (e.g., fangen) by the proper verb lemma (e.g. anfangen) and leaves the lemma of the separated particle unchanged.

Table 119. Capabilities
Inputs	POS Lemma Sentence Token
Outputs	Lemma
Languages	de

Gosen Segmenter

Short name	GosenSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.gosen-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.gosen.GosenSegmenter

Description

Segmenter for Japanese text based on GoSen.

Parameters

language	The language. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 120. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	ja

ICU Segmenter

Short name	IcuSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.icu-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.icu.IcuSegmenter

Description

ICU segmenter.

Parameters

language	The language. Optional — Type: String
splitAtApostrophe	Per default, the segmenter does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered. Type: Boolean — Default value: `false`
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 121. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	af, ak, am, ar, as, az, be, bg, bm, bn, bo, br, bs, ca, ce, cs, cy, da, de, dz, ee, el, en, eo, es, et, eu, fa, ff, fi, fo, fr, fy, ga, gd, gl, gu, gv, ha, hi, hr, hu, hy, ig, ii, is, it, ja, ka, ki, kk, kl, km, kn, ko, ks, kw, ky, lb, lg, ln, lo, lt, lu, lv, mg, mk, ml, mn, mr, ms, mt, my, nb, nd, ne, nl, nn, om, or, os, pa, pl, ps, pt, qu, rm, rn, ro, ru, rw, se, sg, si, sk, sl, sn, so, sq, sr, sv, sw, ta, te, tg, th, ti, to, tr, tt, ug, uk, ur, uz, vi, wo, yo, zh, zu

JTok Segmenter

Short name	JTokSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.jtok-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.jtok.JTokSegmenter

Description

JTok segmenter.

Parameters

language	The language. Optional — Type: String
ptbEscaping	Use PTB-escaping when setting the token form. Type: Boolean — Default value: `false`
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeParagraph	Create Paragraph annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 122. Capabilities
Inputs	none specified
Outputs	Paragraph Sentence Token
Languages	de, en, it

Java BreakIterator Segmenter

Short name	BreakIteratorSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.BreakIteratorSegmenter

Description

BreakIterator segmenter.

Parameters

language	The language. Optional — Type: String
splitAtApostrophe	Per default the Java BreakIterator does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered. Type: Boolean — Default value: `false`
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 123. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	ar, be, bg, ca, cs, da, de, el, en, es, et, fi, fr, ga, hi, hr, hu, is, it, ja, ko, lt, lv, mk, ms, mt, nl, no, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, vi, zh

LanguageTool Segmenter

Short name	LanguageToolSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolSegmenter

Description

Segmenter using LanguageTool to do the heavy lifting. LanguageTool internally uses different strategies for tokenization.

Parameters

language	The language. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 124. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh

Line-based Sentence Segmenter

Short name	LineBasedSentenceSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.LineBasedSentenceSegmenter

Description

Annotates each line in the source text as a sentence. This segmenter is not capable of creating tokens! All respective parameters have no functionality.

Parameters

language	The language. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 125. Capabilities
Inputs	none specified
Outputs	Sentence
Languages	none specified

LingPipe Segmenter

Short name	LingPipeSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.lingpipe-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.lingpipe.LingPipeSegmenter

Description

LingPipe segmenter.

Parameters

language	The language. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 126. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	none specified

NLP4J Segmenter

Short name	Nlp4JSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.nlp4j-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.nlp4j.Nlp4JSegmenter

Description

Segmenter using Emory NLP4J.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 127. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	none specified

OpenNLP Segmenter

Short name	OpenNlpSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpSegmenter

Description

Tokenizer and sentence splitter using OpenNLP.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
segmentationModelLocation	Load the segmentation model from this location instead of locating the model automatically. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
tokenizationModelLocation	Load the tokenization model from this location instead of locating the model automatically. Optional — Type: String
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 128. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	see available models

Table 129. Models
Language	Variant	Version
da	maxent	20120616.1
da	maxent	20120616.1
de	maxent	20120616.1
de	maxent	20120616.1
en	maxent	20120616.1
en	maxent	20120616.1
it	maxent	20130618.0
it	maxent	20130618.0
nb	maxent	20120131.1
nb	maxent	20120131.1
nl	maxent	20120616.1
nl	maxent	20120616.1
pt	maxent	20120616.1
pt	maxent	20120616.1
sv	maxent	20120616.1
sv	maxent	20120616.1

OpenNLP Sentence Splitter Trainer

Short name	OpenNlpSentenceTrainer
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpSentenceTrainer

Description

Train a sentence splitter model for OpenNLP.

Parameters

abbreviationDictionaryEncoding	Encoding of the abbreviation dictionary. Type: String — Default value: `UTF-8`
abbreviationDictionaryLocation	Location of the abbreviation dictionary. Optional — Type: String
algorithm	Training algorithm. Type: String — Default value: `MAXENT`
cutoff	Frequency cut-off. Type: Integer — Default value: `5`
eosCharacters	End-of-sentence characters. Optional — Type: String[]
iterations	Number of training iterations. Type: Integer — Default value: `100`
language	Store this language to the model instead of the document language. Type: String
numThreads	Number of parallel threads. Type: Integer — Default value: `1`
targetLocation	Location to which the output is written. Type: String
trainerType	Trainer type. Type: String — Default value: `Event`

Table 130. Capabilities
Inputs	Sentence
Outputs	none specified
Languages	none specified

OpenNLP Tokenizer Trainer

Short name	OpenNlpTokenTrainer
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpTokenTrainer

Description

Train a tokenizer model for OpenNLP.

Parameters

abbreviationDictionaryEncoding	Encoding of the abbreviation dictionary. Type: String — Default value: `UTF-8`
abbreviationDictionaryLocation	Location of the abbreviation dictionary. Optional — Type: String
algorithm	Training algorithm. Type: String — Default value: `MAXENT`
alphaNumericPattern	Regular expression to detect alpha numerics. Optional — Type: String — Default value: `^[A-Za-z0-9]+$`
cutoff	Frequency cut-off. Type: Integer — Default value: `5`
iterations	Number of training iterations. Type: Integer — Default value: `100`
language	Store this language to the model instead of the document language. Type: String
numThreads	Number of parallel threads. Type: Integer — Default value: `1`
targetLocation	Location to which the output is written. Type: String
trainerType	Trainer type. Type: String — Default value: `Event`
useAlphaNumericOptimization	If true alpha numerics are skipped. Type: Boolean — Default value: `true`

Table 131. Capabilities
Inputs	Token
Outputs	none specified
Languages	none specified

Paragraph Splitter

Short name	ParagraphSplitter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.ParagraphSplitter

Description

This class creates paragraph annotations for the given input document. It searches for the occurrence of two or more line-breaks (Unix and Windows) and regards this as the boundary between paragraphs.

Parameters

splitPattern

A regular expression used to detect paragraph splits.

Type: String — Default value: ((\r\n\r\n)(\r\n)*)|((\n\n)(\n)*)

Table 132. Capabilities
Inputs	none specified
Outputs	Paragraph
Languages	none specified

Pattern-based Token Segmenter

Short name	PatternBasedTokenSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.PatternBasedTokenSegmenter

Description

Split up existing tokens again at particular split-chars. The prefix states whether the split chars should be added as separate Token Tokens. If the #INCLUDE_PREFIX precedes the split pattern, the pattern is included. Consequently, patterns following the #EXCLUDE_PREFIX, will not be added as a Token.

Parameters

deleteCover	Whether to remove the original token. Type: Boolean — Default value: `true`
patterns	A list of regular expressions, prefixed with #INCLUDE_PREFIX or #EXCLUDE_PREFIX. If neither of the prefixes is used, #EXCLUDE_PREFIX is assumed. Type: String[]

Table 133. Capabilities
Inputs	Token
Outputs	Token
Languages	none specified

Regex Segmenter

Short name	RegexSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.RegexSegmenter

Description

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

The default behavior is to split sentences by a line break and tokens by whitespace.

Parameters

language	The language. Optional — Type: String
sentenceBoundaryRegex	Define the sentence boundary. Type: String — Default value: ``
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
tokenBoundaryRegex	Defines the pattern that is used as token end boundary. When setting custom patterns, take into account that the final token is often terminated by a linebreak rather than the boundary character. Therefore, the newline typically has to be added to the group of matching characters, e.g. "tokenized-text" is correctly tokenized with the pattern [-\n]. Type: String — Default value: `[\\s\n]+`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 134. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	none specified

Token Merger

Short name	TokenMerger
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.TokenMerger

Description

Merges any Tokens that are covered by a given annotation type. E.g. this component can be used to create a single tokens from all tokens that constitute a multi-token named entity.

Parameters

POSMappingLocation	Override the tagset mapping. Optional — Type: String
annotationType	Annotation type for which tokens should be merged. Type: String
constraint	A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to merge only tokens that are part of a location named entity. Optional — Type: String
cposValue	Set a new coarse POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset). Optional — Type: String
language	Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String
lemmaMode	Configure what should happen to the lemma of the merged tokens. It is possible to JOIN the lemmata to a single lemma (space separated), to REMOVE the lemma or LEAVE the lemma of the first token as-is. Type: String — Default value: `JOIN`
posType	Set a new POS tag for the new merged token. This is the mapped type. If this is specified, tag set mapping will not be performed. This parameter has no effect unless PARAM_POS_VALUE is also set. Optional — Type: String
posValue	Set a new POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset). Optional — Type: String

Table 135. Capabilities
Inputs	POS Lemma Token
Outputs	Lemma
Languages	none specified

UDPipe Segmenter

Short name	UDPipeSegmenter
Category	Segmenter
Group ID	org.dkpro.core
Artifact ID	dkpro-core-udpipe-asl
Implementation	org.dkpro.core.udpipe.UDPipeSegmenter

Description

Tokenizer and sentence splitter using UDPipe.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 136. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	see available models

Table 137. Models
Language	Variant	Version
en	ud	20160523.1
no	ud	20160523.1

Whitespace Segmenter

Short name	WhitespaceSegmenter
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.WhitespaceSegmenter

Description

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

If PARAM_WRITE_SENTENCES is set to true, one sentence per line is assumed. Otherwise, no sentences are created.

Parameters

language	The language. Optional — Type: String
strictZoning	Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: `false`
writeForm	Create TokenForm annotations. Type: Boolean — Default value: `true`
writeSentence	Create Sentence annotations. Type: Boolean — Default value: `true`
writeToken	Create Token annotations. Type: Boolean — Default value: `true`
zoneTypes	A list of type names used for zoning. Optional — Type: String[] — Default value: `[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]`

Table 138. Capabilities
Inputs	none specified
Outputs	Sentence Token
Languages	none specified

de.tudarmstadt.ukp.dkpro.core.tokit.TokenTrimmer

Short name	TokenTrimmer
Category	Segmenter
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.tokit-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.tokit.TokenTrimmer

Description

Remove prefixes and suffixes from tokens.

Parameters

prefixes	List of prefixes to remove. Type: String[]
suffixes	List of suffixes to remove. Type: String[]

Table 139. Capabilities
Inputs	Token
Outputs	Token
Languages	none specified

Semantic role labeler

Table 140. Analysis Components in category Semantic role labeler (2)
Component	Description
ClearNlpSemanticRoleLabeler	ClearNLP semantic role labeller.
MateSemanticRoleLabeler	Annotator for the MateTools Semantic Role Labeler.

ClearNLP Semantic Role Labeler

Short name	ClearNlpSemanticRoleLabeler
Category	Semantic role labeler
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSemanticRoleLabeler

Description

ClearNLP semantic role labeller.

Parameters

expandArguments	Normally the arguments point only to the head words of arguments in the dependency tree. With this option enabled, they are expanded to the text covered by the minimal and maximal token offsets of all descendants (or self) of the head word. Warning: this parameter should be used with caution! For one, if the descentants of a head word cover a non-continuous region of the text, this information is lost. The arguments will appear to be spanning a continuous region. For another, the arguments may overlap with each other. E.g. if a sentence contains a relative clause with a verb, the subject of the main clause may be recognized as a dependent of the verb and may cause the whole main clause to be recorded in the argument. Type: Boolean — Default value: `false`
language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelVariant	Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String
predModelLocation	Location from which the predicate identifier model is read. Optional — Type: String
printTagSet	Write the tag set(s) to the log when a model is loaded. Type: Boolean — Default value: `false`
roleModelLocation	Location from which the roleset classification model is read. Optional — Type: String
srlModelLocation	Location from which the semantic role labeling model is read. Optional — Type: String

Table 141. Capabilities
Inputs	POS Lemma Sentence Token Dependency
Outputs	SemArg SemPred
Languages	see available models

Table 142. Models
Language	Variant	Version
en	mayo	20131111.0
en	ontonotes	20131128.0

Mate Tools Semantic Role Labeler

Short name	MateSemanticRoleLabeler
Category	Semantic role labeler
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.matetools.MateSemanticRoleLabeler

Description

Annotator for the MateTools Semantic Role Labeler.

Please cite the following paper, if you use the semantic role labeler Anders Björkelund, Love Hafdell, and Pierre Nugues. Multilingual semantic role labeling. In Proceedings of The Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 43--48, Boulder, June 4--5 2009.

Parameters

language	Use this language instead of the document language to resolve the model. Optional — Type: String
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Load the model from this location instead of locating the model automatically. Optional — Type: String
modelVariant	Override the default variant used to locate the model. Optional — Type: String

Table 143. Capabilities
Inputs	POS Lemma Sentence Token Dependency
Outputs	SemArg SemPred
Languages	see available models

Table 144. Models
Language	Variant	Version
de	tiger	20130105.0
en	conll2009	20130117.0
es	conll2009	20130320.0
zh	conll2009	20130117.0

Stemmer

Table 145. Analysis Components in category Stemmer (3)
Component	Description
CisStemmer	UIMA wrapper for the CISTEM algorithm.
LancasterStemmer	This Paice/Husk Lancaster stemmer implementation only works with the English language so far.
SnowballStemmer	UIMA wrapper for the Snowball stemmer.

CIS Stemmer

Short name	CisStemmer
Category	Stemmer
Group ID	org.dkpro.core
Artifact ID	dkpro-core-cisstem-asl
Implementation	org.dkpro.core.cisstem.CisStemmer

Description

UIMA wrapper for the CISTEM algorithm.

CISTEM is a stemming algorithm for the German language, developed by Leonie Weißweiler and Alexander Fraser. Annotation types to be stemmed can be configured by a FeaturePath.

If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.

Parameters

filterConditionOperator	Specifies the operator for a filtering condition. It is only used if `PARAM_FILTER_FEATUREPATH` is set. Optional — Type: String
filterConditionValue	Specifies the value for a filtering condition. It is only used if `PARAM_FILTER_FEATUREPATH` is set. Optional — Type: String
filterFeaturePath	Specifies a feature path that is used in the filter. If this is set, you also have to specify `PARAM_FILTER_CONDITION_OPERATOR` and `PARAM_FILTER_CONDITION_VALUE`. Optional — Type: String
lowerCase	Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer. Optional — Type: Boolean — Default value: `false`
paths	Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path. Optional — Type: String[]

Table 146. Capabilities
Inputs	none specified
Outputs	Stem
Languages	de

Lancaster Stemmer

Short name	LancasterStemmer
Category	Stemmer
Group ID	org.dkpro.core
Artifact ID	dkpro-core-lancaster-asl
Implementation	org.dkpro.core.lancaster.LancasterStemmer

Description

This Paice/Husk Lancaster stemmer implementation only works with the English language so far.

Parameters

filterConditionOperator	Specifies the operator for a filtering condition. It is only used if `PARAM_FILTER_FEATUREPATH` is set. Optional — Type: String
filterConditionValue	Specifies the value for a filtering condition. It is only used if `PARAM_FILTER_FEATUREPATH` is set. Optional — Type: String
filterFeaturePath	Specifies a feature path that is used in the filter. If this is set, you also have to specify `PARAM_FILTER_CONDITION_OPERATOR` and `PARAM_FILTER_CONDITION_VALUE`. Optional — Type: String
language	Specifies the language supported by the stemming model. Default value is "en" (English). Type: String — Default value: `en`
modelArtifactUri	URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model. The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin. Optional — Type: String
modelLocation	Specifies an URL that should resolve to a location from where to load custom rules. If the location starts with classpath: the location is interpreted as a classpath location, e.g. "classpath:my/path/to/the/rules". Otherwise it is tried as an URL, file and at last UIMA resource. Optional — Type: String
paths	Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path. Optional — Type: String[]
stripPrefix	True if the stemmer will strip prefix such as kilo, micro, milli, intra, ultra, mega, nano, pico, pseudo. Type: Boolean — Default value: `false`

Table 147. Capabilities
Inputs	Token
Outputs	Stem
Languages	en

Snowball Stemmer

Short name	SnowballStemmer
Category	Stemmer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.snowball-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.snowball.SnowballStemmer

Description

UIMA wrapper for the Snowball stemmer. Annotation types to be stemmed can be configured by a FeaturePath.

If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.

Parameters

filterConditionOperator

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterConditionValue

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

lowerCase

Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.

Examples
	false (default)	true
EDUCATIONAL	EDUCATIONAL	educ
Educational	Educat	educ
educational	educ	educ

Optional — Type: Boolean — Default value: false

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Optional — Type: String[]

Table 148. Capabilities
Inputs	none specified
Outputs	Stem
Languages	da, de, en, es, fi, fr, hu, it, nl, no, pt, ro, ru, sv, tr

Topic Model

Topic modeling is a statistical approach to discover abstract topics in a collection of documents. A topic is characterized by a probability distribution of the words in the document collection. Once a topic model has been generated, it can be used to analyze unseen documents. The result of the analysis is describes the probability by which a document belongs to each of the topics in the model.

Table 149. Analysis Components in category Topic Model (2)
Component	Description
MalletLdaTopicModelInferencer	Infers the topic distribution over documents using a Mallet ParallelTopicModel.
MalletLdaTopicModelTrainer	Estimate an LDA topic model using Mallet and write it to a file.

Mallet LDA Topic Model Inferencer

Short name	MalletLdaTopicModelInferencer
Category	Topic Model
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.mallet-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.mallet.lda.MalletLdaTopicModelInferencer

Description

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

Parameters

burnIn	The number of iterations before hyper-parameter optimization begins. Type: Integer — Default value: `1`
lowercase	If set to true (default: false), all tokens are lowercased. Type: Boolean — Default value: `false`
maxTopicAssignments	Maximum number of topics to assign. If not set (or <= 0), the number of topics in the model divided by 10 is set. Type: Integer — Default value: `0`
minTokenLength	Ignore tokens (or lemmas, respectively) that are shorter than the given value. Type: Integer — Default value: `3`
minTopicProb	Minimum topic proportion for the document-topic assignment. Type: Float — Default value: `0.2`
modelLocation	Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well. Type: String
nIterations	The number of iterations during inference. Type: Integer — Default value: `100`
thinning	The number of iterations between saved samples. Type: Integer — Default value: `5`
tokenFeaturePath	The annotation type to use for the model. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token`
typeName	The annotation type to use as tokens. Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token`

Table 150. Capabilities
Inputs	Token
Outputs	TopicDistribution
Languages	none specified

Mallet LDA Topic Model Trainer

Short name	MalletLdaTopicModelTrainer
Category	Topic Model
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.mallet-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.mallet.lda.MalletLdaTopicModelTrainer

Description

Estimate an LDA topic model using Mallet and write it to a file. It stores all incoming CAS' to Mallet Instances before estimating the model, using a ParallelTopicModel.

Set #PARAM_TOKEN_FEATURE_PATH to define what is considered as a token (Tokens, Lemmas, etc.).

Set #PARAM_COVERING_ANNOTATION_TYPE to define what is considered a document (sentences, paragraphs, etc.).

Parameters

alphaSum	The sum of alphas over all topics. Another recommended value is 50 / T (number of topics). Type: Float — Default value: `1.0`
beta	Beta for a single dimension of the Dirichlet prior. Type: Float — Default value: `0.01`
burninPeriod	The number of iterations before hyper-parameter optimization begins. Type: Integer — Default value: `100`
compression	Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: `NONE`
coveringAnnotationType	If specified, the text contained in the given segmentation type annotations are fed as separate units ("documents") to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored. By default, the full text is used as a document. Type: String — Default value: ``
displayInterval	The interval in which to display the estimated topics. Type: Integer — Default value: `50`
displayNTopicWords	The number of top words to display during estimation. Type: Integer — Default value: `7`
escapeDocumentId	URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: `true`
filterRegex	Regular expression of tokens to be filtered. Type: String — Default value: ``
filterRegexReplacement	Value with which tokens matching the regular expression are replaced. Type: String — Default value: ``
lowercase	If set to true (default: false), all tokens are lowercased. Type: Boolean — Default value: `false`
minTokenLength	Ignore tokens (or any other annotation type, as specified by #PARAM_TOKEN_FEATURE_PATH) that are shorter than the given value. Type: Integer — Default value: `3`
nIterations	The number of iterations during model estimation. Type: Integer — Default value: `1000`
nTopics	The number of topics to estimate. Type: Integer — Default value: `10`
numThreads	The number of threads to use during model estimation. If not set, the number of threads is automatically set by ComponentParameters#computeNumThreads(int). Warning: do not set this to more than 1 when using very small (test) data sets on MalletEmbeddingsTrainer! This might prevent the process from terminating. Type: Integer — Default value: `0`
optimizeInterval	Interval for optimizing Dirichlet hyper-parameters. Type: Integer — Default value: `50`
overwrite	Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: `false`
paramStopwordsFile	The location of the stopwords file. Type: String — Default value: ``
paramStopwordsReplacement	If set, stopwords found in the #PARAM_STOPWORDS_FILE location are not removed, but replaced by the given string (e.g. STOP). Type: String — Default value: ``
randomSeed	Set random seed. If set to -1 (default), uses random generator. Type: Integer — Default value: `-1`
saveInterval	Define how frequently an intermediate serialized model is saved to disk during estimation. Type: Integer — Default value: `0`
singularTarget	Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: `false`
stripExtension	Remove the original extension. Type: Boolean — Default value: `false`
targetLocation	Target location. If this parameter is not set, data is written to stdout. Optional — Type: String
tokenFeaturePath	The annotation type to use as input tokens for the model estimation. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token`
useCharacters	If true (default: false), estimate character embeddings. #PARAM_TOKEN_FEATURE_PATH is ignored. Type: Boolean — Default value: `false`
useDocumentId	Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: `false`
useSymmetricAlpha	Use a symmetric alpha value during model estimation? Type: Boolean — Default value: `false`

Transformer

Table 151. Analysis Components in category Transformer (15)
Component	Description
ApplyChangesAnnotator	Applies changes annotated using a SofaChangeAnnotation.
Backmapper	After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.
CapitalizationNormalizer	Takes a text and replaces wrong capitalization
CjfNormalizer	Converts traditional Chinese to simplified Chinese or vice-versa.
DictionaryBasedTokenTransformer	Reads a tab-separated file containing mappings from one token to another.
ExpressiveLengtheningNormalizer	Takes a text and shortens extra long words
FileBasedTokenTransformer	Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.
HyphenationRemover	Simple dictionary-based hyphenation remover.
RegexBasedTokenTransformer	A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.
ReplacementFileNormalizer	Takes a text and replaces desired expressions.
SharpSNormalizer	Takes a text and replaces sharp s
SpellingNormalizer	Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.
StanfordPtbTransformer	Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.
TokenCaseTransformer	Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.
UmlautNormalizer	Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

CAS Transformation - Apply

Short name	ApplyChangesAnnotator
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.castransformation-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.castransformation.ApplyChangesAnnotator

Description

Applies changes annotated using a SofaChangeAnnotation.

Table 152. Capabilities
Inputs	DocumentMetaData SofaChangeAnnotation
Outputs	DocumentMetaData SofaChangeAnnotation
Languages	none specified

CAS Transformation - Map back

Short name	Backmapper
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.castransformation-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.castransformation.Backmapper

Description

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

This annotator is able to resume the mapping after a CAS restore from any point after the cleaned view has been created, as long as no changes were made to SofaChangeAnnotations in the original view.

Parameters

Chain

Chain of views for backmapping. This should be the reverse of the chain of views that the ApplyChangesAnnotator has used. For example, if view A has been mapped to B using ApplyChangesAnnotator, then this parameter should be set using an array containing [B, A].

Optional — Type: String[] — Default value: [source, target]

Capitalization Normalizer

Short name	CapitalizationNormalizer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.CapitalizationNormalizer

Description

Takes a text and replaces wrong capitalization

Parameters

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[] — Default value: []

Table 153. Capabilities
Inputs	Token
Outputs	none specified
Languages	none specified

Chinese Traditional/Simplified Converter

Short name	CjfNormalizer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.languagetool.CjfNormalizer

Description

Converts traditional Chinese to simplified Chinese or vice-versa.

Parameters

direction	Direction in which to perform the conversion (Direction#TO_TRADITIONAL or Direction#TO_SIMPLIFIED); Type: String — Default value: `TO_SIMPLIFIED`
typesToCopy	A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: `[]`

Table 154. Capabilities
Inputs	none specified
Outputs	none specified
Languages	zh

Dictionary-based Token Transformer

Short name	DictionaryBasedTokenTransformer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.DictionaryBasedTokenTransformer

Description

Reads a tab-separated file containing mappings from one token to another. All tokens that match an entry in the first column are changed to the corresponding token in the second column.

Parameters

commentMarker	Lines starting with this character (or String) are ignored. Type: String — Default value: `#`
modelEncoding	The character encoding used by the model. Type: String — Default value: `UTF-8`
modelLocation	Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well. Type: String
separator	Separator for mappings file. Type: String — Default value: ``
typesToCopy	A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: `[]`

Expressive Lengthening Normalizer

Short name	ExpressiveLengtheningNormalizer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.ExpressiveLengtheningNormalizer

Description

Takes a text and shortens extra long words

Parameters

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[] — Default value: []

Table 155. Capabilities
Inputs	Token
Outputs	none specified
Languages	none specified

File-based Token Transformer

Short name	FileBasedTokenTransformer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.FileBasedTokenTransformer

Description

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

Parameters

ignoreCase	Match tokens against the dictionary without considering case. Type: Boolean — Default value: `false`
modelLocation	Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well. Type: String
replacement	The value by which the matching tokens should be replaced. Type: String
typesToCopy	A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: `[]`

Hyphenation Remover

Short name	HyphenationRemover
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.HyphenationRemover

Description

Simple dictionary-based hyphenation remover.

Parameters

modelEncoding	The character encoding used by the model. Type: String — Default value: `UTF-8`
modelLocation	Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well. Type: String
typesToCopy	A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: `[]`

Regex-based Token Transformer

Short name	RegexBasedTokenTransformer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.RegexBasedTokenTransformer

Description

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

The parameters #PARAM_REGEX defines the regular expression to be searcher, #PARAM_REPLACEMENT defines the string with which matching patterns are replaces.

Parameters

regex	Define the regular expression to be replaced Type: String
replacement	Define the string to replace matching tokens with Type: String
typesToCopy	A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: `[]`

Table 156. Capabilities
Inputs	Token
Outputs	none specified
Languages	none specified

Replacement File Normalizer

Short name	ReplacementFileNormalizer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.ReplacementFileNormalizer

Description

Takes a text and replaces desired expressions. This class should not work on tokens as some expressions might span several tokens.

Parameters

modelEncoding	The character encoding used by the model. Type: String — Default value: `UTF-8`
modelLocation	Location of a file which contains all replacing characters Type: String
srcExpressionSurroundings	Pattern describing valid left/right context of the source expression. Type: String — Default value: `IRRELEVANT`
targetExpressionSurroundings	Left/right context of the replacement. Type: String — Default value: `NOTHING`

Table 157. Capabilities
Inputs	Token
Outputs	SofaChangeAnnotation
Languages	none specified

Sharp S (ß) Normalizer

Short name	SharpSNormalizer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.SharpSNormalizer

Description

Takes a text and replaces sharp s

Parameters

minFrequencyThreshold	Minimum frequency count. Type: Integer — Default value: `100`
typesToCopy	A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: `[]`

Table 158. Capabilities
Inputs	none specified
Outputs	none specified
Languages	de

Spelling Normalizer

Short name	SpellingNormalizer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.SpellingNormalizer

Description

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

Parameters

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[] — Default value: []

Table 159. Capabilities
Inputs	SpellingAnomaly
Outputs	none specified
Languages	none specified

Stanford Penn Treebank Normalizer

Short name	StanfordPtbTransformer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Implementation	de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPtbTransformer

Description

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. This component operates directly on the text and does not require prior segmentation.

Parameters

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[] — Default value: []

Token Case Transformer

Short name	TokenCaseTransformer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.TokenCaseTransformer

Description

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

Parameters

tokenCase	The case to convert tokens to: UPPERCASE: uppercase everything. LOWERCASE: lowercase everything. NORMALCASE: retain first letter in word and after hyphens, lowercase everything else. Type: String
typesToCopy	A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: `[]`

Umlaut Normalizer

Short name	UmlautNormalizer
Category	Transformer
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.UmlautNormalizer

Description

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

Parameters

minFrequencyThreshold	Minimum frequency count. Type: Integer — Default value: `100`
typesToCopy	A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: `[]`

Table 160. Capabilities
Inputs	Token
Outputs	none specified
Languages	de

Other

Table 161. Analysis Components in category Other (15)
Component	Description
AnnotationByTextFilter	Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.
CompoundAnnotator	Annotates compound parts and linking morphemes.
CorrectionsContextualizer	This component assumes that some spell checker has already been applied upstream (e.g.
NGramAnnotator	N-gram annotator.
PosFilter	Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.
PosMapper	Maps existing POS tags from one tagset to another using a user provided properties file.
PhraseAnnotator	Annotate phrases in a sentence.
ReadabilityAnnotator	Assign a set of popular readability scores to the text.
RegexTokenFilter	Remove every token that does or does not match a given regular expression.
NorvigSpellingCorrector	Identifies spelling errors using Norvig's algorithm.
StopWordRemover	Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.
Stopwatch	Can be used to measure how long the processing between two points in a pipeline takes.
TfIdfAnnotator	This component adds Tfidf annotations consisting of a term and a tfidf weight.
TrailingCharacterRemover	Removing trailing character (sequences) from tokens, e.g. punctuation.
JCasHolder	Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

Annotation-By-Text Filter

Short name	AnnotationByTextFilter
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.AnnotationByTextFilter

Description

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

Parameters

ignoreCase	If true, annotation texts are filtered case-independently (i.e. words that occur in the list with different casing are not filtered out). Type: Boolean — Default value: `true`
modelEncoding	The character encoding used by the model. Type: String — Default value: `UTF-8`
modelLocation	Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well. Type: String
typeName	Annotation type to filter. Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token`

Compound Annotator

Short name	CompoundAnnotator
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.decompounding-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.decompounding.uima.annotator.CompoundAnnotator

Description

Annotates compound parts and linking morphemes.

Table 162. Capabilities
Inputs	Token
Outputs	Compound CompoundPart LinkingMorpheme Split
Languages	none specified

Corrections Contextualizer

Short name	CorrectionsContextualizer
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.jazzy-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.jazzy.CorrectionsContextualizer

Description

This component assumes that some spell checker has already been applied upstream (e.g. Jazzy). It then uses n-gram frequencies from a frequency provider in order to rank the provided corrections.

N-Gram Annotator

Short name	NGramAnnotator
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.ngrams-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.ngrams.NGramAnnotator

Description

N-gram annotator.

Parameters

N	The length of the n-grams to generate (the "n" in n-gram). Type: Integer — Default value: `3`

Table 163. Capabilities
Inputs	Sentence Token
Outputs	NGram
Languages	none specified

POS Filter

Short name	PosFilter
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.posfilter-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.posfilter.PosFilter

Description

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

Parameters

adj	Keep/remove adjectives (true: keep, false: remove) Type: Boolean — Default value: `false`
adp	Keep/remove adpositions (true: keep, false: remove) Type: Boolean — Default value: `false`
adv	Keep/remove adverbs (true: keep, false: remove) Type: Boolean — Default value: `false`
aux	Keep/remove auxiliary verbs (true: keep, false: remove) Type: Boolean — Default value: `false`
conj	Keep/remove conjunctions (true: keep, false: remove) Type: Boolean — Default value: `false`
det	Keep/remove articles (true: keep, false: remove) Type: Boolean — Default value: `false`
intj	Keep/remove interjections (true: keep, false: remove) Type: Boolean — Default value: `false`
noun	Keep/remove nouns (true: keep, false: remove) Type: Boolean — Default value: `false`
num	Keep/remove numerals (true: keep, false: remove) Type: Boolean — Default value: `false`
part	Keep/remove particles (true: keep, false: remove) Type: Boolean — Default value: `false`
pron	Keep/remove pronnouns (true: keep, false: remove) Type: Boolean — Default value: `false`
propn	Keep/remove proper nouns (true: keep, false: remove) Type: Boolean — Default value: `false`
punct	Keep/remove punctuation (true: keep, false: remove) Type: Boolean — Default value: `false`
sconj	Keep/remove conjunctions (true: keep, false: remove) Type: Boolean — Default value: `false`
sym	Keep/remove symbols (true: keep, false: remove) Type: Boolean — Default value: `false`
typeToRemove	The fully qualified name of the type that should be filtered. Type: String
verb	Keep/remove verbs (true: keep, false: remove) Type: Boolean — Default value: `false`
x	Keep/remove other (true: keep, false: remove) Type: Boolean — Default value: `false`

Table 164. Capabilities
Inputs	POS
Outputs	none specified
Languages	none specified

POS Mapper

Short name	PosMapper
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.posfilter-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.posfilter.PosMapper

Description

Maps existing POS tags from one tagset to another using a user provided properties file.

Parameters

dkproMappingLocation	A properties file containing mappings from the new tagset to (fully qualified) DKPro POS classes. If such a file is not supplied, the DKPro POS classes stay the same regardless of the new POS tag value, and only the value is changed. Optional — Type: String
mappingFile	A properties file containing POS tagset mappings. Type: String

Table 165. Capabilities
Inputs	POS Token
Outputs	POS Token
Languages	none specified

Phrase Annotator

Short name	PhraseAnnotator
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.frequency-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.frequency.phrasedetection.PhraseAnnotator

Description

Annotate phrases in a sentence. Depending on the provided n-grams and the threshold, these comprise either one or two annotations (tokens, lemmas, ...).

In order to identify longer phrases, run the FrequencyWriter and this annotator multiple times, each time taking the results of the previous run as input. From the second run on, set phrases in the feature path parameter #PARAM_FEATURE_PATH.

Parameters

PARAM_LOWERCASE	If true, lowercase everything. Type: Boolean — Default value: `false`
coveringType	Set this parameter if bigrams should only be counted when occurring within a covering type, e.g. sentences. Optional — Type: String
discount	The discount in order to prevent too many phrases consisting of very infrequent words to be formed. A typical value is the minimum count set during model creation (FrequencyWriter#PARAM_MIN_COUNT), which is by default set to 5. Type: Integer — Default value: `5`
featurePath	The feature path to use for building bigrams. Type: String — Default value: `de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token`
filterRegex	Regular expression of tokens to be filtered. Type: String — Default value: ``
modelLocation	The file providing the uni-grams and bi-grams to use. Type: String
regexReplacement	Value with which tokens matching the regular expression are replaced. Type: String — Default value: ``
stopwordsFile	Path of a file containing stopwords one work per line. Type: String — Default value: ``
stopwordsReplacement	Stopwords are replaced by this value. Type: String — Default value: ``
threshold	The threshold score for phrase construction. Default is 100. Lower values result in fewer phrases. The value strongly depends on the size of the corpus and the token unigrams. Type: Float — Default value: `100.0`

Readability Annotator

Short name	ReadabilityAnnotator
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.readability-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.readability.ReadabilityAnnotator

Description

Assign a set of popular readability scores to the text.

Table 166. Capabilities
Inputs	Sentence Token
Outputs	ReadabilityScore
Languages	none specified

Regex Token Filter

Short name	RegexTokenFilter
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.RegexTokenFilter

Description

Remove every token that does or does not match a given regular expression.

Parameters

mustMatch	If this parameter is set to true (default), retain only tokens that match the regex given in #PARAM_REGEX. If set to false, all tokens that match the given regex are removed. Type: Boolean — Default value: `true`
regex	Every token that does or does not match this regular expression will be removed. Type: String

Table 167. Capabilities
Inputs	Token
Outputs	none specified
Languages	none specified

Simple Spelling Corrector

Short name	NorvigSpellingCorrector
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.norvig-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.norvig.NorvigSpellingCorrector

Description

Identifies spelling errors using Norvig's algorithm.

Parameters

modelLocation

Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well.

Optional — Type: String

Table 168. Capabilities
Inputs	Token
Outputs	SofaChangeAnnotation
Languages	none specified

Stop Word Remover

Short name	StopWordRemover
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.stopwordremover-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.stopwordremover.StopWordRemover

Description

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. Also remove any other of the specified types that is covered by a matching instance.

Parameters

Paths	Feature paths for annotations that should be matched/removed. The default is StopWord.class.getName() Token.class.getName() Lemma.class.getName()+"/value" Optional — Type: String[]
StopWordType	Anything annotated with this type will be removed even if it does not match any word in the lists. Optional — Type: String
modelEncoding	The character encoding used by the model. Type: String — Default value: `UTF-8`
modelLocation	A list of URLs from which to load the stop word lists. If an URL is prefixed with a language code in square brackets, the stop word list is only used for documents in that language. Using no prefix or the prefix "[]" causes the list to be used for every document. Example: "[de]classpath:/stopwords/en_articles.txt" Type: String[]*

Table 169. Capabilities
Inputs	StopWord
Outputs	none specified
Languages	none specified

Stopwatch

Short name	Stopwatch
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.performance-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.performance.Stopwatch

Description

Can be used to measure how long the processing between two points in a pipeline takes. For that purpose, the AE needs to be added two times, before and after the part of the pipeline that should be measured.

Parameters

timerName	Name of the timer pair. Upstream and downstream timer need to use the same name. Type: String
timerOutputFile	Name of the timer pair. Upstream and downstream timer need to use the same name. Optional — Type: String

Table 170. Capabilities
Inputs	TimerAnnotation
Outputs	TimerAnnotation
Languages	none specified

TF/IDF Annotator

Short name	TfIdfAnnotator
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.frequency-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.frequency.tfidf.TfIdfAnnotator

Description

This component adds Tfidf annotations consisting of a term and a tfidf weight.
The annotator is type agnostic concerning the input annotation, so you have to specify the annotation type and string representation. It uses a pre-serialized DfStore, which can be created using the TfIdfWriter.

Parameters

featurePath	This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path. Type: String
lowercase	If set to true, the whole text is handled in lower case. Optional — Type: Boolean — Default value: `false`
tfdfPath	Provide the path to the Df-Model. When a shared SharedDfModel is bound to this annotator, this is ignored. Optional — Type: String
weightingModeIdf	The model for inverse document frequency weighting. Invoke toString() on an enum of WeightingModeIdf for setup. Default value is "NORMAL" yielding an unweighted idf. Optional — Type: String — Default value: `NORMAL`
weightingModeTf	The model for term frequency weighting. Invoke toString() on an enum of WeightingModeTf for setup. Default value is "NORMAL" yielding an unweighted tf. Optional — Type: String — Default value: `NORMAL`

Table 171. Capabilities
Inputs	none specified
Outputs	Tfidf
Languages	none specified

Trailing Character Remover

Short name	TrailingCharacterRemover
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.TrailingCharacterRemover

Description

Removing trailing character (sequences) from tokens, e.g. punctuation.

Parameters

minTokenLength	All tokens that are shorter than the minimum token length after removing trailing chars are completely removed. By default (1), empty tokens are removed. Set to 0 or a negative value if no tokens should be removed. Shorter tokens that do not have trailing chars removed are always retained, regardless of their length. Type: Integer — Default value: `1`
pattern	A regex to be trimmed from the end of tokens. Type: String — Default value: `[\\Q,-\u201C^\u00BB*\u2019()&/\"'\u00A9\u00A7'\u2014\u00AB\u00B7=\\E0-9A-Z]+`

Table 172. Capabilities
Inputs	Token
Outputs	Token
Languages	none specified

de.tudarmstadt.ukp.dkpro.core.textnormalizer.util.JCasHolder

Short name	JCasHolder
Category	Other
Group ID	de.tudarmstadt.ukp.dkpro.core
Artifact ID	de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Implementation	de.tudarmstadt.ukp.dkpro.core.textnormalizer.util.JCasHolder

Description

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

Appendix

Table 173. Producers and consumers by type
Type	Producer	Consumer
GrammarAnomaly	LanguageToolChecker
SpellingAnomaly	JazzyChecker	SpellingNormalizer
SuggestedAction	JazzyChecker
CoreferenceChain	CoreNlpCoreferenceResolver StanfordCoreferenceResolver
CoreferenceLink	CoreNlpCoreferenceResolver StanfordCoreferenceResolver
Tfidf	TfIdfAnnotator
Morpheme	MateMorphTagger
MorphologicalFeatures	MateMorphTagger RfTagger SfstAnnotator UDPipePosTagger	UDPipeParser
POS	ArktweetPosTagger ClearNlpPosTagger CoreNlpPosTagger HepplePosTagger HunPosTagger LingPipePosTagger MatePosTagger MeCabTagger Nlp4JPosTagger OpenNlpPosTagger PosMapper RfTagger SfstAnnotator StanfordPosTagger TreeTaggerPosTagger UDPipePosTagger	ArktweetPosTaggerTrainer ClearNlpLemmatizer ClearNlpParser ClearNlpSemanticRoleLabeler CoreNlpCoreferenceResolver CoreNlpDependencyParser CoreNlpLemmatizer CoreNlpParser GermanSeparatedParticleAnnotator IxaLemmatizer MaltParser MateParser MateSemanticRoleLabeler MorphaLemmatizer MstParser Nlp4JDependencyParser Nlp4JLemmatizer Nlp4JNamedEntityRecognizer OpenNlpChunker OpenNlpLemmatizer OpenNlpPosTaggerTrainer PosFilter PosMapper SemanticFieldAnnotator StanfordCoreferenceResolver StanfordLemmatizer StanfordParser StanfordPosTaggerTrainer TokenMerger TreeTaggerChunker UDPipeParser
DocumentMetaData	ApplyChangesAnnotator	ApplyChangesAnnotator
NamedEntity	CoreNlpNamedEntityRecognizer LingPipeNamedEntityRecognizer Nlp4JNamedEntityRecognizer OpenNlpNamedEntityRecognizer SemanticFieldAnnotator StanfordNamedEntityRecognizer	CoreNlpCoreferenceResolver OpenNlpNamedEntityRecognizerTrainer StanfordCoreferenceResolver StanfordNamedEntityRecognizerTrainer
PhoneticTranscription	ColognePhoneticTranscriptor DoubleMetaphonePhoneticTranscriptor MetaphonePhoneticTranscriptor SoundexPhoneticTranscriptor
Compound	CompoundAnnotator
CompoundPart	CompoundAnnotator
Lemma	ClearNlpLemmatizer CoreNlpLemmatizer GateLemmatizer GermanSeparatedParticleAnnotator IxaLemmatizer LanguageToolLemmatizer MateLemmatizer MeCabTagger MorphaLemmatizer Nlp4JLemmatizer OpenNlpLemmatizer StanfordLemmatizer TokenMerger TreeTaggerPosTagger UDPipePosTagger	ClearNlpParser ClearNlpSemanticRoleLabeler CoreNlpCoreferenceResolver GermanSeparatedParticleAnnotator MaltParser MateMorphTagger MateSemanticRoleLabeler Nlp4JNamedEntityRecognizer SemanticFieldAnnotator StanfordCoreferenceResolver TokenMerger UDPipeParser
LinkingMorpheme	CompoundAnnotator
NGram	NGramAnnotator
Paragraph	JTokSegmenter ParagraphSplitter
Sentence	BreakIteratorSegmenter ClearNlpSegmenter CoreNlpSegmenter GosenSegmenter IcuSegmenter JTokSegmenter LanguageToolSegmenter LineBasedSentenceSegmenter LingPipeSegmenter MeCabTagger Nlp4JSegmenter OpenNlpSegmenter RegexSegmenter StanfordSegmenter UDPipeSegmenter WhitespaceSegmenter	ArktweetPosTaggerTrainer BerkeleyParser ClearNlpLemmatizer ClearNlpParser ClearNlpPosTagger ClearNlpSemanticRoleLabeler CoreNlpCoreferenceResolver CoreNlpDependencyParser CoreNlpLemmatizer CoreNlpNamedEntityRecognizer CoreNlpParser CoreNlpPosTagger DictionaryAnnotator GermanSeparatedParticleAnnotator HepplePosTagger HunPosTagger IxaLemmatizer LanguageToolLemmatizer LingPipePosTagger MaltParser MateLemmatizer MateMorphTagger MateParser MatePosTagger MateSemanticRoleLabeler MorphaLemmatizer MstParser NGramAnnotator Nlp4JDependencyParser Nlp4JLemmatizer Nlp4JNamedEntityRecognizer Nlp4JPosTagger OpenNlpChunker OpenNlpLemmatizer OpenNlpNamedEntityRecognizerTrainer OpenNlpParser OpenNlpPosTagger OpenNlpPosTaggerTrainer OpenNlpSentenceTrainer ReadabilityAnnotator RfTagger SfstAnnotator StanfordCoreferenceResolver StanfordNamedEntityRecognizer StanfordNamedEntityRecognizerTrainer StanfordParser StanfordPosTagger StanfordPosTaggerTrainer UDPipeParser UDPipePosTagger
Split	CompoundAnnotator
Stem	CisStemmer LancasterStemmer SnowballStemmer
StopWord		StopWordRemover
Token	BreakIteratorSegmenter CamelCaseTokenSegmenter ClearNlpSegmenter CoreNlpSegmenter GosenSegmenter IcuSegmenter JTokSegmenter LanguageToolSegmenter LingPipeSegmenter Nlp4JSegmenter OpenNlpSegmenter PatternBasedTokenSegmenter PosMapper RegexSegmenter StanfordSegmenter TokenTrimmer TrailingCharacterRemover UDPipeSegmenter WhitespaceSegmenter	ArktweetPosTagger ArktweetPosTaggerTrainer BerkeleyParser CamelCaseTokenSegmenter CapitalizationNormalizer ClearNlpLemmatizer ClearNlpParser ClearNlpPosTagger ClearNlpSemanticRoleLabeler ColognePhoneticTranscriptor CompoundAnnotator CoreNlpCoreferenceResolver CoreNlpDependencyParser CoreNlpLemmatizer CoreNlpNamedEntityRecognizer CoreNlpParser CoreNlpPosTagger DictionaryAnnotator DoubleMetaphonePhoneticTranscriptor ExpressiveLengtheningNormalizer GateLemmatizer GermanSeparatedParticleAnnotator HepplePosTagger HunPosTagger IxaLemmatizer JazzyChecker LancasterStemmer LanguageToolLemmatizer LingPipeNamedEntityRecognizer LingPipePosTagger MalletEmbeddingsAnnotator MalletLdaTopicModelInferencer MaltParser MateLemmatizer MateMorphTagger MateParser MatePosTagger MateSemanticRoleLabeler MetaphonePhoneticTranscriptor MorphaLemmatizer MstParser NGramAnnotator Nlp4JDependencyParser Nlp4JLemmatizer Nlp4JNamedEntityRecognizer Nlp4JPosTagger NorvigSpellingCorrector OpenNlpChunker OpenNlpLemmatizer OpenNlpNamedEntityRecognizer OpenNlpNamedEntityRecognizerTrainer OpenNlpParser OpenNlpPosTagger OpenNlpPosTaggerTrainer OpenNlpTokenTrainer PatternBasedTokenSegmenter PosMapper ReadabilityAnnotator RegexBasedTokenTransformer RegexTokenFilter ReplacementFileNormalizer RfTagger SemanticFieldAnnotator SfstAnnotator SoundexPhoneticTranscriptor StanfordCoreferenceResolver StanfordDependencyConverter StanfordLemmatizer StanfordNamedEntityRecognizer StanfordNamedEntityRecognizerTrainer StanfordParser StanfordPosTagger StanfordPosTaggerTrainer TokenMerger TokenTrimmer TrailingCharacterRemover TreeTaggerPosTagger UDPipeParser UDPipePosTagger UmlautNormalizer
SemArg	ClearNlpSemanticRoleLabeler MateSemanticRoleLabeler
SemPred	ClearNlpSemanticRoleLabeler MateSemanticRoleLabeler
PennTree	BerkeleyParser OpenNlpParser
Chunk	OpenNlpChunker TreeTaggerChunker
Constituent	BerkeleyParser CoreNlpParser OpenNlpParser StanfordParser	CoreNlpCoreferenceResolver StanfordCoreferenceResolver StanfordDependencyConverter
Dependency	ClearNlpParser CoreNlpDependencyParser CoreNlpParser MaltParser MateParser MstParser Nlp4JDependencyParser StanfordDependencyConverter StanfordParser UDPipeParser	ClearNlpSemanticRoleLabeler MateSemanticRoleLabeler
SofaChangeAnnotation	ApplyChangesAnnotator NorvigSpellingCorrector ReplacementFileNormalizer	ApplyChangesAnnotator
TopicDistribution	MalletLdaTopicModelInferencer
WordEmbedding	MalletEmbeddingsAnnotator
JapaneseToken	MeCabTagger
ReadabilityScore	ReadabilityAnnotator
TimerAnnotation	Stopwatch	Stopwatch