The document provides detailed information about the DKPro Core UIMA components.

Overview

Analytics components

Table 1. Analysis Components (94)
Component Description

AnnotationByLengthFilter

Removes annotations that do not conform to minimum or maximum length constraints.

AnnotationByTextFilter

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

ApplyChangesAnnotator

Applies changes annotated using a SofaChangeAnnotation.

ArktweetPosTagger

Wrapper for Twitter Tokenizer and POS Tagger.

ArktweetTokenizer

ArkTweet tokenizer.

Backmapper

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

BerkeleyParser

Berkeley Parser annotator .

BreakIteratorSegmenter

BreakIterator segmenter.

CamelCaseTokenSegmenter

Split up existing tokens again if they are camel-case text.

CapitalizationNormalizer

Takes a text and replaces wrong capitalization

CjfNormalizer

Converts traditional Chinese to simplified Chinese or vice-versa.

ClearNlpLemmatizer

Lemmatizer using Clear NLP.

ClearNlpParser

Clear parser annotator.

ClearNlpPosTagger

Part-of-Speech annotator using Clear NLP.

ClearNlpSegmenter

Tokenizer using Clear NLP.

ClearNlpSemanticRoleLabeler

ClearNLP semantic role labeller.

ColognePhoneticTranscriptor

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.

CompoundAnnotator

Annotates compound parts and linking morphemes.

CorrectionsContextualizer

This component assumes that some spell checker has already been applied upstream (e.g.

DictionaryAnnotator

Takes a plain text file with phrases as input and annotates the phrases in the CAS file.

DictionaryBasedTokenTransformer

Reads a tab-separated file containing mappings from one token to another.

DoubleMetaphonePhoneticTranscriptor

Double-Metaphone phonetic transcription based on Apache Commons Codec.

ExpressiveLengtheningNormalizer

Takes a text and shortens extra long words

FileBasedTokenTransformer

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

GateLemmatizer

Wrapper for the GATE rule based lemmatizer.

GermanSeparatedParticleAnnotator

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.

HepplePosTagger

GATE Hepple part-of-speech tagger.

HunPosTagger

Part-of-Speech annotator using HunPos.

HyphenationRemover

Simple dictionary-based hyphenation remover.

JCasHolder

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

JTokSegmenter

JTok segmenter.

JazzyChecker

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

LangDetectLanguageIdentifier

Langdetect language identifier based on character n-grams.

LanguageDetectorWeb1T

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

LanguageIdentifier

Detection based on character n-grams.

LanguageToolChecker

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

LanguageToolLemmatizer

Naive lexicon-based lemmatizer.

LanguageToolSegmenter

Segmenter using LanguageTool to do the heavy lifting.

LineBasedSentenceSegmenter

Annotates each line in the source text as a sentence.

MalletTopicModelEstimator

Estimate an LDA topic model using Mallet and write it to a file.

MalletTopicModelInferencer

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

MaltParser

Dependency parsing using MaltPaser.

MateLemmatizer

DKPro Annotator for the MateToolsLemmatizer.

MateMorphTagger

DKPro Annotator for the MateToolsMorphTagger.

MateParser

DKPro Annotator for the MateToolsParser.

MatePosTagger

DKPro Annotator for the MateToolsPosTagger

MateSemanticRoleLabeler

DKPro Annotator for the MateTools Semantic Role Labeler.

MeCabTagger

Annotator for the MeCab Japanese POS Tagger.

MetaphonePhoneticTranscriptor

Metaphone phonetic transcription based on Apache Commons Codec.

MorphaLemmatizer

Lemmatize based on a finite-state machine.

MstParser

Dependency parsing using MSTParser.

NGramAnnotator

N-gram annotator.

NorvigSpellingCorrector

Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.

OpenNlpChunker

Chunk annotator using OpenNLP.

OpenNlpNamedEntityRecognizer

OpenNLP name finder wrapper.

OpenNlpParser

OpenNLP parser.

OpenNlpPosTagger

Part-of-Speech annotator using OpenNLP.

OpenNlpSegmenter

Tokenizer and sentence splitter using OpenNLP.

ParagraphSplitter

This class creates paragraph annotations for the given input document.

PatternBasedTokenSegmenter

Split up existing tokens again at particular split-chars.

PosFilter

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

PosMapper

Maps existing POS tags from one tagset to another using a user provided properties file.

ReadabilityAnnotator

Assign a set of popular readability scores to the text.

RegexBasedTokenTransformer

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

RegexTokenFilter

Remove every token that does or does not match a given regular expression.

RegexTokenizer

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

ReplacementFileNormalizer

Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens

RfTagger

Rftagger morphological analyzer.

SemanticFieldAnnotator

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.

SfstAnnotator

Sfst morphological analyzer.

SharpSNormalizer

Takes a text and replaces sharp s

SnowballStemmer

UIMA wrapper for the Snowball stemmer.

SoundexPhoneticTranscriptor

Soundex phonetic transcription based on Apache Commons Codec.

SpellingNormalizer

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

StanfordCoreferenceResolver

No description

StanfordDependencyConverter

Converts a constituency structure into a dependency structure.

StanfordLemmatizer

Stanford Lemmatizer component.

StanfordNamedEntityRecognizer

Stanford Named Entity Recognizer component.

StanfordParser

Stanford Parser component.

StanfordPosTagger

Stanford Part-of-Speech tagger component.

StanfordPtbTransformer

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.

StanfordSegmenter

No description

StopWordRemover

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.

Stopwatch

Can be used to measure how long the processing between two points in a pipeline takes.

TfidfAnnotator

This component adds Tfidf annotations consisting of a term and a tfidf weight.

TfidfConsumer

This consumer builds a DfModel.

TokenCaseTransformer

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

TokenMerger

Merges any Tokens that are covered by a given annotation type.

TokenTrimmer

Remove prefixes and suffixes from tokens.

TrailingCharacterRemover

Removing trailing character (sequences) from tokens, e.g. punctuation.

TreeTaggerChunker

Chunk annotator using TreeTagger.

TreeTaggerPosTagger

Part-of-Speech and lemmatizer annotator using TreeTagger.

UmlautNormalizer

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

WhitespaceTokenizer

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

Checker

Table 2. Analysis Components in group Checker (2)
Component Description

JazzyChecker

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

LanguageToolChecker

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

JazzyChecker

Role: Checker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.jazzy-asl
Class: de.tudarmstadt.ukp.dkpro.core.jazzy.JazzyChecker

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

Parameters
ScoreThreshold (Integer) = 1

Determines the maximum edit distance (as an int value) that a suggestion for a spelling error may have. E.g. if set to one suggestions are limited to words within edit distance 1 to the original word.

modelEncoding (String) = UTF-8

The character encoding used by the model.

modelLocation (String)

Location from which the model is read. The model file is a simple word-list with one word per line.

Inputs and outputs

Inputs

Outputs

LanguageToolChecker

Role: Checker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolChecker

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

Parameters
language (String) [optional]

Use this language instead of the document language to resolve the model.

Inputs and outputs

Inputs

none specified

Outputs

Chunker

Table 3. Analysis Components in group Chunker (2)
Component Description

OpenNlpChunker

Chunk annotator using OpenNLP.

TreeTaggerChunker

Chunk annotator using TreeTagger.

OpenNlpChunker

Role: Chunker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpChunker

Chunk annotator using OpenNLP.

Parameters
ChunkMappingLocation (String) [optional]

Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.

internTags (Boolean) = true [optional]

Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

en

default

20100908.1

TreeTaggerChunker

Role: Chunker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.treetagger-asl
Class: de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerChunker

Chunk annotator using TreeTagger.

Parameters
ChunkMappingLocation (String) [optional]

Location of the mapping file for chunk tags to UIMA types.

executablePath (String) [optional]

Use this TreeTagger executable instead of trying to locate the executable automatically.

flushSequence (String) [optional]

A sequence to flush the internal TreeTagger buffer and to force it to output the rest of the completed analysis. This is typically just a sequence of like 5-10 full stops (".") separated by new line characters. However, some models may require a different flush sequence, e.g. a short sentence in the respective language. For chunker models, mind that the sentence must also be POS tagged, e.g. Nous-PRO:PER\n....

internTags (Boolean) = true [optional]

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

performanceMode (Boolean) = false

TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

de

le

20110429.1

en

iso8859-le

20090824.1

en

le

20140520.1

fr

le

20141218.2

Coreference resolver

Table 4. Analysis Components in group Coreference resolver (1)
Component Description

StanfordCoreferenceResolver

No description

StanfordCoreferenceResolver

Role: Coreference resolver
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordCoreferenceResolver

null
Parameters
maxDist (Integer) = -1

DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance)

postprocessing (Boolean) = false

DCoRef parameter: Do post processing

score (Boolean) = false

DCoRef parameter: Scoring the output of the system

sieves (String) = MarkRole, DiscourseMatch, ExactStringMatch, RelaxedExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, RelaxedHeadMatch, PronounMatch

DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/.

singleton (Boolean) = true

DCoRef parameter: setting singleton predictor

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

en

default

${core.version}.1

Language Identifier

Table 5. Analysis Components in group Language Identifier (3)
Component Description

LangDetectLanguageIdentifier

Langdetect language identifier based on character n-grams.

LanguageDetectorWeb1T

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

LanguageIdentifier

Detection based on character n-grams.

LangDetectLanguageIdentifier

Role: Language Identifier
Artifact ID: de.tudarmstadt.ukp.dkpro.core.langdetect-asl
Class: de.tudarmstadt.ukp.dkpro.core.langdetect.LangDetectLanguageIdentifier

Langdetect language identifier based on character n-grams.

Parameters
modelLocation (String) [optional]

Location from which the model is read.

modelVariant (String) [optional]

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Models
Language Variant Version

any

socialmedia

20141013.1

any

wikipedia

20141013.1

LanguageDetectorWeb1T

Role: Language Identifier
Artifact ID: de.tudarmstadt.ukp.dkpro.core.ldweb1t-asl
Class: de.tudarmstadt.ukp.dkpro.core.ldweb1t.LanguageDetectorWeb1T

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

Parameters
maxNGramSize (Integer) = 3

The maximum n-gram size that should be considered. Default is 3.

minNGramSize (Integer) = 1

The minimum n-gram size that should be considered. Default is 1.

LanguageIdentifier

Role: Language Identifier
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textcat-asl
Class: de.tudarmstadt.ukp.dkpro.core.textcat.LanguageIdentifier

Detection based on character n-grams. Uses the Java Text Categorizing Library based on a technique by Cavnar and Trenkle.

References:

  • Cavnar, W. B. and J. M. Trenkle (1994). N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

Lemmatizer

Table 6. Analysis Components in group Lemmatizer (6)
Component Description

ClearNlpLemmatizer

Lemmatizer using Clear NLP.

GateLemmatizer

Wrapper for the GATE rule based lemmatizer.

LanguageToolLemmatizer

Naive lexicon-based lemmatizer.

MateLemmatizer

DKPro Annotator for the MateToolsLemmatizer.

MorphaLemmatizer

Lemmatize based on a finite-state machine.

StanfordLemmatizer

Stanford Lemmatizer component.

ClearNlpLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpLemmatizer

Lemmatizer using Clear NLP.

Parameters
language (String) = en [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

en

default

20130715.0

GateLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.gate-gpl
Class: de.tudarmstadt.ukp.dkpro.core.gate.GateLemmatizer

Wrapper for the GATE rule based lemmatizer. Based on code by Asher Stern from the BIUTEE textual entailment tool.

Parameters
language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

LanguageToolLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolLemmatizer

Naive lexicon-based lemmatizer. The words are looked up using the wordform lexicons of LanguageTool. Multiple readings are produced. The annotator simply takes the most frequent lemma from those readings. If no readings could be found, the original text is assigned as lemma.

Parameters
sanitize (Boolean) = true
sanitizeChars (String[]) = [(, ), [, ]]
Inputs and outputs

Inputs

Outputs

MateLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateLemmatizer

DKPro Annotator for the MateToolsLemmatizer.

Parameters
language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

uppercase (Boolean) = false

Try reconstructing proper casing for lemmata. This is useful for German, but e.g. for English creates odd results.

variant (String) [optional]

Override the default variant used to locate the model.

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

de

tiger

20121024.1

en

conll2009

20130117.1

es

conll2009

20130117.1

fr

ftb

20130918.0

MorphaLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.morpha-asl
Class: de.tudarmstadt.ukp.dkpro.core.morpha.MorphaLemmatizer

Lemmatize based on a finite-state machine. Uses the Java port of Morpha.

References:

  • Minnen, G., J. Carroll and D. Pearce (2001). Applied morphological processing of English, Natural Language Engineering, 7(3). 207-223.
Parameters
readPOS (Boolean) = false

Pass part-of-speech information on to Morpha. Since we currently do not know in which format the part-of-speech tags are expected by Morpha, we just pass on the actual pos tag value we get from the token. This may produce worse results than not passing on pos tags at all, so this is disabled by default.

Inputs and outputs

Inputs

Outputs

StanfordLemmatizer

Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordLemmatizer

Stanford Lemmatizer component. The Stanford Morphology-class computes the base form of English words, by removing just inflections (not derivational morphology). That is, it only does noun plurals, pronoun case, and verb endings, and not things like comparative adjectives or derived nominals. It is based on a finite-state transducer implemented by John Carroll et al., written in flex and publicly available. See: http://www.informatics.susx.ac.uk/research/nlp/carroll/morph.html

This only works for ENGLISH.

Parameters
ptb3Escaping (Boolean) = true

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

quoteBegin (String[]) [optional]

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

quoteEnd (String[]) [optional]

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Inputs and outputs

Inputs

Outputs

Morphological analyzer

Table 7. Analysis Components in group Morphological analyzer (2)
Component Description

RfTagger

Rftagger morphological analyzer.

SfstAnnotator

Sfst morphological analyzer.

RfTagger

Role: Morphological analyzer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.rftagger-asl
Class: de.tudarmstadt.ukp.dkpro.core.rftagger.RfTagger

Rftagger morphological analyzer.

Parameters
MorphMappingLocation (String) [optional]
POSMappingLocation (String) [optional]

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelEncoding (String) [optional]

The character encoding used by the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Write the tag set(s) to the log when a model is loaded.

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

cz

cac

20150728.1

de

tiger

20150928.1

hu

szeged

20150728.1

ru

ric

20150728.1

sk

snk

20150728.1

sl

jos

20150728.1

SfstAnnotator

Role: Morphological analyzer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.sfst-gpl
Class: de.tudarmstadt.ukp.dkpro.core.sfst.SfstAnnotator

Sfst morphological analyzer.

Parameters
MorphMappingLocation (String) [optional]
language (String) [optional]

Use this language instead of the document language to resolve the model.

mode (String) = FIRST
modelEncoding (String) = UTF-8

Specifies the model encoding.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Write the tag set(s) to the log when a model is loaded.

writeLemma (Boolean) = true

Write lemma information. Default: true

writePOS (Boolean) = true

Write part-of-speech information. Default: true

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

de

morphisto-ca

20110202.1

de

smor-ca

20140801.1

de

zmorge-newlemma-ca

20140521.1

de

zmorge-orig-ca

20140521.1

it

pippi-ca

20090223.1

tr

trmorph-ca

20130219.1

Named Entity Recognizer

Table 8. Analysis Components in group Named Entity Recognizer (2)
Component Description

OpenNlpNamedEntityRecognizer

OpenNLP name finder wrapper.

StanfordNamedEntityRecognizer

Stanford Named Entity Recognizer component.

OpenNlpNamedEntityRecognizer

Role: Named Entity Recognizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpNamedEntityRecognizer

OpenNLP name finder wrapper.

Parameters
NamedEntityMappingLocation (String) [optional]

Location of the mapping file for named entity tags to UIMA types.

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Location from which the model is read.

modelVariant (String) = person

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded.

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

en

date

20100907.0

en

location

20100907.0

en

money

20100907.0

en

organization

20100907.0

en

percentage

20100907.0

en

person

20130624.1

en

time

20100907.0

es

location

20100908.0

es

misc

20100908.0

es

organization

20100908.0

es

person

20100908.0

nl

location

20100908.0

nl

misc

20100908.0

nl

organization

20100908.0

nl

person

20100908.0

StanfordNamedEntityRecognizer

Role: Named Entity Recognizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizer

Stanford Named Entity Recognizer component.

Parameters
NamedEntityMappingLocation (String) [optional]

Location of the mapping file for named entity tags to UIMA types.

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Location from which the model is read.

modelVariant (String) [optional]

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded.

ptb3Escaping (Boolean) = true

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

quoteBegin (String[]) [optional]

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

quoteEnd (String[]) [optional]

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

de

dewac_175m_600.crf

20150130.1

de

hgc_175m_600.crf

20150130.1

en

all.3class.caseless.distsim.crf

20160110.0

en

all.3class.distsim.crf

20150420.1

en

all.3class.nodistsim.crf

20160110.1

en

conll.4class.caseless.distsim.crf

20160110.0

en

conll.4class.distsim.crf

20150420.1

en

conll.4class.nodistsim.crf

20160110.1

en

muc.7class.caseless.distsim.crf

20150129.0

en

muc.7class.distsim.crf

20150129.1

en

muc.7class.nodistsim.crf

20160110.1

en

nowiki.3class.caseless.distsim.crf

20160110.0

en

nowiki.3class.nodistsim.crf

20160110.0

es

ancora.distsim.s512.crf

20140826.1

Parser

Table 9. Analysis Components in group Parser (7)
Component Description

BerkeleyParser

Berkeley Parser annotator .

ClearNlpParser

Clear parser annotator.

MaltParser

Dependency parsing using MaltPaser.

MateParser

DKPro Annotator for the MateToolsParser.

MstParser

Dependency parsing using MSTParser.

OpenNlpParser

OpenNLP parser.

StanfordParser

Stanford Parser component.

BerkeleyParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.berkeleyparser-gpl
Class: de.tudarmstadt.ukp.dkpro.core.berkeleyparser.BerkeleyParser

Berkeley Parser annotator . Requires Sentences to be annotated before.

Parameters
ConstituentMappingLocation (String) [optional]

Location of the mapping file for constituent tags to UIMA types.

POSMappingLocation (String) [optional]

Location of the mapping file for part-of-speech tags to UIMA types.

accurate (Boolean) = false

Set thresholds for accuracy.

Default: false (set thresholds for efficiency)

binarize (Boolean) = false

Output binarized trees.

Default: false

internTags (Boolean) = true [optional]

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

keepFunctionLabels (Boolean) = false

Retain predicted function labels. Model must have been trained with function labels.

Default: false

language (String) [optional]

Use this language instead of the language set in the CAS to locate the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

readPOS (Boolean) = true

Sets whether to use or not to use already existing POS tags from another annotator for the parsing process.

Default: false

scores (Boolean) = false

Output inside scores (only for binarized viterbi trees).

Default: false

substates (Boolean) = false

Output sub-categories (only for binarized Viterbi trees).

Default: false

variational (Boolean) = false

Use variational rule score approximation instead of max-rule

Default: false

viterbi (Boolean) = false

Compute Viterbi derivation instead of max-rule tree.

Default: false (max-rule)

writePOS (Boolean) = false

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Default: true

writePennTree (Boolean) = false

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

ar

sm5

20090917.1

bg

sm5

20090917.1

de

sm5

20090917.1

en

sm6

20100819.1

fr

sm5

20090917.1

zh

sm5

20090917.1

ClearNlpParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpParser

Clear parser annotator.

Parameters
language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Location from which the model is read.

modelVariant (String) [optional]

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

printTagSet (Boolean) = false

Write the tag set(s) to the log when a model is loaded.

Inputs and outputs

Inputs

Outputs

MaltParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.maltparser-asl
Class: de.tudarmstadt.ukp.dkpro.core.maltparser.MaltParser

Dependency parsing using MaltPaser.

Required annotations:

  • Token
  • Sentence
  • POS
Generated annotations:
  • Dependency (annotated over sentence-span)
Parameters
ignoreMissingFeatures (Boolean) = false

Process anyway, even if the model relies on features that are not supported by this component. Default: false

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

bn

linear

20120905.1

en

linear

20120312.1

en

poly

20120312.1

es

linear

20130220.0

fa

linear

20130522.1

fr

linear

20120312.1

pl

linear

20120904.1

sv

linear

20120925.2

MateParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateParser

DKPro Annotator for the MateToolsParser.

Please cite the following paper, if you use the parser: Bernd Bohnet. 2010. Top Accuracy and Fast Dependency Parsing is not a Contradiction. The 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China.

Parameters
DependencyMappingLocation (String) [optional]

Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

de

tiger

20121024.1

en

conll2009

20130117.2

es

conll2009

20130117.1

fr

ftb

20130918.0

zh

conll2009

20130117.1

MstParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mstparser-asl
Class: de.tudarmstadt.ukp.dkpro.core.mstparser.MstParser

Dependency parsing using MSTParser.

Wrapper for the MSTParser (high memory requirements). More information about the parser can be found here here

The MSTParser models tend to be very large, e.g. the Eisner model is about 600 MB uncompressed. With this model, parsing a simple sentence with MSTParser requires about 3 GB heap memory.

This component feeds MSTParser only with the FORM (token) and POS (part-of-speech) fields. LEMMA, CPOS, and other columns from the CONLL 2006 format are not generated (cf. mstparser.DependencyInstance DependencyInstance).

Parameters
DependencyMappingLocation (String) [optional]

Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

order (Integer) [optional]

Specifies the order/scope of features. 1 only has features over single edges and 2 has features over pairs of adjacent edges in the tree. The model must have been trained with the respective order set here.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

en

eisner

20100416.2

en

sample

20121019.2

hr

mte5.defnpout

20130527.1

hr

mte5.pos

20130527.1

OpenNlpParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpParser

OpenNLP parser. The parser ignores existing POS tags and internally creates new ones. However, these tags are only added as annotation if explicitly requested via #PARAM_WRITE_POS.

Parameters
ConstituentMappingLocation (String) [optional]

Location of the mapping file for constituent tags to UIMA types.

POSMappingLocation (String) [optional]

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

internTags (Boolean) = true [optional]

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded.

Default: false

writePOS (Boolean) = false

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Default: true

writePennTree (Boolean) = false

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

en

chunking

20120616.1

StanfordParser

Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordParser

Stanford Parser component.

Parameters
ConstituentMappingLocation (String) [optional]

Location of the mapping file for constituent tags to UIMA types.

POSMappingLocation (String) [optional]

Location of the mapping file for part-of-speech tags to UIMA types.

annotationTypeToParse (String) [optional]

This parameter can be used to override the standard behavior which uses the Sentence annotation as the basic unit for parsing.

If the parameter is set with the name of an annotation type x, the parser will no longer parse Sentence-annotations, but x-Annotations.

Default: null

language (String) [optional]

Use this language instead of the document language to resolve the model and tag set mapping.

maxItems (Integer) = 200000

Controls when the factored parser considers a sentence to be too complex and falls back to the PCFG parser.

Default: 200000

maxSentenceLength (Integer) = 130

Maximum number of tokens in a sentence. Longer sentences are not parsed. This is to avoid out of memory exceptions.

Default: 130

mode (String) = TREE [optional]

Sets the kind of dependencies being created.

Default: DependenciesMode#COLLAPSED TREE

modelLocation (String) [optional]

Location from which the model is read.

modelVariant (String) [optional]

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

printTagSet (Boolean) = false

Write the tag set(s) to the log when a model is loaded.

ptb3Escaping (Boolean) = true

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

quoteBegin (String[]) [optional]

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

quoteEnd (String[]) [optional]

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

readPOS (Boolean) = true

Sets whether to use or not to use already existing POS tags from another annotator for the parsing process.

Default: true

writeConstituent (Boolean) = true

Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization.

Default: true

writeDependency (Boolean) = true

Sets whether to create or not to create dependency annotations.

Default: true

writePOS (Boolean) = false

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Default: false

writePennTree (Boolean) = false

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

ar

factored

20150129.1

ar

sr

20141031.1

de

factored

20150129.1

de

pcfg

20150129.1

de

sr

20141031.1

en

factored

20150129.1

en

pcfg

20150129.1

en

pcfg.caseless

20160110.1

en

rnn

20140104.1

en

sr

20141031.1

en

sr-beam

20141031.1

en

wsj-factored

20150129.1

en

wsj-pcfg

20150129.1

en

wsj-rnn

20140104.1

es

pcfg

20150108.1

es

sr

20141023.1

es

sr-beam

20141023.1

fr

factored

20150129.1

fr

sr

20160114.1

fr

sr-beam

20141023.1

zh

factored

20150129.1

zh

pcfg

20150129.1

zh

sr

20141023.1

zh

xinhua-factored

20150129.1

zh

xinhua-pcfg

20150129.1

Part-of-speech tagger

Table 10. Analysis Components in group Part-of-speech tagger (10)
Component Description

ArktweetPosTagger

Wrapper for Twitter Tokenizer and POS Tagger.

ClearNlpPosTagger

Part-of-Speech annotator using Clear NLP.

HepplePosTagger

GATE Hepple part-of-speech tagger.

HunPosTagger

Part-of-Speech annotator using HunPos.

MateMorphTagger

DKPro Annotator for the MateToolsMorphTagger.

MatePosTagger

DKPro Annotator for the MateToolsPosTagger

MeCabTagger

Annotator for the MeCab Japanese POS Tagger.

OpenNlpPosTagger

Part-of-Speech annotator using OpenNLP.

StanfordPosTagger

Stanford Part-of-Speech tagger component.

TreeTaggerPosTagger

Part-of-Speech and lemmatizer annotator using TreeTagger.

ArktweetPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.arktools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetPosTagger

Wrapper for Twitter Tokenizer and POS Tagger. As described in: Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider and Noah A. Smith. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters In Proceedings of NAACL 2013.

Parameters
POSMappingLocation (String) [optional]

Location of the mapping file for part-of-speech tags to UIMA types.

language (String) [optional]

Use this language instead of the document language to resolve the model and tag set mapping.

modelLocation (String) [optional]

Location from which the model is read.

modelVariant (String) [optional]

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Models
Language Variant Version

en

default

20120919.1

en

irc

20121211.1

en

ritter

20130723.1

ClearNlpPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpPosTagger

Part-of-Speech annotator using Clear NLP. Requires Sentences to be annotated before.

Parameters
POSMappingLocation (String) [optional]

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

dictLocation (String) [optional]

Load the dictionary from this location instead of locating the dictionary automatically.

dictVariant (String) [optional]

Override the default variant used to locate the dictionary.

internTags (Boolean) = true [optional]

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the pos-tagging model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the pos-tagging model.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded.

Inputs and outputs

Inputs

Outputs

HepplePosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.gate-gpl
Class: de.tudarmstadt.ukp.dkpro.core.gate.HepplePosTagger

GATE Hepple part-of-speech tagger.

Parameters
POSMappingLocation (String) [optional]

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

internTags (Boolean) = true [optional]

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

language (String) [optional]

Use this language instead of the document language to resolve the model.

lexiconLocation (String) [optional]

Load the lexicon from this location instead of locating it automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

rulesetLocation (String) [optional]

Load the ruleset from this location instead of locating it automatically.

Inputs and outputs

Inputs

Outputs

HunPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.hunpos-asl
Class: de.tudarmstadt.ukp.dkpro.core.hunpos.HunPosTagger

Part-of-Speech annotator using HunPos. Requires Sentences to be annotated before.

Parameters
POSMappingLocation (String) [optional]

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

internTags (Boolean) = true [optional]

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

cs

pdt

20121123.2

da

ddt

20121123.2

de

tiger

20121123.2

en

wsj

20070724.2

fa

upc

20140414.0

hr

mte5.defnpout

20130509.2

hu

szeged_kr

20070724.2

pt

bosque

20121123.2

pt

bosque

20121123.2

pt

mm

20130119.2

pt

tbchp

20110419.2

ru

rdt

20121123.2

sl

jos

20121123.2

sv

paroletags

20100215.2

sv

suctags

20100927.2

MateMorphTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateMorphTagger

DKPro Annotator for the MateToolsMorphTagger.

Parameters
language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

Inputs and outputs

Inputs

Outputs

MatePosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MatePosTagger

DKPro Annotator for the MateToolsPosTagger

Parameters
POSMappingLocation (String) [optional]

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

de

tiger

20121024.1

en

conll2009

20130117.1

es

conll2009

20130117.1

fr

ftb

20130918.0

zh

conll2009

20130117.1

MeCabTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mecab-asl
Class: de.tudarmstadt.ukp.dkpro.core.mecab.MeCabTagger

Annotator for the MeCab Japanese POS Tagger.

Parameters
language (String) [optional]

The language.

strictZoning (Boolean) = false

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

writeSentence (Boolean) = true

Create Sentence annotations.

writeToken (Boolean) = true

Create Token annotations.

zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]

A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

Models
Language Variant Version

jp

bin-linux-x86_32

.

jp

bin-linux-x86_64

.

jp

bin-osx-x86_64

.

jp

ipadic

.

OpenNlpPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTagger

Part-of-Speech annotator using OpenNLP. Requires Sentences to be annotated before.

Parameters
POSMappingLocation (String) [optional]

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

internTags (Boolean) = true [optional]

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

da

maxent

20120616.1

da

perceptron

20120616.1

de

maxent

20120616.1

de

perceptron

20120616.1

en

maxent

20120616.1

en

perceptron

20120616.1

en

perceptron-ixa

20131115.1

es

maxent

20120410.1

es

maxent-ixa

20140425.1

es

maxent-universal

20120410.1

es

perceptron

20120410.1

es

perceptron-ixa

20131115.1

es

perceptron-universal

20120410.1

it

perceptron

20130618.0

nl

maxent

20120616.1

nl

perceptron

20120616.1

pt

maxent

20120616.1

pt

mm-maxent

20130121.1

pt

mm-perceptron

20130121.1

pt

perceptron

20120616.1

sv

maxent

20120616.1

sv

perceptron

20120616.1

StanfordPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPosTagger

Stanford Part-of-Speech tagger component.

Parameters
POSMappingLocation (String) [optional]

Location of the mapping file for part-of-speech tags to UIMA types.

internTags (Boolean) = true [optional]

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false

language (String) [optional]

Use this language instead of the document language to resolve the model and tag set mapping.

maxSentenceLength (Integer) [optional]

Sentences with more tokens than the specified max amount will be ignored if this parameter is set to a value larger than zero. The default value zero will allow all sentences to be POS tagged.

modelLocation (String) [optional]

Location from which the model is read.

modelVariant (String) [optional]

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

ptb3Escaping (Boolean) = true

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

quoteBegin (String[]) [optional]

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

quoteEnd (String[]) [optional]

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

ar

accurate

20131112.1

de

dewac

20140827.1

de

fast

20140827.1

de

fast-caseless

20140827.0

de

hgc

20140827.1

en

bidirectional-distsim

20140616.1

en

caseless-left3words-distsim

20140827.0

en

fast.41

20130730.1

en

left3words-distsim

20140616.1

en

twitter

20130730.1

en

twitter-fast

20130914.0

en

wsj-0-18-bidirectional-distsim

20160110.1

en

wsj-0-18-bidirectional-nodistsim

20131112.1

en

wsj-0-18-caseless-left3words-distsim

20140827.0

en

wsj-0-18-left3words-distsim

20140616.1

en

wsj-0-18-left3words-nodistsim

20131112.1

es

default

20151014.1

es

distsim

20150108.1

fr

default

20140616.1

zh

distsim

20140616.1

zh

nodistsim

20140616.1

TreeTaggerPosTagger

Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.treetagger-asl
Class: de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger

Part-of-Speech and lemmatizer annotator using TreeTagger.

Parameters
POSMappingLocation (String) [optional]

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

executablePath (String) [optional]

Use this TreeTagger executable instead of trying to locate the executable automatically.

internTags (Boolean) = true [optional]

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelEncoding (String) [optional]

The character encoding used by the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

performanceMode (Boolean) = false

TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.

printTagSet (Boolean) = false

Log the tag set(s) when a model is loaded. Default: false

writeLemma (Boolean) = true

Write lemma information. Default: true

writePOS (Boolean) = true

Write part-of-speech information. Default: true

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

bg

le

20160430.1

de

le

20121207.1

en

le

20151119.1

es

le

20150724.1

et

le

20110124.1

fi

le

20140704.1

fr

le

20100111.1

gl

le

20130516.1

it

le

20141020.1

la

le

20110819.1

mn

le

20120925.1

nl

le

20130107.1

pl

le

20150506.1

pt

le

20101115.2

ru

le

20140505.1

sk

le

20130725.1

sw

le

20130729.1

zh

le

20101115.1

Phonetic Transcriptor

Table 11. Analysis Components in group Phonetic Transcriptor (4)
Component Description

ColognePhoneticTranscriptor

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.

DoubleMetaphonePhoneticTranscriptor

Double-Metaphone phonetic transcription based on Apache Commons Codec.

MetaphonePhoneticTranscriptor

Metaphone phonetic transcription based on Apache Commons Codec.

SoundexPhoneticTranscriptor

Soundex phonetic transcription based on Apache Commons Codec.

ColognePhoneticTranscriptor

Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.ColognePhoneticTranscriptor

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. Works for German.

Inputs and outputs

Inputs

Outputs

DoubleMetaphonePhoneticTranscriptor

Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.DoubleMetaphonePhoneticTranscriptor

Double-Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

Inputs and outputs

Inputs

Outputs

MetaphonePhoneticTranscriptor

Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.MetaphonePhoneticTranscriptor

Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

Inputs and outputs

Inputs

Outputs

SoundexPhoneticTranscriptor

Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.SoundexPhoneticTranscriptor

Soundex phonetic transcription based on Apache Commons Codec. Works for English.

Inputs and outputs

Inputs

Outputs

Segmenter

Segmenter components identify sentence boundaries and tokens. The order in which sentence splitting and tokenization are done differs between the integrated the NLP libraries. Thus, we chose to integrate both steps into a segmenter component to avoid the need to reorder the components in a pipeline when replacing one segmenter with another.

Table 12. Analysis Components in group Segmenter (17)
Component Description

AnnotationByLengthFilter

Removes annotations that do not conform to minimum or maximum length constraints.

ArktweetTokenizer

ArkTweet tokenizer.

BreakIteratorSegmenter

BreakIterator segmenter.

CamelCaseTokenSegmenter

Split up existing tokens again if they are camel-case text.

ClearNlpSegmenter

Tokenizer using Clear NLP.

GermanSeparatedParticleAnnotator

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.

JTokSegmenter

JTok segmenter.

LanguageToolSegmenter

Segmenter using LanguageTool to do the heavy lifting.

LineBasedSentenceSegmenter

Annotates each line in the source text as a sentence.

OpenNlpSegmenter

Tokenizer and sentence splitter using OpenNLP.

ParagraphSplitter

This class creates paragraph annotations for the given input document.

PatternBasedTokenSegmenter

Split up existing tokens again at particular split-chars.

RegexTokenizer

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

StanfordSegmenter

No description

TokenMerger

Merges any Tokens that are covered by a given annotation type.

TokenTrimmer

Remove prefixes and suffixes from tokens.

WhitespaceTokenizer

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

AnnotationByLengthFilter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.AnnotationByLengthFilter

Removes annotations that do not conform to minimum or maximum length constraints. (This was previously called TokenFilter).

Parameters
FilterTypes (String[]) = []

A set of annotation types that should be filtered.

MaxLengthFilter (Integer) = 1000

Any annotation in filterAnnotations shorter than this value will be removed.

MinLengthFilter (Integer) = 0

Any annotation in filterTypes shorter than this value will be removed.

ArktweetTokenizer

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.arktools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetTokenizer

ArkTweet tokenizer.

BreakIteratorSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.BreakIteratorSegmenter

BreakIterator segmenter.

Parameters
language (String) [optional]

The language.

splitAtApostrophe (Boolean) = false

Per default the Java BreakIterator does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered.

strictZoning (Boolean) = false

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

writeSentence (Boolean) = true

Create Sentence annotations.

writeToken (Boolean) = true

Create Token annotations.

zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]

A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

CamelCaseTokenSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.CamelCaseTokenSegmenter

Split up existing tokens again if they are camel-case text.

Parameters
deleteCover (Boolean) = true

Wether to remove the original token. Default: true

Inputs and outputs

Inputs

Outputs

ClearNlpSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter

Tokenizer using Clear NLP.

Parameters
language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

strictZoning (Boolean) = false

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

writeSentence (Boolean) = true

Create Sentence annotations.

writeToken (Boolean) = true

Create Token annotations.

zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]

A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

GermanSeparatedParticleAnnotator

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.GermanSeparatedParticleAnnotator

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. This Annotator deals with German particle verbs. Particle verbs consist of a particle and a stem, e.g. anfangen = an+fangen There are many usages of German particle verbs where the stem and the particle are separated, e.g., Wir fangen gleich an. The TreeTagger lemmatizes the verb stem as "fangen" and the separated particle as "an", the proper verblemma "anfangen" is thus not available as an annotation. The GermanSeparatedParticleAnnotator replaces the lemma of the stem of particle-verbs (e.g., fangen) by the proper verb lemma (e.g. anfangen) and leaves the lemma of the separated particle unchanged.

Inputs and outputs

Inputs

Outputs

JTokSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.jtok-asl
Class: de.tudarmstadt.ukp.dkpro.core.jtok.JTokSegmenter

JTok segmenter.

Parameters
language (String) [optional]

The language.

strictZoning (Boolean) = false

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

writeParagraph (Boolean) = true

Create Paragraph annotations.

writeSentence (Boolean) = true

Create Sentence annotations.

writeToken (Boolean) = true

Create Token annotations.

zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]

A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

LanguageToolSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolSegmenter

Segmenter using LanguageTool to do the heavy lifting. LanguageTool internally uses different strategies for tokenization.

Parameters
language (String) [optional]

The language.

strictZoning (Boolean) = false

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

writeSentence (Boolean) = true

Create Sentence annotations.

writeToken (Boolean) = true

Create Token annotations.

zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]

A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

LineBasedSentenceSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.LineBasedSentenceSegmenter

Annotates each line in the source text as a sentence. This segmenter is not capable of creating tokens! All respective parameters have no functionality.

Parameters
language (String) [optional]

The language.

strictZoning (Boolean) = false

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

writeSentence (Boolean) = true

Create Sentence annotations.

writeToken (Boolean) = true

Create Token annotations.

zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]

A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

OpenNlpSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpSegmenter

Tokenizer and sentence splitter using OpenNLP.

Parameters
language (String) [optional]

Use this language instead of the document language to resolve the model.

modelVariant (String) [optional]

Override the default variant used to locate the model.

segmentationModelLocation (String) [optional]

Load the segmentation model from this location instead of locating the model automatically.

strictZoning (Boolean) = false

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

tokenizationModelLocation (String) [optional]

Load the tokenization model from this location instead of locating the model automatically.

writeSentence (Boolean) = true

Create Sentence annotations.

writeToken (Boolean) = true

Create Token annotations.

zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]

A list of type names used for zoning.

Inputs and outputs

Inputs

none specified

Outputs

Models
Language Variant Version

da

maxent

20120616.1

da

maxent

20120616.1

de

maxent

20120616.1

de

maxent

20120616.1

en

maxent

20120616.1

en

maxent

20120616.1

it

maxent

20130618.0

it

maxent

20130618.0

nb

maxent

20120131.1

nb

maxent

20120131.1

nl

maxent

20120616.1

nl

maxent

20120616.1

pt

maxent

20120616.1

pt

maxent

20120616.1

sv

maxent

20120616.1

sv

maxent

20120616.1

ParagraphSplitter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.ParagraphSplitter

This class creates paragraph annotations for the given input document. It searches for the occurrence of two or more line-breaks (Unix and Windows) and regards this as the boundary between paragraphs.

Parameters
splitPattern (String) = ((\r\n\r\n)(\r\n)*)|((\n\n)(\n)*)

A regular expression used to detect paragraph splits. Default: #DOUBLE_LINE_BREAKS_PATTERN (split on two consecutive line breaks)

Inputs and outputs

Inputs

none specified

Outputs

PatternBasedTokenSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.PatternBasedTokenSegmenter

Split up existing tokens again at particular split-chars. The prefix states whether the split chars should be added as separate Token Tokens. If the #INCLUDE_PREFIX precedes the split pattern, the pattern is included. Consequently, patterns following the #EXCLUDE_PREFIX, will not be added as a Token.

Parameters
deleteCover (Boolean) = true

Wether to remove the original token. Default: true

patterns (String[])

A list of regular expressions, prefixed with #INCLUDE_PREFIX or #EXCLUDE_PREFIX. If neither of the prefixes is used, #EXCLUDE_PREFIX is assumed.

Inputs and outputs

Inputs

Outputs

RegexTokenizer

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.RegexTokenizer

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

The default behaviour is to split sentences by a line break and tokens by whitespace.

Parameters
language (String) [optional]

The language.

sentenceBoundaryRegex (String) = ``

Define the sentence boundary. Default: \n (assume one sentence per line).

strictZoning (Boolean) = false

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

tokenBoundaryRegex (String) = [\\s\n]+

Defines the pattern that is used as token end boundary. Default: [\s\n]+ (matching whitespace and linebreaks.

When setting custom patterns, take into account that the final token is often terminated by a linebreak rather than the boundary character. Therefore, the newline typically has to be added to the group of matching characters, e.g. "tokenized-text" is correctly tokenized with the pattern [-\n].

writeSentence (Boolean) = true

Create Sentence annotations.

writeToken (Boolean) = true

Create Token annotations.

zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]

A list of type names used for zoning.

StanfordSegmenter

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordSegmenter

null
Parameters
allowEmptySentences (Boolean) = false

Whether to generate empty sentences.

boundaryFollowers (String[]) = [), ], }, \", ', '', \u2019, \u201D, -RRB-, -RSB-, -RCB-, ), ], }] [optional]

This is a Set of String that are matched with .equals() which are allowed to be tacked onto the end of a sentence after a sentence boundary token, for example ")".

boundaryToDiscard (String[]) = [, NL] [optional]

The set of regex for sentence boundary tokens that should be discarded.

boundaryTokenRegex (String) = \\.|[!?]+ [optional]

The set of boundary tokens. If null, use default.

isOneSentence (Boolean) = false

Whether to treat all input as one sentence.

language (String) [optional]

The language.

languageFallback (String) [optional]
newlineIsSentenceBreak (String) = TWO_CONSECUTIVE [optional]

Strategy for treating newlines as paragraph breaks.

regionElementRegex (String) [optional]

A regular expression for element names containing a sentence region. Only tokens in such elements will be included in sentences. The start and end tags themselves are not included in the sentence.

strictZoning (Boolean) = false

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

tokenRegexesToDiscard (String[]) = [] [optional]

The set of regex for sentence boundary tokens that should be discarded.

writeSentence (Boolean) = true

Create Sentence annotations.

writeToken (Boolean) = true

Create Token annotations.

xmlBreakElementsToDiscard (String[]) [optional]

These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary.

zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]

A list of type names used for zoning.

TokenMerger

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.TokenMerger

Merges any Tokens that are covered by a given annotation type. E.g. this component can be used to create a single tokens from all tokens that constitute a multi-token named entity.

Parameters
POSMappingLocation (String) [optional]

Override the tagset mapping.

annotationType (String)

Annotation type for which tokens should be merged.

constraint (String) [optional]

A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to merge only tokens that are part of a location named entity.

language (String) [optional]

Use this language instead of the document language to resolve the model and tag set mapping.

lemmaMode (String) = JOIN

Configure what should happen to the lemma of the merged tokens. It is possible to JOIN the lemmata to a single lemma (space separated), to REMOVE the lemma or LEAVE the lemma of the first token as-is.

posType (String) [optional]

Set a new POS tag for the new merged token. This is the mapped type. If this is specified, tag set mapping will not be performed. This parameter has no effect unless PARAM_POS_VALUE is also set.

posValue (String) [optional]

Set a new POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset).

Inputs and outputs

Inputs

Outputs

TokenTrimmer

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.TokenTrimmer

Remove prefixes and suffixes from tokens.

Parameters
prefixes (String[])

List of prefixes to remove.

suffixes (String[])

List of suffixes to remove.

Inputs and outputs

Inputs

Outputs

WhitespaceTokenizer

Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.WhitespaceTokenizer

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

If PARAM_WRITE_SENTENCES is set to true, one sentence per line is assumed. Otherwise, no sentences are created.

Parameters
language (String) [optional]

The language.

strictZoning (Boolean) = false

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

writeSentence (Boolean) = true

Create Sentence annotations.

writeToken (Boolean) = true

Create Token annotations.

zoneTypes (String[]) = [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div] [optional]

A list of type names used for zoning.

Semantic role labeler

Table 13. Analysis Components in group Semantic role labeler (2)
Component Description

ClearNlpSemanticRoleLabeler

ClearNLP semantic role labeller.

MateSemanticRoleLabeler

DKPro Annotator for the MateTools Semantic Role Labeler.

ClearNlpSemanticRoleLabeler

Role: Semantic role labeler
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSemanticRoleLabeler

ClearNLP semantic role labeller.

Parameters
expandArguments (Boolean) = false

Normally the arguments point only to the head words of arguments in the dependency tree. With this option enabled, they are expanded to the text covered by the minimal and maximal token offsets of all descendants (or self) of the head word.

Warning: this parameter should be used with caution! For one, if the descentants of a head word cover a non-continuous region of the text, this information is lost. The arguments will appear to be spanning a continuous region. For another, the arguments may overlap with each other. E.g. if a sentence contains a relative clause with a verb, the subject of the main clause may be recognized as a dependent of the verb and may cause the whole main clause to be recorded in the argument.

language (String) [optional]

Use this language instead of the document language to resolve the model.

modelVariant (String) [optional]

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

predModelLocation (String) [optional]

Location from which the predicate identifier model is read.

printTagSet (Boolean) = false

Write the tag set(s) to the log when a model is loaded.

roleModelLocation (String) [optional]

Location from which the roleset classification model is read.

srlModelLocation (String) [optional]

Location from which the semantic role labeling model is read.

Inputs and outputs

Inputs

Outputs

MateSemanticRoleLabeler

Role: Semantic role labeler
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateSemanticRoleLabeler

DKPro Annotator for the MateTools Semantic Role Labeler.

Please cite the following paper, if you use the semantic role labeler Anders Björkelund, Love Hafdell, and Pierre Nugues. Multilingual semantic role labeling. In Proceedings of The Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 43--48, Boulder, June 4--5 2009.

Parameters
language (String) [optional]

Use this language instead of the document language to resolve the model.

modelLocation (String) [optional]

Load the model from this location instead of locating the model automatically.

modelVariant (String) [optional]

Override the default variant used to locate the model.

Inputs and outputs

Inputs

Outputs

Models
Language Variant Version

de

tiger

20130105.0

en

conll2009

20130117.0

es

conll2009

20130320.0

zh

conll2009

20130117.0

Stemmer

Table 14. Analysis Components in group Stemmer (1)
Component Description

SnowballStemmer

UIMA wrapper for the Snowball stemmer.

SnowballStemmer

Role: Stemmer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.snowball-asl
Class: de.tudarmstadt.ukp.dkpro.core.snowball.SnowballStemmer

UIMA wrapper for the Snowball stemmer. Annotation types to be stemmed can beconfigured by a FeaturePath.

If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.

Parameters
filterConditionOperator (String) [optional]

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

filterConditionValue (String) [optional]

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

filterFeaturePath (String) [optional]

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

language (String) [optional]

Use this language instead of the document language to resolve the model.

lowerCase (Boolean) = false [optional]

Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.

Examples
false (default)true
EDUCATIONALEDUCATIONALeduc
EducationalEducateduc
educationaleduceduc

paths (String[]) [optional]

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Inputs and outputs

Inputs

none specified

Outputs

Topic Model

Topic modeling is a statistical approach to discover abstract topics in a collection of documents. A topic is characterized by a probability distribution of the words in the document collection. Once a topic model has been generated, it can be used to analyze unseen documents. The result of the analysis is describes the probability by which a document belongs to each of the topics in the model.

Table 15. Analysis Components in group Topic Model (2)
Component Description

MalletTopicModelEstimator

Estimate an LDA topic model using Mallet and write it to a file.

MalletTopicModelInferencer

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

MalletTopicModelEstimator

Role: Topic Model
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mallet-asl
Class: de.tudarmstadt.ukp.dkpro.core.mallet.topicmodel.MalletTopicModelEstimator

Estimate an LDA topic model using Mallet and write it to a file. It stores all incoming CAS' to Mallet Instances before estimating the model, using a ParallelTopicModel.

Parameters
alphaSum (Float) = 1.0

The sum of alphas over all topics. Default: 1.0.

Another recommended value is 50 / T (number of topics).

beta (Float) = 0.01

Beta for a single dimension of the Dirichlet prior. Default: 0.01.

burninPeriod (Integer) = 100

The number of iterations before hyperparameter optimization begins. Default: 100

displayInterval (Integer) = 50

The interval in which to display the estimated topics. Default: 50.

displayNTopicWords (Integer) = 7

The number of top words to display during estimation. Default: 7.

minTokenLength (Integer) = 3

Ignore tokens (or lemmas, respectively) that are shorter than the given value. Default: 3.

modelEntityType (String) [optional]

If specific, the text contained in the given segmentation type annotations are fed as separate units to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored.

By default, the full document text is used as a document.

nIterations (Integer) = 1000

The number of iterations during model estimation. Default: 1000.

nThreads (Integer) = 1

The number of threads to use during model estimation. Default: 1.

nTopics (Integer) = 10

The number of topics to estimate for the topic model.

optimizeInterval (Integer) = 50

Interval for optimizing Dirichlet hyperparameters. Default: 50

randomSeed (Integer) = -1

Set random seed. If set to -1 (default), uses random generator.

saveInterval (Integer) = 0

Define how often to save a serialized model during estimation. Default: 0 (only save when estimation is done).

targetLocation (String)

The target model file location.

typeName (String) = de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

The annotation type to use for the topic model. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token.

useLemma (Boolean) = false

If set, uses lemmas instead of original text as features.

useSymmetricAlph (Boolean) = false

Use a symmatric alpha value during model estimation? Default: false.

Inputs and outputs

Inputs

Outputs

none specified

MalletTopicModelInferencer

Role: Topic Model
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mallet-asl
Class: de.tudarmstadt.ukp.dkpro.core.mallet.topicmodel.MalletTopicModelInferencer

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

Parameters
burnIn (Integer) = 1

The number of iterations before hyperparameter optimization begins. Default: 1

maxTopicAssignments (Integer) = 0

Maximum number of topics to assign. If not set (or <= 0), the number of topics in the model divided by 10 is set.

minTokenLength (Integer) = 3

Ignore tokens (or lemmas, respectively) that are shorter than the given value. Default: 3.

minTopicProb (Float) = 0.2

Minimum topic proportion for the document-topic assignment.

modelLocation (String)
nIterations (Integer) = 10

The number of iterations during inference. Default: 10.

thinning (Integer) = 5
typeName (String) = de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

The annotation type to use as tokens. Default: Token

useLemma (Boolean) = false

If set, uses lemmas instead of original text as features.

Inputs and outputs

Inputs

Outputs

Transformer

Table 16. Analysis Components in group Transformer (13)
Component Description

CapitalizationNormalizer

Takes a text and replaces wrong capitalization

CjfNormalizer

Converts traditional Chinese to simplified Chinese or vice-versa.

DictionaryBasedTokenTransformer

Reads a tab-separated file containing mappings from one token to another.

ExpressiveLengtheningNormalizer

Takes a text and shortens extra long words

FileBasedTokenTransformer

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

HyphenationRemover

Simple dictionary-based hyphenation remover.

RegexBasedTokenTransformer

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

ReplacementFileNormalizer

Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens

SharpSNormalizer

Takes a text and replaces sharp s

SpellingNormalizer

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

StanfordPtbTransformer

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.

TokenCaseTransformer

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

UmlautNormalizer

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

CapitalizationNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.CapitalizationNormalizer

Takes a text and replaces wrong capitalization

Parameters
typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Inputs and outputs

Inputs

Outputs

none specified

CjfNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.CjfNormalizer

Converts traditional Chinese to simplified Chinese or vice-versa.

Parameters
direction (String) = TO_SIMPLIFIED
typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

DictionaryBasedTokenTransformer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.DictionaryBasedTokenTransformer

Reads a tab-separated file containing mappings from one token to another. All tokens that match an entry in the first column are changed to the corresponding token in the second column.

Parameters
commentMarker (String) = #

Lines starting with this character (or String) are ignored. Default: '#'

modelEncoding (String) = UTF-8
modelLocation (String)
separator (String) = ``

Separator for mappings file. Default: "\t" (TAB).

typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

ExpressiveLengtheningNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.ExpressiveLengtheningNormalizer

Takes a text and shortens extra long words

Parameters
typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Inputs and outputs

Inputs

Outputs

none specified

FileBasedTokenTransformer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.FileBasedTokenTransformer

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

Parameters
ignoreCase (Boolean) = false
modelLocation (String)
replacement (String)
typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

HyphenationRemover

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.HyphenationRemover

Simple dictionary-based hyphenation remover.

Parameters
modelEncoding (String) = UTF-8
modelLocation (String)
typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

RegexBasedTokenTransformer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.RegexBasedTokenTransformer

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

The parameters #PARAM_REGEX defines the regular expression to be searcher, #PARAM_REPLACEMENT defines the string with which matching patterns are replaces.

Parameters
regex (String)

Define the regular expression to be replaced

replacement (String)

Define the string to replace matching tokens with

typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

ReplacementFileNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.ReplacementFileNormalizer

Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens

Parameters
modelLocation (String)

Location of a file which contains all replacing characters

srcExpressionSurroundings (String) = IRRELEVANT
targetExpressionSurroundings (String) = NOTHING
Inputs and outputs

Inputs

Outputs

SharpSNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.SharpSNormalizer

Takes a text and replaces sharp s

Parameters
MinFrequencyThreshold (Integer) = 100
typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

SpellingNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.SpellingNormalizer

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

Parameters
typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Inputs and outputs

Inputs

Outputs

none specified

StanfordPtbTransformer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPtbTransformer

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. This component operates directly on the text and does not require prior segmentation.

Parameters
typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

TokenCaseTransformer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.TokenCaseTransformer

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

Parameters
tokenCase (String)
The case to convert tokens to:
  • UPPERCASE: uppercase everything.
  • LOWERCASE: lowercase everything.
  • NORMALCASE: retain first letter in word and after hyphens, lowercase everything else.
typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

UmlautNormalizer

Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.UmlautNormalizer

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

Parameters
MinFrequencyThreshold (Integer) = 100
typesToCopy (String[]) = []

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Inputs and outputs

Inputs

Outputs

none specified

Other

Table 17. Analysis Components in group Other (20)
Component Description

AnnotationByTextFilter

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

ApplyChangesAnnotator

Applies changes annotated using a SofaChangeAnnotation.

Backmapper

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

CompoundAnnotator

Annotates compound parts and linking morphemes.

CorrectionsContextualizer

This component assumes that some spell checker has already been applied upstream (e.g.

DictionaryAnnotator

Takes a plain text file with phrases as input and annotates the phrases in the CAS file.

JCasHolder

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

NGramAnnotator

N-gram annotator.

NorvigSpellingCorrector

Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.

PosFilter

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

PosMapper

Maps existing POS tags from one tagset to another using a user provided properties file.

ReadabilityAnnotator

Assign a set of popular readability scores to the text.

RegexTokenFilter

Remove every token that does or does not match a given regular expression.

SemanticFieldAnnotator

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.

StanfordDependencyConverter

Converts a constituency structure into a dependency structure.

StopWordRemover

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.

Stopwatch

Can be used to measure how long the processing between two points in a pipeline takes.

TfidfAnnotator

This component adds Tfidf annotations consisting of a term and a tfidf weight.

TfidfConsumer

This consumer builds a DfModel.

TrailingCharacterRemover

Removing trailing character (sequences) from tokens, e.g. punctuation.

AnnotationByTextFilter

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.AnnotationByTextFilter

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

Parameters
ignoreCase (Boolean) = true

If true, annotation texts are filtered case-independently. Default: true, i.e. words that occur in the list with different casing are not filtered out.

modelEncoding (String) = UTF-8
modelLocation (String)
typeName (String) = de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

Annotation type to filter. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token.

ApplyChangesAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.castransformation-asl
Class: de.tudarmstadt.ukp.dkpro.core.castransformation.ApplyChangesAnnotator

Applies changes annotated using a SofaChangeAnnotation.

Inputs and outputs

Inputs

Outputs

Backmapper

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.castransformation-asl
Class: de.tudarmstadt.ukp.dkpro.core.castransformation.Backmapper

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

Parameters
Chain (String[]) = [source, target] [optional]

Chain of views for backmapping. This should be the reverse of the chain of views that the ApplyChangesAnnotator has used. For example, if view A has been mapped to B using ApplyChangesAnnotator, then this parameter should be set using an array containing [B, A].

CompoundAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.decompounding-asl
Class: de.tudarmstadt.ukp.dkpro.core.decompounding.uima.annotator.CompoundAnnotator

Annotates compound parts and linking morphemes.

Inputs and outputs

Inputs

Outputs

CorrectionsContextualizer

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.jazzy-asl
Class: de.tudarmstadt.ukp.dkpro.core.jazzy.CorrectionsContextualizer

This component assumes that some spell checker has already been applied upstream (e.g. Jazzy). It then uses ngram frequencies from a frequency provider in order to rank the provided corrections.

DictionaryAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator-asl
Class: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.DictionaryAnnotator

Takes a plain text file with phrases as input and annotates the phrases in the CAS file. The annotation type defaults to NGram, but can be changed. The component requires that Tokens and Sentencees are annotated in the CAS. The format of the phrase file is one phrase per line, tokens are separated by space:

this is a phrase
another phrase

Parameters
annotationType (String) [optional]

The annotation to create on matching phases. If nothing is specified, this defaults to NGram.

modelEncoding (String) = UTF-8

The character encoding used by the model.

modelLocation (String)

The file must contain one phrase per line - phrases will be split at " "

value (String) [optional]

The value to set the feature configured in #PARAM_VALUE_FEATURE to.

valueFeature (String) = value [optional]

Set this feature on the created annotations.

Inputs and outputs

Inputs

Outputs

none specified

JCasHolder

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.util.JCasHolder

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

NGramAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.ngrams-asl
Class: de.tudarmstadt.ukp.dkpro.core.ngrams.NGramAnnotator

N-gram annotator.

Parameters
N (Integer) = 3

The length of the n-grams to generate (the "n" in n-gram).

Inputs and outputs

Inputs

Outputs

NorvigSpellingCorrector

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.norvig-asl
Class: de.tudarmstadt.ukp.dkpro.core.norvig.NorvigSpellingCorrector

Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.

Inputs and outputs

Inputs

Outputs

PosFilter

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.posfilter-asl
Class: de.tudarmstadt.ukp.dkpro.core.posfilter.PosFilter

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

Parameters
Verbs (Boolean) = false

Keep/remove verbs (true: keep, false: v)

adj (Boolean) = false

Keep/remove adjectives (true: keep, false: remove)

adv (Boolean) = false

Keep/remove adverbs (true: keep, false: remove)

art (Boolean) = false

Keep/remove articles (true: keep, false: remove)

card (Boolean) = false

Keep/remove cardinal numbers (true: keep, false: remove)

conj (Boolean) = false

Keep/remove conjunctions (true: keep, false: remove)

n (Boolean) = false

Keep/remove nouns (true: keep, false: remove)

o (Boolean) = false

Keep/remove "others" (true: keep, false: remove)

pp (Boolean) = false

Keep/remove prepositions (true: keep, false: remove)

pr (Boolean) = false

Keep/remove pronouns (true: keep, false: remove)

punc (Boolean) = false

Keep/remove punctuation (true: keep, false: remove)

typeToRemove (String)

The fully qualified name of the type that should be filtered.

Inputs and outputs

Inputs

Outputs

none specified

PosMapper

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.posfilter-asl
Class: de.tudarmstadt.ukp.dkpro.core.posfilter.PosMapper

Maps existing POS tags from one tagset to another using a user provided properties file.

Parameters
dkproMappingLocation (String) [optional]

A properties file containing mappings from the new tagset to (fully qualified) DKPro POS classes.
If such a file is not supplied, the DKPro POS classes stay the same regardless of the new POS tag value, and only the value is changed.

mappingFile (String)

A properties file containing POS tagset mappings.

Inputs and outputs

Inputs

Outputs

ReadabilityAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.readability-asl
Class: de.tudarmstadt.ukp.dkpro.core.readability.ReadabilityAnnotator

Assign a set of popular readability scores to the text.

RegexTokenFilter

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.RegexTokenFilter

Remove every token that does or does not match a given regular expression.

Parameters
mustMatch (Boolean) = true

If this parameter is set to true (default), retain only tokens that match the regex given in #PARAM_REGEX. If set to false, all tokens that match the given regex are removed.

regex (String)

Every token that does or does not match this regular expression will be removed.

SemanticFieldAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator-asl
Class: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.SemanticFieldAnnotator

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. This could be a lexical resource such as WordNet or a simple key-value map. The annotation is stored in the SemanticField annotation type.

Parameters
annotationType (String)

Annotation types which should be annotated with semantic fields

constraint (String) [optional]

A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to annotate only tokens with semantic fields that are part of a location named entity.

Inputs and outputs

Inputs

Outputs

StanfordDependencyConverter

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordDependencyConverter

Converts a constituency structure into a dependency structure.

Parameters
language (String) [optional]

Use this language instead of the document language to resolve the model and tag set mapping.

mode (String) = TREE [optional]

Sets the kind of dependencies being created.

Default: DependenciesMode#COLLAPSED TREE

originalDependencies (Boolean) = true

Create original dependencies. If this is disabled, universal dependencies are created. The default is to create the original dependencies.

Inputs and outputs

Inputs

Outputs

StopWordRemover

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stopwordremover-asl
Class: de.tudarmstadt.ukp.dkpro.core.stopwordremover.StopWordRemover

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. Also remove any other of the specified types that is covered by a matching instance.

Parameters
Paths (String[]) [optional]

Feature paths for annotations that should be matched/removed. The default is

StopWord.class.getName()
Token.class.getName()
Lemma.class.getName()+"/value"

StopWordType (String) [optional]

Anything annotated with this type will be removed even if it does not match any word in the lists.

modelEncoding (String) = UTF-8

The character encoding used by the model.

modelLocation (String[])

A list of URLs from which to load the stop word lists. If an URL is prefixed with a language code in square brackets, the stop word list is only used for documents in that language. Using no prefix or the prefix "[*]" causes the list to be used for every document. Example: "[de]classpath:/stopwords/en_articles.txt"

Inputs and outputs

Inputs

Outputs

none specified

Stopwatch

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.performance-asl
Class: de.tudarmstadt.ukp.dkpro.core.performance.Stopwatch

Can be used to measure how long the processing between two points in a pipeline takes. For that purpose, the AE needs to be added two times, before and after the part of the pipeline that should be measured.

Parameters
timerName (String)

Name of the timer pair. Upstream and downstream timer need to use the same name.

timerOutputFile (String) [optional]

Name of the timer pair. Upstream and downstream timer need to use the same name.

Inputs and outputs

Inputs

Outputs

TfidfAnnotator

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.frequency-asl
Class: de.tudarmstadt.ukp.dkpro.core.frequency.tfidf.TfidfAnnotator

This component adds Tfidf annotations consisting of a term and a tfidf weight.
The annotator is type agnostic concerning the input annotation, so you have to specify the annotation type and string representation. It uses a pre-serialized DfStore, which can be created using the TfidfConsumer.

Parameters
featurePath (String)

This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path.

lowercase (Boolean) = false [optional]

If set to true, the whole text is handled in lower case.

tfdfPath (String) [optional]

Provide the path to the Df-Model. When a shared SharedDfModel is bound to this annotator, this is ignored.

weightingModeIdf (String) = NORMAL [optional]

The model for inverse document frequency weighting.
Invoke toString() on an enum of WeightingModeIdf for setup.

Default value is "NORMAL" yielding an unweighted idf.

weightingModeTf (String) = NORMAL [optional]

The model for term frequency weighting.
Invoke toString() on an enum of WeightingModeTf for setup.

Default value is "NORMAL" yielding an unweighted tf.

Inputs and outputs

Inputs

none specified

Outputs

TfidfConsumer

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.frequency-asl
Class: de.tudarmstadt.ukp.dkpro.core.frequency.tfidf.TfidfConsumer

This consumer builds a DfModel. It collects the df (document frequency) counts for the processed collection. The counts are serialized as a DfModel-object.

Parameters
featurePath (String)

This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path.

lowercase (Boolean) = false

If set to true, the whole text is handled in lower case.

targetLocation (String)

Specifies the path and filename where the model file is written.

TrailingCharacterRemover

Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.TrailingCharacterRemover

Removing trailing character (sequences) from tokens, e.g. punctuation.

Parameters
minTokenLength (Integer) = 1

All tokens that are shorter than the minimum token length after removing trailing chars are completely removed. By default (1), empty tokens are removed. Set to 0 or a negative value if no tokens should be removed.

Shorter tokens that do not have trailing chars removed are always retained, regardless of their length.

pattern (String) = [\\Q,-\u201C^\u00BB*\u2019()&/\"'\u00A9\u00A7'\u2014\u00AB\u00B7=\\E0-9A-Z]+

A regex to be trimmed from the end of tokens.

Default: "[\\Q,-“^»*’()&/\"'©§'—«·=\\E0-9A-Z]+" (remove punctuations, special characters and capital letters).

Appendix

Table 18. Producers and consumers by type
Type Producer Consumer

GrammarAnomaly

SpellingAnomaly

SuggestedAction

CoreferenceChain

CoreferenceLink

Tfidf

Morpheme

MorphologicalFeatures

POS

DocumentMetaData

NamedEntity

PhoneticTranscription

Compound

CompoundPart

Lemma

LinkingMorpheme

NGram

NamedEntity

Paragraph

Sentence

Split

Stem

StopWord

Token

SemanticArgument

SemanticPredicate

PennTree

Chunk

Constituent

Dependency

SofaChangeAnnotation

TopicDistribution

JapaneseToken

TimerAnnotation