The document provides detailed information about the DKPro Core UIMA components.

Overview

Analytics components

Table 1. Analysis Components (138)
Component Description

Annotation-By-Length Filter

Removes annotations that do not conform to minimum or maximum length constraints.

Annotation-By-Text Filter

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

CAS Transformation - Apply

Applies changes annotated using a SofaChangeAnnotation.

ArkTweet POS-Tagger

Wrapper for Twitter Tokenizer and POS Tagger.

ArkTweet POS-Tagger Trainer

Trainer for ark-tweet POS tagger.

ArkTweet Tokenizer

ArkTweet tokenizer.

CAS Transformation - Map back

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

Berkeley Parser

Berkeley Parser annotator.

Java BreakIterator Segmenter

BreakIterator segmenter.

CamelCase Token Segmenter

Split up existing tokens again if they are camel-case text.

Capitalization Normalizer

Takes a text and replaces wrong capitalization

CIS Stemmer

UIMA wrapper for the CISTEM algorithm.

Chinese Traditional/Simplified Converter

Converts traditional Chinese to simplified Chinese or vice-versa.

ClearNLP Lemmatizer

Lemmatizer using Clear NLP.

ClearNLP Parser

CLEAR parser annotator.

ClearNLP POS-Tagger

Part-of-Speech annotator using Clear NLP.

ClearNLP Segmenter

Tokenizer using Clear NLP.

ClearNLP Semantic Role Labeler

ClearNLP semantic role labeller.

Commons Codec Cologne Phonetic Transcriptor

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.

Compound Annotator

Annotates compound parts and linking morphemes.

CoreNLP Coreference Resolver

Deterministic coreference annotator from CoreNLP.

CoreNLP Dependency Parser

Dependency parser from CoreNLP.

CoreNLP Lemmatizer

Lemmatizer from CoreNLP.

CoreNLP Named Entity Recognizer

Named entity recognizer from CoreNLP.

CoreNLP Parser

Parser from CoreNLP.

CoreNLP POS-Tagger

Part-of-speech tagger from CoreNLP.

CoreNLP Segmenter

Tokenizer and sentence splitter using from Stanford CoreNLP.

Corrections Contextualizer

This component assumes that some spell checker has already been applied upstream (e.g.

Dictionary Annotator

Takes a plain text file with phrases as input and annotates the phrases in the CAS file.

Dictionary-based Token Transformer

Reads a tab-separated file containing mappings from one token to another.

Commons Codec Double-Metaphone Phonetic Transcriptor

Double-Metaphone phonetic transcription based on Apache Commons Codec.

Expressive Lengthening Normalizer

Takes a text and shortens extra long words

File-based Token Transformer

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

FlexTag POS-Tagger

Flexible part-of-speech tagger.

GATE Lemmatizer

Wrapper for the GATE rule based lemmatizer.

German Separated Particle Annotator

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.

Gosen Segmenter

Segmenter for Japanese text based on GoSen.

GATE Hepple POS-Tagger

GATE Hepple part-of-speech tagger.

HunPos POS-Tagger

Part-of-Speech annotator using HunPos.

Hyphenation Remover

Simple dictionary-based hyphenation remover.

ICU Segmenter

ICU segmenter.

IXA Lemmatizer

Lemmatizer using the OpenNLP-based Ixa implementation.

IXA POS-Tagger

Part-of-Speech annotator using OpenNLP with IXA extensions.

org.dkpro.core.textnormalizer.util.JCasHolder

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

JTok Segmenter

JTok segmenter.

Jazzy Spellchecker

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

Jieba Segmenter

Segmenter for Japanese using Jieba.

Lancaster Stemmer

This Paice/Husk Lancaster stemmer implementation only works with the English language so far.

LangDetect

Langdetect language identifier based on character n-grams.

Web1T Language Detector

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

TextCat Language Identifier (Character N-Gram-based)

Detection based on character n-grams.

LanguageTool Grammar Checker

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

LanguageTool Lemmatizer

Naive lexicon-based lemmatizer.

LanguageTool Segmenter

Segmenter using LanguageTool to do the heavy lifting.

Line-based Sentence Segmenter

Annotates each line in the source text as a sentence.

LingPipe Named Entity Recognizer

LingPipe named entity recognizer.

LingPipe Named Entity Recognizer Trainer

LingPipe named entity recognizer trainer.

LingPipe POS-Tagger

LingPipe part-of-speech tagger.

LingPipe Segmenter

LingPipe segmenter.

Mallet Embeddings Annotator

Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas.

Mallet Embeddings Trainer

Compute word embeddings from the given collection using skip-grams.

Mallet LDA Topic Model Inferencer

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

Mallet LDA Topic Model Trainer

Estimate an LDA topic model using Mallet and write it to a file.

MaltParser Dependency Parser

Dependency parsing using MaltPaser.

Mate Tools Lemmatizer

DKPro Core Annotator for the MateToolsLemmatizer.

Mate Tools Morphological Analyzer

DKPro Core Annotator for the MateToolsMorphTagger.

Mate Tools Dependency Parser

DKPro Annotator for the MateToolsParser.

Mate Tools POS-Tagger

DKPro Annotator for the MateToolsPosTagger

Mate Tools Semantic Role Labeler

Annotator for the MateTools Semantic Role Labeler.

Maui Keyword Annotator

The Maui tool assigns keywords to documents.

MeCab POS-Tagger

Annotator for the MeCab Japanese POS Tagger.

Commons Codec Metaphone Phonetic Transcriptor

Metaphone phonetic transcription based on Apache Commons Codec.

Morpha Lemmatizer

Lemmatize based on a finite-state machine.

MSTParser Dependency Parser

Dependency parsing using MSTParser.

MyStem Stemmer

This MyStem stemmer implementation only works with the Russian language.

N-Gram Annotator

N-gram annotator.

NLP4J Dependency Parser

Emory NLP4J dependency parser.

NLP4J Lemmatizer

Emory NLP4J lemmatizer.

NLP4J Named Entity Recognizer

Emory NLP4J name finder wrapper.

NLP4J POS-Tagger

Part-of-Speech annotator using Emory NLP4J.

NLP4J Segmenter

Segmenter using Emory NLP4J.

Simple Spelling Corrector

Identifies spelling errors using Norvig's algorithm.

OpenNLP Chunker

Chunk annotator using OpenNLP.

OpenNLP Chunker Trainer

Train a chunker model for OpenNLP.

OpenNLP Lemmatizer

Lemmatizer using OpenNLP.

OpenNLP Lemmatizer Trainer

Train a lemmatizer model for OpenNLP.

OpenNLP Named Entity Recognizer

OpenNLP name finder wrapper.

OpenNLP Named Entity Recognizer Trainer

Train a named entity recognizer model for OpenNLP.

OpenNLP Parser

OpenNLP parser.

OpenNLP POS-Tagger

Part-of-Speech annotator using OpenNLP.

OpenNLP POS-Tagger Trainer

Train a POS tagging model for OpenNLP.

OpenNLP Segmenter

Tokenizer and sentence splitter using OpenNLP.

OpenNLP Sentence Splitter Trainer

Train a sentence splitter model for OpenNLP.

OpenNLP Snowball Stemmer

UIMA wrapper for the Snowball stemmer included with OpenNLP.

OpenNLP Tokenizer Trainer

Train a tokenizer model for OpenNLP.

Paragraph Splitter

This class creates paragraph annotations for the given input document.

Pattern-based Token Segmenter

Split up existing tokens again at particular split-chars.

Phrase Annotator

Annotate phrases in a sentence.

POS Filter

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

POS Mapper

Maps existing POS tags from one tagset to another using a user provided properties file.

Readability Annotator

Assign a set of popular readability scores to the text.

Regex-based Token Transformer

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

Regex Segmenter

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

Regex Token Filter

Remove every token that does or does not match a given regular expression.

Replacement File Normalizer

Takes a text and replaces desired expressions.

RFTagger Morphological Analyzer

Rftagger morphological analyzer.

Semantic Field Annotator

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.

SFST Morphological Analyzer

SFST morphological analyzer.

Sharp S (ß) Normalizer

Takes a text and replaces sharp s

Lancaster Stemmer

This Paice/Husk Lancaster stemmer implementation only works with the English language so far.

Snowball Stemmer

UIMA wrapper for the Snowball stemmer.

Commons Codec Soundex Phonetic Transcriptor

Soundex phonetic transcription based on Apache Commons Codec.

Spelling Normalizer

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

CoreNLP Coreference Resolver (old API)

No description

CoreNLP Dependency Converter

Converts a constituency structure into a dependency structure.

CoreNLP Lemmatizer (old API)

Stanford Lemmatizer component.

CoreNLP Named Entity Recogizer (old API)

Stanford Named Entity Recognizer component.

CoreNLP Named Entity Recognizer Trainer

Train a NER model for Stanford CoreNLP Named Entity Recognizer.

CoreNLP Parser (old API)

Stanford Parser component.

CoreNLP POS-Tagger (old API)

Stanford Part-of-Speech tagger component.

CoreNLP POS-Tagger Trainer

Train a POS tagging model for the Stanford POS tagger.

Stanford Penn Treebank Normalizer

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.

CoreNLP Segmenter (old API)

Stanford sentence splitter and tokenizer.

CoreNLP Sentiment Analyzer

Experimental wrapper for edu.stanford.nlp.pipeline.SentimentAnnotator which assigns 5 scores to each sentence.

Stop Word Remover

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.

Stopwatch

Can be used to measure how long the processing between two points in a pipeline takes.

TF/IDF Annotator

This component adds Tfidf annotations consisting of a term and a tfidf weight.

Token Case Transformer

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

Token Merger

Merges any Tokens that are covered by a given annotation type.

org.dkpro.core.tokit.TokenTrimmer

Remove prefixes and suffixes from tokens.

Trailing Character Remover

Removing trailing character (sequences) from tokens, e.g. punctuation.

TreeTagger Chunker

Chunk annotator using TreeTagger.

TreeTagger POS-Tagger

Part-of-Speech and lemmatizer annotator using TreeTagger.

UDPipe Parsito Dependency Parser

Dependency parser using UDPipe.

UDPipe MorphoDiTa Morphological Analyzer

Part-of-Speech, lemmatizer, and morphological analyzer using UDPipe.

UDPipe Segmenter

Tokenizer and sentence splitter using UDPipe.

Umlaut Normalizer

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

Whitespace Segmenter

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

Checker

Table 2. Analysis Components in category Checker (2)
Component Description

JazzyChecker

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

LanguageToolChecker

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

Jazzy Spellchecker

Short name

JazzyChecker

Category

Checker

Group ID

org.dkpro.core

Artifact ID

dkpro-core-jazzy-asl

Implementation

org.dkpro.core.jazzy.JazzyChecker

Description

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

Parameters
modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

Location from which the model is read. The model file is a simple word-list with one word per line.

Type: String

scoreThreshold

Determines the maximum edit distance (as an int value) that a suggestion for a spelling error may have. E.g. if set to one suggestions are limited to words within edit distance 1 to the original word.

Type: Integer  — Default value: 1

Table 3. Capabilities

Inputs

Token

Outputs

SpellingAnomaly SuggestedAction

Languages

none specified

LanguageTool Grammar Checker

Short name

LanguageToolChecker

Category

Checker

Group ID

org.dkpro.core

Artifact ID

dkpro-core-languagetool-asl

Implementation

org.dkpro.core.languagetool.LanguageToolChecker

Description

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

Table 4. Capabilities

Inputs

none specified

Outputs

GrammarAnomaly

Languages

be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh

Chunker

Table 5. Analysis Components in category Chunker (3)
Component Description

OpenNlpChunker

Chunk annotator using OpenNLP.

OpenNlpChunkerTrainer

Train a chunker model for OpenNLP.

TreeTaggerChunker

Chunk annotator using TreeTagger.

OpenNLP Chunker

Short name

OpenNlpChunker

Category

Chunker

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpChunker

Description

Chunk annotator using OpenNLP.

Parameters
ChunkMappingLocation

Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 6. Capabilities

Inputs

POS Sentence Token

Outputs

Chunk

Languages

see available models

Table 7. Models
Language Variant Version

en

default

20100908.1

en

perceptron-ixa

20160205.1

OpenNLP Chunker Trainer

Short name

OpenNlpChunkerTrainer

Category

Chunker

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpChunkerTrainer

Description

Train a chunker model for OpenNLP.

Parameters
algorithm

Training algorithm.

Type: String  — Default value: MAXENT

beamSize

Beam size.

Type: Integer  — Default value: 3

cutoff

Frequency cut-off.

Type: Integer  — Default value: 5

iterations

Number of training iterations.

Type: Integer  — Default value: 100

language

Store this language to the model instead of the document language.

Type: String

numThreads

Number of parallel threads.

Type: Integer  — Default value: 1

targetLocation

Location to which the output is written.

Type: String

trainerType

Trainer type.

Type: String  — Default value: Event

TreeTagger Chunker

Short name

TreeTaggerChunker

Category

Chunker

Group ID

org.dkpro.core

Artifact ID

dkpro-core-treetagger-asl

Implementation

org.dkpro.core.treetagger.TreeTaggerChunker

Description

Chunk annotator using TreeTagger.

Parameters
ChunkMappingLocation

Location of the mapping file for chunk tags to UIMA types.

Optional — Type: String

executablePath

Use this TreeTagger executable instead of trying to locate the executable automatically.

Optional — Type: String

flushSequence

A sequence to flush the internal TreeTagger buffer and to force it to output the rest of the completed analysis. This is typically just a sequence of like 5-10 full stops (".") separated by new line characters. However, some models may require a different flush sequence, e.g. a short sentence in the respective language. For chunker models, mind that the sentence must also be POS tagged, e.g. Nous-PRO:PER\n....

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

performanceMode

TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.

Type: Boolean  — Default value: false

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 8. Capabilities

Inputs

POS

Outputs

Chunk

Languages

see available models

Table 9. Models
Language Variant Version

de

le

20110429.1

en

iso8859-le

20090824.1

en

le

20140520.1

fr

le

20141218.2

Coreference resolver

Table 10. Analysis Components in category Coreference resolver (2)
Component Description

CoreNlpCoreferenceResolver

Deterministic coreference annotator from CoreNLP.

StanfordCoreferenceResolver

No description

CoreNLP Coreference Resolver

Short name

CoreNlpCoreferenceResolver

Category

Coreference resolver

Group ID

org.dkpro.core

Artifact ID

dkpro-core-corenlp-gpl

Implementation

org.dkpro.core.corenlp.CoreNlpCoreferenceResolver

Description

Deterministic coreference annotator from CoreNLP.

Parameters
maxDist

DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance)

Type: Integer  — Default value: -1

postprocessing

DCoRef parameter: Do post-processing

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

score

DCoRef parameter: Scoring the output of the system

Type: Boolean  — Default value: false

sieves

DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/.

Type: String  — Default value: MarkRole, DiscourseMatch, ExactStringMatch, RelaxedExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, RelaxedHeadMatch, PronounMatch

singleton

DCoRef parameter: setting singleton predictor

Type: Boolean  — Default value: true

Table 11. Capabilities

Inputs

POS NamedEntity Lemma Sentence Token Constituent

Outputs

CoreferenceChain CoreferenceLink

Languages

none specified

CoreNLP Coreference Resolver (old API)

Short name

StanfordCoreferenceResolver

Category

Coreference resolver

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordCoreferenceResolver

Description
null
Parameters
maxDist

DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance)

Type: Integer  — Default value: -1

postprocessing

DCoRef parameter: Do post processing

Type: Boolean  — Default value: false

score

DCoRef parameter: Scoring the output of the system

Type: Boolean  — Default value: false

sieves

DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/.

Type: String  — Default value: MarkRole, DiscourseMatch, ExactStringMatch, RelaxedExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, RelaxedHeadMatch, PronounMatch

singleton

DCoRef parameter: setting singleton predictor

Type: Boolean  — Default value: true

Table 12. Capabilities

Inputs

POS NamedEntity Lemma Sentence Token Constituent

Outputs

CoreferenceChain CoreferenceLink

Languages

see available models

Table 13. Models
Language Variant Version

en

default

${core.version}.1

Embeddings

Table 14. Analysis Components in category Embeddings (2)
Component Description

MalletEmbeddingsAnnotator

Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas.

MalletEmbeddingsTrainer

Compute word embeddings from the given collection using skip-grams.

Mallet Embeddings Annotator

Short name

MalletEmbeddingsAnnotator

Category

Embeddings

Group ID

org.dkpro.core

Artifact ID

dkpro-core-mallet-asl

Implementation

org.dkpro.core.mallet.wordembeddings.MalletEmbeddingsAnnotator

Description

Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas.

Parameters
annotateUnknownTokens
Specify how to handle unknown tokens:
  1. If this parameter is not specified, unknown tokens are not annotated.
  2. If an empty float[] is passed, a random vector is generated that is used for each unknown token.
  3. If a float[] is passed, each unknown token is annotated with that vector. The float must have the same length as the vectors in the model file.

Type: Boolean  — Default value: false

lowercase

If set to true (default: false), all tokens are lowercased.

Type: Boolean  — Default value: false

modelHasHeader

If set to true (default: false), the first line is interpreted as header line containing the number of entries and the dimensionality. This should be set to true for models generated with Word2Vec.

Type: Boolean  — Default value: false

modelIsBinary

Whether the model is in binary format instead of text format.

Type: Boolean  — Default value: false

modelLocation

The file containing the word embeddings.

Currently only supports text file format.

Type: String

tokenFeaturePath

The annotation type to use for the model. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

Table 15. Capabilities

Inputs

Token

Outputs

WordEmbedding

Languages

none specified

Mallet Embeddings Trainer

Short name

MalletEmbeddingsTrainer

Category

Embeddings

Group ID

org.dkpro.core

Artifact ID

dkpro-core-mallet-asl

Implementation

org.dkpro.core.mallet.wordembeddings.MalletEmbeddingsTrainer

Description

Compute word embeddings from the given collection using skip-grams.

Set #PARAM_TOKEN_FEATURE_PATH to define what is considered as a token (Tokens, Lemmas, etc.).

Set #PARAM_COVERING_ANNOTATION_TYPE to define what is considered a document (sentences, paragraphs, etc.).

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

coveringAnnotationType

If specified, the text contained in the given segmentation type annotations are fed as separate units ("documents") to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored.

By default, the full text is used as a document.

Type: String  — Default value: ``

dimensions

The dimensionality of the output word embeddings (default: 50).

Type: Integer  — Default value: 50

escapeFilename

URL-encode the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: false

exampleWord

An example word that is output with its nearest neighbours once in a while (default: null, i.e. none).

Optional — Type: String

filterRegex

Regular expression of tokens to be filtered.

Type: String  — Default value: ``

filterRegexReplacement

Value with which tokens matching the regular expression are replaced.

Type: String  — Default value: ``

lowercase

If set to true (default: false), all tokens are lowercased.

Type: Boolean  — Default value: false

minDocumentLength

Ignore documents with fewer tokens than this value (default: 10).

Type: Integer  — Default value: 10

minTokenLength

Ignore tokens (or any other annotation type, as specified by #PARAM_TOKEN_FEATURE_PATH) that are shorter than the given value.

Type: Integer  — Default value: 3

numNegativeSamples

The number of negative samples to be generated for each token (default: 5).

Type: Integer  — Default value: 5

numThreads

The number of threads to use during model estimation. If not set, the number of threads is automatically set by ComponentParameters#computeNumThreads(int).

Warning: do not set this to more than 1 when using very small (test) data sets on MalletEmbeddingsTrainer! This might prevent the process from terminating.

Type: Integer  — Default value: 0

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

paramStopwordsFile

The location of the stopwords file.

Type: String  — Default value: ``

paramStopwordsReplacement

If set, stopwords found in the #PARAM_STOPWORDS_FILE location are not removed, but replaced by the given string (e.g. STOP).

Type: String  — Default value: ``

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

tokenFeaturePath

The annotation type to use as input tokens for the model estimation. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

useCharacters

If true (default: false), estimate character embeddings. #PARAM_TOKEN_FEATURE_PATH is ignored.

Type: Boolean  — Default value: false

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

windowSize

The context size when generating embeddings (default: 5).

Type: Integer  — Default value: 5

Gazeteer

Table 16. Analysis Components in category Gazeteer (1)
Component Description

DictionaryAnnotator

Takes a plain text file with phrases as input and annotates the phrases in the CAS file.

Dictionary Annotator

Short name

DictionaryAnnotator

Category

Gazeteer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-dictionaryannotator-asl

Implementation

org.dkpro.core.dictionaryannotator.DictionaryAnnotator

Description

Takes a plain text file with phrases as input and annotates the phrases in the CAS file. The annotation type defaults to NGram, but can be changed. The component requires that Tokens and Sentencees are annotated in the CAS. The format of the phrase file is one phrase per line, tokens are separated by space:

this is a phrase
another phrase

Parameters
annotationType

The annotation to create on matching phases. If nothing is specified, this defaults to NGram.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

The file must contain one phrase per line - phrases will be split at " "

Type: String

value

The value to set the feature configured in #PARAM_VALUE_FEATURE to.

Optional — Type: String

valueFeature

Set this feature on the created annotations.

Optional — Type: String  — Default value: value

Table 17. Capabilities

Inputs

Sentence Token

Outputs

none specified

Languages

none specified

Language Identifier

Table 18. Analysis Components in category Language Identifier (3)
Component Description

LangDetectLanguageIdentifier

Langdetect language identifier based on character n-grams.

LanguageIdentifier

Detection based on character n-grams.

LanguageDetectorWeb1T

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

LangDetect

Short name

LangDetectLanguageIdentifier

Category

Language Identifier

Group ID

org.dkpro.core

Artifact ID

dkpro-core-langdetect-asl

Implementation

org.dkpro.core.langdetect.LangDetectLanguageIdentifier

Description

Langdetect language identifier based on character n-grams. Due to the way LangDetect is implemented, this component does not support being instantiated multiple times with different model locations. Only a single model location can be active at a time over all instances of this component.

Parameters
modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

seed

The random seed.

Optional — Type: String

Table 19. Models
Language Variant Version

any

socialmedia

20141013.1

any

wikipedia

20141013.1

TextCat Language Identifier (Character N-Gram-based)

Short name

LanguageIdentifier

Category

Language Identifier

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textcat-asl

Implementation

org.dkpro.core.textcat.LanguageIdentifier

Description

Detection based on character n-grams. Uses the Java Text Categorizing Library based on a technique by Cavnar and Trenkle.

References

  • Cavnar, W. B. and J. M. Trenkle (1994). N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

Web1T Language Detector

Short name

LanguageDetectorWeb1T

Category

Language Identifier

Group ID

org.dkpro.core

Artifact ID

dkpro-core-ldweb1t-asl

Implementation

org.dkpro.core.ldweb1t.LanguageDetectorWeb1T

Description

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

Parameters
maxNGramSize

The maximum n-gram size that should be considered. Default is 3.

Type: Integer  — Default value: 3

minNGramSize

The minimum n-gram size that should be considered. Default is 1.

Type: Integer  — Default value: 1

Lemmatizer

Table 20. Analysis Components in category Lemmatizer (11)
Component Description

ClearNlpLemmatizer

Lemmatizer using Clear NLP.

CoreNlpLemmatizer

Lemmatizer from CoreNLP.

StanfordLemmatizer

Stanford Lemmatizer component.

GateLemmatizer

Wrapper for the GATE rule based lemmatizer.

IxaLemmatizer

Lemmatizer using the OpenNLP-based Ixa implementation.

LanguageToolLemmatizer

Naive lexicon-based lemmatizer.

MateLemmatizer

DKPro Core Annotator for the MateToolsLemmatizer.

MorphaLemmatizer

Lemmatize based on a finite-state machine.

Nlp4JLemmatizer

Emory NLP4J lemmatizer.

OpenNlpLemmatizer

Lemmatizer using OpenNLP.

OpenNlpLemmatizerTrainer

Train a lemmatizer model for OpenNLP.

ClearNLP Lemmatizer

Short name

ClearNlpLemmatizer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-clearnlp-asl

Implementation

org.dkpro.core.clearnlp.ClearNlpLemmatizer

Description

Lemmatizer using Clear NLP.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String  — Default value: en

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 21. Capabilities

Inputs

POS Sentence Token

Outputs

Lemma

Languages

see available models

Table 22. Models
Language Variant Version

en

default

20131111.0

CoreNLP Lemmatizer

Short name

CoreNlpLemmatizer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-corenlp-gpl

Implementation

org.dkpro.core.corenlp.CoreNlpLemmatizer

Description

Lemmatizer from CoreNLP.

Parameters
ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 23. Capabilities

Inputs

POS Sentence Token

Outputs

Lemma

Languages

none specified

CoreNLP Lemmatizer (old API)

Short name

StanfordLemmatizer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordLemmatizer

Description

Stanford Lemmatizer component. The Stanford Morphology-class computes the base form of English words, by removing just inflections (not derivational morphology). That is, it only does noun plurals, pronoun case, and verb endings, and not things like comparative adjectives or derived nominals. It is based on a finite-state transducer implemented by John Carroll et al., written in flex and publicly available. See: http://www.informatics.susx.ac.uk/research/nlp/carroll/morph.html

This only works for ENGLISH.

Parameters
ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 24. Capabilities

Inputs

POS Token

Outputs

Lemma

Languages

en

GATE Lemmatizer

Short name

GateLemmatizer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-gate-asl

Implementation

org.dkpro.core.gate.GateLemmatizer

Description

Wrapper for the GATE rule based lemmatizer. Based on code by Asher Stern from the BIUTEE textual entailment tool.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 25. Capabilities

Inputs

Token

Outputs

Lemma

Languages

see available models

Table 26. Models
Language Variant Version

en

default

20160531.0

IXA Lemmatizer

Short name

IxaLemmatizer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-ixa-asl

Implementation

org.dkpro.core.ixa.IxaLemmatizer

Description

Lemmatizer using the OpenNLP-based Ixa implementation.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 27. Capabilities

Inputs

POS Sentence Token

Outputs

Lemma

Languages

see available models

Table 28. Models
Language Variant Version

de

perceptron-conll09

20160213.1

en

perceptron-conll09

20160211.1

en

perceptron-ud

20160214.1

en

xlemma-perceptron-ud

20160214.1

es

perceptron-ancora-2.0

20160211.1

eu

perceptron-ud

20160212.1

fr

perceptron-sequoia

20160215.1

gl

perceptron-autodict05-ctag

20160212.1

it

perceptron-ud

20160213.1

nl

perceptron-alpino

20160215.1

LanguageTool Lemmatizer

Short name

LanguageToolLemmatizer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-languagetool-asl

Implementation

org.dkpro.core.languagetool.LanguageToolLemmatizer

Description

Naive lexicon-based lemmatizer. The words are looked up using the wordform lexicons of LanguageTool. Multiple readings are produced. The annotator simply takes the most frequent lemma from those readings. If no readings could be found, the original text is assigned as lemma.

Parameters
sanitize

Remove characters specified in #PARAM_SANTIZE_CHARS from lemmas.

Type: Boolean  — Default value: true

sanitizeChars

Characters to remove from lemmas if #PARAM_SANITIZE is enabled.

Type: String[]  — Default value: [(, ), [, ]]

Table 29. Capabilities

Inputs

Sentence Token

Outputs

Lemma

Languages

be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh

Mate Tools Lemmatizer

Short name

MateLemmatizer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-matetools-gpl

Implementation

org.dkpro.core.matetools.MateLemmatizer

Description

DKPro Core Annotator for the MateToolsLemmatizer.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

uppercase

Try reconstructing proper casing for lemmata. This is useful for German, but e.g. for English creates odd results.

Type: Boolean  — Default value: false

variant

Override the default variant used to locate the model.

Optional — Type: String

Table 30. Capabilities

Inputs

Sentence Token

Outputs

Lemma

Languages

see available models

Table 31. Models
Language Variant Version

de

tiger

20121024.1

en

conll2009

20130117.1

es

conll2009

20130117.1

fr

ftb

20130918.0

Morpha Lemmatizer

Short name

MorphaLemmatizer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-morpha-asl

Implementation

org.dkpro.core.morpha.MorphaLemmatizer

Description

Lemmatize based on a finite-state machine. Uses the Java port of Morpha.

References:

  • Minnen, G., J. Carroll and D. Pearce (2001). Applied morphological processing of English, Natural Language Engineering, 7(3). 207-223.
Parameters
readPOS

Pass part-of-speech information on to Morpha. Since we currently do not know in which format the part-of-speech tags are expected by Morpha, we just pass on the actual pos tag value we get from the token. This may produce worse results than not passing on pos tags at all, so this is disabled by default.

Type: Boolean  — Default value: false

Table 32. Capabilities

Inputs

POS Sentence Token

Outputs

Lemma

Languages

en

NLP4J Lemmatizer

Short name

Nlp4JLemmatizer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-nlp4j-asl

Implementation

org.dkpro.core.nlp4j.Nlp4JLemmatizer

Description

Emory NLP4J lemmatizer. This is a lower-casing lemmatizer.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

Table 33. Capabilities

Inputs

POS Sentence Token

Outputs

Lemma

Languages

none specified

OpenNLP Lemmatizer

Short name

OpenNlpLemmatizer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpLemmatizer

Description

Lemmatizer using OpenNLP.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 34. Capabilities

Inputs

POS Sentence Token

Outputs

Lemma

Languages

none specified

OpenNLP Lemmatizer Trainer

Short name

OpenNlpLemmatizerTrainer

Category

Lemmatizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpLemmatizerTrainer

Description

Train a lemmatizer model for OpenNLP.

Parameters
algorithm

Training algorithm.

Type: String  — Default value: MAXENT

beamSize

Beam size.

Type: Integer  — Default value: 3

cutoff

Frequency cut-off.

Type: Integer  — Default value: 5

iterations

Number of training iterations.

Type: Integer  — Default value: 100

language

Store this language to the model instead of the document language.

Type: String

numThreads

Number of parallel threads.

Type: Integer  — Default value: 1

targetLocation

Location to which the output is written.

Type: String

trainerType

Trainer type.

Type: String  — Default value: Event

Morphological analyzer

Table 35. Analysis Components in category Morphological analyzer (3)
Component Description

MateMorphTagger

DKPro Core Annotator for the MateToolsMorphTagger.

RfTagger

Rftagger morphological analyzer.

SfstAnnotator

SFST morphological analyzer.

Mate Tools Morphological Analyzer

Short name

MateMorphTagger

Category

Morphological analyzer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-matetools-gpl

Implementation

org.dkpro.core.matetools.MateMorphTagger

Description

DKPro Core Annotator for the MateToolsMorphTagger.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 36. Capabilities

Inputs

Lemma Sentence Token

Outputs

Morpheme MorphologicalFeatures

Languages

see available models

Table 37. Models
Language Variant Version

de

tiger

20121024.1

es

conll2009

20130117.1

fr

ftb

20130918.0

RFTagger Morphological Analyzer

Short name

RfTagger

Category

Morphological analyzer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-rftagger-asl

Implementation

org.dkpro.core.rftagger.RfTagger

Description

Rftagger morphological analyzer.

Parameters
MorphMappingLocation

Load the morphological features mapping from this location instead of locating the mapping automatically.

Optional — Type: String

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Write the tag set(s) to the log when a model is loaded.

Type: Boolean  — Default value: false

Table 38. Capabilities

Inputs

Sentence Token

Outputs

MorphologicalFeatures POS

Languages

see available models

Table 39. Models
Language Variant Version

cz

cac

20150728.1

de

tiger

20150928.1

hu

szeged

20150728.1

ru

ric

20150728.1

sk

snk

20150728.1

sl

jos

20150728.1

SFST Morphological Analyzer

Short name

SfstAnnotator

Category

Morphological analyzer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-sfst-gpl

Implementation

org.dkpro.core.sfst.SfstAnnotator

Description

SFST morphological analyzer.

Parameters
MorphMappingLocation

Load the morphological features mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

lowercaseFirstWord

Whether to lookup the first word of a sentence in lowercase, useful if the employed model does not handle lowercasing.

Optional — Type: Boolean  — Default value: false

mode

Whether to record only the first (FIRST) or all possible analyses (ALL).

Type: String  — Default value: FIRST

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

Specifies the model encoding.

Type: String  — Default value: UTF-8

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Write the tag set(s) to the log when a model is loaded.

Type: Boolean  — Default value: false

writeLemma

Write lemma information.

Type: Boolean  — Default value: true

writePOS

Write part-of-speech information.

Type: Boolean  — Default value: true

Table 40. Capabilities

Inputs

Sentence Token

Outputs

MorphologicalFeatures POS

Languages

see available models

Table 41. Models
Language Variant Version

de

morphisto-ca

20110202.1

de

smor-ca

20140801.1

de

zmorge-newlemma-ca

20140521.1

de

zmorge-orig-ca

20140521.1

it

pippi-ca

20090223.1

tr

trmorph-ca

20130219.1

Named Entity Recognizer

Table 42. Analysis Components in category Named Entity Recognizer (9)
Component Description

StanfordNamedEntityRecognizer

Stanford Named Entity Recognizer component.

CoreNlpNamedEntityRecognizer

Named entity recognizer from CoreNLP.

StanfordNamedEntityRecognizerTrainer

Train a NER model for Stanford CoreNLP Named Entity Recognizer.

LingPipeNamedEntityRecognizer

LingPipe named entity recognizer.

LingPipeNamedEntityRecognizerTrainer

LingPipe named entity recognizer trainer.

Nlp4JNamedEntityRecognizer

Emory NLP4J name finder wrapper.

OpenNlpNamedEntityRecognizer

OpenNLP name finder wrapper.

OpenNlpNamedEntityRecognizerTrainer

Train a named entity recognizer model for OpenNLP.

SemanticFieldAnnotator

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.

CoreNLP Named Entity Recogizer (old API)

Short name

StanfordNamedEntityRecognizer

Category

Named Entity Recognizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizer

Description

Stanford Named Entity Recognizer component.

Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 43. Capabilities

Inputs

Sentence Token

Outputs

NamedEntity

Languages

see available models

Table 44. Models
Language Variant Version

de

germeval2014.hgc_175m_600.crf

20180227.1

de

nemgp

20141024.1

en

all.3class.caseless.distsim.crf

20161213.0

en

all.3class.distsim.crf

20161213.1

en

all.3class.nodistsim.crf

20160110.1

en

conll.4class.caseless.distsim.crf

20160110.0

en

conll.4class.distsim.crf

20150420.1

en

conll.4class.nodistsim.crf

20160110.1

en

freme-wikiner

20150925.1

en

muc.7class.caseless.distsim.crf

20150129.0

en

muc.7class.distsim.crf

20150129.1

en

muc.7class.nodistsim.crf

20160110.1

en

nowiki.3class.caseless.distsim.crf

20161213.0

en

nowiki.3class.nodistsim.crf

20160110.0

es

ancora.distsim.s512.crf

20161211.1

es

freme-wikiner

20150925.1

fr

freme-wikiner

20150925.1

it

freme-wikiner

20150925.1

nl

freme-wikiner

20150925.1

ru

freme-wikiner

20160726.1

CoreNLP Named Entity Recognizer

Short name

CoreNlpNamedEntityRecognizer

Category

Named Entity Recognizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-corenlp-gpl

Implementation

org.dkpro.core.corenlp.CoreNlpNamedEntityRecognizer

Description

Named entity recognizer from CoreNLP.

Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

applyNumericClassifiers

Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

maxSentenceLength

Maximum sentence length. Longer sentences are skipped.

Type: Integer  — Default value: 2147483647

maxTime

Maximum time to spend on a single sentence.

Type: Integer  — Default value: -1

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

numThreads

Number of parallel threads to use.

Type: Integer  — Default value: 0

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 45. Capabilities

Inputs

Sentence Token

Outputs

NamedEntity

Languages

none specified

CoreNLP Named Entity Recognizer Trainer

Short name

StanfordNamedEntityRecognizerTrainer

Category

Named Entity Recognizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizerTrainer

Description

Train a NER model for Stanford CoreNLP Named Entity Recognizer.

Parameters
acceptedTagsRegex

Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type.

Optional — Type: String

entitySubClassification

Label set to use for training.

Options: IOB1, IOB2, IOE1, IOE2, SBIEO, IO, BIO, BILOU, noprefix

Optional — Type: String  — Default value: noprefix

propertiesFile

Training file containing the parameters. The trainFile or trainFileList and serializeTo parameters in this file are ignored/overridden.

Optional — Type: String

retainClassification

Flag to keep the label set specified by PARAM_LABEL_SET. If set to false, representation is mapped to IOB1 on output.

Optional — Type: Boolean  — Default value: true

targetLocation

Location of the target model file.

Type: String

Table 46. Capabilities

Inputs

NamedEntity Sentence Token

Outputs

none specified

Languages

none specified

LingPipe Named Entity Recognizer

Short name

LingPipeNamedEntityRecognizer

Category

Named Entity Recognizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-lingpipe-gpl

Implementation

org.dkpro.core.lingpipe.LingPipeNamedEntityRecognizer

Description

LingPipe named entity recognizer.

Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 47. Capabilities

Inputs

Token

Outputs

NamedEntity

Languages

see available models

Table 48. Models
Language Variant Version

en

bio-genetag

20110623.1

en

bio-genia

20110623.1

en

news-muc6

20110623.1

LingPipe Named Entity Recognizer Trainer

Short name

LingPipeNamedEntityRecognizerTrainer

Category

Named Entity Recognizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-lingpipe-gpl

Implementation

org.dkpro.core.lingpipe.LingPipeNamedEntityRecognizerTrainer

Description

LingPipe named entity recognizer trainer.

Parameters
acceptedTagsRegex

Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type.

Optional — Type: String

targetLocation

Location to which the output is written.

Type: String

NLP4J Named Entity Recognizer

Short name

Nlp4JNamedEntityRecognizer

Category

Named Entity Recognizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-nlp4j-asl

Implementation

org.dkpro.core.nlp4j.Nlp4JNamedEntityRecognizer

Description

Emory NLP4J name finder wrapper.

Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

ignoreMissingFeatures

Process anyway, even if the model relies on features that are not supported by this component.

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 49. Capabilities

Inputs

POS Lemma Sentence Token

Outputs

NamedEntity

Languages

see available models

Table 50. Models
Language Variant Version

en

default

20160802.0

OpenNLP Named Entity Recognizer

Short name

OpenNlpNamedEntityRecognizer

Category

Named Entity Recognizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpNamedEntityRecognizer

Description

OpenNLP name finder wrapper.

Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Type: String  — Default value: person

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 51. Capabilities

Inputs

Token

Outputs

NamedEntity

Languages

see available models

Table 52. Models
Language Variant Version

de

nemgp

20141024.1

en

date

20100907.0

en

location

20100907.0

en

money

20100907.0

en

organization

20100907.0

en

percentage

20100907.0

en

person

20130624.1

en

time

20100907.0

es

location

20100908.0

es

misc

20100908.0

es

organization

20100908.0

es

person

20100908.0

nl

location

20100908.0

nl

misc

20100908.0

nl

organization

20100908.0

nl

person

20100908.0

OpenNLP Named Entity Recognizer Trainer

Short name

OpenNlpNamedEntityRecognizerTrainer

Category

Named Entity Recognizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpNamedEntityRecognizerTrainer

Description

Train a named entity recognizer model for OpenNLP.

Parameters
acceptedTagsRegex

Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type.

Optional — Type: String

algorithm

Type: String  — Default value: PERCEPTRON

beamSize

Type: Integer  — Default value: 3

cutoff

Frequency cut-off.

Type: Integer  — Default value: 0

featureGen

File containing the feature generation specification.

Optional — Type: String

iterations

Number of training iterations.

Type: Integer  — Default value: 300

language

Store this language to the model instead of the document language.

Type: String

numThreads

Number of parallel threads.

Type: Integer  — Default value: 1

sequenceEncoding

Type of sequence encoding to use.

Type: String  — Default value: BILOU

targetLocation

Location to which the output is written.

Type: String

trainerType

Training algorithm.

Type: String  — Default value: Event

Table 53. Capabilities

Inputs

NamedEntity Sentence Token

Outputs

none specified

Languages

none specified

Semantic Field Annotator

Short name

SemanticFieldAnnotator

Category

Named Entity Recognizer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-dictionaryannotator-asl

Implementation

org.dkpro.core.dictionaryannotator.semantictagging.SemanticFieldAnnotator

Description

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. This could be a lexical resource such as WordNet or a simple key-value map. The annotation is stored in the SemanticField annotation type.

Parameters
annotationType

Annotation types which should be annotated with semantic fields

Type: String

constraint

A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to annotate only tokens with semantic fields that are part of a location named entity.

Optional — Type: String

Table 54. Capabilities

Inputs

POS Lemma Token

Outputs

NamedEntity

Languages

none specified

Parser

Table 55. Analysis Components in category Parser (12)
Component Description

BerkeleyParser

Berkeley Parser annotator.

ClearNlpParser

CLEAR parser annotator.

StanfordDependencyConverter

Converts a constituency structure into a dependency structure.

CoreNlpDependencyParser

Dependency parser from CoreNLP.

CoreNlpParser

Parser from CoreNLP.

StanfordParser

Stanford Parser component.

MstParser

Dependency parsing using MSTParser.

MaltParser

Dependency parsing using MaltPaser.

MateParser

DKPro Annotator for the MateToolsParser.

Nlp4JDependencyParser

Emory NLP4J dependency parser.

OpenNlpParser

OpenNLP parser.

UDPipeParser

Dependency parser using UDPipe.

Berkeley Parser

Short name

BerkeleyParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-berkeleyparser-gpl

Implementation

org.dkpro.core.berkeleyparser.BerkeleyParser

Description

Berkeley Parser annotator. Requires Sentences to be annotated before.

Parameters
ConstituentMappingLocation

Location of the mapping file for constituent tags to UIMA types.

Optional — Type: String

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

accurate

Set thresholds for accuracy instead of efficiency.

Type: Boolean  — Default value: false

binarize

Output binarized trees.

Type: Boolean  — Default value: false

keepFunctionLabels

Retain predicted function labels. Model must have been trained with function labels.

Type: Boolean  — Default value: false

language

Use this language instead of the language set in the CAS to locate the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

readPOS

Sets whether to use or not to use already existing POS tags from another annotator for the parsing process.

Type: Boolean  — Default value: true

scores

Output inside scores (only for binarized viterbi trees).

Type: Boolean  — Default value: false

substates

Output sub-categories (only for binarized Viterbi trees).

Type: Boolean  — Default value: false

variational

Use variational rule score approximation instead of max-rule

Type: Boolean  — Default value: false

viterbi

Compute Viterbi derivation instead of max-rule tree.

Type: Boolean  — Default value: false

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Type: Boolean  — Default value: false

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Type: Boolean  — Default value: false

Table 56. Capabilities

Inputs

Sentence Token

Outputs

PennTree Constituent

Languages

see available models

Table 57. Models
Language Variant Version

ar

sm5

20090917.1

bg

sm5

20090917.1

de

sm5

20090917.1

en

sm6

20100819.1

fr

sm5

20090917.1

zh

sm5

20090917.1

ClearNLP Parser

Short name

ClearNlpParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-clearnlp-asl

Implementation

org.dkpro.core.clearnlp.ClearNlpParser

Description

CLEAR parser annotator.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Write the tag set(s) to the log when a model is loaded.

Type: Boolean  — Default value: false

Table 58. Capabilities

Inputs

POS Lemma Sentence Token

Outputs

Dependency

Languages

see available models

Table 59. Models
Language Variant Version

en

mayo

20131111.0

en

ontonotes

20131128.0

CoreNLP Dependency Converter

Short name

StanfordDependencyConverter

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordDependencyConverter

Description

Converts a constituency structure into a dependency structure.

Parameters
language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

mode

Sets the kind of dependencies being created.

Optional — Type: String  — Default value: TREE

originalDependencies

Create original dependencies. If this is disabled, universal dependencies are created. The default is to create the original dependencies.

Type: Boolean  — Default value: true

Table 60. Capabilities

Inputs

Token Constituent

Outputs

Dependency

Languages

none specified

CoreNLP Dependency Parser

Short name

CoreNlpDependencyParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-corenlp-gpl

Implementation

org.dkpro.core.corenlp.CoreNlpDependencyParser

Description

Dependency parser from CoreNLP.

Parameters
DependencyMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

extraDependencies

Types of extra edges to add to the dependency tree.

Type: String  — Default value: NONE

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

maxSentenceLength

Maximum sentence length. Longer sentences are skipped.

Type: Integer  — Default value: 2147483647

maxTime

Maximum time to spend on a single sentence.

Type: Integer  — Default value: -1

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

numThreads

Number of parallel threads to use.

Type: Integer  — Default value: 0

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 61. Capabilities

Inputs

POS Sentence Token

Outputs

Dependency

Languages

see available models

Table 62. Models
Language Variant Version

de

ud

20161213.1

en

ptb-conll

20160119.1

en

sd

20150418.1

en

ud

20161213.1

en

wsj-sd

20150418.1

en

wsj-ud

20161213.1

fr

ud

20180227.1

zh

ctb-conll

20160119.1

zh

ptb-conll

20161223.1

zh

ud

20161223.1

CoreNLP Parser

Short name

CoreNlpParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-corenlp-gpl

Implementation

org.dkpro.core.corenlp.CoreNlpParser

Description

Parser from CoreNLP.

Parameters
ConstituentMappingLocation

Location of the mapping file for dependency tags to UIMA types.

Optional — Type: String

DependencyMappingLocation

Location of the mapping file for dependency tags to UIMA types.

Optional — Type: String

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

extraDependencies

Types of extra edges to add to the dependency tree.

Type: String  — Default value: NONE

keepPunctuation

Whether to keep punctuation dependencies in the dependency parse output of the parser.

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

maxSentenceLength

Maximum sentence length. Longer sentences are skipped.

Type: Integer  — Default value: 2147483647

maxTime

Maximum time to spend on a single sentence.

Type: Integer  — Default value: -1

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

numThreads

Number of parallel threads to use.

Type: Integer  — Default value: 0

originalDependencies

Generate original Stanford Dependencies grammatical relations instead of Universal Dependencies.

Type: Boolean  — Default value: true

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

readPOS

Sets whether to use or not to use existing POS tags.

Type: Boolean  — Default value: true

writeConstituent

Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization.

Type: Boolean  — Default value: true

writeDependency

Sets whether to create or not to create dependency annotations.

Type: Boolean  — Default value: true

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Type: Boolean  — Default value: false

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Type: Boolean  — Default value: false

Table 63. Capabilities

Inputs

POS Sentence Token

Outputs

Constituent Dependency

Languages

none specified

CoreNLP Parser (old API)

Short name

StanfordParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordParser

Description

Stanford Parser component.

Parameters
ConstituentMappingLocation

Location of the mapping file for constituent tags to UIMA types.

Optional — Type: String

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

annotationTypeToParse

This parameter can be used to override the standard behavior which uses the Sentence annotation as the basic unit for parsing.

If the parameter is set with the name of an annotation type x, the parser will no longer parse Sentence-annotations, but x-Annotations.

Optional — Type: String

keepPunctuation

Whether to keep the punctuation as part of the parse tree.

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

maxItems

Controls when the factored parser considers a sentence to be too complex and falls back to the PCFG parser.

Type: Integer  — Default value: 200000

maxSentenceLength

Maximum number of tokens in a sentence. Longer sentences are not parsed. This is to avoid out of memory exceptions.

Type: Integer  — Default value: 130

mode

Sets the kind of dependencies being created.

Optional — Type: String  — Default value: TREE

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Write the tag set(s) to the log when a model is loaded.

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

readPOS

Sets whether to use or not to use already existing POS tags from another annotator for the parsing process.

Type: Boolean  — Default value: true

writeConstituent

Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization.

Type: Boolean  — Default value: true

writeDependency

Sets whether to create or not to create dependency annotations.

Type: Boolean  — Default value: true

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Type: Boolean  — Default value: false

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Type: Boolean  — Default value: false

Table 64. Capabilities

Inputs

POS Sentence Token

Outputs

Constituent Dependency

Languages

see available models

Table 65. Models
Language Variant Version

ar

factored

20150129.1

ar

sr

20180227.1

de

factored

20150129.1

de

pcfg

20150129.1

de

sr

20141031.1

en

factored

20150129.1

en

pcfg

20150129.1

en

pcfg.caseless

20160110.1

en

rnn

20140104.1

en

sr

20141031.1

en

sr-beam

20141031.1

en

wsj-factored

20150129.1

en

wsj-pcfg

20150129.1

en

wsj-rnn

20140104.1

es

pcfg

20161211.1

es

sr

20161211.1

es

sr-beam

20161211.1

fr

factored

20150129.1

fr

sr

20160114.1

fr

sr-beam

20141023.1

zh

factored

20150129.1

zh

pcfg

20150129.1

zh

sr

20141023.1

zh

xinhua-factored

20150129.1

zh

xinhua-pcfg

20150129.1

MSTParser Dependency Parser

Short name

MstParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-mstparser-asl

Implementation

org.dkpro.core.mstparser.MstParser

Description

Dependency parsing using MSTParser.

Wrapper for the MSTParser (high memory requirements). More information about the parser can be found here here

The MSTParser models tend to be very large, e.g. the Eisner model is about 600 MB uncompressed. With this model, parsing a simple sentence with MSTParser requires about 3 GB heap memory.

This component feeds MSTParser only with the FORM (token) and POS (part-of-speech) fields. LEMMA, CPOS, and other columns from the CONLL 2006 format are not generated (cf. mstparser.DependencyInstance DependencyInstance).

Parameters
DependencyMappingLocation

Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

order

Specifies the order/scope of features. 1 only has features over single edges and 2 has features over pairs of adjacent edges in the tree. The model must have been trained with the respective order set here.

Optional — Type: Integer

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 66. Capabilities

Inputs

POS Sentence Token

Outputs

Dependency

Languages

see available models

Table 67. Models
Language Variant Version

en

eisner

20100416.2

en

sample

20121019.2

hr

mte5.defnpout

20130527.1

hr

mte5.pos

20130527.1

MaltParser Dependency Parser

Short name

MaltParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-maltparser-asl

Implementation

org.dkpro.core.maltparser.MaltParser

Description

Dependency parsing using MaltPaser.

Required annotations:

  • Token
  • Sentence
  • POS
Generated annotations:
  • Dependency (annotated over sentence-span)
Parameters
ignoreMissingFeatures

Process anyway, even if the model relies on features that are not supported by this component.

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 68. Capabilities

Inputs

POS Lemma Sentence Token

Outputs

Dependency

Languages

see available models

Table 69. Models
Language Variant Version

bn

linear

20120905.1

en

linear

20120312.1

en

poly

20120312.1

es

linear

20130220.0

fa

linear

20130522.1

fr

linear

20120312.1

pl

linear

20120904.1

sv

linear

20120925.2

Mate Tools Dependency Parser

Short name

MateParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-matetools-gpl

Implementation

org.dkpro.core.matetools.MateParser

Description

DKPro Annotator for the MateToolsParser.

Please cite the following paper, if you use the parser: Bernd Bohnet. 2010. Top Accuracy and Fast Dependency Parsing is not a Contradiction. The 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China.

Parameters
DependencyMappingLocation

Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 70. Capabilities

Inputs

POS Sentence Token

Outputs

Dependency

Languages

see available models

Table 71. Models
Language Variant Version

de

tiger

20121024.1

en

conll2009

20130117.2

es

conll2009

20130117.1

fa

parsper

20141124.0

fr

ftb

20130918.0

zh

conll2009

20130117.1

NLP4J Dependency Parser

Short name

Nlp4JDependencyParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-nlp4j-asl

Implementation

org.dkpro.core.nlp4j.Nlp4JDependencyParser

Description

Emory NLP4J dependency parser.

Parameters
DependencyMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

ignoreMissingFeatures

Process anyway, even if the model relies on features that are not supported by this component.

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 72. Capabilities

Inputs

POS Sentence Token

Outputs

Dependency

Languages

none specified

OpenNLP Parser

Short name

OpenNlpParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpParser

Description

OpenNLP parser. The parser ignores existing POS tags and internally creates new ones. However, these tags are only added as annotation if explicitly requested via #PARAM_WRITE_POS.

Parameters
ConstituentMappingLocation

Location of the mapping file for constituent tags to UIMA types.

Optional — Type: String

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Type: Boolean  — Default value: false

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Type: Boolean  — Default value: false

Table 73. Capabilities

Inputs

Sentence Token

Outputs

PennTree Constituent

Languages

see available models

Table 74. Models
Language Variant Version

en

chunking

20120616.1

en

chunking-ixa

20140426.1

es

chunking-ixa

20140426.1

UDPipe Parsito Dependency Parser

Short name

UDPipeParser

Category

Parser

Group ID

org.dkpro.core

Artifact ID

dkpro-core-udpipe-asl

Implementation

org.dkpro.core.udpipe.UDPipeParser

Description

Dependency parser using UDPipe. UDPipe uses Parsito, a greedy transition-based parser utilizing an artificial neural network.

Parameters
DependencyMappingLocation

Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 75. Capabilities

Inputs

MorphologicalFeatures POS Lemma Sentence Token

Outputs

Dependency

Languages

see available models

Table 76. Models
Language Variant Version

en

ud

20160523.1

no

ud

20160523.1

Part-of-speech tagger

Table 77. Analysis Components in category Part-of-speech tagger (18)
Component Description

ArktweetPosTagger

Wrapper for Twitter Tokenizer and POS Tagger.

ArktweetPosTaggerTrainer

Trainer for ark-tweet POS tagger.

ClearNlpPosTagger

Part-of-Speech annotator using Clear NLP.

CoreNlpPosTagger

Part-of-speech tagger from CoreNLP.

StanfordPosTagger

Stanford Part-of-Speech tagger component.

StanfordPosTaggerTrainer

Train a POS tagging model for the Stanford POS tagger.

FlexTagPosTagger

Flexible part-of-speech tagger.

HepplePosTagger

GATE Hepple part-of-speech tagger.

HunPosTagger

Part-of-Speech annotator using HunPos.

IxaPosTagger

Part-of-Speech annotator using OpenNLP with IXA extensions.

LingPipePosTagger

LingPipe part-of-speech tagger.

MatePosTagger

DKPro Annotator for the MateToolsPosTagger

MeCabTagger

Annotator for the MeCab Japanese POS Tagger.

Nlp4JPosTagger

Part-of-Speech annotator using Emory NLP4J.

OpenNlpPosTagger

Part-of-Speech annotator using OpenNLP.

OpenNlpPosTaggerTrainer

Train a POS tagging model for OpenNLP.

TreeTaggerPosTagger

Part-of-Speech and lemmatizer annotator using TreeTagger.

UDPipePosTagger

Part-of-Speech, lemmatizer, and morphological analyzer using UDPipe.

ArkTweet POS-Tagger

Short name

ArktweetPosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-arktools-gpl

Implementation

org.dkpro.core.arktools.ArktweetPosTagger

Description

Wrapper for Twitter Tokenizer and POS Tagger. As described in: Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider and Noah A. Smith. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters In Proceedings of NAACL 2013.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

Table 78. Capabilities

Inputs

Token

Outputs

POS

Languages

see available models

Table 79. Models
Language Variant Version

en

default

20120919.1

en

irc

20121211.1

en

ritter

20130723.1

ArkTweet POS-Tagger Trainer

Short name

ArktweetPosTaggerTrainer

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-arktools-gpl

Implementation

org.dkpro.core.arktools.ArktweetPosTaggerTrainer

Description

Trainer for ark-tweet POS tagger.

Parameters
targetLocation

Location to which the model is written.

Type: String

wordClusterFile

Classpath resource pointing to the the word cluster file calculated with brown clustering algorithm.

Type: String

Table 80. Capabilities

Inputs

POS Sentence Token

Outputs

none specified

Languages

none specified

ClearNLP POS-Tagger

Short name

ClearNlpPosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-clearnlp-asl

Implementation

org.dkpro.core.clearnlp.ClearNlpPosTagger

Description

Part-of-Speech annotator using Clear NLP. Requires Sentences to be annotated before.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

dictLocation

Load the dictionary from this location instead of locating the dictionary automatically.

Optional — Type: String

dictVariant

Override the default variant used to locate the dictionary.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the pos-tagging model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the pos-tagging model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 81. Capabilities

Inputs

Sentence Token

Outputs

POS

Languages

see available models

Table 82. Models
Language Variant Version

en

mayo

20131111.0

en

ontonotes

20131128.0

CoreNLP POS-Tagger

Short name

CoreNlpPosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-corenlp-gpl

Implementation

org.dkpro.core.corenlp.CoreNlpPosTagger

Description

Part-of-speech tagger from CoreNLP.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

maxSentenceLength

Maximum sentence length. Longer sentences are skipped.

Type: Integer  — Default value: 2147483647

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

numThreads

Number of parallel threads to use.

Type: Integer  — Default value: 0

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 83. Capabilities

Inputs

Sentence Token

Outputs

POS

Languages

none specified

CoreNLP POS-Tagger (old API)

Short name

StanfordPosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordPosTagger

Description

Stanford Part-of-Speech tagger component.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

maxSentenceLength

Sentences with more tokens than the specified max amount will be ignored if this parameter is set to a value larger than zero. The default value zero will allow all sentences to be POS tagged.

Optional — Type: Integer

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 84. Capabilities

Inputs

Sentence Token

Outputs

POS

Languages

see available models

Table 85. Models
Language Variant Version

ar

default

20180103.1

de

fast

20140827.1

de

fast-caseless

20140827.0

de

hgc

20140827.1

de

ud

20161213.1

en

bidirectional-distsim

20181002.1

en

caseless-left3words-distsim

20181002.0

en

fast.41

20130730.1

en

left3words-distsim

20181002.1

en

twitter

20130730.1

en

twitter-fast

20130914.0

en

wsj-0-18-bidirectional-distsim

20160110.1

en

wsj-0-18-bidirectional-nodistsim

20131112.1

en

wsj-0-18-caseless-left3words-distsim

20140827.0

en

wsj-0-18-left3words-distsim

20140616.1

en

wsj-0-18-left3words-nodistsim

20131112.1

es

default

20161211.1

es

distsim

20161211.1

fr

default

20140616.1

zh

distsim

20140616.1

CoreNLP POS-Tagger Trainer

Short name

StanfordPosTaggerTrainer

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordPosTaggerTrainer

Description

Train a POS tagging model for the Stanford POS tagger.

Parameters
clusterFile

Distsim cluster files.

Optional — Type: String

targetLocation

Location to which the output is written.

Type: String

trainFile

Training file containing the parameters. The trainFile, model and encoding parameters in this file are ignored/overwritten. In the arch parameter, the string ${distsimCluster} is replaced with the path to the cluster files if #PARAM_CLUSTER_FILE is specified.

Optional — Type: String

Table 86. Capabilities

Inputs

POS Sentence Token

Outputs

none specified

Languages

none specified

FlexTag POS-Tagger

Short name

FlexTagPosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-flextag-asl

Implementation

org.dkpro.core.flextag.FlexTagPosTagger

Description

Flexible part-of-speech tagger.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

Table 87. Models
Language Variant Version

de

tiger

20170512.1

en

wsj0-18

20170512.1

GATE Hepple POS-Tagger

Short name

HepplePosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-gate-asl

Implementation

org.dkpro.core.gate.HepplePosTagger

Description

GATE Hepple part-of-speech tagger.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

lexiconLocation

Load the lexicon from this location instead of locating it automatically.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

rulesetLocation

Load the ruleset from this location instead of locating it automatically.

Optional — Type: String

Table 88. Capabilities

Inputs

Sentence Token

Outputs

POS

Languages

see available models

Table 89. Models
Language Variant Version

en

annie

20160531.0

HunPos POS-Tagger

Short name

HunPosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-hunpos-asl

Implementation

org.dkpro.core.hunpos.HunPosTagger

Description

Part-of-Speech annotator using HunPos. Requires Sentences to be annotated before.

References

  • HALÁCSY, Péter; KORNAI, András; ORAVECZ, Csaba. HunPos: an open source trigram tagger. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007. S. 209-212. (pdf) (bibtex)
Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 90. Capabilities

Inputs

Sentence Token

Outputs

POS

Languages

see available models

Table 91. Models
Language Variant Version

cs

pdt

20121123.2

da

ddt

20121123.2

de

tiger

20121123.2

en

wsj

20070724.2

fa

upc

20140414.0

hr

mte5.defnpout

20130509.2

hu

szeged_kr

20070724.2

pt

bosque

20121123.2

pt

bosque

20121123.2

pt

mm

20130119.2

pt

tbchp

20110419.2

ru

rdt

20121123.2

sl

jos

20121123.2

sv

paroletags

20100215.2

sv

suctags

20100927.2

IXA POS-Tagger

Short name

IxaPosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-ixa-asl

Implementation

org.dkpro.core.ixa.IxaPosTagger

Description

Part-of-Speech annotator using OpenNLP with IXA extensions.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 92. Models
Language Variant Version

de

perceptron-autodict01-conll09

20160213.1

en

maxent-100-c5-baseline-autodict01-conll09

20160211.1

en

perceptron-autodict01-conll09

20160211.1

en

perceptron-autodict01-ud

20160214.1

en

xpos-perceptron-autodict01-ud

20160214.1

es

perceptron-autodict01-ancora-2.0

20160212.1

eu

perceptron-ud

20160212.1

fr

perceptron-autodict01-sequoia

20160215.1

gl

perceptron-autdict05-ctag

20160212.1

it

perceptron-autodict01-ud

20160213.1

nl

maxent-100-c5-autodict01-alpino

20160214.1

nl

perceptron-autodict01-alpino

20160214.1

LingPipe POS-Tagger

Short name

LingPipePosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-lingpipe-gpl

Implementation

org.dkpro.core.lingpipe.LingPipePosTagger

Description

LingPipe part-of-speech tagger.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

uppercaseTags

Lingpipe models tend to be trained on lower-case tags, but our POS mappings use uppercase.

Type: Boolean  — Default value: true

Table 93. Capabilities

Inputs

Sentence Token

Outputs

POS

Languages

see available models

Table 94. Models
Language Variant Version

en

bio-genia

20110623.1

en

bio-medpost

20110623.1

en

general-brown

20110623.1

Mate Tools POS-Tagger

Short name

MatePosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-matetools-gpl

Implementation

org.dkpro.core.matetools.MatePosTagger

Description

DKPro Annotator for the MateToolsPosTagger

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 95. Capabilities

Inputs

Sentence Token

Outputs

POS

Languages

see available models

Table 96. Models
Language Variant Version

de

tiger

20121024.1

en

conll2009

20130117.1

es

conll2009

20130117.1

fr

ftb

20130918.0

zh

conll2009

20130117.1

MeCab POS-Tagger

Short name

MeCabTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-mecab-asl

Implementation

org.dkpro.core.mecab.MeCabTagger

Description

Annotator for the MeCab Japanese POS Tagger.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 97. Capabilities

Inputs

none specified

Outputs

POS Lemma Sentence JapaneseToken

Languages

ja

Table 98. Models
Language Variant Version

jp

bin-linux-x86_32

20140917.0

jp

bin-linux-x86_64

20140917.0

jp

bin-osx-x86_64

20140917.0

jp

ipadic

20070801.0

NLP4J POS-Tagger

Short name

Nlp4JPosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-nlp4j-asl

Implementation

org.dkpro.core.nlp4j.Nlp4JPosTagger

Description

Part-of-Speech annotator using Emory NLP4J. Requires Sentences to be annotated before.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

ignoreMissingFeatures

Process anyway, even if the model relies on features that are not supported by this component.

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 99. Capabilities

Inputs

Sentence Token

Outputs

POS

Languages

see available models

Table 100. Models
Language Variant Version

en

default

20160802.0

OpenNLP POS-Tagger

Short name

OpenNlpPosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpPosTagger

Description

Part-of-Speech annotator using OpenNLP.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 101. Capabilities

Inputs

Sentence Token

Outputs

POS

Languages

see available models

Table 102. Models
Language Variant Version

da

maxent

20120616.1

da

perceptron

20120616.1

de

maxent

20120616.1

de

perceptron

20120616.1

en

maxent

20120616.1

en

perceptron

20120616.1

es

maxent

20120410.1

es

maxent-universal

20120410.1

es

perceptron

20120410.1

es

perceptron-universal

20120410.1

it

perceptron

20130618.0

nl

maxent

20120616.1

nl

perceptron

20120616.1

pt

maxent

20120616.1

pt

mm-maxent

20130121.1

pt

mm-perceptron

20130121.1

pt

perceptron

20120616.1

sv

maxent

20120616.1

sv

perceptron

20120616.1

OpenNLP POS-Tagger Trainer

Short name

OpenNlpPosTaggerTrainer

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpPosTaggerTrainer

Description

Train a POS tagging model for OpenNLP.

Parameters
algorithm

Training algorithm.

Type: String  — Default value: MAXENT

beamSize

Type: Integer  — Default value: 3

cutoff

Frequency cut-off.

Type: Integer  — Default value: 5

iterations

Number of training iterations.

Type: Integer  — Default value: 100

language

Store this language to the model instead of the document language.

Type: String

numThreads

Number of parallel threads.

Type: Integer  — Default value: 1

targetLocation

Location to which the output is written.

Type: String

trainerType

Trainer type.

Type: String  — Default value: Event

Table 103. Capabilities

Inputs

POS Sentence Token

Outputs

none specified

Languages

none specified

TreeTagger POS-Tagger

Short name

TreeTaggerPosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-treetagger-asl

Implementation

org.dkpro.core.treetagger.TreeTaggerPosTagger

Description

Part-of-Speech and lemmatizer annotator using TreeTagger.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

executablePath

Use this TreeTagger executable instead of trying to locate the executable automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

performanceMode

TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.

Type: Boolean  — Default value: false

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

writeLemma

Write lemma information.

Type: Boolean  — Default value: true

writePOS

Write part-of-speech information.

Type: Boolean  — Default value: true

Table 104. Capabilities

Inputs

Token

Outputs

POS Lemma

Languages

see available models

Table 105. Models
Language Variant Version

bg

le

20160430.1

de

le

20190409.1

en

le

20190304.1

es

le

20161222.1

et

le

20110124.1

fi

le

20140704.1

fr

le

20190404.1

gl

le

20190413.1

gmh

le

20161107.1

it

le

20141020.1

la

le

20110819.1

mn

le

20120925.1

nl

le

20130107.1

pl

le

20150506.1

pt

le

20101115.2

ru

le

20140505.1

sk

le

20130725.1

sw

le

20130729.1

zh

le

20101115.1

UDPipe MorphoDiTa Morphological Analyzer

Short name

UDPipePosTagger

Category

Part-of-speech tagger

Group ID

org.dkpro.core

Artifact ID

dkpro-core-udpipe-asl

Implementation

org.dkpro.core.udpipe.UDPipePosTagger

Description

Part-of-Speech, lemmatizer, and morphological analyzer using UDPipe. UDPipe uses MorphoDiTa for this task, a Morphological Dictionary and Tagger.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mappingEnabled

Enable/disable type mapping.

Type: Boolean  — Default value: true

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 106. Capabilities

Inputs

Sentence Token

Outputs

MorphologicalFeatures POS Lemma

Languages

see available models

Table 107. Models
Language Variant Version

en

ud

20160523.1

no

ud

20160523.1

Phonetic Transcriptor

Table 108. Analysis Components in category Phonetic Transcriptor (4)
Component Description

ColognePhoneticTranscriptor

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.

DoubleMetaphonePhoneticTranscriptor

Double-Metaphone phonetic transcription based on Apache Commons Codec.

MetaphonePhoneticTranscriptor

Metaphone phonetic transcription based on Apache Commons Codec.

SoundexPhoneticTranscriptor

Soundex phonetic transcription based on Apache Commons Codec.

Commons Codec Cologne Phonetic Transcriptor

Short name

ColognePhoneticTranscriptor

Category

Phonetic Transcriptor

Group ID

org.dkpro.core

Artifact ID

dkpro-core-commonscodec-asl

Implementation

org.dkpro.core.commonscodec.ColognePhoneticTranscriptor

Description

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. Works for German.

Table 109. Capabilities

Inputs

Token

Outputs

PhoneticTranscription

Languages

de

Commons Codec Double-Metaphone Phonetic Transcriptor

Short name

DoubleMetaphonePhoneticTranscriptor

Category

Phonetic Transcriptor

Group ID

org.dkpro.core

Artifact ID

dkpro-core-commonscodec-asl

Implementation

org.dkpro.core.commonscodec.DoubleMetaphonePhoneticTranscriptor

Description

Double-Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

Table 110. Capabilities

Inputs

Token

Outputs

PhoneticTranscription

Languages

none specified

Commons Codec Metaphone Phonetic Transcriptor

Short name

MetaphonePhoneticTranscriptor

Category

Phonetic Transcriptor

Group ID

org.dkpro.core

Artifact ID

dkpro-core-commonscodec-asl

Implementation

org.dkpro.core.commonscodec.MetaphonePhoneticTranscriptor

Description

Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

Table 111. Capabilities

Inputs

Token

Outputs

PhoneticTranscription

Languages

none specified

Commons Codec Soundex Phonetic Transcriptor

Short name

SoundexPhoneticTranscriptor

Category

Phonetic Transcriptor

Group ID

org.dkpro.core

Artifact ID

dkpro-core-commonscodec-asl

Implementation

org.dkpro.core.commonscodec.SoundexPhoneticTranscriptor

Description

Soundex phonetic transcription based on Apache Commons Codec. Works for English.

Table 112. Capabilities

Inputs

Token

Outputs

PhoneticTranscription

Languages

en

Segmenter

Segmenter components identify sentence boundaries and tokens. The order in which sentence splitting and tokenization are done differs between the integrated the NLP libraries. Thus, we chose to integrate both steps into a segmenter component to avoid the need to reorder the components in a pipeline when replacing one segmenter with another.

Table 113. Analysis Components in category Segmenter (26)
Component Description

AnnotationByLengthFilter

Removes annotations that do not conform to minimum or maximum length constraints.

ArktweetTokenizer

ArkTweet tokenizer.

CamelCaseTokenSegmenter

Split up existing tokens again if they are camel-case text.

ClearNlpSegmenter

Tokenizer using Clear NLP.

CoreNlpSegmenter

Tokenizer and sentence splitter using from Stanford CoreNLP.

StanfordSegmenter

Stanford sentence splitter and tokenizer.

GermanSeparatedParticleAnnotator

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.

GosenSegmenter

Segmenter for Japanese text based on GoSen.

IcuSegmenter

ICU segmenter.

JTokSegmenter

JTok segmenter.

BreakIteratorSegmenter

BreakIterator segmenter.

JiebaSegmenter

Segmenter for Japanese using Jieba.

LanguageToolSegmenter

Segmenter using LanguageTool to do the heavy lifting.

LineBasedSentenceSegmenter

Annotates each line in the source text as a sentence.

LingPipeSegmenter

LingPipe segmenter.

Nlp4JSegmenter

Segmenter using Emory NLP4J.

OpenNlpSegmenter

Tokenizer and sentence splitter using OpenNLP.

OpenNlpSentenceTrainer

Train a sentence splitter model for OpenNLP.

OpenNlpTokenTrainer

Train a tokenizer model for OpenNLP.

ParagraphSplitter

This class creates paragraph annotations for the given input document.

PatternBasedTokenSegmenter

Split up existing tokens again at particular split-chars.

RegexSegmenter

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

TokenMerger

Merges any Tokens that are covered by a given annotation type.

UDPipeSegmenter

Tokenizer and sentence splitter using UDPipe.

WhitespaceSegmenter

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

TokenTrimmer

Remove prefixes and suffixes from tokens.

Annotation-By-Length Filter

Short name

AnnotationByLengthFilter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.AnnotationByLengthFilter

Description

Removes annotations that do not conform to minimum or maximum length constraints. (This was previously called TokenFilter).

Parameters
FilterTypes

A set of annotation types that should be filtered.

Type: String[]  — Default value: []

MaxLengthFilter

Any annotation in filterAnnotations shorter than this value will be removed.

Type: Integer  — Default value: 1000

MinLengthFilter

Any annotation in filterTypes shorter than this value will be removed.

Type: Integer  — Default value: 0

ArkTweet Tokenizer

Short name

ArktweetTokenizer

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-arktools-gpl

Implementation

org.dkpro.core.arktools.ArktweetTokenizer

Description

ArkTweet tokenizer.

CamelCase Token Segmenter

Short name

CamelCaseTokenSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.CamelCaseTokenSegmenter

Description

Split up existing tokens again if they are camel-case text.

Parameters
deleteCover

Whether to remove the original token.

Type: Boolean  — Default value: true

markupType

Optional annotation type to markup the original covered token area when specified. This type must be a subtype of Annotation.

Optional — Type: String

Table 114. Capabilities

Inputs

Token

Outputs

Token

Languages

none specified

ClearNLP Segmenter

Short name

ClearNlpSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-clearnlp-asl

Implementation

org.dkpro.core.clearnlp.ClearNlpSegmenter

Description

Tokenizer using Clear NLP.

Parameters
language

The language.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 115. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

en

Table 116. Models
Language Variant Version

en

default

20131111.0

CoreNLP Segmenter

Short name

CoreNlpSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-corenlp-gpl

Implementation

org.dkpro.core.corenlp.CoreNlpSegmenter

Description

Tokenizer and sentence splitter using from Stanford CoreNLP.

Parameters
boundaryMultiTokenRegex

A TokensRegex multi-token pattern for finding boundaries.

Optional — Type: String

boundaryToDiscard

The set of regular expressions for sentence boundary tokens that should be discarded.

Optional — Type: String[]  — Default value: [, NL]

boundaryTokenRegex

The set of boundary tokens.

Optional — Type: String  — Default value: [.\u3002]|[!?\uFF01\uFF1F]+

htmlElementsToDiscard

These are elements like "p" or "sent", which will be wrapped into regular expressions for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary.

Optional — Type: String[]

language

The language.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

newlineIsSentenceBreak

Strategy for treating newlines as sentence breaks.

Optional — Type: String  — Default value: two

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

tokenRegexesToDiscard

The set of regular expressions for sentence boundary tokens that should be discarded.

Optional — Type: String[]  — Default value: []

tokenizationOption

Additional options that should be passed to the tokenizers.

Optional — Type: String

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 117. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

none specified

CoreNLP Segmenter (old API)

Short name

StanfordSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordSegmenter

Description

Stanford sentence splitter and tokenizer.

Parameters
allowEmptySentences

Whether to generate empty sentences.

Type: Boolean  — Default value: false

boundaryFollowersRegex

This is a Set of String that are matched with .equals() which are allowed to be tacked onto the end of a sentence after a sentence boundary token, for example ")".

Optional — Type: String  — Default value: [\\p{Pe}\\p{Pf}\"'>\uFF02\uFF07\uFF1E]|''|-R[CRS]B-

boundaryToDiscard

The set of regex for sentence boundary tokens that should be discarded.

Optional — Type: String[]  — Default value: [, NL]

boundaryTokenRegex

The set of boundary tokens. If null, use default.

Optional — Type: String  — Default value: [.\u3002]|[!?\uFF01\uFF1F]+

isOneSentence

Whether to treat all input as one sentence.

Type: Boolean  — Default value: false

language

The language.

Optional — Type: String

languageFallback

If this component is not configured for a specific language and if the language stored in the document metadata is not supported, use the given language as a fallback.

Optional — Type: String

newlineIsSentenceBreak

Strategy for treating newlines as paragraph breaks.

Optional — Type: String  — Default value: TWO_CONSECUTIVE

regionElementRegex

A regular expression for element names containing a sentence region. Only tokens in such elements will be included in sentences. The start and end tags themselves are not included in the sentence.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

tokenRegexesToDiscard

The set of regex for sentence boundary tokens that should be discarded.

Optional — Type: String[]  — Default value: []

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

xmlBreakElementsToDiscard

These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary.

Optional — Type: String[]

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 118. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

en, es, fr

German Separated Particle Annotator

Short name

GermanSeparatedParticleAnnotator

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.GermanSeparatedParticleAnnotator

Description

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. This Annotator deals with German particle verbs. Particle verbs consist of a particle and a stem, e.g. anfangen = an+fangen There are many usages of German particle verbs where the stem and the particle are separated, e.g., Wir fangen gleich an. The TreeTagger lemmatizes the verb stem as "fangen" and the separated particle as "an", the proper verblemma "anfangen" is thus not available as an annotation. The GermanSeparatedParticleAnnotator replaces the lemma of the stem of particle-verbs (e.g., fangen) by the proper verb lemma (e.g. anfangen) and leaves the lemma of the separated particle unchanged.

Table 119. Capabilities

Inputs

POS Lemma Sentence Token

Outputs

Lemma

Languages

de

Gosen Segmenter

Short name

GosenSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-gosen-asl

Implementation

org.dkpro.core.gosen.GosenSegmenter

Description

Segmenter for Japanese text based on GoSen.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 120. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

ja

ICU Segmenter

Short name

IcuSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-icu-asl

Implementation

org.dkpro.core.icu.IcuSegmenter

Description

ICU segmenter.

Parameters
language

The language.

Optional — Type: String

splitAtApostrophe

Per default, the segmenter does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered.

Type: Boolean  — Default value: false

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 121. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

af, ak, am, ar, as, az, be, bg, bm, bn, bo, br, bs, ca, ce, cs, cy, da, de, dz, ee, el, en, eo, es, et, eu, fa, ff, fi, fo, fr, fy, ga, gd, gl, gu, gv, ha, hi, hr, hu, hy, ig, ii, is, it, ja, ka, ki, kk, kl, km, kn, ko, ks, kw, ky, lb, lg, ln, lo, lt, lu, lv, mg, mk, ml, mn, mr, ms, mt, my, nb, nd, ne, nl, nn, om, or, os, pa, pl, ps, pt, qu, rm, rn, ro, ru, rw, se, sg, si, sk, sl, sn, so, sq, sr, sv, sw, ta, te, tg, th, ti, to, tr, tt, ug, uk, ur, uz, vi, wo, yo, zh, zu

JTok Segmenter

Short name

JTokSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-jtok-asl

Implementation

org.dkpro.core.jtok.JTokSegmenter

Description

JTok segmenter.

Parameters
language

The language.

Optional — Type: String

ptbEscaping

Use PTB-escaping when setting the token form.

Type: Boolean  — Default value: false

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeParagraph

Create Paragraph annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 122. Capabilities

Inputs

none specified

Outputs

Paragraph Sentence Token

Languages

de, en, it

Java BreakIterator Segmenter

Short name

BreakIteratorSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.BreakIteratorSegmenter

Description

BreakIterator segmenter.

Parameters
language

The language.

Optional — Type: String

splitAtApostrophe

Per default the Java BreakIterator does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered.

Type: Boolean  — Default value: false

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 123. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

ar, be, bg, ca, cs, da, de, el, en, es, et, fi, fr, ga, hi, hr, hu, is, it, ja, ko, lt, lv, mk, ms, mt, nl, no, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, vi, zh

Jieba Segmenter

Short name

JiebaSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-jieba-asl

Implementation

org.dkpro.core.jieba.JiebaSegmenter

Description

Segmenter for Japanese using Jieba.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 124. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

zh

LanguageTool Segmenter

Short name

LanguageToolSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-languagetool-asl

Implementation

org.dkpro.core.languagetool.LanguageToolSegmenter

Description

Segmenter using LanguageTool to do the heavy lifting. LanguageTool internally uses different strategies for tokenization.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 125. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh

Line-based Sentence Segmenter

Short name

LineBasedSentenceSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.LineBasedSentenceSegmenter

Description

Annotates each line in the source text as a sentence. This segmenter is not capable of creating tokens! All respective parameters have no functionality.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 126. Capabilities

Inputs

none specified

Outputs

Sentence

Languages

none specified

LingPipe Segmenter

Short name

LingPipeSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-lingpipe-gpl

Implementation

org.dkpro.core.lingpipe.LingPipeSegmenter

Description

LingPipe segmenter.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 127. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

none specified

NLP4J Segmenter

Short name

Nlp4JSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-nlp4j-asl

Implementation

org.dkpro.core.nlp4j.Nlp4JSegmenter

Description

Segmenter using Emory NLP4J.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 128. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

none specified

OpenNLP Segmenter

Short name

OpenNlpSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpSegmenter

Description

Tokenizer and sentence splitter using OpenNLP.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

segmentationModelLocation

Load the segmentation model from this location instead of locating the model automatically.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

tokenizationModelLocation

Load the tokenization model from this location instead of locating the model automatically.

Optional — Type: String

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 129. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

see available models

Table 130. Models
Language Variant Version

da

maxent

20120616.1

da

maxent

20120616.1

de

maxent

20120616.1

de

maxent

20120616.1

en

maxent

20120616.1

en

maxent

20120616.1

it

maxent

20130618.0

it

maxent

20130618.0

nb

maxent

20120131.1

nb

maxent

20120131.1

nl

maxent

20120616.1

nl

maxent

20120616.1

pt

maxent

20120616.1

pt

maxent

20120616.1

sv

maxent

20120616.1

sv

maxent

20120616.1

OpenNLP Sentence Splitter Trainer

Short name

OpenNlpSentenceTrainer

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpSentenceTrainer

Description

Train a sentence splitter model for OpenNLP.

Parameters
abbreviationDictionaryEncoding

Encoding of the abbreviation dictionary.

Type: String  — Default value: UTF-8

abbreviationDictionaryLocation

Location of the abbreviation dictionary.

Optional — Type: String

algorithm

Training algorithm.

Type: String  — Default value: MAXENT

cutoff

Frequency cut-off.

Type: Integer  — Default value: 5

eosCharacters

End-of-sentence characters.

Optional — Type: String[]

iterations

Number of training iterations.

Type: Integer  — Default value: 100

language

Store this language to the model instead of the document language.

Type: String

numThreads

Number of parallel threads.

Type: Integer  — Default value: 1

targetLocation

Location to which the output is written.

Type: String

trainerType

Trainer type.

Type: String  — Default value: Event

Table 131. Capabilities

Inputs

Sentence

Outputs

none specified

Languages

none specified

OpenNLP Tokenizer Trainer

Short name

OpenNlpTokenTrainer

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpTokenTrainer

Description

Train a tokenizer model for OpenNLP.

Parameters
abbreviationDictionaryEncoding

Encoding of the abbreviation dictionary.

Type: String  — Default value: UTF-8

abbreviationDictionaryLocation

Location of the abbreviation dictionary.

Optional — Type: String

algorithm

Training algorithm.

Type: String  — Default value: MAXENT

alphaNumericPattern

Regular expression to detect alpha numerics.

Optional — Type: String  — Default value: ^[A-Za-z0-9]+$

cutoff

Frequency cut-off.

Type: Integer  — Default value: 5

iterations

Number of training iterations.

Type: Integer  — Default value: 100

language

Store this language to the model instead of the document language.

Type: String

numThreads

Number of parallel threads.

Type: Integer  — Default value: 1

targetLocation

Location to which the output is written.

Type: String

trainerType

Trainer type.

Type: String  — Default value: Event

useAlphaNumericOptimization

If true alpha numerics are skipped.

Type: Boolean  — Default value: true

Table 132. Capabilities

Inputs

Token

Outputs

none specified

Languages

none specified

Paragraph Splitter

Short name

ParagraphSplitter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.ParagraphSplitter

Description

This class creates paragraph annotations for the given input document. It searches for the occurrence of two or more line-breaks (Unix and Windows) and regards this as the boundary between paragraphs.

Parameters
splitPattern

A regular expression used to detect paragraph splits.

Type: String  — Default value: ((\r\n\r\n)+(\r\n))|((\n\n)+(\n))

Table 133. Capabilities

Inputs

none specified

Outputs

Paragraph

Languages

none specified

Pattern-based Token Segmenter

Short name

PatternBasedTokenSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.PatternBasedTokenSegmenter

Description

Split up existing tokens again at particular split-chars. The prefix states whether the split chars should be added as separate Token Tokens. If the #INCLUDE_PREFIX precedes the split pattern, the pattern is included. Consequently, patterns following the #EXCLUDE_PREFIX, will not be added as a Token.

Parameters
deleteCover

Whether to remove the original token.

Type: Boolean  — Default value: true

patterns

A list of regular expressions, prefixed with #INCLUDE_PREFIX or #EXCLUDE_PREFIX. If neither of the prefixes is used, #EXCLUDE_PREFIX is assumed.

Type: String[]

Table 134. Capabilities

Inputs

Token

Outputs

Token

Languages

none specified

Regex Segmenter

Short name

RegexSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.RegexSegmenter

Description

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

The default behavior is to split sentences by a line break and tokens by whitespace.

Parameters
language

The language.

Optional — Type: String

sentenceBoundaryRegex

Define the sentence boundary.

Type: String  — Default value: ``

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

tokenBoundaryRegex

Defines the pattern that is used as token end boundary.

When setting custom patterns, take into account that the final token is often terminated by a linebreak rather than the boundary character. Therefore, the newline typically has to be added to the group of matching characters, e.g. "tokenized-text" is correctly tokenized with the pattern [-\n].

Type: String  — Default value: [\\s\n]+

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 135. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

none specified

Token Merger

Short name

TokenMerger

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.TokenMerger

Description

Merges any Tokens that are covered by a given annotation type. E.g. this component can be used to create a single tokens from all tokens that constitute a multi-token named entity.

Parameters
POSMappingLocation

Override the tagset mapping.

Optional — Type: String

annotationType

Annotation type for which tokens should be merged.

Type: String

constraint

A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to merge only tokens that are part of a location named entity.

Optional — Type: String

cposValue

Set a new coarse POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset).

Optional — Type: String

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

lemmaMode

Configure what should happen to the lemma of the merged tokens. It is possible to JOIN the lemmata to a single lemma (space separated), to REMOVE the lemma or LEAVE the lemma of the first token as-is.

Type: String  — Default value: JOIN

posType

Set a new POS tag for the new merged token. This is the mapped type. If this is specified, tag set mapping will not be performed. This parameter has no effect unless PARAM_POS_VALUE is also set.

Optional — Type: String

posValue

Set a new POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset).

Optional — Type: String

Table 136. Capabilities

Inputs

POS Lemma Token

Outputs

Lemma

Languages

none specified

UDPipe Segmenter

Short name

UDPipeSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-udpipe-asl

Implementation

org.dkpro.core.udpipe.UDPipeSegmenter

Description

Tokenizer and sentence splitter using UDPipe.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 137. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

see available models

Table 138. Models
Language Variant Version

en

ud

20160523.1

no

ud

20160523.1

Whitespace Segmenter

Short name

WhitespaceSegmenter

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.WhitespaceSegmenter

Description

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

If PARAM_WRITE_SENTENCES is set to true, one sentence per line is assumed. Otherwise, no sentences are created.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 139. Capabilities

Inputs

none specified

Outputs

Sentence Token

Languages

none specified

org.dkpro.core.tokit.TokenTrimmer

Short name

TokenTrimmer

Category

Segmenter

Group ID

org.dkpro.core

Artifact ID

dkpro-core-tokit-asl

Implementation

org.dkpro.core.tokit.TokenTrimmer

Description

Remove prefixes and suffixes from tokens.

Parameters
prefixes

List of prefixes to remove.

Type: String[]

suffixes

List of suffixes to remove.

Type: String[]

Table 140. Capabilities

Inputs

Token

Outputs

Token

Languages

none specified

Semantic role labeler

Table 141. Analysis Components in category Semantic role labeler (2)
Component Description

ClearNlpSemanticRoleLabeler

ClearNLP semantic role labeller.

MateSemanticRoleLabeler

Annotator for the MateTools Semantic Role Labeler.

ClearNLP Semantic Role Labeler

Short name

ClearNlpSemanticRoleLabeler

Category

Semantic role labeler

Group ID

org.dkpro.core

Artifact ID

dkpro-core-clearnlp-asl

Implementation

org.dkpro.core.clearnlp.ClearNlpSemanticRoleLabeler

Description

ClearNLP semantic role labeller.

Parameters
expandArguments

Normally the arguments point only to the head words of arguments in the dependency tree. With this option enabled, they are expanded to the text covered by the minimal and maximal token offsets of all descendants (or self) of the head word.

Warning: this parameter should be used with caution! For one, if the descentants of a head word cover a non-continuous region of the text, this information is lost. The arguments will appear to be spanning a continuous region. For another, the arguments may overlap with each other. E.g. if a sentence contains a relative clause with a verb, the subject of the main clause may be recognized as a dependent of the verb and may cause the whole main clause to be recorded in the argument.

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

predModelLocation

Location from which the predicate identifier model is read.

Optional — Type: String

printTagSet

Write the tag set(s) to the log when a model is loaded.

Type: Boolean  — Default value: false

roleModelLocation

Location from which the roleset classification model is read.

Optional — Type: String

srlModelLocation

Location from which the semantic role labeling model is read.

Optional — Type: String

Table 142. Capabilities

Inputs

POS Lemma Sentence Token Dependency

Outputs

SemArg SemPred

Languages

see available models

Table 143. Models
Language Variant Version

en

mayo

20131111.0

en

ontonotes

20131128.0

Mate Tools Semantic Role Labeler

Short name

MateSemanticRoleLabeler

Category

Semantic role labeler

Group ID

org.dkpro.core

Artifact ID

dkpro-core-matetools-gpl

Implementation

org.dkpro.core.matetools.MateSemanticRoleLabeler

Description

Annotator for the MateTools Semantic Role Labeler.

Please cite the following paper, if you use the semantic role labeler Anders Björkelund, Love Hafdell, and Pierre Nugues. Multilingual semantic role labeling. In Proceedings of The Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 43--48, Boulder, June 4--5 2009.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 144. Capabilities

Inputs

POS Lemma Sentence Token Dependency

Outputs

SemArg SemPred

Languages

see available models

Table 145. Models
Language Variant Version

de

tiger

20130105.0

en

conll2009

20130117.0

es

conll2009

20130320.0

zh

conll2009

20130117.0

Stemmer

Table 146. Analysis Components in category Stemmer (6)
Component Description

CisStemmer

UIMA wrapper for the CISTEM algorithm.

SmileLancasterStemmer

This Paice/Husk Lancaster stemmer implementation only works with the English language so far.

LancasterStemmer

This Paice/Husk Lancaster stemmer implementation only works with the English language so far.

MyStemStemmer

This MyStem stemmer implementation only works with the Russian language.

OpenNlpSnowballStemmer

UIMA wrapper for the Snowball stemmer included with OpenNLP.

SnowballStemmer

UIMA wrapper for the Snowball stemmer.

CIS Stemmer

Short name

CisStemmer

Category

Stemmer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-cisstem-asl

Implementation

org.dkpro.core.cisstem.CisStemmer

Description

UIMA wrapper for the CISTEM algorithm.

CISTEM is a stemming algorithm for the German language, developed by Leonie Weißweiler and Alexander Fraser. Annotation types to be stemmed can be configured by a FeaturePath.

If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.

Parameters
filterConditionOperator

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterConditionValue

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

Optional — Type: String

lowerCase

Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.

Optional — Type: Boolean  — Default value: false

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Optional — Type: String[]

Table 147. Capabilities

Inputs

none specified

Outputs

Stem

Languages

de

Lancaster Stemmer

Short name

SmileLancasterStemmer

Category

Stemmer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-smile-asl

Implementation

org.dkpro.core.smile.SmileLancasterStemmer

Description

This Paice/Husk Lancaster stemmer implementation only works with the English language so far.

Parameters
filterConditionOperator

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterConditionValue

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

Optional — Type: String

language

Specifies the language supported by the stemming model. Default value is "en" (English).

Type: String  — Default value: en

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Specifies an URL that should resolve to a location from where to load custom rules. If the location starts with classpath: the location is interpreted as a classpath location, e.g. "classpath:my/path/to/the/rules". Otherwise it is tried as an URL, file and at last UIMA resource.

Optional — Type: String

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Optional — Type: String[]

stripPrefix

True if the stemmer will strip prefix such as kilo, micro, milli, intra, ultra, mega, nano, pico, pseudo.

Type: Boolean  — Default value: false

Table 148. Capabilities

Inputs

Token

Outputs

Stem

Languages

en

Lancaster Stemmer

Short name

LancasterStemmer

Category

Stemmer

Group ID

org.dkpro.core

Artifact ID

dkpro-core

Implementation

org.dkpro.core.lancaster.LancasterStemmer

Description

This Paice/Husk Lancaster stemmer implementation only works with the English language so far.

Parameters
filterConditionOperator

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterConditionValue

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

Optional — Type: String

language

Specifies the language supported by the stemming model. Default value is "en" (English).

Type: String  — Default value: en

modelArtifactUri

URI of the model artifact. This can be used to override the default model resolving mechanism and directly address a particular model.

The URI format is mvn:${groupId:${artifactId}:${version}}. Remember to set the variant parameter to match the artifact. If the artifact contains the model in a non-default location, you also have to specify the model location parameter, e.g. classpath:/model/path/in/artifact/model.bin.

Optional — Type: String

modelLocation

Specifies an URL that should resolve to a location from where to load custom rules. If the location starts with classpath: the location is interpreted as a classpath location, e.g. "classpath:my/path/to/the/rules". Otherwise it is tried as an URL, file and at last UIMA resource.

Optional — Type: String

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Optional — Type: String[]

stripPrefix

True if the stemmer will strip prefix such as kilo, micro, milli, intra, ultra, mega, nano, pico, pseudo.

Type: Boolean  — Default value: false

Table 149. Capabilities

Inputs

Token

Outputs

Stem

Languages

en

MyStem Stemmer

Short name

MyStemStemmer

Category

Stemmer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-mystem-asl

Implementation

org.dkpro.core.mystem.MyStemStemmer

Description

This MyStem stemmer implementation only works with the Russian language.

Parameters
filterConditionOperator

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterConditionValue

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

Optional — Type: String

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Optional — Type: String[]

Table 150. Capabilities

Inputs

Token

Outputs

Stem

Languages

ru

OpenNLP Snowball Stemmer

Short name

OpenNlpSnowballStemmer

Category

Stemmer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-opennlp-asl

Implementation

org.dkpro.core.opennlp.OpenNlpSnowballStemmer

Description

UIMA wrapper for the Snowball stemmer included with OpenNLP. Annotation types to be stemmed can be configured by a FeaturePath.

If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.

Parameters
filterConditionOperator

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterConditionValue

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

lowerCase

Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.

Examples
false (default)true
EDUCATIONALEDUCATIONALeduc
EducationalEducateduc
educationaleduceduc

Optional — Type: Boolean  — Default value: false

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Optional — Type: String[]

Table 151. Capabilities

Inputs

none specified

Outputs

Stem

Languages

ar, da, de, el, en, es, fi, fr, ga, hu, it, nl, no, pt, ro, ru, sv, tr

Snowball Stemmer

Short name

SnowballStemmer

Category

Stemmer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-snowball-asl

Implementation

org.dkpro.core.snowball.SnowballStemmer

Description

UIMA wrapper for the Snowball stemmer. Annotation types to be stemmed can be configured by a FeaturePath.

If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.

Parameters
filterConditionOperator

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterConditionValue

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

lowerCase

Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.

Examples
false (default)true
EDUCATIONALEDUCATIONALeduc
EducationalEducateduc
educationaleduceduc

Optional — Type: Boolean  — Default value: false

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Optional — Type: String[]

Table 152. Capabilities

Inputs

none specified

Outputs

Stem

Languages

da, de, en, es, fi, fr, hu, it, nl, no, pt, ro, ru, sv, tr

Topic Model

Topic modeling is a statistical approach to discover abstract topics in a collection of documents. A topic is characterized by a probability distribution of the words in the document collection. Once a topic model has been generated, it can be used to analyze unseen documents. The result of the analysis is describes the probability by which a document belongs to each of the topics in the model.

Table 153. Analysis Components in category Topic Model (2)
Component Description

MalletLdaTopicModelInferencer

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

MalletLdaTopicModelTrainer

Estimate an LDA topic model using Mallet and write it to a file.

Mallet LDA Topic Model Inferencer

Short name

MalletLdaTopicModelInferencer

Category

Topic Model

Group ID

org.dkpro.core

Artifact ID

dkpro-core-mallet-asl

Implementation

org.dkpro.core.mallet.lda.MalletLdaTopicModelInferencer

Description

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

Parameters
burnIn

The number of iterations before hyper-parameter optimization begins.

Type: Integer  — Default value: 1

lowercase

If set to true (default: false), all tokens are lowercased.

Type: Boolean  — Default value: false

maxTopicAssignments

Maximum number of topics to assign. If not set (or <= 0), the number of topics in the model divided by 10 is set.

Type: Integer  — Default value: 0

minTokenLength

Ignore tokens (or lemmas, respectively) that are shorter than the given value.

Type: Integer  — Default value: 3

minTopicProb

Minimum topic proportion for the document-topic assignment.

Type: Float  — Default value: 0.2

modelLocation

Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well.

Type: String

nIterations

The number of iterations during inference.

Type: Integer  — Default value: 100

thinning

The number of iterations between saved samples.

Type: Integer  — Default value: 5

tokenFeaturePath

The annotation type to use for the model. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

typeName

The annotation type to use as tokens.

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

Table 154. Capabilities

Inputs

Token

Outputs

TopicDistribution

Languages

none specified

Mallet LDA Topic Model Trainer

Short name

MalletLdaTopicModelTrainer

Category

Topic Model

Group ID

org.dkpro.core

Artifact ID

dkpro-core-mallet-asl

Implementation

org.dkpro.core.mallet.lda.MalletLdaTopicModelTrainer

Description

Estimate an LDA topic model using Mallet and write it to a file. It stores all incoming CAS' to Mallet Instances before estimating the model, using a ParallelTopicModel.

Set #PARAM_TOKEN_FEATURE_PATH to define what is considered as a token (Tokens, Lemmas, etc.).

Set #PARAM_COVERING_ANNOTATION_TYPE to define what is considered a document (sentences, paragraphs, etc.).

Parameters
alphaSum

The sum of alphas over all topics.

Another recommended value is 50 / T (number of topics).

Type: Float  — Default value: 1.0

beta

Beta for a single dimension of the Dirichlet prior.

Type: Float  — Default value: 0.01

burninPeriod

The number of iterations before hyper-parameter optimization begins.

Type: Integer  — Default value: 100

compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

coveringAnnotationType

If specified, the text contained in the given segmentation type annotations are fed as separate units ("documents") to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored.

By default, the full text is used as a document.

Type: String  — Default value: ``

displayInterval

The interval in which to display the estimated topics.

Type: Integer  — Default value: 50

displayNTopicWords

The number of top words to display during estimation.

Type: Integer  — Default value: 7

escapeFilename

URL-encode the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: false

filterRegex

Regular expression of tokens to be filtered.

Type: String  — Default value: ``

filterRegexReplacement

Value with which tokens matching the regular expression are replaced.

Type: String  — Default value: ``

lowercase

If set to true (default: false), all tokens are lowercased.

Type: Boolean  — Default value: false

minTokenLength

Ignore tokens (or any other annotation type, as specified by #PARAM_TOKEN_FEATURE_PATH) that are shorter than the given value.

Type: Integer  — Default value: 3

nIterations

The number of iterations during model estimation.

Type: Integer  — Default value: 1000

nTopics

The number of topics to estimate.

Type: Integer  — Default value: 10

numThreads

The number of threads to use during model estimation. If not set, the number of threads is automatically set by ComponentParameters#computeNumThreads(int).

Warning: do not set this to more than 1 when using very small (test) data sets on MalletEmbeddingsTrainer! This might prevent the process from terminating.

Type: Integer  — Default value: 0

optimizeInterval

Interval for optimizing Dirichlet hyper-parameters.

Type: Integer  — Default value: 50

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

paramStopwordsFile

The location of the stopwords file.

Type: String  — Default value: ``

paramStopwordsReplacement

If set, stopwords found in the #PARAM_STOPWORDS_FILE location are not removed, but replaced by the given string (e.g. STOP).

Type: String  — Default value: ``

randomSeed

Set random seed. If set to -1 (default), uses random generator.

Type: Integer  — Default value: -1

saveInterval

Define how frequently an intermediate serialized model is saved to disk during estimation.

Type: Integer  — Default value: 0

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

tokenFeaturePath

The annotation type to use as input tokens for the model estimation. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

useCharacters

If true (default: false), estimate character embeddings. #PARAM_TOKEN_FEATURE_PATH is ignored.

Type: Boolean  — Default value: false

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

useSymmetricAlpha

Use a symmetric alpha value during model estimation?

Type: Boolean  — Default value: false

Transformer

Table 155. Analysis Components in category Transformer (15)
Component Description

ApplyChangesAnnotator

Applies changes annotated using a SofaChangeAnnotation.

Backmapper

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

CapitalizationNormalizer

Takes a text and replaces wrong capitalization

CjfNormalizer

Converts traditional Chinese to simplified Chinese or vice-versa.

DictionaryBasedTokenTransformer

Reads a tab-separated file containing mappings from one token to another.

ExpressiveLengtheningNormalizer

Takes a text and shortens extra long words

FileBasedTokenTransformer

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

HyphenationRemover

Simple dictionary-based hyphenation remover.

RegexBasedTokenTransformer

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

ReplacementFileNormalizer

Takes a text and replaces desired expressions.

SharpSNormalizer

Takes a text and replaces sharp s

SpellingNormalizer

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

StanfordPtbTransformer

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.

TokenCaseTransformer

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

UmlautNormalizer

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

CAS Transformation - Apply

Short name

ApplyChangesAnnotator

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-castransformation-asl

Implementation

org.dkpro.core.castransformation.ApplyChangesAnnotator

Description

Applies changes annotated using a SofaChangeAnnotation.

Table 156. Capabilities

Inputs

DocumentMetaData SofaChangeAnnotation

Outputs

DocumentMetaData SofaChangeAnnotation

Languages

none specified

CAS Transformation - Map back

Short name

Backmapper

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-castransformation-asl

Implementation

org.dkpro.core.castransformation.Backmapper

Description

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

This annotator is able to resume the mapping after a CAS restore from any point after the cleaned view has been created, as long as no changes were made to SofaChangeAnnotations in the original view.

Parameters
Chain

Chain of views for backmapping. This should be the reverse of the chain of views that the ApplyChangesAnnotator has used. For example, if view A has been mapped to B using ApplyChangesAnnotator, then this parameter should be set using an array containing [B, A].

Optional — Type: String[]  — Default value: [source, target]

Capitalization Normalizer

Short name

CapitalizationNormalizer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.frequency.CapitalizationNormalizer

Description

Takes a text and replaces wrong capitalization

Parameters
typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 157. Capabilities

Inputs

Token

Outputs

none specified

Languages

none specified

Chinese Traditional/Simplified Converter

Short name

CjfNormalizer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-languagetool-asl

Implementation

org.dkpro.core.languagetool.CjfNormalizer

Description

Converts traditional Chinese to simplified Chinese or vice-versa.

Parameters
direction

Direction in which to perform the conversion (Direction#TO_TRADITIONAL or Direction#TO_SIMPLIFIED);

Type: String  — Default value: TO_SIMPLIFIED

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 158. Capabilities

Inputs

none specified

Outputs

none specified

Languages

zh

Dictionary-based Token Transformer

Short name

DictionaryBasedTokenTransformer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.transformation.DictionaryBasedTokenTransformer

Description

Reads a tab-separated file containing mappings from one token to another. All tokens that match an entry in the first column are changed to the corresponding token in the second column.

Parameters
commentMarker

Lines starting with this character (or String) are ignored.

Type: String  — Default value: #

modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well.

Type: String

separator

Separator for mappings file.

Type: String  — Default value: ``

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Expressive Lengthening Normalizer

Short name

ExpressiveLengtheningNormalizer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.frequency.ExpressiveLengtheningNormalizer

Description

Takes a text and shortens extra long words

Parameters
typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 159. Capabilities

Inputs

Token

Outputs

none specified

Languages

none specified

File-based Token Transformer

Short name

FileBasedTokenTransformer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.transformation.FileBasedTokenTransformer

Description

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

Parameters
ignoreCase

Match tokens against the dictionary without considering case.

Type: Boolean  — Default value: false

modelLocation

Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well.

Type: String

replacement

The value by which the matching tokens should be replaced.

Type: String

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Hyphenation Remover

Short name

HyphenationRemover

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.transformation.HyphenationRemover

Description

Simple dictionary-based hyphenation remover.

Parameters
modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well.

Type: String

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Regex-based Token Transformer

Short name

RegexBasedTokenTransformer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.transformation.RegexBasedTokenTransformer

Description

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

The parameters #PARAM_REGEX defines the regular expression to be searcher, #PARAM_REPLACEMENT defines the string with which matching patterns are replaces.

Parameters
regex

Define the regular expression to be replaced

Type: String

replacement

Define the string to replace matching tokens with

Type: String

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 160. Capabilities

Inputs

Token

Outputs

none specified

Languages

none specified

Replacement File Normalizer

Short name

ReplacementFileNormalizer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.ReplacementFileNormalizer

Description

Takes a text and replaces desired expressions. This class should not work on tokens as some expressions might span several tokens.

Parameters
modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

Location of a file which contains all replacing characters

Type: String

srcExpressionSurroundings

Pattern describing valid left/right context of the source expression.

Type: String  — Default value: IRRELEVANT

targetExpressionSurroundings

Left/right context of the replacement.

Type: String  — Default value: NOTHING

Table 161. Capabilities

Inputs

Token

Outputs

SofaChangeAnnotation

Languages

none specified

Sharp S (ß) Normalizer

Short name

SharpSNormalizer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.frequency.SharpSNormalizer

Description

Takes a text and replaces sharp s

Parameters
minFrequencyThreshold

Minimum frequency count.

Type: Integer  — Default value: 100

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 162. Capabilities

Inputs

none specified

Outputs

none specified

Languages

de

Spelling Normalizer

Short name

SpellingNormalizer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.SpellingNormalizer

Description

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

Parameters
typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 163. Capabilities

Inputs

SpellingAnomaly

Outputs

none specified

Languages

none specified

Stanford Penn Treebank Normalizer

Short name

StanfordPtbTransformer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordPtbTransformer

Description

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. This component operates directly on the text and does not require prior segmentation.

Parameters
typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Token Case Transformer

Short name

TokenCaseTransformer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.transformation.TokenCaseTransformer

Description

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

Parameters
tokenCase
The case to convert tokens to:
  • UPPERCASE: uppercase everything.
  • LOWERCASE: lowercase everything.
  • NORMALCASE: retain first letter in word and after hyphens, lowercase everything else.

Type: String

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Umlaut Normalizer

Short name

UmlautNormalizer

Category

Transformer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.frequency.UmlautNormalizer

Description

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

Parameters
minFrequencyThreshold

Minimum frequency count.

Type: Integer  — Default value: 100

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 164. Capabilities

Inputs

Token

Outputs

none specified

Languages

de

Other

Table 165. Analysis Components in category Other (17)
Component Description

AnnotationByTextFilter

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

CompoundAnnotator

Annotates compound parts and linking morphemes.

StanfordSentimentAnalyzer

Experimental wrapper for edu.stanford.nlp.pipeline.SentimentAnnotator which assigns 5 scores to each sentence.

CorrectionsContextualizer

This component assumes that some spell checker has already been applied upstream (e.g.

MauiKeywordAnnotator

The Maui tool assigns keywords to documents.

NGramAnnotator

N-gram annotator.

PosFilter

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

PosMapper

Maps existing POS tags from one tagset to another using a user provided properties file.

PhraseAnnotator

Annotate phrases in a sentence.

ReadabilityAnnotator

Assign a set of popular readability scores to the text.

RegexTokenFilter

Remove every token that does or does not match a given regular expression.

NorvigSpellingCorrector

Identifies spelling errors using Norvig's algorithm.

StopWordRemover

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.

Stopwatch

Can be used to measure how long the processing between two points in a pipeline takes.

TfIdfAnnotator

This component adds Tfidf annotations consisting of a term and a tfidf weight.

TrailingCharacterRemover

Removing trailing character (sequences) from tokens, e.g. punctuation.

JCasHolder

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

Annotation-By-Text Filter

Short name

AnnotationByTextFilter

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.annotations.AnnotationByTextFilter

Description

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

Parameters
ignoreCase

If true, annotation texts are filtered case-independently (i.e. words that occur in the list with different casing are not filtered out).

Type: Boolean  — Default value: true

modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well.

Type: String

typeName

Annotation type to filter.

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

Compound Annotator

Short name

CompoundAnnotator

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-decompounding-asl

Implementation

org.dkpro.core.decompounding.uima.annotator.CompoundAnnotator

Description

Annotates compound parts and linking morphemes.

Table 166. Capabilities

Inputs

Token

Outputs

Compound CompoundPart LinkingMorpheme Split

Languages

none specified

CoreNLP Sentiment Analyzer

Short name

StanfordSentimentAnalyzer

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-stanfordnlp-gpl

Implementation

org.dkpro.core.stanfordnlp.StanfordSentimentAnalyzer

Description

Experimental wrapper for edu.stanford.nlp.pipeline.SentimentAnnotator which assigns 5 scores to each sentence. NOTE: Is very slow in the current state as it runs full Stanford pipeline and does not take into account any existing DKPro annotations.

Table 167. Capabilities

Inputs

Sentence Token

Outputs

StanfordSentimentAnnotation

Languages

none specified

Corrections Contextualizer

Short name

CorrectionsContextualizer

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-jazzy-asl

Implementation

org.dkpro.core.jazzy.CorrectionsContextualizer

Description

This component assumes that some spell checker has already been applied upstream (e.g. Jazzy). It then uses n-gram frequencies from a frequency provider in order to rank the provided corrections.

Maui Keyword Annotator

Short name

MauiKeywordAnnotator

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-maui-gpl

Implementation

org.dkpro.core.maui.MauiKeywordAnnotator

Description

The Maui tool assigns keywords to documents. The keywords can optionally come from controlled vocabulary. The keywords are stored in DKPro Core MetaDataStringField annotations with the key http://purl.org/dc/terms/subject.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

maxTopics

Maximum number of keywords to assign to a document.

Type: Integer  — Default value: 10

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

scoreThreshold

Minimum similarity score to a variable require to count as a match (0-1).

Type: Float  — Default value: 0.5

vocabularyEncoding

Encoding of the vocabulary file. Normally, this information is obtained from the key vocabulary.encoding in the model metadata.

Optional — Type: String  — Default value: UTF-8

vocabularyFormat

Format of the vocabulary file. Normally, this information is obtained from the key vocabulary.format in the model metadata. Only skos and leaving the parameter unset (i.e. no vocabulary) are currently supported.

Optional — Type: String

vocabularyLocation

Location of the vocabulary file. Normally, this location is derived from the model location by replacing the model extension .ser with .rdf.gz.

Optional — Type: String

Table 168. Capabilities

Inputs

none specified

Outputs

MetaDataStringField

Languages

none specified

N-Gram Annotator

Short name

NGramAnnotator

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-ngrams-asl

Implementation

org.dkpro.core.ngrams.NGramAnnotator

Description

N-gram annotator.

Parameters
N

The length of the n-grams to generate (the "n" in n-gram).

Type: Integer  — Default value: 3

Table 169. Capabilities

Inputs

Sentence Token

Outputs

NGram

Languages

none specified

POS Filter

Short name

PosFilter

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-posfilter-asl

Implementation

org.dkpro.core.posfilter.PosFilter

Description

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

Parameters
adj

Keep/remove adjectives (true: keep, false: remove)

Type: Boolean  — Default value: false

adp

Keep/remove adpositions (true: keep, false: remove)

Type: Boolean  — Default value: false

adv

Keep/remove adverbs (true: keep, false: remove)

Type: Boolean  — Default value: false

aux

Keep/remove auxiliary verbs (true: keep, false: remove)

Type: Boolean  — Default value: false

conj

Keep/remove conjunctions (true: keep, false: remove)

Type: Boolean  — Default value: false

det

Keep/remove articles (true: keep, false: remove)

Type: Boolean  — Default value: false

intj

Keep/remove interjections (true: keep, false: remove)

Type: Boolean  — Default value: false

noun

Keep/remove nouns (true: keep, false: remove)

Type: Boolean  — Default value: false

num

Keep/remove numerals (true: keep, false: remove)

Type: Boolean  — Default value: false

part

Keep/remove particles (true: keep, false: remove)

Type: Boolean  — Default value: false

pron

Keep/remove pronnouns (true: keep, false: remove)

Type: Boolean  — Default value: false

propn

Keep/remove proper nouns (true: keep, false: remove)

Type: Boolean  — Default value: false

punct

Keep/remove punctuation (true: keep, false: remove)

Type: Boolean  — Default value: false

sconj

Keep/remove conjunctions (true: keep, false: remove)

Type: Boolean  — Default value: false

sym

Keep/remove symbols (true: keep, false: remove)

Type: Boolean  — Default value: false

typeToRemove

The fully qualified name of the type that should be filtered.

Type: String

verb

Keep/remove verbs (true: keep, false: remove)

Type: Boolean  — Default value: false

x

Keep/remove other (true: keep, false: remove)

Type: Boolean  — Default value: false

Table 170. Capabilities

Inputs

POS

Outputs

none specified

Languages

none specified

POS Mapper

Short name

PosMapper

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-posfilter-asl

Implementation

org.dkpro.core.posfilter.PosMapper

Description

Maps existing POS tags from one tagset to another using a user provided properties file.

Parameters
dkproMappingLocation

A properties file containing mappings from the new tagset to (fully qualified) DKPro POS classes.
If such a file is not supplied, the DKPro POS classes stay the same regardless of the new POS tag value, and only the value is changed.

Optional — Type: String

mappingFile

A properties file containing POS tagset mappings.

Type: String

Table 171. Capabilities

Inputs

POS Token

Outputs

POS Token

Languages

none specified

Phrase Annotator

Short name

PhraseAnnotator

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-frequency-asl

Implementation

org.dkpro.core.frequency.phrasedetection.PhraseAnnotator

Description

Annotate phrases in a sentence. Depending on the provided n-grams and the threshold, these comprise either one or two annotations (tokens, lemmas, ...).

In order to identify longer phrases, run the FrequencyWriter and this annotator multiple times, each time taking the results of the previous run as input. From the second run on, set phrases in the feature path parameter #PARAM_FEATURE_PATH.

Parameters
PARAM_LOWERCASE

If true, lowercase everything.

Type: Boolean  — Default value: false

coveringType

Set this parameter if bigrams should only be counted when occurring within a covering type, e.g. sentences.

Optional — Type: String

discount

The discount in order to prevent too many phrases consisting of very infrequent words to be formed. A typical value is the minimum count set during model creation (FrequencyWriter#PARAM_MIN_COUNT), which is by default set to 5.

Type: Integer  — Default value: 5

featurePath

The feature path to use for building bigrams.

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

filterRegex

Regular expression of tokens to be filtered.

Type: String  — Default value: ``

modelLocation

The file providing the uni-grams and bi-grams to use.

Type: String

regexReplacement

Value with which tokens matching the regular expression are replaced.

Type: String  — Default value: ``

stopwordsFile

Path of a file containing stopwords one work per line.

Type: String  — Default value: ``

stopwordsReplacement

Stopwords are replaced by this value.

Type: String  — Default value: ``

threshold

The threshold score for phrase construction. Default is 100. Lower values result in fewer phrases. The value strongly depends on the size of the corpus and the token unigrams.

Type: Float  — Default value: 100.0

Readability Annotator

Short name

ReadabilityAnnotator

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-readability-asl

Implementation

org.dkpro.core.readability.ReadabilityAnnotator

Description

Assign a set of popular readability scores to the text.

Table 172. Capabilities

Inputs

Sentence Token

Outputs

ReadabilityScore

Languages

none specified

Regex Token Filter

Short name

RegexTokenFilter

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.annotations.RegexTokenFilter

Description

Remove every token that does or does not match a given regular expression.

Parameters
mustMatch

If this parameter is set to true (default), retain only tokens that match the regex given in #PARAM_REGEX. If set to false, all tokens that match the given regex are removed.

Type: Boolean  — Default value: true

regex

Every token that does or does not match this regular expression will be removed.

Type: String

Table 173. Capabilities

Inputs

Token

Outputs

none specified

Languages

none specified

Simple Spelling Corrector

Short name

NorvigSpellingCorrector

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-norvig-asl

Implementation

org.dkpro.core.norvig.NorvigSpellingCorrector

Description

Identifies spelling errors using Norvig's algorithm.

Parameters
modelLocation

Location from which the model is read. This is either a local path or a classpath location. In the latter case, the model artifact (if any) is searched as well.

Optional — Type: String

Table 174. Capabilities

Inputs

Token

Outputs

SofaChangeAnnotation

Languages

none specified

Stop Word Remover

Short name

StopWordRemover

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core

Implementation

org.dkpro.core.stopwordremover.StopWordRemover

Description

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. Also remove any other of the specified types that is covered by a matching instance.

Parameters
Paths

Feature paths for annotations that should be matched/removed. The default is

StopWord.class.getName()
Token.class.getName()
Lemma.class.getName()+"/value"

Optional — Type: String[]

StopWordType

Anything annotated with this type will be removed even if it does not match any word in the lists.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

A list of URLs from which to load the stop word lists. If an URL is prefixed with a language code in square brackets, the stop word list is only used for documents in that language. Using no prefix or the prefix "[*]" causes the list to be used for every document. Example: "[de]classpath:/stopwords/en_articles.txt"

Type: String[]

Table 175. Capabilities

Inputs

StopWord

Outputs

none specified

Languages

none specified

Stopwatch

Short name

Stopwatch

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-performance-asl

Implementation

org.dkpro.core.performance.Stopwatch

Description

Can be used to measure how long the processing between two points in a pipeline takes. For that purpose, the AE needs to be added two times, before and after the part of the pipeline that should be measured.

Parameters
timerName

Name of the timer pair. Upstream and downstream timer need to use the same name.

Type: String

timerOutputFile

Name of the timer pair. Upstream and downstream timer need to use the same name.

Optional — Type: String

Table 176. Capabilities

Inputs

TimerAnnotation

Outputs

TimerAnnotation

Languages

none specified

TF/IDF Annotator

Short name

TfIdfAnnotator

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-frequency-asl

Implementation

org.dkpro.core.frequency.tfidf.TfIdfAnnotator

Description

This component adds Tfidf annotations consisting of a term and a tfidf weight.
The annotator is type agnostic concerning the input annotation, so you have to specify the annotation type and string representation. It uses a pre-serialized DfStore, which can be created using the TfIdfWriter.

Parameters
featurePath

This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path.

Type: String

lowercase

If set to true, the whole text is handled in lower case.

Optional — Type: Boolean  — Default value: false

tfdfPath

Provide the path to the Df-Model. When a shared SharedDfModel is bound to this annotator, this is ignored.

Optional — Type: String

weightingModeIdf

The model for inverse document frequency weighting.
Invoke toString() on an enum of WeightingModeIdf for setup.

Default value is "NORMAL" yielding an unweighted idf.

Optional — Type: String  — Default value: NORMAL

weightingModeTf

The model for term frequency weighting.
Invoke toString() on an enum of WeightingModeTf for setup.

Default value is "NORMAL" yielding an unweighted tf.

Optional — Type: String  — Default value: NORMAL

Table 177. Capabilities

Inputs

none specified

Outputs

Tfidf

Languages

none specified

Trailing Character Remover

Short name

TrailingCharacterRemover

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.annotations.TrailingCharacterRemover

Description

Removing trailing character (sequences) from tokens, e.g. punctuation.

Parameters
minTokenLength

All tokens that are shorter than the minimum token length after removing trailing chars are completely removed. By default (1), empty tokens are removed. Set to 0 or a negative value if no tokens should be removed.

Shorter tokens that do not have trailing chars removed are always retained, regardless of their length.

Type: Integer  — Default value: 1

pattern

A regex to be trimmed from the end of tokens.

Type: String  — Default value: [\\Q,-\u201C^\u00BB*\u2019()&/\"'\u00A9\u00A7'\u2014\u00AB\u00B7=\\E0-9A-Z]+

Table 178. Capabilities

Inputs

Token

Outputs

Token

Languages

none specified

org.dkpro.core.textnormalizer.util.JCasHolder

Short name

JCasHolder

Category

Other

Group ID

org.dkpro.core

Artifact ID

dkpro-core-textnormalizer-asl

Implementation

org.dkpro.core.textnormalizer.util.JCasHolder

Description

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

Appendix

Table 179. Producers and consumers by type
Type Producer Consumer

GrammarAnomaly

LanguageToolChecker

SpellingAnomaly

JazzyChecker

SpellingNormalizer

SuggestedAction

JazzyChecker

CoreferenceChain

CoreNlpCoreferenceResolver StanfordCoreferenceResolver

CoreferenceLink

CoreNlpCoreferenceResolver StanfordCoreferenceResolver

Tfidf

TfIdfAnnotator

Morpheme

MateMorphTagger

MorphologicalFeatures

MateMorphTagger RfTagger SfstAnnotator UDPipePosTagger

UDPipeParser

POS

ArktweetPosTagger ClearNlpPosTagger CoreNlpPosTagger HepplePosTagger HunPosTagger LingPipePosTagger MatePosTagger MeCabTagger Nlp4JPosTagger OpenNlpPosTagger PosMapper RfTagger SfstAnnotator StanfordPosTagger TreeTaggerPosTagger UDPipePosTagger

ArktweetPosTaggerTrainer ClearNlpLemmatizer ClearNlpParser ClearNlpSemanticRoleLabeler CoreNlpCoreferenceResolver CoreNlpDependencyParser CoreNlpLemmatizer CoreNlpParser GermanSeparatedParticleAnnotator IxaLemmatizer MaltParser MateParser MateSemanticRoleLabeler MorphaLemmatizer MstParser Nlp4JDependencyParser Nlp4JLemmatizer Nlp4JNamedEntityRecognizer OpenNlpChunker OpenNlpLemmatizer OpenNlpPosTaggerTrainer PosFilter PosMapper SemanticFieldAnnotator StanfordCoreferenceResolver StanfordLemmatizer StanfordParser StanfordPosTaggerTrainer TokenMerger TreeTaggerChunker UDPipeParser

DocumentMetaData

ApplyChangesAnnotator

ApplyChangesAnnotator

MetaDataStringField

MauiKeywordAnnotator

NamedEntity

CoreNlpNamedEntityRecognizer LingPipeNamedEntityRecognizer Nlp4JNamedEntityRecognizer OpenNlpNamedEntityRecognizer SemanticFieldAnnotator StanfordNamedEntityRecognizer

CoreNlpCoreferenceResolver OpenNlpNamedEntityRecognizerTrainer StanfordCoreferenceResolver StanfordNamedEntityRecognizerTrainer

PhoneticTranscription

ColognePhoneticTranscriptor DoubleMetaphonePhoneticTranscriptor MetaphonePhoneticTranscriptor SoundexPhoneticTranscriptor

Compound

CompoundAnnotator

CompoundPart

CompoundAnnotator

Lemma

ClearNlpLemmatizer CoreNlpLemmatizer GateLemmatizer GermanSeparatedParticleAnnotator IxaLemmatizer LanguageToolLemmatizer MateLemmatizer MeCabTagger MorphaLemmatizer Nlp4JLemmatizer OpenNlpLemmatizer StanfordLemmatizer TokenMerger TreeTaggerPosTagger UDPipePosTagger

ClearNlpParser ClearNlpSemanticRoleLabeler CoreNlpCoreferenceResolver GermanSeparatedParticleAnnotator MaltParser MateMorphTagger MateSemanticRoleLabeler Nlp4JNamedEntityRecognizer SemanticFieldAnnotator StanfordCoreferenceResolver TokenMerger UDPipeParser

LinkingMorpheme

CompoundAnnotator

NGram

NGramAnnotator

Paragraph

JTokSegmenter ParagraphSplitter

Sentence

BreakIteratorSegmenter ClearNlpSegmenter CoreNlpSegmenter GosenSegmenter IcuSegmenter JTokSegmenter JiebaSegmenter LanguageToolSegmenter LineBasedSentenceSegmenter LingPipeSegmenter MeCabTagger Nlp4JSegmenter OpenNlpSegmenter RegexSegmenter StanfordSegmenter UDPipeSegmenter WhitespaceSegmenter

ArktweetPosTaggerTrainer BerkeleyParser ClearNlpLemmatizer ClearNlpParser ClearNlpPosTagger ClearNlpSemanticRoleLabeler CoreNlpCoreferenceResolver CoreNlpDependencyParser CoreNlpLemmatizer CoreNlpNamedEntityRecognizer CoreNlpParser CoreNlpPosTagger DictionaryAnnotator GermanSeparatedParticleAnnotator HepplePosTagger HunPosTagger IxaLemmatizer LanguageToolLemmatizer LingPipePosTagger MaltParser MateLemmatizer MateMorphTagger MateParser MatePosTagger MateSemanticRoleLabeler MorphaLemmatizer MstParser NGramAnnotator Nlp4JDependencyParser Nlp4JLemmatizer Nlp4JNamedEntityRecognizer Nlp4JPosTagger OpenNlpChunker OpenNlpLemmatizer OpenNlpNamedEntityRecognizerTrainer OpenNlpParser OpenNlpPosTagger OpenNlpPosTaggerTrainer OpenNlpSentenceTrainer ReadabilityAnnotator RfTagger SfstAnnotator StanfordCoreferenceResolver StanfordNamedEntityRecognizer StanfordNamedEntityRecognizerTrainer StanfordParser StanfordPosTagger StanfordPosTaggerTrainer StanfordSentimentAnalyzer UDPipeParser UDPipePosTagger

Split

CompoundAnnotator

Stem

CisStemmer LancasterStemmer MyStemStemmer OpenNlpSnowballStemmer SmileLancasterStemmer SnowballStemmer

StopWord

StopWordRemover

Token

BreakIteratorSegmenter CamelCaseTokenSegmenter ClearNlpSegmenter CoreNlpSegmenter GosenSegmenter IcuSegmenter JTokSegmenter JiebaSegmenter LanguageToolSegmenter LingPipeSegmenter Nlp4JSegmenter OpenNlpSegmenter PatternBasedTokenSegmenter PosMapper RegexSegmenter StanfordSegmenter TokenTrimmer TrailingCharacterRemover UDPipeSegmenter WhitespaceSegmenter

ArktweetPosTagger ArktweetPosTaggerTrainer BerkeleyParser CamelCaseTokenSegmenter CapitalizationNormalizer ClearNlpLemmatizer ClearNlpParser ClearNlpPosTagger ClearNlpSemanticRoleLabeler ColognePhoneticTranscriptor CompoundAnnotator CoreNlpCoreferenceResolver CoreNlpDependencyParser CoreNlpLemmatizer CoreNlpNamedEntityRecognizer CoreNlpParser CoreNlpPosTagger DictionaryAnnotator DoubleMetaphonePhoneticTranscriptor ExpressiveLengtheningNormalizer GateLemmatizer GermanSeparatedParticleAnnotator HepplePosTagger HunPosTagger IxaLemmatizer JazzyChecker LancasterStemmer LanguageToolLemmatizer LingPipeNamedEntityRecognizer LingPipePosTagger MalletEmbeddingsAnnotator MalletLdaTopicModelInferencer MaltParser MateLemmatizer MateMorphTagger MateParser MatePosTagger MateSemanticRoleLabeler MetaphonePhoneticTranscriptor MorphaLemmatizer MstParser MyStemStemmer NGramAnnotator Nlp4JDependencyParser Nlp4JLemmatizer Nlp4JNamedEntityRecognizer Nlp4JPosTagger NorvigSpellingCorrector OpenNlpChunker OpenNlpLemmatizer OpenNlpNamedEntityRecognizer OpenNlpNamedEntityRecognizerTrainer OpenNlpParser OpenNlpPosTagger OpenNlpPosTaggerTrainer OpenNlpTokenTrainer PatternBasedTokenSegmenter PosMapper ReadabilityAnnotator RegexBasedTokenTransformer RegexTokenFilter ReplacementFileNormalizer RfTagger SemanticFieldAnnotator SfstAnnotator SmileLancasterStemmer SoundexPhoneticTranscriptor StanfordCoreferenceResolver StanfordDependencyConverter StanfordLemmatizer StanfordNamedEntityRecognizer StanfordNamedEntityRecognizerTrainer StanfordParser StanfordPosTagger StanfordPosTaggerTrainer StanfordSentimentAnalyzer TokenMerger TokenTrimmer TrailingCharacterRemover TreeTaggerPosTagger UDPipeParser UDPipePosTagger UmlautNormalizer

SemArg

ClearNlpSemanticRoleLabeler MateSemanticRoleLabeler

SemPred

ClearNlpSemanticRoleLabeler MateSemanticRoleLabeler

PennTree

BerkeleyParser OpenNlpParser

Chunk

OpenNlpChunker TreeTaggerChunker

Constituent

BerkeleyParser CoreNlpParser OpenNlpParser StanfordParser

CoreNlpCoreferenceResolver StanfordCoreferenceResolver StanfordDependencyConverter

Dependency

ClearNlpParser CoreNlpDependencyParser CoreNlpParser MaltParser MateParser MstParser Nlp4JDependencyParser StanfordDependencyConverter StanfordParser UDPipeParser

ClearNlpSemanticRoleLabeler MateSemanticRoleLabeler

SofaChangeAnnotation

ApplyChangesAnnotator NorvigSpellingCorrector ReplacementFileNormalizer

ApplyChangesAnnotator

TopicDistribution

MalletLdaTopicModelInferencer

WordEmbedding

MalletEmbeddingsAnnotator

JapaneseToken

MeCabTagger

StanfordSentimentAnnotation

StanfordSentimentAnalyzer

ReadabilityScore

ReadabilityAnnotator

TimerAnnotation

Stopwatch

Stopwatch