The document provides detailed information about the DKPro Core UIMA components.

Overview

Analytics components

Table 1. Analysis Components (130)
Component Description

Annotation-By-Length Filter

Removes annotations that do not conform to minimum or maximum length constraints.

Annotation-By-Text Filter

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

CAS Transformation - Apply

Applies changes annotated using a SofaChangeAnnotation.

ArkTweet POS-Tagger

Wrapper for Twitter Tokenizer and POS Tagger.

ArkTweet Tokenizer

ArkTweet tokenizer.

CAS Transformation - Map back

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

Berkeley Parser

Berkeley Parser annotator .

Java BreakIterator Segmenter

BreakIterator segmenter.

CamelCase Token Segmenter

Split up existing tokens again if they are camel-case text.

Capitalization Normalizer

Takes a text and replaces wrong capitalization

CIS Stemmer

UIMA wrapper for the CISTEM algorithm.

Chinese Traditional/Simplified Converter

Converts traditional Chinese to simplified Chinese or vice-versa.

ClearNLP Lemmatizer

Lemmatizer using Clear NLP.

ClearNLP Parser

CLEAR parser annotator.

ClearNLP POS-Tagger

Part-of-Speech annotator using Clear NLP.

ClearNLP Segmenter

Tokenizer using Clear NLP.

ClearNLP Semantic Role Labeler

ClearNLP semantic role labeller.

Commons Codec Cologne Phonetic Transcriptor

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.

Compound Annotator

Annotates compound parts and linking morphemes.

CoreNLP Coreference Resolver

Deterministic coreference annotator from CoreNLP.

CoreNLP Dependency Parser

Dependency parser from CoreNLP.

CoreNLP Lemmatizer

Lemmatizer from CoreNLP.

CoreNLP Named Entity Recognizer

Named entity recognizer from CoreNLP.

CoreNLP Parser

Parser from CoreNLP.

CoreNLP POS-Tagger

Part-of-speech tagger from CoreNLP.

CoreNLP Segmenter

Tokenizer and sentence splitter using from CoreNLP.

Corrections Contextualizer

This component assumes that some spell checker has already been applied upstream (e.g.

Dictionary Annotator

Takes a plain text file with phrases as input and annotates the phrases in the CAS file.

Dictionary-based Token Transformer

Reads a tab-separated file containing mappings from one token to another.

Commons Codec Double-Metaphone Phonetic Transcriptor

Double-Metaphone phonetic transcription based on Apache Commons Codec.

Expressive Lengthening Normalizer

Takes a text and shortens extra long words

File-based Token Transformer

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

FlexTag POS-Tagger

Flexible part-of-speech tagger.

Frequency Count Writer

Count unigrams and bigrams in a collection.

GATE Lemmatizer

Wrapper for the GATE rule based lemmatizer.

German Separated Particle Annotator

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.

Gosen Segmenter

Segmenter for Japanese text based on GoSen.

GATE Hepple POS-Tagger

GATE Hepple part-of-speech tagger.

HunPos POS-Tagger

Part-of-Speech annotator using HunPos.

Hyphenation Remover

Simple dictionary-based hyphenation remover.

ICU Segmenter

ICU segmenter.

IXA Lemmatizer

Lemmatizer using the OpenNLP-based Ixa implementation.

IXA POS-Tagger

Part-of-Speech annotator using OpenNLP with IXA extensions.

de.tudarmstadt.ukp.dkpro.core.textnormalizer.util.JCasHolder

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

JTok Segmenter

JTok segmenter.

Jazzy Spellchecker

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

Lancaster Stemmer

This Paice/Husk Lancaster stemmer implementation only works with the English language so far.

LangDetect

Langdetect language identifier based on character n-grams.

Web1T Language Detector

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

TextCat Language Identifier (Character N-Gram-based)

Detection based on character n-grams.

LanguageTool Grammar Checker

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

LanguageTool Lemmatizer

Naive lexicon-based lemmatizer.

LanguageTool Segmenter

Segmenter using LanguageTool to do the heavy lifting.

Line-based Sentence Segmenter

Annotates each line in the source text as a sentence.

LingPipe Named Entity Recognizer

LingPipe named entity recognizer.

LingPipe Named Entity Recognizer Trainer

LingPipe named entity recognizer trainer.

LingPipe POS-Tagger

LingPipe part-of-speech tagger.

LingPipe Segmenter

LingPipe segmenter.

Mallet Embeddings Annotator

Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas.

Mallet Embeddings Trainer

Compute word embeddings from the given collection using skip-grams.

Mallet LDA Topic Model Inferencer

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

Mallet LDA Topic Model Trainer

Estimate an LDA topic model using Mallet and write it to a file.

MaltParser Dependency Parser

Dependency parsing using MaltPaser.

Mate Tools Lemmatizer

DKPro Annotator for the MateToolsLemmatizer.

Mate Tools Morphological Analyzer

DKPro Annotator for the MateToolsMorphTagger.

Mate Tools Dependency Parser

DKPro Annotator for the MateToolsParser.

Mate Tools POS-Tagger

DKPro Annotator for the MateToolsPosTagger

Mate Tools Semantic Role Labeler

DKPro Annotator for the MateTools Semantic Role Labeler.

MeCab POS-Tagger

Annotator for the MeCab Japanese POS Tagger.

Commons Codec Metaphone Phonetic Transcriptor

Metaphone phonetic transcription based on Apache Commons Codec.

Morpha Lemmatizer

Lemmatize based on a finite-state machine.

MSTParser Dependency Parser

Dependency parsing using MSTParser.

N-Gram Annotator

N-gram annotator.

NLP4J Dependency Parser

Emory NLP4J dependency parser.

NLP4J Lemmatizer

Emory NLP4J lemmatizer.

NLP4J Named Entity Recognizer

Emory NLP4J name finder wrapper.

NLP4J POS-Tagger

Part-of-Speech annotator using Emory NLP4J.

NLP4J Segmenter

Segmenter using Emory NLP4J.

Simple Spelling Corrector

Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.

OpenNLP Chunker

Chunk annotator using OpenNLP.

OpenNLP Chunker Trainer

Train a chunker model for OpenNLP.

OpenNLP Lemmatizer

Lemmatizer using OpenNLP.

OpenNLP Lemmatizer Trainer

Train a lemmatizer model for OpenNLP.

OpenNLP Named Entity Recognizer

OpenNLP name finder wrapper.

OpenNLP Named Entity Recognizer Trainer

Train a named entity recognizer model for OpenNLP.

OpenNLP Parser

OpenNLP parser.

OpenNLP POS-Tagger

Part-of-Speech annotator using OpenNLP.

OpenNLP POS-Tagger Trainer

Train a POS tagging model for OpenNLP.

OpenNLP Segmenter

Tokenizer and sentence splitter using OpenNLP.

OpenNLP Sentence Splitter Trainer

Train a sentence splitter model for OpenNLP.

OpenNLP Tokenizer Trainer

Train a tokenizer model for OpenNLP.

Paragraph Splitter

This class creates paragraph annotations for the given input document.

Pattern-based Token Segmenter

Split up existing tokens again at particular split-chars.

Phrase Annotator

Annotate phrases in a sentence.

POS Filter

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

POS Mapper

Maps existing POS tags from one tagset to another using a user provided properties file.

Readability Annotator

Assign a set of popular readability scores to the text.

Regex-based Token Transformer

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

Regex Segmenter

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

Regex Token Filter

Remove every token that does or does not match a given regular expression.

Replacement File Normalizer

Takes a text and replaces desired expressions.

RFTagger Morphological Analyzer

Rftagger morphological analyzer.

Semantic Field Annotator

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.

SFST Morphological Analyzer

SFST morphological analyzer.

Sharp S (ß) Normalizer

Takes a text and replaces sharp s

Snowball Stemmer

UIMA wrapper for the Snowball stemmer.

Commons Codec Soundex Phonetic Transcriptor

Soundex phonetic transcription based on Apache Commons Codec.

Spelling Normalizer

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

CoreNLP Coreference Resolver (old API)

No description

CoreNLP Dependency Converter

Converts a constituency structure into a dependency structure.

CoreNLP Lemmatizer (old API)

Stanford Lemmatizer component.

CoreNLP Named Entity Recogizer (old API)

Stanford Named Entity Recognizer component.

CoreNLP Named Entity Recognizer Trainer

Train a NER model for Stanford CoreNLP Named Entity Recognizer.

CoreNLP Parser (old API)

Stanford Parser component.

CoreNLP POS-Tagger (old API)

Stanford Part-of-Speech tagger component.

CoreNLP POS-Tagger Trainer

Train a POS tagging model for the Stanford POS tagger.

Stanford Penn Treebank Normalizer

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.

CoreNLP Segmenter (old API)

Stanford sentence splitter and tokenizer.

Stop Word Remover

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.

Stopwatch

Can be used to measure how long the processing between two points in a pipeline takes.

TF/IDF Annotator

This component adds Tfidf annotations consisting of a term and a tfidf weight.

TF/IDF Model Writer

This consumer builds a DfModel.

Token Case Transformer

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

Token Merger

Merges any Tokens that are covered by a given annotation type.

Token Trimmer

Remove prefixes and suffixes from tokens.

Trailing Character Remover

Removing trailing character (sequences) from tokens, e.g. punctuation.

TreeTagger Chunker

Chunk annotator using TreeTagger.

TreeTagger POS-Tagger

Part-of-Speech and lemmatizer annotator using TreeTagger.

Umlaut Normalizer

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

Whitespace Segmenter

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

Checker

Table 2. Analysis Components in category Checker (2)
Component Description

JazzyChecker

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

LanguageToolChecker

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

Jazzy Spellchecker

Short name

JazzyChecker

Category

Checker

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.jazzy-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.jazzy.JazzyChecker

Description

This annotator uses Jazzy for the decision whether a word is spelled correctly or not.

Parameters
modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

Location from which the model is read. The model file is a simple word-list with one word per line.

Type: String

scoreThreshold

Determines the maximum edit distance (as an int value) that a suggestion for a spelling error may have. E.g. if set to one suggestions are limited to words within edit distance 1 to the original word.

Type: Integer  — Default value: 1

Table 3. Capabilities

Inputs

Outputs

Languages

none specified

LanguageTool Grammar Checker

Short name

LanguageToolChecker

Category

Checker

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.languagetool-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolChecker

Description

Detect grammatical errors in text using LanguageTool a rule based grammar checker.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

Table 4. Capabilities

Inputs

none specified

Outputs

Languages

be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh

Chunker

Table 5. Analysis Components in category Chunker (3)
Component Description

OpenNlpChunker

Chunk annotator using OpenNLP.

OpenNlpChunkerTrainer

Train a chunker model for OpenNLP.

TreeTaggerChunker

Chunk annotator using TreeTagger.

OpenNLP Chunker

Short name

OpenNlpChunker

Category

Chunker

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpChunker

Description

Chunk annotator using OpenNLP.

Parameters
ChunkMappingLocation

Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 6. Capabilities

Inputs

Outputs

Languages

see available models

Table 7. Models
Language Variant Version

en

default

20100908.1

en

perceptron-ixa

20160205.1

OpenNLP Chunker Trainer

Short name

OpenNlpChunkerTrainer

Category

Chunker

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpChunkerTrainer

Description

Train a chunker model for OpenNLP.

Parameters
algorithm

Type: String  — Default value: MAXENT

beamSize

Type: Integer  — Default value: 3

cutoff

Type: Integer  — Default value: 5

iterations

Type: Integer  — Default value: 100

language

Type: String

numThreads

Type: Integer  — Default value: 1

targetLocation

Type: String

trainerType

Type: String  — Default value: Event

TreeTagger Chunker

Short name

TreeTaggerChunker

Category

Chunker

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.treetagger-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerChunker

Description

Chunk annotator using TreeTagger.

Parameters
ChunkMappingLocation

Location of the mapping file for chunk tags to UIMA types.

Optional — Type: String

executablePath

Use this TreeTagger executable instead of trying to locate the executable automatically.

Optional — Type: String

flushSequence

A sequence to flush the internal TreeTagger buffer and to force it to output the rest of the completed analysis. This is typically just a sequence of like 5-10 full stops (".") separated by new line characters. However, some models may require a different flush sequence, e.g. a short sentence in the respective language. For chunker models, mind that the sentence must also be POS tagged, e.g. Nous-PRO:PER\n....

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

performanceMode

TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.

Type: Boolean  — Default value: false

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 8. Capabilities

Inputs

Outputs

Languages

see available models

Table 9. Models
Language Variant Version

de

le

20110429.1

en

iso8859-le

20090824.1

en

le

20140520.1

fr

le

20141218.2

Coreference resolver

Table 10. Analysis Components in category Coreference resolver (2)
Component Description

CoreNlpCoreferenceResolver

Deterministic coreference annotator from CoreNLP.

StanfordCoreferenceResolver

No description

CoreNLP Coreference Resolver

Short name

CoreNlpCoreferenceResolver

Category

Coreference resolver

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.corenlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpCoreferenceResolver

Description

Deterministic coreference annotator from CoreNLP.

Parameters
maxDist

DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance)

Type: Integer  — Default value: -1

postprocessing

DCoRef parameter: Do post-processing

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

score

DCoRef parameter: Scoring the output of the system

Type: Boolean  — Default value: false

sieves

DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/.

Type: String  — Default value: MarkRole, DiscourseMatch, ExactStringMatch, RelaxedExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, RelaxedHeadMatch, PronounMatch

singleton

DCoRef parameter: setting singleton predictor

Type: Boolean  — Default value: true

Table 11. Capabilities

Inputs

Outputs

Languages

none specified

CoreNLP Coreference Resolver (old API)

Short name

StanfordCoreferenceResolver

Category

Coreference resolver

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordCoreferenceResolver

Description
null
Parameters
maxDist

DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance)

Type: Integer  — Default value: -1

postprocessing

DCoRef parameter: Do post processing

Type: Boolean  — Default value: false

score

DCoRef parameter: Scoring the output of the system

Type: Boolean  — Default value: false

sieves

DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/.

Type: String  — Default value: MarkRole, DiscourseMatch, ExactStringMatch, RelaxedExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, RelaxedHeadMatch, PronounMatch

singleton

DCoRef parameter: setting singleton predictor

Type: Boolean  — Default value: true

Table 12. Capabilities

Inputs

Outputs

Languages

see available models

Table 13. Models
Language Variant Version

en

default

${core.version}.1

Embeddings

Table 14. Analysis Components in category Embeddings (2)
Component Description

MalletEmbeddingsAnnotator

Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas.

MalletEmbeddingsTrainer

Compute word embeddings from the given collection using skip-grams.

Mallet Embeddings Annotator

Short name

MalletEmbeddingsAnnotator

Category

Embeddings

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.mallet-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.mallet.wordembeddings.MalletEmbeddingsAnnotator

Description

Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas.

Parameters
annotateUnknownTokens
Specify how to handle unknown tokens:
  1. If this parameter is not specified, unknown tokens are not annotated.
  2. If an empty float[] is passed, a random vector is generated that is used for each unknown token.
  3. If a float[] is passed, each unknown token is annotated with that vector. The float must have the same length as the vectors in the model file.

Type: Boolean  — Default value: false

lowercase

If set to true (default: false), all tokens are lowercased.

Type: Boolean  — Default value: false

modelHasHeader

If set to true (default: false), the first line is interpreted as header line containing the number of entries and the dimensionality. This should be set to true for models generated with Word2Vec.

Type: Boolean  — Default value: false

modelIsBinary

Type: Boolean  — Default value: false

modelLocation

The file containing the word embeddings.

Currently only supports text file format.

Type: String

tokenFeaturePath

The annotation type to use for the model. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

Table 15. Capabilities

Inputs

Outputs

Languages

none specified

Mallet Embeddings Trainer

Short name

MalletEmbeddingsTrainer

Category

Embeddings

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.mallet-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.mallet.wordembeddings.MalletEmbeddingsTrainer

Description

Compute word embeddings from the given collection using skip-grams.

Set #PARAM_TOKEN_FEATURE_PATH to define what is considered as a token (Tokens, Lemmas, etc.).

Set #PARAM_COVERING_ANNOTATION_TYPE to define what is considered a document (sentences, paragraphs, etc.).

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

coveringAnnotationType

If specified, the text contained in the given segmentation type annotations are fed as separate units ("documents") to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored.

By default, the full text is used as a document.

Type: String  — Default value: ``

dimensions

The dimensionality of the output word embeddings (default: 50).

Type: Integer  — Default value: 50

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

exampleWord

An example word that is output with its nearest neighbours once in a while (default: null, i.e. none).

Optional — Type: String

filterRegex

Filter out all tokens matching that regular expression.

Type: String  — Default value: ``

filterRegexReplacement

Type: String  — Default value: ``

lowercase

If set to true (default: false), all tokens are lowercased.

Type: Boolean  — Default value: false

minDocumentLength

Ignore documents with fewer tokens than this value (default: 10).

Type: Integer  — Default value: 10

minTokenLength

Ignore tokens (or any other annotation type, as specified by #PARAM_TOKEN_FEATURE_PATH) that are shorter than the given value. Default: 3.

Type: Integer  — Default value: 3

numNegativeSamples

The number of negative samples to be generated for each token (default: 5).

Type: Integer  — Default value: 5

numThreads

The number of threads to use during model estimation. If not set, the number of threads is automatically set by ComponentParameters#computeNumThreads(int).

Warning: do not set this to more than 1 when using very small (test) data sets on MalletEmbeddingsTrainer! This might prevent the process from terminating.

Type: Integer  — Default value: 0

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

paramStopwordsFile

The location of the stopwords file.

Type: String  — Default value: ``

paramStopwordsReplacement

If set, stopwords found in the #PARAM_STOPWORDS_FILE location are not removed, but replaced by the given string (e.g. STOP).

Type: String  — Default value: ``

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

tokenFeaturePath

The annotation type to use as input tokens for the model estimation. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token. For lemmas, for instance, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

useCharacters

If true (default: false), estimate character embeddings. #PARAM_TOKEN_FEATURE_PATH is ignored.

Type: Boolean  — Default value: false

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

windowSize

The context size when generating embeddings (default: 5).

Type: Integer  — Default value: 5

Gazeteer

Table 16. Analysis Components in category Gazeteer (1)
Component Description

DictionaryAnnotator

Takes a plain text file with phrases as input and annotates the phrases in the CAS file.

Dictionary Annotator

Short name

DictionaryAnnotator

Category

Gazeteer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.dictionaryannotator-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.DictionaryAnnotator

Description

Takes a plain text file with phrases as input and annotates the phrases in the CAS file. The annotation type defaults to NGram, but can be changed. The component requires that Tokens and Sentencees are annotated in the CAS. The format of the phrase file is one phrase per line, tokens are separated by space:

this is a phrase
another phrase

Parameters
annotationType

The annotation to create on matching phases. If nothing is specified, this defaults to NGram.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

The file must contain one phrase per line - phrases will be split at " "

Type: String

value

The value to set the feature configured in #PARAM_VALUE_FEATURE to.

Optional — Type: String

valueFeature

Set this feature on the created annotations.

Optional — Type: String  — Default value: value

Table 17. Capabilities

Inputs

Outputs

none specified

Languages

none specified

Language Identifier

Table 18. Analysis Components in category Language Identifier (3)
Component Description

LangDetectLanguageIdentifier

Langdetect language identifier based on character n-grams.

LanguageIdentifier

Detection based on character n-grams.

LanguageDetectorWeb1T

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

LangDetect

Short name

LangDetectLanguageIdentifier

Category

Language Identifier

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.langdetect-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.langdetect.LangDetectLanguageIdentifier

Description

Langdetect language identifier based on character n-grams. Due to the way LangDetect is implemented, this component does not support being instantiated multiple times with different model locations. Only a single model location can be active at a time over all instances of this component.

Parameters
modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

seed

The random seed.

Optional — Type: String

Table 19. Models
Language Variant Version

any

socialmedia

20141013.1

any

wikipedia

20141013.1

TextCat Language Identifier (Character N-Gram-based)

Short name

LanguageIdentifier

Category

Language Identifier

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textcat-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textcat.LanguageIdentifier

Description

Detection based on character n-grams. Uses the Java Text Categorizing Library based on a technique by Cavnar and Trenkle.

References

  • Cavnar, W. B. and J. M. Trenkle (1994). N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

Web1T Language Detector

Short name

LanguageDetectorWeb1T

Category

Language Identifier

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.ldweb1t-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.ldweb1t.LanguageDetectorWeb1T

Description

Language detector based on n-gram frequency counts, e.g. as provided by Web1T

Parameters
maxNGramSize

The maximum n-gram size that should be considered. Default is 3.

Type: Integer  — Default value: 3

minNGramSize

The minimum n-gram size that should be considered. Default is 1.

Type: Integer  — Default value: 1

Lemmatizer

Table 20. Analysis Components in category Lemmatizer (11)
Component Description

ClearNlpLemmatizer

Lemmatizer using Clear NLP.

CoreNlpLemmatizer

Lemmatizer from CoreNLP.

StanfordLemmatizer

Stanford Lemmatizer component.

GateLemmatizer

Wrapper for the GATE rule based lemmatizer.

IxaLemmatizer

Lemmatizer using the OpenNLP-based Ixa implementation.

LanguageToolLemmatizer

Naive lexicon-based lemmatizer.

MateLemmatizer

DKPro Annotator for the MateToolsLemmatizer.

MorphaLemmatizer

Lemmatize based on a finite-state machine.

Nlp4JLemmatizer

Emory NLP4J lemmatizer.

OpenNlpLemmatizer

Lemmatizer using OpenNLP.

OpenNlpLemmatizerTrainer

Train a lemmatizer model for OpenNLP.

ClearNLP Lemmatizer

Short name

ClearNlpLemmatizer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.clearnlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpLemmatizer

Description

Lemmatizer using Clear NLP.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String  — Default value: en

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 21. Capabilities

Inputs

Outputs

Languages

see available models

Table 22. Models
Language Variant Version

en

default

20131111.0

CoreNLP Lemmatizer

Short name

CoreNlpLemmatizer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.corenlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpLemmatizer

Description

Lemmatizer from CoreNLP.

Parameters
ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 23. Capabilities

Inputs

Outputs

Languages

none specified

CoreNLP Lemmatizer (old API)

Short name

StanfordLemmatizer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordLemmatizer

Description

Stanford Lemmatizer component. The Stanford Morphology-class computes the base form of English words, by removing just inflections (not derivational morphology). That is, it only does noun plurals, pronoun case, and verb endings, and not things like comparative adjectives or derived nominals. It is based on a finite-state transducer implemented by John Carroll et al., written in flex and publicly available. See: http://www.informatics.susx.ac.uk/research/nlp/carroll/morph.html

This only works for ENGLISH.

Parameters
ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 24. Capabilities

Inputs

Outputs

Languages

en

GATE Lemmatizer

Short name

GateLemmatizer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.gate-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.gate.GateLemmatizer

Description

Wrapper for the GATE rule based lemmatizer. Based on code by Asher Stern from the BIUTEE textual entailment tool.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 25. Capabilities

Inputs

Outputs

Languages

see available models

Table 26. Models
Language Variant Version

en

default

20160531.0

IXA Lemmatizer

Short name

IxaLemmatizer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.ixa-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.ixa.IxaLemmatizer

Description

Lemmatizer using the OpenNLP-based Ixa implementation.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 27. Capabilities

Inputs

Outputs

Languages

see available models

Table 28. Models
Language Variant Version

de

perceptron-conll09

20160213.1

en

perceptron-conll09

20160211.1

en

perceptron-ud

20160214.1

en

xlemma-perceptron-ud

20160214.1

es

perceptron-ancora-2.0

20160211.1

eu

perceptron-ud

20160212.1

fr

perceptron-sequoia

20160215.1

gl

perceptron-autodict05-ctag

20160212.1

it

perceptron-ud

20160213.1

nl

perceptron-alpino

20160215.1

LanguageTool Lemmatizer

Short name

LanguageToolLemmatizer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.languagetool-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolLemmatizer

Description

Naive lexicon-based lemmatizer. The words are looked up using the wordform lexicons of LanguageTool. Multiple readings are produced. The annotator simply takes the most frequent lemma from those readings. If no readings could be found, the original text is assigned as lemma.

Parameters
sanitize

Type: Boolean  — Default value: true

sanitizeChars

Type: String[]  — Default value: [(, ), [, ]]

Table 29. Capabilities

Inputs

Outputs

Languages

be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh

Mate Tools Lemmatizer

Short name

MateLemmatizer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.matetools-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.matetools.MateLemmatizer

Description

DKPro Annotator for the MateToolsLemmatizer.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

uppercase

Try reconstructing proper casing for lemmata. This is useful for German, but e.g. for English creates odd results.

Type: Boolean  — Default value: false

variant

Override the default variant used to locate the model.

Optional — Type: String

Table 30. Capabilities

Inputs

Outputs

Languages

see available models

Table 31. Models
Language Variant Version

de

tiger

20121024.1

en

conll2009

20130117.1

es

conll2009

20130117.1

fr

ftb

20130918.0

Morpha Lemmatizer

Short name

MorphaLemmatizer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.morpha-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.morpha.MorphaLemmatizer

Description

Lemmatize based on a finite-state machine. Uses the Java port of Morpha.

References:

  • Minnen, G., J. Carroll and D. Pearce (2001). Applied morphological processing of English, Natural Language Engineering, 7(3). 207-223.
Parameters
readPOS

Pass part-of-speech information on to Morpha. Since we currently do not know in which format the part-of-speech tags are expected by Morpha, we just pass on the actual pos tag value we get from the token. This may produce worse results than not passing on pos tags at all, so this is disabled by default.

Type: Boolean  — Default value: false

Table 32. Capabilities

Inputs

Outputs

Languages

en

NLP4J Lemmatizer

Short name

Nlp4JLemmatizer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.nlp4j-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.nlp4j.Nlp4JLemmatizer

Description

Emory NLP4J lemmatizer. This is a lower-casing lemmatizer.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

Table 33. Capabilities

Inputs

Outputs

Languages

none specified

OpenNLP Lemmatizer

Short name

OpenNlpLemmatizer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpLemmatizer

Description

Lemmatizer using OpenNLP.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 34. Capabilities

Inputs

Outputs

Languages

none specified

OpenNLP Lemmatizer Trainer

Short name

OpenNlpLemmatizerTrainer

Category

Lemmatizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpLemmatizerTrainer

Description

Train a lemmatizer model for OpenNLP.

Parameters
algorithm

Type: String  — Default value: MAXENT

beamSize

Type: Integer  — Default value: 3

cutoff

Type: Integer  — Default value: 5

iterations

Type: Integer  — Default value: 100

language

Type: String

numThreads

Type: Integer  — Default value: 1

targetLocation

Type: String

trainerType

Type: String  — Default value: Event

Morphological analyzer

Table 35. Analysis Components in category Morphological analyzer (3)
Component Description

MateMorphTagger

DKPro Annotator for the MateToolsMorphTagger.

RfTagger

Rftagger morphological analyzer.

SfstAnnotator

SFST morphological analyzer.

Mate Tools Morphological Analyzer

Short name

MateMorphTagger

Category

Morphological analyzer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.matetools-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.matetools.MateMorphTagger

Description

DKPro Annotator for the MateToolsMorphTagger.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 36. Capabilities

Inputs

Outputs

Languages

see available models

Table 37. Models
Language Variant Version

de

tiger

20121024.1

es

conll2009

20130117.1

fr

ftb

20130918.0

RFTagger Morphological Analyzer

Short name

RfTagger

Category

Morphological analyzer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.rftagger-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.rftagger.RfTagger

Description

Rftagger morphological analyzer.

Parameters
MorphMappingLocation

Optional — Type: String

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Write the tag set(s) to the log when a model is loaded.

Type: Boolean  — Default value: false

Table 38. Capabilities

Inputs

Outputs

Languages

see available models

Table 39. Models
Language Variant Version

cz

cac

20150728.1

de

tiger

20150928.1

hu

szeged

20150728.1

ru

ric

20150728.1

sk

snk

20150728.1

sl

jos

20150728.1

SFST Morphological Analyzer

Short name

SfstAnnotator

Category

Morphological analyzer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.sfst-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.sfst.SfstAnnotator

Description

SFST morphological analyzer.

Parameters
MorphMappingLocation

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

mode

Type: String  — Default value: FIRST

modelEncoding

Specifies the model encoding.

Type: String  — Default value: UTF-8

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Write the tag set(s) to the log when a model is loaded.

Type: Boolean  — Default value: false

writeLemma

Write lemma information. Default: true

Type: Boolean  — Default value: true

writePOS

Write part-of-speech information. Default: true

Type: Boolean  — Default value: true

Table 40. Capabilities

Inputs

Outputs

Languages

see available models

Table 41. Models
Language Variant Version

de

morphisto-ca

20110202.1

de

smor-ca

20140801.1

de

zmorge-newlemma-ca

20140521.1

de

zmorge-orig-ca

20140521.1

it

pippi-ca

20090223.1

tr

trmorph-ca

20130219.1

Named Entity Recognizer

Table 42. Analysis Components in category Named Entity Recognizer (8)
Component Description

StanfordNamedEntityRecognizer

Stanford Named Entity Recognizer component.

CoreNlpNamedEntityRecognizer

Named entity recognizer from CoreNLP.

StanfordNamedEntityRecognizerTrainer

Train a NER model for Stanford CoreNLP Named Entity Recognizer.

LingPipeNamedEntityRecognizer

LingPipe named entity recognizer.

LingPipeNamedEntityRecognizerTrainer

LingPipe named entity recognizer trainer.

Nlp4JNamedEntityRecognizer

Emory NLP4J name finder wrapper.

OpenNlpNamedEntityRecognizer

OpenNLP name finder wrapper.

OpenNlpNamedEntityRecognizerTrainer

Train a named entity recognizer model for OpenNLP.

CoreNLP Named Entity Recogizer (old API)

Short name

StanfordNamedEntityRecognizer

Category

Named Entity Recognizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizer

Description

Stanford Named Entity Recognizer component.

Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 43. Capabilities

Inputs

Outputs

Languages

see available models

Table 44. Models
Language Variant Version

de

dewac_175m_600.crf

20150130.1

de

hgc_175m_600.crf

20161213.1

de

nemgp

20141024.1

en

all.3class.caseless.distsim.crf

20161213.0

en

all.3class.distsim.crf

20161213.1

en

all.3class.nodistsim.crf

20160110.1

en

conll.4class.caseless.distsim.crf

20160110.0

en

conll.4class.distsim.crf

20150420.1

en

conll.4class.nodistsim.crf

20160110.1

en

freme-wikiner

20150925.1

en

muc.7class.caseless.distsim.crf

20150129.0

en

muc.7class.distsim.crf

20150129.1

en

muc.7class.nodistsim.crf

20160110.1

en

nowiki.3class.caseless.distsim.crf

20161213.0

en

nowiki.3class.nodistsim.crf

20160110.0

es

ancora.distsim.s512.crf

20161211.1

es

freme-wikiner

20150925.1

fr

freme-wikiner

20150925.1

it

freme-wikiner

20150925.1

nl

freme-wikiner

20150925.1

ru

freme-wikiner

20160726.1

CoreNLP Named Entity Recognizer

Short name

CoreNlpNamedEntityRecognizer

Category

Named Entity Recognizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.corenlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpNamedEntityRecognizer

Description

Named entity recognizer from CoreNLP.

Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

applyNumericClassifiers

Type: Boolean  — Default value: true

augmentRegexNER

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

maxSentenceLength

Type: Integer  — Default value: 2147483647

maxTime

Type: Integer  — Default value: -1

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

numThreads

Type: Integer  — Default value: 0

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

useSUTime

Type: Boolean  — Default value: false

Table 45. Capabilities

Inputs

Outputs

Languages

none specified

CoreNLP Named Entity Recognizer Trainer

Short name

StanfordNamedEntityRecognizerTrainer

Category

Named Entity Recognizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizerTrainer

Description

Train a NER model for Stanford CoreNLP Named Entity Recognizer.

Parameters
acceptedTagsRegex

Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type.

Optional — Type: String

entitySubClassification

Optional — Type: String  — Default value: noprefix

propertiesFile

Training file containing the parameters. The trainFile or trainFileList and serializeTo parameters in this file are ignored/overridden.

Optional — Type: String

retainClassification

Flag to keep the label set specified by PARAM_LABEL_SET. If set to false, representation is mapped to IOB1 on output. Default: true

Optional — Type: Boolean  — Default value: true

targetLocation

Location of the target model file.

Type: String

LingPipe Named Entity Recognizer

Short name

LingPipeNamedEntityRecognizer

Category

Named Entity Recognizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.lingpipe-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.lingpipe.LingPipeNamedEntityRecognizer

Description

LingPipe named entity recognizer.

Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 46. Capabilities

Inputs

Outputs

Languages

see available models

Table 47. Models
Language Variant Version

en

bio-genetag

20110623.1

en

bio-genia

20110623.1

en

news-muc6

20110623.1

LingPipe Named Entity Recognizer Trainer

Short name

LingPipeNamedEntityRecognizerTrainer

Category

Named Entity Recognizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.lingpipe-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.lingpipe.LingPipeNamedEntityRecognizerTrainer

Description

LingPipe named entity recognizer trainer.

Parameters
acceptedTagsRegex

Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type.

Optional — Type: String

targetLocation

Type: String

NLP4J Named Entity Recognizer

Short name

Nlp4JNamedEntityRecognizer

Category

Named Entity Recognizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.nlp4j-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.nlp4j.Nlp4JNamedEntityRecognizer

Description

Emory NLP4J name finder wrapper.

Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

ignoreMissingFeatures

Process anyway, even if the model relies on features that are not supported by this component. Default: false

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 48. Capabilities

Inputs

Outputs

Languages

see available models

Table 49. Models
Language Variant Version

en

default

20160802.0

OpenNLP Named Entity Recognizer

Short name

OpenNlpNamedEntityRecognizer

Category

Named Entity Recognizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpNamedEntityRecognizer

Description

OpenNLP name finder wrapper.

Parameters
NamedEntityMappingLocation

Location of the mapping file for named entity tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Type: String  — Default value: person

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 50. Capabilities

Inputs

Outputs

Languages

see available models

Table 51. Models
Language Variant Version

de

nemgp

20141024.1

en

date

20100907.0

en

location

20100907.0

en

money

20100907.0

en

organization

20100907.0

en

percentage

20100907.0

en

person

20130624.1

en

time

20100907.0

es

location

20100908.0

es

misc

20100908.0

es

organization

20100908.0

es

person

20100908.0

nl

location

20100908.0

nl

misc

20100908.0

nl

organization

20100908.0

nl

person

20100908.0

OpenNLP Named Entity Recognizer Trainer

Short name

OpenNlpNamedEntityRecognizerTrainer

Category

Named Entity Recognizer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpNamedEntityRecognizerTrainer

Description

Train a named entity recognizer model for OpenNLP.

Parameters
acceptedTagsRegex

Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type.

Optional — Type: String

algorithm

Type: String  — Default value: PERCEPTRON

beamSize

Type: Integer  — Default value: 3

cutoff

Type: Integer  — Default value: 0

featureGen

Optional — Type: String

iterations

Type: Integer  — Default value: 300

language

Type: String

numThreads

Type: Integer  — Default value: 1

sequenceEncoding

Type: String  — Default value: BILOU

targetLocation

Type: String

trainerType

Type: String  — Default value: Event

Parser

Table 52. Analysis Components in category Parser (11)
Component Description

BerkeleyParser

Berkeley Parser annotator .

ClearNlpParser

CLEAR parser annotator.

StanfordDependencyConverter

Converts a constituency structure into a dependency structure.

CoreNlpDependencyParser

Dependency parser from CoreNLP.

CoreNlpParser

Parser from CoreNLP.

StanfordParser

Stanford Parser component.

MstParser

Dependency parsing using MSTParser.

MaltParser

Dependency parsing using MaltPaser.

MateParser

DKPro Annotator for the MateToolsParser.

Nlp4JDependencyParser

Emory NLP4J dependency parser.

OpenNlpParser

OpenNLP parser.

Berkeley Parser

Short name

BerkeleyParser

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.berkeleyparser-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.berkeleyparser.BerkeleyParser

Description

Berkeley Parser annotator . Requires Sentences to be annotated before.

Parameters
ConstituentMappingLocation

Location of the mapping file for constituent tags to UIMA types.

Optional — Type: String

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

accurate

Set thresholds for accuracy.

Default: false (set thresholds for efficiency)

Type: Boolean  — Default value: false

binarize

Output binarized trees.

Default: false

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

keepFunctionLabels

Retain predicted function labels. Model must have been trained with function labels.

Default: false

Type: Boolean  — Default value: false

language

Use this language instead of the language set in the CAS to locate the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

readPOS

Sets whether to use or not to use already existing POS tags from another annotator for the parsing process.

Default: false

Type: Boolean  — Default value: true

scores

Output inside scores (only for binarized viterbi trees).

Default: false

Type: Boolean  — Default value: false

substates

Output sub-categories (only for binarized Viterbi trees).

Default: false

Type: Boolean  — Default value: false

variational

Use variational rule score approximation instead of max-rule

Default: false

Type: Boolean  — Default value: false

viterbi

Compute Viterbi derivation instead of max-rule tree.

Default: false (max-rule)

Type: Boolean  — Default value: false

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Default: true

Type: Boolean  — Default value: false

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Default: false

Type: Boolean  — Default value: false

Table 53. Capabilities

Inputs

Outputs

Languages

see available models

Table 54. Models
Language Variant Version

ar

sm5

20090917.1

bg

sm5

20090917.1

de

sm5

20090917.1

en

sm6

20100819.1

fr

sm5

20090917.1

zh

sm5

20090917.1

ClearNLP Parser

Short name

ClearNlpParser

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.clearnlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpParser

Description

CLEAR parser annotator.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Write the tag set(s) to the log when a model is loaded.

Type: Boolean  — Default value: false

Table 55. Capabilities

Inputs

Outputs

Languages

see available models

Table 56. Models
Language Variant Version

en

mayo

20131111.0

en

ontonotes

20131128.0

CoreNLP Dependency Converter

Short name

StanfordDependencyConverter

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordDependencyConverter

Description

Converts a constituency structure into a dependency structure.

Parameters
language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

mode

Sets the kind of dependencies being created.

Default: DependenciesMode#COLLAPSED TREE

Optional — Type: String  — Default value: TREE

originalDependencies

Create original dependencies. If this is disabled, universal dependencies are created. The default is to create the original dependencies.

Type: Boolean  — Default value: true

Table 57. Capabilities

Inputs

Outputs

Languages

none specified

CoreNLP Dependency Parser

Short name

CoreNlpDependencyParser

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.corenlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpDependencyParser

Description

Dependency parser from CoreNLP.

Parameters
DependencyMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

extraDependencies

Type: String  — Default value: NONE

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

maxSentenceLength

Type: Integer  — Default value: 2147483647

maxTime

Type: Integer  — Default value: -1

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

numThreads

Type: Integer  — Default value: 0

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 58. Capabilities

Inputs

Outputs

Languages

see available models

Table 59. Models
Language Variant Version

de

ud

20161213.1

en

ptb-conll

20160119.1

en

sd

20150418.1

en

ud

20161213.1

en

wsj-sd

20150418.1

en

wsj-ud

20161213.1

fr

ud

20161211.1

zh

ctb-conll

20160119.1

zh

ptb-conll

20161223.1

zh

ud

20161223.1

CoreNLP Parser

Short name

CoreNlpParser

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.corenlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpParser

Description

Parser from CoreNLP.

Parameters
ConstituentMappingLocation

Location of the mapping file for dependency tags to UIMA types.

Optional — Type: String

DependencyMappingLocation

Location of the mapping file for dependency tags to UIMA types.

Optional — Type: String

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

extraDependencies

Type: String  — Default value: NONE

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false

Optional — Type: Boolean  — Default value: true

keepPunctuation

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

maxSentenceLength

Type: Integer  — Default value: 2147483647

maxTime

Type: Integer  — Default value: -1

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

numThreads

Type: Integer  — Default value: 0

originalDependencies

Type: Boolean  — Default value: true

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

readPOS

Sets whether to use or not to use existing POS tags.

Default: true

Type: Boolean  — Default value: true

writeConstituent

Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization.

Default: true

Type: Boolean  — Default value: true

writeDependency

Sets whether to create or not to create dependency annotations.

Default: true

Type: Boolean  — Default value: true

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Default: false

Type: Boolean  — Default value: false

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Default: false

Type: Boolean  — Default value: false

Table 60. Capabilities

Inputs

Outputs

Languages

none specified

CoreNLP Parser (old API)

Short name

StanfordParser

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordParser

Description

Stanford Parser component.

Parameters
ConstituentMappingLocation

Location of the mapping file for constituent tags to UIMA types.

Optional — Type: String

POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

annotationTypeToParse

This parameter can be used to override the standard behavior which uses the Sentence annotation as the basic unit for parsing.

If the parameter is set with the name of an annotation type x, the parser will no longer parse Sentence-annotations, but x-Annotations.

Default: null

Optional — Type: String

keepPunctuation

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

maxItems

Controls when the factored parser considers a sentence to be too complex and falls back to the PCFG parser.

Default: 200000

Type: Integer  — Default value: 200000

maxSentenceLength

Maximum number of tokens in a sentence. Longer sentences are not parsed. This is to avoid out of memory exceptions.

Default: 130

Type: Integer  — Default value: 130

mode

Sets the kind of dependencies being created.

Default: DependenciesMode#TREE TREE

Optional — Type: String  — Default value: TREE

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Write the tag set(s) to the log when a model is loaded.

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

readPOS

Sets whether to use or not to use already existing POS tags from another annotator for the parsing process.

Default: true

Type: Boolean  — Default value: true

writeConstituent

Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization.

Default: true

Type: Boolean  — Default value: true

writeDependency

Sets whether to create or not to create dependency annotations.

Default: true

Type: Boolean  — Default value: true

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Default: false

Type: Boolean  — Default value: false

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Default: false

Type: Boolean  — Default value: false

Table 61. Capabilities

Inputs

Outputs

Languages

see available models

Table 62. Models
Language Variant Version

ar

factored

20150129.1

ar

sr

20141031.1

de

factored

20150129.1

de

pcfg

20150129.1

de

sr

20141031.1

en

factored

20150129.1

en

pcfg

20150129.1

en

pcfg.caseless

20160110.1

en

rnn

20140104.1

en

sr

20141031.1

en

sr-beam

20141031.1

en

wsj-factored

20150129.1

en

wsj-pcfg

20150129.1

en

wsj-rnn

20140104.1

es

pcfg

20161211.1

es

sr

20161211.1

es

sr-beam

20161211.1

fr

factored

20150129.1

fr

sr

20160114.1

fr

sr-beam

20141023.1

zh

factored

20150129.1

zh

pcfg

20150129.1

zh

sr

20141023.1

zh

xinhua-factored

20150129.1

zh

xinhua-pcfg

20150129.1

MSTParser Dependency Parser

Short name

MstParser

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.mstparser-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.mstparser.MstParser

Description

Dependency parsing using MSTParser.

Wrapper for the MSTParser (high memory requirements). More information about the parser can be found here here

The MSTParser models tend to be very large, e.g. the Eisner model is about 600 MB uncompressed. With this model, parsing a simple sentence with MSTParser requires about 3 GB heap memory.

This component feeds MSTParser only with the FORM (token) and POS (part-of-speech) fields. LEMMA, CPOS, and other columns from the CONLL 2006 format are not generated (cf. mstparser.DependencyInstance DependencyInstance).

Parameters
DependencyMappingLocation

Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

order

Specifies the order/scope of features. 1 only has features over single edges and 2 has features over pairs of adjacent edges in the tree. The model must have been trained with the respective order set here.

Optional — Type: Integer

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 63. Capabilities

Inputs

Outputs

Languages

see available models

Table 64. Models
Language Variant Version

en

eisner

20100416.2

en

sample

20121019.2

hr

mte5.defnpout

20130527.1

hr

mte5.pos

20130527.1

MaltParser Dependency Parser

Short name

MaltParser

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.maltparser-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.maltparser.MaltParser

Description

Dependency parsing using MaltPaser.

Required annotations:

  • Token
  • Sentence
  • POS
Generated annotations:
  • Dependency (annotated over sentence-span)
Parameters
ignoreMissingFeatures

Process anyway, even if the model relies on features that are not supported by this component. Default: false

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 65. Capabilities

Inputs

Outputs

Languages

see available models

Table 66. Models
Language Variant Version

bn

linear

20120905.1

en

linear

20120312.1

en

poly

20120312.1

es

linear

20130220.0

fa

linear

20130522.1

fr

linear

20120312.1

pl

linear

20120904.1

sv

linear

20120925.2

Mate Tools Dependency Parser

Short name

MateParser

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.matetools-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.matetools.MateParser

Description

DKPro Annotator for the MateToolsParser.

Please cite the following paper, if you use the parser: Bernd Bohnet. 2010. Top Accuracy and Fast Dependency Parsing is not a Contradiction. The 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China.

Parameters
DependencyMappingLocation

Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 67. Capabilities

Inputs

Outputs

Languages

see available models

Table 68. Models
Language Variant Version

de

tiger

20121024.1

en

conll2009

20130117.2

es

conll2009

20130117.1

fa

parsper

20141124.0

fr

ftb

20130918.0

zh

conll2009

20130117.1

NLP4J Dependency Parser

Short name

Nlp4JDependencyParser

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.nlp4j-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.nlp4j.Nlp4JDependencyParser

Description

Emory NLP4J dependency parser.

Parameters
DependencyMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

ignoreMissingFeatures

Process anyway, even if the model relies on features that are not supported by this component. Default: false

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 69. Capabilities

Inputs

Outputs

Languages

none specified

OpenNLP Parser

Short name

OpenNlpParser

Category

Parser

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpParser

Description

OpenNLP parser. The parser ignores existing POS tags and internally creates new ones. However, these tags are only added as annotation if explicitly requested via #PARAM_WRITE_POS.

Parameters
ConstituentMappingLocation

Location of the mapping file for constituent tags to UIMA types.

Optional — Type: String

POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Default: true

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Default: false

Type: Boolean  — Default value: false

writePOS

Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.

Default: true

Type: Boolean  — Default value: false

writePennTree

If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.

Default: false

Type: Boolean  — Default value: false

Table 70. Capabilities

Inputs

Outputs

Languages

see available models

Table 71. Models
Language Variant Version

en

chunking

20120616.1

en

chunking-ixa

20140426.1

es

chunking-ixa

20140426.1

Part-of-speech tagger

Table 72. Analysis Components in category Part-of-speech tagger (16)
Component Description

ArktweetPosTagger

Wrapper for Twitter Tokenizer and POS Tagger.

ClearNlpPosTagger

Part-of-Speech annotator using Clear NLP.

CoreNlpPosTagger

Part-of-speech tagger from CoreNLP.

StanfordPosTagger

Stanford Part-of-Speech tagger component.

StanfordPosTaggerTrainer

Train a POS tagging model for the Stanford POS tagger.

FlexTagPosTagger

Flexible part-of-speech tagger.

HepplePosTagger

GATE Hepple part-of-speech tagger.

HunPosTagger

Part-of-Speech annotator using HunPos.

IxaPosTagger

Part-of-Speech annotator using OpenNLP with IXA extensions.

LingPipePosTagger

LingPipe part-of-speech tagger.

MatePosTagger

DKPro Annotator for the MateToolsPosTagger

MeCabTagger

Annotator for the MeCab Japanese POS Tagger.

Nlp4JPosTagger

Part-of-Speech annotator using Emory NLP4J.

OpenNlpPosTagger

Part-of-Speech annotator using OpenNLP.

OpenNlpPosTaggerTrainer

Train a POS tagging model for OpenNLP.

TreeTaggerPosTagger

Part-of-Speech and lemmatizer annotator using TreeTagger.

ArkTweet POS-Tagger

Short name

ArktweetPosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.arktools-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetPosTagger

Description

Wrapper for Twitter Tokenizer and POS Tagger. As described in: Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider and Noah A. Smith. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters In Proceedings of NAACL 2013.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

Table 73. Capabilities

Inputs

Outputs

Languages

see available models

Table 74. Models
Language Variant Version

en

default

20120919.1

en

irc

20121211.1

en

ritter

20130723.1

ClearNLP POS-Tagger

Short name

ClearNlpPosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.clearnlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpPosTagger

Description

Part-of-Speech annotator using Clear NLP. Requires Sentences to be annotated before.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

dictLocation

Load the dictionary from this location instead of locating the dictionary automatically.

Optional — Type: String

dictVariant

Override the default variant used to locate the dictionary.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the pos-tagging model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the pos-tagging model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded.

Type: Boolean  — Default value: false

Table 75. Capabilities

Inputs

Outputs

Languages

see available models

Table 76. Models
Language Variant Version

en

mayo

20131111.0

en

ontonotes

20131128.0

CoreNLP POS-Tagger

Short name

CoreNlpPosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.corenlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpPosTagger

Description

Part-of-speech tagger from CoreNLP.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

maxSentenceLength

Type: Integer  — Default value: 2147483647

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

numThreads

Type: Integer  — Default value: 0

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 77. Capabilities

Inputs

Outputs

Languages

none specified

CoreNLP POS-Tagger (old API)

Short name

StanfordPosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPosTagger

Description

Stanford Part-of-Speech tagger component.

Parameters
POSMappingLocation

Location of the mapping file for part-of-speech tags to UIMA types.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

maxSentenceLength

Sentences with more tokens than the specified max amount will be ignored if this parameter is set to a value larger than zero. The default value zero will allow all sentences to be POS tagged.

Optional — Type: Integer

modelLocation

Location from which the model is read.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

ptb3Escaping

Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).

Type: Boolean  — Default value: true

quoteBegin

List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

quoteEnd

List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.

Optional — Type: String[]

Table 78. Capabilities

Inputs

Outputs

Languages

see available models

Table 79. Models
Language Variant Version

ar

accurate

20131112.1

de

dewac

20140827.1

de

fast

20140827.1

de

fast-caseless

20140827.0

de

hgc

20140827.1

de

ud

20161213.1

en

bidirectional-distsim

20140616.1

en

caseless-left3words-distsim

20140827.0

en

fast.41

20130730.1

en

left3words-distsim

20140616.1

en

twitter

20130730.1

en

twitter-fast

20130914.0

en

wsj-0-18-bidirectional-distsim

20160110.1

en

wsj-0-18-bidirectional-nodistsim

20131112.1

en

wsj-0-18-caseless-left3words-distsim

20140827.0

en

wsj-0-18-left3words-distsim

20140616.1

en

wsj-0-18-left3words-nodistsim

20131112.1

es

default

20161211.1

es

distsim

20161211.1

fr

default

20140616.1

zh

distsim

20140616.1

zh

nodistsim

20140616.1

CoreNLP POS-Tagger Trainer

Short name

StanfordPosTaggerTrainer

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPosTaggerTrainer

Description

Train a POS tagging model for the Stanford POS tagger.

Parameters
clusterFile

Distsim cluster files.

Optional — Type: String

targetLocation

Type: String

trainFile

Training file containing the parameters. The trainFile, model and encoding parameters in this file are ignored/overwritten. In the arch parameter, the string ${distsimCluster} is replaced with the path to the cluster files if #PARAM_CLUSTER_FILE is specified.

Optional — Type: String

FlexTag POS-Tagger

Short name

FlexTagPosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.flextag-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.flextag.FlexTagPosTagger

Description

Flexible part-of-speech tagger.

Parameters
POSMappingLocation

Optional — Type: String

language

Optional — Type: String

modelLocation

Optional — Type: String

modelVariant

Optional — Type: String

Table 80. Models
Language Variant Version

de

tiger

20170512.1

en

wsj0-18

20170512.1

GATE Hepple POS-Tagger

Short name

HepplePosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.gate-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.gate.HepplePosTagger

Description

GATE Hepple part-of-speech tagger.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

lexiconLocation

Load the lexicon from this location instead of locating it automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

rulesetLocation

Load the ruleset from this location instead of locating it automatically.

Optional — Type: String

Table 81. Capabilities

Inputs

Outputs

Languages

see available models

Table 82. Models
Language Variant Version

en

annie

20160531.0

HunPos POS-Tagger

Short name

HunPosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.hunpos-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.hunpos.HunPosTagger

Description

Part-of-Speech annotator using HunPos. Requires Sentences to be annotated before.

References

  • HALÁCSY, Péter; KORNAI, András; ORAVECZ, Csaba. HunPos: an open source trigram tagger. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007. S. 209-212. (pdf) (bibtex)
Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 83. Capabilities

Inputs

Outputs

Languages

see available models

Table 84. Models
Language Variant Version

cs

pdt

20121123.2

da

ddt

20121123.2

de

tiger

20121123.2

en

wsj

20070724.2

fa

upc

20140414.0

hr

mte5.defnpout

20130509.2

hu

szeged_kr

20070724.2

pt

bosque

20121123.2

pt

bosque

20121123.2

pt

mm

20130119.2

pt

tbchp

20110419.2

ru

rdt

20121123.2

sl

jos

20121123.2

sv

paroletags

20100215.2

sv

suctags

20100927.2

IXA POS-Tagger

Short name

IxaPosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.ixa-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.ixa.IxaPosTagger

Description

Part-of-Speech annotator using OpenNLP with IXA extensions.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 85. Models
Language Variant Version

de

perceptron-autodict01-conll09

20160213.1

en

maxent-100-c5-baseline-autodict01-conll09

20160211.1

en

perceptron-autodict01-conll09

20160211.1

en

perceptron-autodict01-ud

20160214.1

en

xpos-perceptron-autodict01-ud

20160214.1

es

perceptron-autodict01-ancora-2.0

20160212.1

eu

perceptron-ud

20160212.1

fr

perceptron-autodict01-sequoia

20160215.1

gl

perceptron-autdict05-ctag

20160212.1

it

perceptron-autodict01-ud

20160213.1

nl

maxent-100-c5-autodict01-alpino

20160214.1

nl

perceptron-autodict01-alpino

20160214.1

LingPipe POS-Tagger

Short name

LingPipePosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.lingpipe-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.lingpipe.LingPipePosTagger

Description

LingPipe part-of-speech tagger.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

uppercaseTags

Lingpipe models tend to be trained on lower-case tags, but our POS mappings use uppercase.

Type: Boolean  — Default value: true

Table 86. Capabilities

Inputs

Outputs

Languages

see available models

Table 87. Models
Language Variant Version

en

bio-genia

20110623.1

en

bio-medpost

20110623.1

en

general-brown

20110623.1

Mate Tools POS-Tagger

Short name

MatePosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.matetools-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.matetools.MatePosTagger

Description

DKPro Annotator for the MateToolsPosTagger

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 88. Capabilities

Inputs

Outputs

Languages

see available models

Table 89. Models
Language Variant Version

de

tiger

20121024.1

en

conll2009

20130117.1

es

conll2009

20130117.1

fr

ftb

20130918.0

zh

conll2009

20130117.1

MeCab POS-Tagger

Short name

MeCabTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.mecab-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.mecab.MeCabTagger

Description

Annotator for the MeCab Japanese POS Tagger.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 90. Capabilities

Inputs

none specified

Outputs

Languages

ja

Table 91. Models
Language Variant Version

jp

bin-linux-x86_32

20140917.0

jp

bin-linux-x86_64

20140917.0

jp

bin-osx-x86_64

20140917.0

jp

ipadic

20070801.0

NLP4J POS-Tagger

Short name

Nlp4JPosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.nlp4j-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.nlp4j.Nlp4JPosTagger

Description

Part-of-Speech annotator using Emory NLP4J. Requires Sentences to be annotated before.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

ignoreMissingFeatures

Process anyway, even if the model relies on features that are not supported by this component. Default: false

Type: Boolean  — Default value: false

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 92. Capabilities

Inputs

Outputs

Languages

see available models

Table 93. Models
Language Variant Version

en

default

20160802.0

OpenNLP POS-Tagger

Short name

OpenNlpPosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTagger

Description

Part-of-Speech annotator using OpenNLP.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

Table 94. Capabilities

Inputs

Outputs

Languages

see available models

Table 95. Models
Language Variant Version

da

maxent

20120616.1

da

perceptron

20120616.1

de

maxent

20120616.1

de

perceptron

20120616.1

en

maxent

20120616.1

en

perceptron

20120616.1

en

perceptron-ixa

20131115.1

es

maxent

20120410.1

es

maxent-ixa

20140425.1

es

maxent-universal

20120410.1

es

perceptron

20120410.1

es

perceptron-ixa

20131115.1

es

perceptron-universal

20120410.1

it

perceptron

20130618.0

nl

maxent

20120616.1

nl

perceptron

20120616.1

pt

maxent

20120616.1

pt

mm-maxent

20130121.1

pt

mm-perceptron

20130121.1

pt

perceptron

20120616.1

sv

maxent

20120616.1

sv

perceptron

20120616.1

OpenNLP POS-Tagger Trainer

Short name

OpenNlpPosTaggerTrainer

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTaggerTrainer

Description

Train a POS tagging model for OpenNLP.

Parameters
algorithm

Type: String  — Default value: MAXENT

beamSize

Type: Integer  — Default value: 3

cutoff

Type: Integer  — Default value: 5

iterations

Type: Integer  — Default value: 100

language

Type: String

numThreads

Type: Integer  — Default value: 1

targetLocation

Type: String

trainerType

Type: String  — Default value: Event

TreeTagger POS-Tagger

Short name

TreeTaggerPosTagger

Category

Part-of-speech tagger

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.treetagger-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger

Description

Part-of-Speech and lemmatizer annotator using TreeTagger.

Parameters
POSMappingLocation

Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.

Optional — Type: String

executablePath

Use this TreeTagger executable instead of trying to locate the executable automatically.

Optional — Type: String

internTags

Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true

Optional — Type: Boolean  — Default value: true

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

performanceMode

TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.

Type: Boolean  — Default value: false

printTagSet

Log the tag set(s) when a model is loaded. Default: false

Type: Boolean  — Default value: false

writeLemma

Write lemma information. Default: true

Type: Boolean  — Default value: true

writePOS

Write part-of-speech information. Default: true

Type: Boolean  — Default value: true

Table 96. Capabilities

Inputs

Outputs

Languages

see available models

Table 97. Models
Language Variant Version

bg

le

20160430.1

de

le

20170316.1

en

le

20170220.1

es

le

20161222.1

et

le

20110124.1

fi

le

20140704.1

fr

le

20100111.1

gl

le

20130516.1

gmh

le

20161107.1

it

le

20141020.1

la

le

20110819.1

mn

le

20120925.1

nl

le

20130107.1

pl

le

20150506.1

pt

le

20101115.2

ru

le

20140505.1

sk

le

20130725.1

sw

le

20130729.1

zh

le

20101115.1

Phonetic Transcriptor

Table 98. Analysis Components in category Phonetic Transcriptor (4)
Component Description

ColognePhoneticTranscriptor

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec.

DoubleMetaphonePhoneticTranscriptor

Double-Metaphone phonetic transcription based on Apache Commons Codec.

MetaphonePhoneticTranscriptor

Metaphone phonetic transcription based on Apache Commons Codec.

SoundexPhoneticTranscriptor

Soundex phonetic transcription based on Apache Commons Codec.

Commons Codec Cologne Phonetic Transcriptor

Short name

ColognePhoneticTranscriptor

Category

Phonetic Transcriptor

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.commonscodec-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.commonscodec.ColognePhoneticTranscriptor

Description

Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. Works for German.

Table 99. Capabilities

Inputs

Outputs

Languages

de

Commons Codec Double-Metaphone Phonetic Transcriptor

Short name

DoubleMetaphonePhoneticTranscriptor

Category

Phonetic Transcriptor

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.commonscodec-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.commonscodec.DoubleMetaphonePhoneticTranscriptor

Description

Double-Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

Table 100. Capabilities

Inputs

Outputs

Languages

none specified

Commons Codec Metaphone Phonetic Transcriptor

Short name

MetaphonePhoneticTranscriptor

Category

Phonetic Transcriptor

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.commonscodec-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.commonscodec.MetaphonePhoneticTranscriptor

Description

Metaphone phonetic transcription based on Apache Commons Codec. Works for English.

Table 101. Capabilities

Inputs

Outputs

Languages

none specified

Commons Codec Soundex Phonetic Transcriptor

Short name

SoundexPhoneticTranscriptor

Category

Phonetic Transcriptor

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.commonscodec-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.commonscodec.SoundexPhoneticTranscriptor

Description

Soundex phonetic transcription based on Apache Commons Codec. Works for English.

Table 102. Capabilities

Inputs

Outputs

Languages

en

Segmenter

Segmenter components identify sentence boundaries and tokens. The order in which sentence splitting and tokenization are done differs between the integrated the NLP libraries. Thus, we chose to integrate both steps into a segmenter component to avoid the need to reorder the components in a pipeline when replacing one segmenter with another.

Table 103. Analysis Components in category Segmenter (24)
Component Description

AnnotationByLengthFilter

Removes annotations that do not conform to minimum or maximum length constraints.

ArktweetTokenizer

ArkTweet tokenizer.

CamelCaseTokenSegmenter

Split up existing tokens again if they are camel-case text.

ClearNlpSegmenter

Tokenizer using Clear NLP.

CoreNlpSegmenter

Tokenizer and sentence splitter using from CoreNLP.

StanfordSegmenter

Stanford sentence splitter and tokenizer.

GermanSeparatedParticleAnnotator

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset.

GosenSegmenter

Segmenter for Japanese text based on GoSen.

IcuSegmenter

ICU segmenter.

JTokSegmenter

JTok segmenter.

BreakIteratorSegmenter

BreakIterator segmenter.

LanguageToolSegmenter

Segmenter using LanguageTool to do the heavy lifting.

LineBasedSentenceSegmenter

Annotates each line in the source text as a sentence.

LingPipeSegmenter

LingPipe segmenter.

Nlp4JSegmenter

Segmenter using Emory NLP4J.

OpenNlpSegmenter

Tokenizer and sentence splitter using OpenNLP.

OpenNlpSentenceTrainer

Train a sentence splitter model for OpenNLP.

OpenNlpTokenTrainer

Train a tokenizer model for OpenNLP.

ParagraphSplitter

This class creates paragraph annotations for the given input document.

PatternBasedTokenSegmenter

Split up existing tokens again at particular split-chars.

RegexSegmenter

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

TokenMerger

Merges any Tokens that are covered by a given annotation type.

TokenTrimmer

Remove prefixes and suffixes from tokens.

WhitespaceSegmenter

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

Annotation-By-Length Filter

Short name

AnnotationByLengthFilter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.AnnotationByLengthFilter

Description

Removes annotations that do not conform to minimum or maximum length constraints. (This was previously called TokenFilter).

Parameters
FilterTypes

A set of annotation types that should be filtered.

Type: String[]  — Default value: []

MaxLengthFilter

Any annotation in filterAnnotations shorter than this value will be removed.

Type: Integer  — Default value: 1000

MinLengthFilter

Any annotation in filterTypes shorter than this value will be removed.

Type: Integer  — Default value: 0

ArkTweet Tokenizer

Short name

ArktweetTokenizer

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.arktools-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetTokenizer

Description

ArkTweet tokenizer.

CamelCase Token Segmenter

Short name

CamelCaseTokenSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.CamelCaseTokenSegmenter

Description

Split up existing tokens again if they are camel-case text.

Parameters
deleteCover

Wether to remove the original token. Default: true

Type: Boolean  — Default value: true

Table 104. Capabilities

Inputs

Outputs

Languages

none specified

ClearNLP Segmenter

Short name

ClearNlpSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.clearnlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter

Description

Tokenizer using Clear NLP.

Parameters
language

The language.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 105. Capabilities

Inputs

none specified

Outputs

Languages

en

Table 106. Models
Language Variant Version

en

default

20131111.0

CoreNLP Segmenter

Short name

CoreNlpSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.corenlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpSegmenter

Description

Tokenizer and sentence splitter using from CoreNLP.

Parameters
boundaryMultiTokenRegex

Optional — Type: String

boundaryToDiscard

The set of regex for sentence boundary tokens that should be discarded.

Optional — Type: String[]  — Default value: [, NL]

boundaryTokenRegex

The set of boundary tokens. If null, use default.

Optional — Type: String  — Default value: \\.|[!?]+

htmlElementsToDiscard

These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary.

Optional — Type: String[]

language

The language.

Optional — Type: String

newlineIsSentenceBreak

Strategy for treating newlines as sentence breaks.

Optional — Type: String  — Default value: two

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

tokenRegexesToDiscard

The set of regex for sentence boundary tokens that should be discarded.

Optional — Type: String[]  — Default value: []

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 107. Capabilities

Inputs

none specified

Outputs

Languages

none specified

CoreNLP Segmenter (old API)

Short name

StanfordSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordSegmenter

Description

Stanford sentence splitter and tokenizer.

Parameters
allowEmptySentences

Whether to generate empty sentences.

Type: Boolean  — Default value: false

boundaryFollowersRegex

This is a Set of String that are matched with .equals() which are allowed to be tacked onto the end of a sentence after a sentence boundary token, for example ")".

Optional — Type: String  — Default value: [\\p{Pe}\\p{Pf}\"'>\uFF02\uFF07\uFF1E]|''|-R[CRS]B-

boundaryToDiscard

The set of regex for sentence boundary tokens that should be discarded.

Optional — Type: String[]  — Default value: [, NL]

boundaryTokenRegex

The set of boundary tokens. If null, use default.

Optional — Type: String  — Default value: \\.|[!?]+

isOneSentence

Whether to treat all input as one sentence.

Type: Boolean  — Default value: false

language

The language.

Optional — Type: String

languageFallback

If this component is not configured for a specific language and if the language stored in the document metadata is not supported, use the given language as a fallback.

Optional — Type: String

newlineIsSentenceBreak

Strategy for treating newlines as paragraph breaks.

Optional — Type: String  — Default value: TWO_CONSECUTIVE

regionElementRegex

A regular expression for element names containing a sentence region. Only tokens in such elements will be included in sentences. The start and end tags themselves are not included in the sentence.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

tokenRegexesToDiscard

The set of regex for sentence boundary tokens that should be discarded.

Optional — Type: String[]  — Default value: []

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

xmlBreakElementsToDiscard

These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary.

Optional — Type: String[]

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 108. Capabilities

Inputs

none specified

Outputs

Languages

en, es, fr

German Separated Particle Annotator

Short name

GermanSeparatedParticleAnnotator

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.GermanSeparatedParticleAnnotator

Description

Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. This Annotator deals with German particle verbs. Particle verbs consist of a particle and a stem, e.g. anfangen = an+fangen There are many usages of German particle verbs where the stem and the particle are separated, e.g., Wir fangen gleich an. The TreeTagger lemmatizes the verb stem as "fangen" and the separated particle as "an", the proper verblemma "anfangen" is thus not available as an annotation. The GermanSeparatedParticleAnnotator replaces the lemma of the stem of particle-verbs (e.g., fangen) by the proper verb lemma (e.g. anfangen) and leaves the lemma of the separated particle unchanged.

Table 109. Capabilities

Inputs

Outputs

Languages

de

Gosen Segmenter

Short name

GosenSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.gosen-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.gosen.GosenSegmenter

Description

Segmenter for Japanese text based on GoSen.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 110. Capabilities

Inputs

none specified

Outputs

Languages

ja

ICU Segmenter

Short name

IcuSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.icu-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.icu.IcuSegmenter

Description

ICU segmenter.

Parameters
language

The language.

Optional — Type: String

splitAtApostrophe

Per default, the segmenter does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered.

Type: Boolean  — Default value: false

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 111. Capabilities

Inputs

none specified

Outputs

Languages

af, ak, am, ar, as, az, be, bg, bm, bn, bo, br, bs, ca, ce, cs, cy, da, de, dz, ee, el, en, eo, es, et, eu, fa, ff, fi, fo, fr, fy, ga, gd, gl, gu, gv, ha, hi, hr, hu, hy, ig, ii, in, is, it, iw, ja, ji, ka, ki, kk, kl, km, kn, ko, ks, kw, ky, lb, lg, ln, lo, lt, lu, lv, mg, mk, ml, mn, mr, ms, mt, my, nb, nd, ne, nl, nn, om, or, os, pa, pl, ps, pt, qu, rm, rn, ro, ru, rw, se, sg, si, sk, sl, sn, so, sq, sr, sv, sw, ta, te, th, ti, to, tr, ug, uk, ur, uz, vi, yo, zh, zu

JTok Segmenter

Short name

JTokSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.jtok-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.jtok.JTokSegmenter

Description

JTok segmenter.

Parameters
language

The language.

Optional — Type: String

ptbEscaping

Type: Boolean  — Default value: false

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeParagraph

Create Paragraph annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 112. Capabilities

Inputs

none specified

Outputs

Languages

de, en, it

Java BreakIterator Segmenter

Short name

BreakIteratorSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.BreakIteratorSegmenter

Description

BreakIterator segmenter.

Parameters
language

The language.

Optional — Type: String

splitAtApostrophe

Per default the Java BreakIterator does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered.

Type: Boolean  — Default value: false

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 113. Capabilities

Inputs

none specified

Outputs

Languages

ar, be, bg, ca, cs, da, de, el, en, es, et, fi, fr, ga, hi, hr, hu, in, is, it, iw, ja, ko, lt, lv, mk, ms, mt, nl, no, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, vi, zh

LanguageTool Segmenter

Short name

LanguageToolSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.languagetool-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolSegmenter

Description

Segmenter using LanguageTool to do the heavy lifting. LanguageTool internally uses different strategies for tokenization.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 114. Capabilities

Inputs

none specified

Outputs

Languages

be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh

Line-based Sentence Segmenter

Short name

LineBasedSentenceSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.LineBasedSentenceSegmenter

Description

Annotates each line in the source text as a sentence. This segmenter is not capable of creating tokens! All respective parameters have no functionality.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 115. Capabilities

Inputs

none specified

Outputs

Languages

none specified

LingPipe Segmenter

Short name

LingPipeSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.lingpipe-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.lingpipe.LingPipeSegmenter

Description

LingPipe segmenter.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 116. Capabilities

Inputs

none specified

Outputs

Languages

none specified

NLP4J Segmenter

Short name

Nlp4JSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.nlp4j-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.nlp4j.Nlp4JSegmenter

Description

Segmenter using Emory NLP4J.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 117. Capabilities

Inputs

none specified

Outputs

Languages

none specified

OpenNLP Segmenter

Short name

OpenNlpSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpSegmenter

Description

Tokenizer and sentence splitter using OpenNLP.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

segmentationModelLocation

Load the segmentation model from this location instead of locating the model automatically.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

tokenizationModelLocation

Load the tokenization model from this location instead of locating the model automatically.

Optional — Type: String

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 118. Capabilities

Inputs

none specified

Outputs

Languages

see available models

Table 119. Models
Language Variant Version

da

maxent

20120616.1

da

maxent

20120616.1

de

maxent

20120616.1

de

maxent

20120616.1

en

maxent

20120616.1

en

maxent

20120616.1

it

maxent

20130618.0

it

maxent

20130618.0

nb

maxent

20120131.1

nb

maxent

20120131.1

nl

maxent

20120616.1

nl

maxent

20120616.1

pt

maxent

20120616.1

pt

maxent

20120616.1

sv

maxent

20120616.1

sv

maxent

20120616.1

OpenNLP Sentence Splitter Trainer

Short name

OpenNlpSentenceTrainer

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpSentenceTrainer

Description

Train a sentence splitter model for OpenNLP.

Parameters
abbreviationDictionaryEncoding

Type: String  — Default value: UTF-8

abbreviationDictionaryLocation

Optional — Type: String

algorithm

Type: String  — Default value: MAXENT

cutoff

Type: Integer  — Default value: 5

eosCharacters

Optional — Type: String[]

iterations

Type: Integer  — Default value: 100

language

Type: String

numThreads

Type: Integer  — Default value: 1

targetLocation

Type: String

trainerType

Type: String  — Default value: Event

OpenNLP Tokenizer Trainer

Short name

OpenNlpTokenTrainer

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.opennlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpTokenTrainer

Description

Train a tokenizer model for OpenNLP.

Parameters
abbreviationDictionaryEncoding

Type: String  — Default value: UTF-8

abbreviationDictionaryLocation

Optional — Type: String

algorithm

Type: String  — Default value: MAXENT

alphaNumericPattern

Optional — Type: String  — Default value: ^[A-Za-z0-9]+$

cutoff

Type: Integer  — Default value: 5

iterations

Type: Integer  — Default value: 100

language

Type: String

numThreads

Type: Integer  — Default value: 1

targetLocation

Type: String

trainerType

Type: String  — Default value: Event

useAlphaNumericOptimization

Type: Boolean  — Default value: true

Paragraph Splitter

Short name

ParagraphSplitter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.ParagraphSplitter

Description

This class creates paragraph annotations for the given input document. It searches for the occurrence of two or more line-breaks (Unix and Windows) and regards this as the boundary between paragraphs.

Parameters
splitPattern

A regular expression used to detect paragraph splits. Default: #DOUBLE_LINE_BREAKS_PATTERN (split on two consecutive line breaks)

Type: String  — Default value: ((\r\n\r\n)(\r\n)*)|((\n\n)(\n)*)

Table 120. Capabilities

Inputs

none specified

Outputs

Languages

none specified

Pattern-based Token Segmenter

Short name

PatternBasedTokenSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.PatternBasedTokenSegmenter

Description

Split up existing tokens again at particular split-chars. The prefix states whether the split chars should be added as separate Token Tokens. If the #INCLUDE_PREFIX precedes the split pattern, the pattern is included. Consequently, patterns following the #EXCLUDE_PREFIX, will not be added as a Token.

Parameters
deleteCover

Wether to remove the original token. Default: true

Type: Boolean  — Default value: true

patterns

A list of regular expressions, prefixed with #INCLUDE_PREFIX or #EXCLUDE_PREFIX. If neither of the prefixes is used, #EXCLUDE_PREFIX is assumed.

Type: String[]

Table 121. Capabilities

Inputs

Outputs

Languages

none specified

Regex Segmenter

Short name

RegexSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.RegexSegmenter

Description

This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.

The default behaviour is to split sentences by a line break and tokens by whitespace.

Parameters
language

The language.

Optional — Type: String

sentenceBoundaryRegex

Define the sentence boundary. Default: \n (assume one sentence per line).

Type: String  — Default value: ``

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

tokenBoundaryRegex

Defines the pattern that is used as token end boundary. Default: [\s\n]+ (matching whitespace and linebreaks.

When setting custom patterns, take into account that the final token is often terminated by a linebreak rather than the boundary character. Therefore, the newline typically has to be added to the group of matching characters, e.g. "tokenized-text" is correctly tokenized with the pattern [-\n].

Type: String  — Default value: [\\s\n]+

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 122. Capabilities

Inputs

none specified

Outputs

Languages

none specified

Token Merger

Short name

TokenMerger

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.TokenMerger

Description

Merges any Tokens that are covered by a given annotation type. E.g. this component can be used to create a single tokens from all tokens that constitute a multi-token named entity.

Parameters
POSMappingLocation

Override the tagset mapping.

Optional — Type: String

annotationType

Annotation type for which tokens should be merged.

Type: String

constraint

A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to merge only tokens that are part of a location named entity.

Optional — Type: String

cposValue

Set a new coarse POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset).

Optional — Type: String

language

Use this language instead of the document language to resolve the model and tag set mapping.

Optional — Type: String

lemmaMode

Configure what should happen to the lemma of the merged tokens. It is possible to JOIN the lemmata to a single lemma (space separated), to REMOVE the lemma or LEAVE the lemma of the first token as-is.

Type: String  — Default value: JOIN

posType

Set a new POS tag for the new merged token. This is the mapped type. If this is specified, tag set mapping will not be performed. This parameter has no effect unless PARAM_POS_VALUE is also set.

Optional — Type: String

posValue

Set a new POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset).

Optional — Type: String

Table 123. Capabilities

Inputs

Outputs

Languages

none specified

Token Trimmer

Short name

TokenTrimmer

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.TokenTrimmer

Description

Remove prefixes and suffixes from tokens.

Parameters
prefixes

List of prefixes to remove.

Type: String[]

suffixes

List of suffixes to remove.

Type: String[]

Table 124. Capabilities

Inputs

Outputs

Languages

none specified

Whitespace Segmenter

Short name

WhitespaceSegmenter

Category

Segmenter

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.tokit-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.tokit.WhitespaceSegmenter

Description

A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.

If PARAM_WRITE_SENTENCES is set to true, one sentence per line is assumed. Otherwise, no sentences are created.

Parameters
language

The language.

Optional — Type: String

strictZoning

Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.

Type: Boolean  — Default value: false

writeForm

Create TokenForm annotations.

Type: Boolean  — Default value: true

writeSentence

Create Sentence annotations.

Type: Boolean  — Default value: true

writeToken

Create Token annotations.

Type: Boolean  — Default value: true

zoneTypes

A list of type names used for zoning.

Optional — Type: String[]  — Default value: [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]

Table 125. Capabilities

Inputs

none specified

Outputs

Languages

none specified

Semantic role labeler

Table 126. Analysis Components in category Semantic role labeler (2)
Component Description

ClearNlpSemanticRoleLabeler

ClearNLP semantic role labeller.

MateSemanticRoleLabeler

DKPro Annotator for the MateTools Semantic Role Labeler.

ClearNLP Semantic Role Labeler

Short name

ClearNlpSemanticRoleLabeler

Category

Semantic role labeler

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.clearnlp-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSemanticRoleLabeler

Description

ClearNLP semantic role labeller.

Parameters
expandArguments

Normally the arguments point only to the head words of arguments in the dependency tree. With this option enabled, they are expanded to the text covered by the minimal and maximal token offsets of all descendants (or self) of the head word.

Warning: this parameter should be used with caution! For one, if the descentants of a head word cover a non-continuous region of the text, this information is lost. The arguments will appear to be spanning a continuous region. For another, the arguments may overlap with each other. E.g. if a sentence contains a relative clause with a verb, the subject of the main clause may be recognized as a dependent of the verb and may cause the whole main clause to be recorded in the argument.

Type: Boolean  — Default value: false

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelVariant

Variant of a model the model. Used to address a specific model if here are multiple models for one language.

Optional — Type: String

predModelLocation

Location from which the predicate identifier model is read.

Optional — Type: String

printTagSet

Write the tag set(s) to the log when a model is loaded.

Type: Boolean  — Default value: false

roleModelLocation

Location from which the roleset classification model is read.

Optional — Type: String

srlModelLocation

Location from which the semantic role labeling model is read.

Optional — Type: String

Table 127. Capabilities

Inputs

Outputs

Languages

see available models

Table 128. Models
Language Variant Version

en

mayo

20131111.0

en

ontonotes

20131128.0

Mate Tools Semantic Role Labeler

Short name

MateSemanticRoleLabeler

Category

Semantic role labeler

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.matetools-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.matetools.MateSemanticRoleLabeler

Description

DKPro Annotator for the MateTools Semantic Role Labeler.

Please cite the following paper, if you use the semantic role labeler Anders Björkelund, Love Hafdell, and Pierre Nugues. Multilingual semantic role labeling. In Proceedings of The Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 43--48, Boulder, June 4--5 2009.

Parameters
language

Use this language instead of the document language to resolve the model.

Optional — Type: String

modelLocation

Load the model from this location instead of locating the model automatically.

Optional — Type: String

modelVariant

Override the default variant used to locate the model.

Optional — Type: String

Table 129. Capabilities

Inputs

Outputs

Languages

see available models

Table 130. Models
Language Variant Version

de

tiger

20130105.0

en

conll2009

20130117.0

es

conll2009

20130320.0

zh

conll2009

20130117.0

Stemmer

Table 131. Analysis Components in category Stemmer (3)
Component Description

CisStemmer

UIMA wrapper for the CISTEM algorithm.

LancasterStemmer

This Paice/Husk Lancaster stemmer implementation only works with the English language so far.

SnowballStemmer

UIMA wrapper for the Snowball stemmer.

CIS Stemmer

Short name

CisStemmer

Category

Stemmer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-cisstem-asl

Implementation

org.dkpro.core.cisstem.CisStemmer

Description

UIMA wrapper for the CISTEM algorithm. CISTEM is a stemming algorithm for the German language, developed by Leonie Weißweiler and Alexander Fraser. Annotation types to be stemmed can be configured by a FeaturePath.

If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.

Parameters
filterConditionOperator

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterConditionValue

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

Optional — Type: String

lowerCase

Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.

Optional — Type: Boolean  — Default value: false

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Optional — Type: String[]

Table 132. Capabilities

Inputs

none specified

Outputs

Languages

de

Lancaster Stemmer

Short name

LancasterStemmer

Category

Stemmer

Group ID

org.dkpro.core

Artifact ID

dkpro-core-lancaster-asl

Implementation

org.dkpro.core.lancaster.LancasterStemmer

Description

This Paice/Husk Lancaster stemmer implementation only works with the English language so far.

Parameters
filterConditionOperator

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterConditionValue

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

Optional — Type: String

language

Specifies the language supported by the stemming model. Default value is "en" (English).

Type: String  — Default value: en

modelLocation

Specifies an URL that should resolve to a location from where to load custom rules. If the location starts with classpath: the location is interpreted as a classpath location, e.g. "classpath:my/path/to/the/rules". Otherwise it is tried as an URL, file and at last UIMA resource.

Optional — Type: String

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Optional — Type: String[]

stripPrefix

True if the stemmer will strip prefix such as kilo, micro, milli, intra, ultra, mega, nano, pico, pseudo.

Type: Boolean  — Default value: false

Table 133. Capabilities

Inputs

Outputs

Languages

en

Snowball Stemmer

Short name

SnowballStemmer

Category

Stemmer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.snowball-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.snowball.SnowballStemmer

Description

UIMA wrapper for the Snowball stemmer. Annotation types to be stemmed can be configured by a FeaturePath.

If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.

Parameters
filterConditionOperator

Specifies the operator for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterConditionValue

Specifies the value for a filtering condition.

It is only used if PARAM_FILTER_FEATUREPATH is set.

Optional — Type: String

filterFeaturePath

Specifies a feature path that is used in the filter. If this is set, you also have to specify PARAM_FILTER_CONDITION_OPERATOR and PARAM_FILTER_CONDITION_VALUE.

Optional — Type: String

language

Use this language instead of the document language to resolve the model.

Optional — Type: String

lowerCase

Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.

Examples
false (default)true
EDUCATIONALEDUCATIONALeduc
EducationalEducateduc
educationaleduceduc

Optional — Type: Boolean  — Default value: false

paths

Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.

Optional — Type: String[]

Table 134. Capabilities

Inputs

none specified

Outputs

Languages

da, de, en, es, fi, fr, hu, it, nl, no, pt, ro, ru, sv, tr

Topic Model

Topic modeling is a statistical approach to discover abstract topics in a collection of documents. A topic is characterized by a probability distribution of the words in the document collection. Once a topic model has been generated, it can be used to analyze unseen documents. The result of the analysis is describes the probability by which a document belongs to each of the topics in the model.

Table 135. Analysis Components in category Topic Model (2)
Component Description

MalletLdaTopicModelInferencer

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

MalletLdaTopicModelTrainer

Estimate an LDA topic model using Mallet and write it to a file.

Mallet LDA Topic Model Inferencer

Short name

MalletLdaTopicModelInferencer

Category

Topic Model

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.mallet-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.mallet.lda.MalletLdaTopicModelInferencer

Description

Infers the topic distribution over documents using a Mallet ParallelTopicModel.

Parameters
burnIn

The number of iterations before hyperparameter optimization begins. Default: 1

Type: Integer  — Default value: 1

lowercase

If set to true (default: false), all tokens are lowercased.

Type: Boolean  — Default value: false

maxTopicAssignments

Maximum number of topics to assign. If not set (or <= 0), the number of topics in the model divided by 10 is set.

Type: Integer  — Default value: 0

minTokenLength

Ignore tokens (or lemmas, respectively) that are shorter than the given value. Default: 3.

Type: Integer  — Default value: 3

minTopicProb

Minimum topic proportion for the document-topic assignment.

Type: Float  — Default value: 0.2

modelLocation

Type: String

nIterations

The number of iterations during inference. Default: 100.

Type: Integer  — Default value: 100

thinning

Type: Integer  — Default value: 5

tokenFeaturePath

The annotation type to use for the model. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

typeName

The annotation type to use as tokens. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

Table 136. Capabilities

Inputs

Outputs

Languages

none specified

Mallet LDA Topic Model Trainer

Short name

MalletLdaTopicModelTrainer

Category

Topic Model

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.mallet-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.mallet.lda.MalletLdaTopicModelTrainer

Description

Estimate an LDA topic model using Mallet and write it to a file. It stores all incoming CAS' to Mallet Instances before estimating the model, using a ParallelTopicModel.

Set #PARAM_TOKEN_FEATURE_PATH to define what is considered as a token (Tokens, Lemmas, etc.).

Set #PARAM_COVERING_ANNOTATION_TYPE to define what is considered a document (sentences, paragraphs, etc.).

Parameters
alphaSum

The sum of alphas over all topics. Default: 1.0.

Another recommended value is 50 / T (number of topics).

Type: Float  — Default value: 1.0

beta

Beta for a single dimension of the Dirichlet prior. Default: 0.01.

Type: Float  — Default value: 0.01

burninPeriod

The number of iterations before hyper-parameter optimization begins. Default: 100

Type: Integer  — Default value: 100

compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

coveringAnnotationType

If specified, the text contained in the given segmentation type annotations are fed as separate units ("documents") to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored.

By default, the full text is used as a document.

Type: String  — Default value: ``

displayInterval

The interval in which to display the estimated topics. Default: 50.

Type: Integer  — Default value: 50

displayNTopicWords

The number of top words to display during estimation. Default: 7.

Type: Integer  — Default value: 7

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

filterRegex

Filter out all tokens matching that regular expression.

Type: String  — Default value: ``

filterRegexReplacement

Type: String  — Default value: ``

lowercase

If set to true (default: false), all tokens are lowercased.

Type: Boolean  — Default value: false

minTokenLength

Ignore tokens (or any other annotation type, as specified by #PARAM_TOKEN_FEATURE_PATH) that are shorter than the given value. Default: 3.

Type: Integer  — Default value: 3

nIterations

The number of iterations during model estimation. Default: 1000.

Type: Integer  — Default value: 1000

nTopics

The number of topics to estimate.

Type: Integer  — Default value: 10

numThreads

The number of threads to use during model estimation. If not set, the number of threads is automatically set by ComponentParameters#computeNumThreads(int).

Warning: do not set this to more than 1 when using very small (test) data sets on MalletEmbeddingsTrainer! This might prevent the process from terminating.

Type: Integer  — Default value: 0

optimizeInterval

Interval for optimizing Dirichlet hyper-parameters. Default: 50

Type: Integer  — Default value: 50

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

paramStopwordsFile

The location of the stopwords file.

Type: String  — Default value: ``

paramStopwordsReplacement

If set, stopwords found in the #PARAM_STOPWORDS_FILE location are not removed, but replaced by the given string (e.g. STOP).

Type: String  — Default value: ``

randomSeed

Set random seed. If set to -1 (default), uses random generator.

Type: Integer  — Default value: -1

saveInterval

Define how frequently a serialized model is saved to disk during estimation. Default: 0 (only save when estimation is done).

Type: Integer  — Default value: 0

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

tokenFeaturePath

The annotation type to use as input tokens for the model estimation. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token. For lemmas, for instance, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

useCharacters

If true (default: false), estimate character embeddings. #PARAM_TOKEN_FEATURE_PATH is ignored.

Type: Boolean  — Default value: false

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

useSymmetricAlpha

Use a symmetric alpha value during model estimation? Default: false.

Type: Boolean  — Default value: false

Transformer

Table 137. Analysis Components in category Transformer (15)
Component Description

ApplyChangesAnnotator

Applies changes annotated using a SofaChangeAnnotation.

Backmapper

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

CapitalizationNormalizer

Takes a text and replaces wrong capitalization

CjfNormalizer

Converts traditional Chinese to simplified Chinese or vice-versa.

DictionaryBasedTokenTransformer

Reads a tab-separated file containing mappings from one token to another.

ExpressiveLengtheningNormalizer

Takes a text and shortens extra long words

FileBasedTokenTransformer

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

HyphenationRemover

Simple dictionary-based hyphenation remover.

RegexBasedTokenTransformer

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

ReplacementFileNormalizer

Takes a text and replaces desired expressions.

SharpSNormalizer

Takes a text and replaces sharp s

SpellingNormalizer

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

StanfordPtbTransformer

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style.

TokenCaseTransformer

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

UmlautNormalizer

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

CAS Transformation - Apply

Short name

ApplyChangesAnnotator

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.castransformation-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.castransformation.ApplyChangesAnnotator

Description

Applies changes annotated using a SofaChangeAnnotation.

Table 138. Capabilities

Inputs

Outputs

Languages

none specified

CAS Transformation - Map back

Short name

Backmapper

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.castransformation-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.castransformation.Backmapper

Description

After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.

Parameters
Chain

Chain of views for backmapping. This should be the reverse of the chain of views that the ApplyChangesAnnotator has used. For example, if view A has been mapped to B using ApplyChangesAnnotator, then this parameter should be set using an array containing [B, A].

Optional — Type: String[]  — Default value: [source, target]

Capitalization Normalizer

Short name

CapitalizationNormalizer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.CapitalizationNormalizer

Description

Takes a text and replaces wrong capitalization

Parameters
typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 139. Capabilities

Inputs

Outputs

none specified

Languages

none specified

Chinese Traditional/Simplified Converter

Short name

CjfNormalizer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.languagetool-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.languagetool.CjfNormalizer

Description

Converts traditional Chinese to simplified Chinese or vice-versa.

Parameters
direction

Type: String  — Default value: TO_SIMPLIFIED

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 140. Capabilities

Inputs

none specified

Outputs

none specified

Languages

zh

Dictionary-based Token Transformer

Short name

DictionaryBasedTokenTransformer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.DictionaryBasedTokenTransformer

Description

Reads a tab-separated file containing mappings from one token to another. All tokens that match an entry in the first column are changed to the corresponding token in the second column.

Parameters
commentMarker

Lines starting with this character (or String) are ignored. Default: '#'

Type: String  — Default value: #

modelEncoding

Type: String  — Default value: UTF-8

modelLocation

Type: String

separator

Separator for mappings file. Default: "\t" (TAB).

Type: String  — Default value: ``

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Expressive Lengthening Normalizer

Short name

ExpressiveLengtheningNormalizer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.ExpressiveLengtheningNormalizer

Description

Takes a text and shortens extra long words

Parameters
typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 141. Capabilities

Inputs

Outputs

none specified

Languages

none specified

File-based Token Transformer

Short name

FileBasedTokenTransformer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.FileBasedTokenTransformer

Description

Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.

Parameters
ignoreCase

Type: Boolean  — Default value: false

modelLocation

Type: String

replacement

Type: String

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Hyphenation Remover

Short name

HyphenationRemover

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.HyphenationRemover

Description

Simple dictionary-based hyphenation remover.

Parameters
modelEncoding

Type: String  — Default value: UTF-8

modelLocation

Type: String

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Regex-based Token Transformer

Short name

RegexBasedTokenTransformer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.RegexBasedTokenTransformer

Description

A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.

The parameters #PARAM_REGEX defines the regular expression to be searcher, #PARAM_REPLACEMENT defines the string with which matching patterns are replaces.

Parameters
regex

Define the regular expression to be replaced

Type: String

replacement

Define the string to replace matching tokens with

Type: String

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Replacement File Normalizer

Short name

ReplacementFileNormalizer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.ReplacementFileNormalizer

Description

Takes a text and replaces desired expressions. This class should not work on tokens as some expressions might span several tokens.

Parameters
modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

Location of a file which contains all replacing characters

Type: String

srcExpressionSurroundings

Type: String  — Default value: IRRELEVANT

targetExpressionSurroundings

Type: String  — Default value: NOTHING

Table 142. Capabilities

Inputs

Outputs

Languages

none specified

Sharp S (ß) Normalizer

Short name

SharpSNormalizer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.SharpSNormalizer

Description

Takes a text and replaces sharp s

Parameters
minFrequencyThreshold

Type: Integer  — Default value: 100

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 143. Capabilities

Inputs

none specified

Outputs

none specified

Languages

de

Spelling Normalizer

Short name

SpellingNormalizer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.SpellingNormalizer

Description

Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.

Parameters
typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 144. Capabilities

Inputs

Outputs

none specified

Languages

none specified

Stanford Penn Treebank Normalizer

Short name

StanfordPtbTransformer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl

Implementation

de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPtbTransformer

Description

Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. This component operates directly on the text and does not require prior segmentation.

Parameters
typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Token Case Transformer

Short name

TokenCaseTransformer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.TokenCaseTransformer

Description

Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.

Parameters
tokenCase
The case to convert tokens to:
  • UPPERCASE: uppercase everything.
  • LOWERCASE: lowercase everything.
  • NORMALCASE: retain first letter in word and after hyphens, lowercase everything else.

Type: String

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Umlaut Normalizer

Short name

UmlautNormalizer

Category

Transformer

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.UmlautNormalizer

Description

Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.

Parameters
minFrequencyThreshold

Type: Integer  — Default value: 100

typesToCopy

A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.

Type: String[]  — Default value: []

Table 145. Capabilities

Inputs

Outputs

none specified

Languages

de

Other

Table 146. Analysis Components in category Other (18)
Component Description

AnnotationByTextFilter

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

CompoundAnnotator

Annotates compound parts and linking morphemes.

CorrectionsContextualizer

This component assumes that some spell checker has already been applied upstream (e.g.

FrequencyCounter

Count unigrams and bigrams in a collection.

NGramAnnotator

N-gram annotator.

PosFilter

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

PosMapper

Maps existing POS tags from one tagset to another using a user provided properties file.

PhraseAnnotator

Annotate phrases in a sentence.

ReadabilityAnnotator

Assign a set of popular readability scores to the text.

RegexTokenFilter

Remove every token that does or does not match a given regular expression.

SemanticFieldAnnotator

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource.

NorvigSpellingCorrector

Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.

StopWordRemover

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary.

Stopwatch

Can be used to measure how long the processing between two points in a pipeline takes.

TfidfAnnotator

This component adds Tfidf annotations consisting of a term and a tfidf weight.

TfidfConsumer

This consumer builds a DfModel.

TrailingCharacterRemover

Removing trailing character (sequences) from tokens, e.g. punctuation.

JCasHolder

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

Annotation-By-Text Filter

Short name

AnnotationByTextFilter

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.AnnotationByTextFilter

Description

Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.

Parameters
ignoreCase

If true, annotation texts are filtered case-independently. Default: true, i.e. words that occur in the list with different casing are not filtered out.

Type: Boolean  — Default value: true

modelEncoding

Type: String  — Default value: UTF-8

modelLocation

Type: String

typeName

Annotation type to filter. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token.

Type: String  — Default value: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token

Compound Annotator

Short name

CompoundAnnotator

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.decompounding-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.decompounding.uima.annotator.CompoundAnnotator

Description

Annotates compound parts and linking morphemes.

Table 147. Capabilities

Inputs

Outputs

Languages

none specified

Corrections Contextualizer

Short name

CorrectionsContextualizer

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.jazzy-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.jazzy.CorrectionsContextualizer

Description

This component assumes that some spell checker has already been applied upstream (e.g. Jazzy). It then uses ngram frequencies from a frequency provider in order to rank the provided corrections.

Frequency Count Writer

Short name

FrequencyCounter

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.frequency-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.frequency.phrasedetection.FrequencyCounter

Description

Count unigrams and bigrams in a collection.

Parameters
compression

Choose a compression method. (default: CompressionMethod#NONE)

Optional — Type: String  — Default value: NONE

coveringType

Set this parameter if bigrams should only be counted when occurring within a covering type, e.g. sentences.

Optional — Type: String

escapeDocumentId

URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.)

Type: Boolean  — Default value: true

featurePath

The feature path. Default: tokens.

Optional — Type: String

filterRegex

Type: String  — Default value: ``

lowercase

If true, all tokens are lowercased.

Type: Boolean  — Default value: false

minCount

Tokens occurring fewer times than this value are omitted. Default: 5.

Type: Integer  — Default value: 5

overwrite

Allow overwriting target files (ignored when writing to ZIP archives).

Type: Boolean  — Default value: false

regexReplacement

Type: String  — Default value: ``

singularTarget

Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved.

Type: Boolean  — Default value: false

sortByAlphabet

If true, sort output alphabetically.

Type: Boolean  — Default value: false

sortByCount

If true, sort output by count (descending order).

Type: Boolean  — Default value: false

stopwordsFile

Type: String  — Default value: ``

stopwordsReplacement

Type: String  — Default value: ``

stripExtension

Remove the original extension.

Type: Boolean  — Default value: false

targetLocation

Target location. If this parameter is not set, data is written to stdout.

Optional — Type: String

useDocumentId

Use the document ID as file name even if a relative path information is present.

Type: Boolean  — Default value: false

N-Gram Annotator

Short name

NGramAnnotator

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.ngrams-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.ngrams.NGramAnnotator

Description

N-gram annotator.

Parameters
N

The length of the n-grams to generate (the "n" in n-gram).

Type: Integer  — Default value: 3

Table 148. Capabilities

Inputs

Outputs

Languages

none specified

POS Filter

Short name

PosFilter

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.posfilter-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.posfilter.PosFilter

Description

Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.

Parameters
adj

Keep/remove adjectives (true: keep, false: remove)

Type: Boolean  — Default value: false

adp

Keep/remove adpositions (true: keep, false: remove)

Type: Boolean  — Default value: false

adv

Keep/remove adverbs (true: keep, false: remove)

Type: Boolean  — Default value: false

aux

Keep/remove auxiliary verbs (true: keep, false: remove)

Type: Boolean  — Default value: false

conj

Keep/remove conjunctions (true: keep, false: remove)

Type: Boolean  — Default value: false

det

Keep/remove articles (true: keep, false: remove)

Type: Boolean  — Default value: false

intj

Keep/remove interjections (true: keep, false: remove)

Type: Boolean  — Default value: false

noun

Keep/remove nouns (true: keep, false: remove)

Type: Boolean  — Default value: false

num

Keep/remove numerals (true: keep, false: remove)

Type: Boolean  — Default value: false

part

Keep/remove particles (true: keep, false: remove)

Type: Boolean  — Default value: false

pron

Keep/remove pronnouns (true: keep, false: remove)

Type: Boolean  — Default value: false

propn

Keep/remove proper nouns (true: keep, false: remove)

Type: Boolean  — Default value: false

punct

Keep/remove punctuation (true: keep, false: remove)

Type: Boolean  — Default value: false

sconj

Keep/remove conjunctions (true: keep, false: remove)

Type: Boolean  — Default value: false

sym

Keep/remove symbols (true: keep, false: remove)

Type: Boolean  — Default value: false

typeToRemove

The fully qualified name of the type that should be filtered.

Type: String

verb

Keep/remove verbs (true: keep, false: remove)

Type: Boolean  — Default value: false

x

Keep/remove other (true: keep, false: remove)

Type: Boolean  — Default value: false

Table 149. Capabilities

Inputs

Outputs

none specified

Languages

none specified

POS Mapper

Short name

PosMapper

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.posfilter-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.posfilter.PosMapper

Description

Maps existing POS tags from one tagset to another using a user provided properties file.

Parameters
dkproMappingLocation

A properties file containing mappings from the new tagset to (fully qualified) DKPro POS classes.
If such a file is not supplied, the DKPro POS classes stay the same regardless of the new POS tag value, and only the value is changed.

Optional — Type: String

mappingFile

A properties file containing POS tagset mappings.

Type: String

Table 150. Capabilities

Inputs

Outputs

Languages

none specified

Phrase Annotator

Short name

PhraseAnnotator

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.frequency-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.frequency.phrasedetection.PhraseAnnotator

Description

Annotate phrases in a sentence. Depending on the provided unigrams and the threshold, these comprise either one or two annotations (tokens, lemmas, ...).

In order to identify longer phrases, run the FrequencyCounter and this annotator multiple times, each time taking the results of the previous run as input. From the second run on, set phrases in the feature path parameter #PARAM_FEATURE_PATH.

Parameters
PARAM_LOWERCASE

If true, lowercase everything.

Type: Boolean  — Default value: false

coveringType

Set this parameter if bigrams should only be counted when occurring within a covering type, e.g. sentences.

Optional — Type: String

discount

The discount in order to prevent too many phrases consisting of very infrequent words to be formed. A typical value is the minimum count set during model creation (FrequencyCounter#PARAM_MIN_COUNT), which is by default set to 5.

Type: Integer  — Default value: 5

featurePath

The feature path to use for building bigrams. Default: tokens.

Optional — Type: String

filterRegex

Type: String  — Default value: ``

modelLocation

The file providing the unigram and bigram unigrams to use.

Type: String

regexReplacement

Type: String  — Default value: ``

stopwordsFile

Type: String  — Default value: ``

stopwordsReplacement

Type: String  — Default value: ``

threshold

The threshold score for phrase construction. Default is 100. Lower values result in fewer phrases. The value strongly depends on the size of the corpus and the token unigrams.

Type: Float  — Default value: 100.0

Readability Annotator

Short name

ReadabilityAnnotator

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.readability-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.readability.ReadabilityAnnotator

Description

Assign a set of popular readability scores to the text.

Regex Token Filter

Short name

RegexTokenFilter

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.RegexTokenFilter

Description

Remove every token that does or does not match a given regular expression.

Parameters
mustMatch

If this parameter is set to true (default), retain only tokens that match the regex given in #PARAM_REGEX. If set to false, all tokens that match the given regex are removed.

Type: Boolean  — Default value: true

regex

Every token that does or does not match this regular expression will be removed.

Type: String

Semantic Field Annotator

Short name

SemanticFieldAnnotator

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.dictionaryannotator-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.SemanticFieldAnnotator

Description

This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. This could be a lexical resource such as WordNet or a simple key-value map. The annotation is stored in the SemanticField annotation type.

Parameters
annotationType

Annotation types which should be annotated with semantic fields

Type: String

constraint

A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to annotate only tokens with semantic fields that are part of a location named entity.

Optional — Type: String

Table 151. Capabilities

Inputs

Outputs

Languages

none specified

Simple Spelling Corrector

Short name

NorvigSpellingCorrector

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.norvig-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.norvig.NorvigSpellingCorrector

Description

Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.

Table 152. Capabilities

Inputs

Outputs

Languages

none specified

Stop Word Remover

Short name

StopWordRemover

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.stopwordremover-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.stopwordremover.StopWordRemover

Description

Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. Also remove any other of the specified types that is covered by a matching instance.

Parameters
Paths

Feature paths for annotations that should be matched/removed. The default is

StopWord.class.getName()
Token.class.getName()
Lemma.class.getName()+"/value"

Optional — Type: String[]

StopWordType

Anything annotated with this type will be removed even if it does not match any word in the lists.

Optional — Type: String

modelEncoding

The character encoding used by the model.

Type: String  — Default value: UTF-8

modelLocation

A list of URLs from which to load the stop word lists. If an URL is prefixed with a language code in square brackets, the stop word list is only used for documents in that language. Using no prefix or the prefix "[*]" causes the list to be used for every document. Example: "[de]classpath:/stopwords/en_articles.txt"

Type: String[]

Table 153. Capabilities

Inputs

Outputs

none specified

Languages

none specified

Stopwatch

Short name

Stopwatch

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.performance-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.performance.Stopwatch

Description

Can be used to measure how long the processing between two points in a pipeline takes. For that purpose, the AE needs to be added two times, before and after the part of the pipeline that should be measured.

Parameters
timerName

Name of the timer pair. Upstream and downstream timer need to use the same name.

Type: String

timerOutputFile

Name of the timer pair. Upstream and downstream timer need to use the same name.

Optional — Type: String

Table 154. Capabilities

Inputs

Outputs

Languages

none specified

TF/IDF Annotator

Short name

TfidfAnnotator

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.frequency-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.frequency.tfidf.TfidfAnnotator

Description

This component adds Tfidf annotations consisting of a term and a tfidf weight.
The annotator is type agnostic concerning the input annotation, so you have to specify the annotation type and string representation. It uses a pre-serialized DfStore, which can be created using the TfidfConsumer.

Parameters
featurePath

This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path.

Type: String

lowercase

If set to true, the whole text is handled in lower case.

Optional — Type: Boolean  — Default value: false

tfdfPath

Provide the path to the Df-Model. When a shared SharedDfModel is bound to this annotator, this is ignored.

Optional — Type: String

weightingModeIdf

The model for inverse document frequency weighting.
Invoke toString() on an enum of WeightingModeIdf for setup.

Default value is "NORMAL" yielding an unweighted idf.

Optional — Type: String  — Default value: NORMAL

weightingModeTf

The model for term frequency weighting.
Invoke toString() on an enum of WeightingModeTf for setup.

Default value is "NORMAL" yielding an unweighted tf.

Optional — Type: String  — Default value: NORMAL

Table 155. Capabilities

Inputs

none specified

Outputs

Languages

none specified

TF/IDF Model Writer

Short name

TfidfConsumer

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.frequency-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.frequency.tfidf.TfidfConsumer

Description

This consumer builds a DfModel. It collects the df (document frequency) counts for the processed collection. The counts are serialized as a DfModel-object.

Parameters
featurePath

This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path.

Type: String

lowercase

If set to true, the whole text is handled in lower case.

Type: Boolean  — Default value: false

targetLocation

Specifies the path and filename where the model file is written.

Type: String

Trailing Character Remover

Short name

TrailingCharacterRemover

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.TrailingCharacterRemover

Description

Removing trailing character (sequences) from tokens, e.g. punctuation.

Parameters
minTokenLength

All tokens that are shorter than the minimum token length after removing trailing chars are completely removed. By default (1), empty tokens are removed. Set to 0 or a negative value if no tokens should be removed.

Shorter tokens that do not have trailing chars removed are always retained, regardless of their length.

Type: Integer  — Default value: 1

pattern

A regex to be trimmed from the end of tokens.

Default: "[\\Q,-“^»*’()&/\"'©§'—«·=\\E0-9A-Z]+" (remove punctuations, special characters and capital letters).

Type: String  — Default value: [\\Q,-\u201C^\u00BB*\u2019()&/\"'\u00A9\u00A7'\u2014\u00AB\u00B7=\\E0-9A-Z]+

de.tudarmstadt.ukp.dkpro.core.textnormalizer.util.JCasHolder

Short name

JCasHolder

Category

Other

Group ID

de.tudarmstadt.ukp.dkpro.core

Artifact ID

de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl

Implementation

de.tudarmstadt.ukp.dkpro.core.textnormalizer.util.JCasHolder

Description

Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.

Appendix

Table 156. Producers and consumers by type
Type Producer Consumer

GrammarAnomaly

SpellingAnomaly

SuggestedAction

CoreferenceChain

CoreferenceLink

Tfidf

Morpheme

MorphologicalFeatures

POS

DocumentMetaData

NamedEntity

PhoneticTranscription

Compound

CompoundPart

Lemma

LinkingMorpheme

NGram

NamedEntity

Paragraph

Sentence

Split

Stem

StopWord

Token

SemArg

SemPred

PennTree

Chunk

Constituent

Dependency

SofaChangeAnnotation

TopicDistribution

WordEmbedding

JapaneseToken

TimerAnnotation