The document provides detailed information about the DKPro Core UIMA components.
Analytics components
Component | Description |
---|---|
Removes annotations that do not conform to minimum or maximum length constraints. |
|
Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words. |
|
Applies changes annotated using a SofaChangeAnnotation. |
|
Wrapper for Twitter Tokenizer and POS Tagger. |
|
ArkTweet tokenizer. |
|
After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view. |
|
Berkeley Parser annotator . |
|
BreakIterator segmenter. |
|
Split up existing tokens again if they are camel-case text. |
|
Takes a text and replaces wrong capitalization |
|
UIMA wrapper for the CISTEM algorithm. |
|
Converts traditional Chinese to simplified Chinese or vice-versa. |
|
Lemmatizer using Clear NLP. |
|
CLEAR parser annotator. |
|
Part-of-Speech annotator using Clear NLP. |
|
Tokenizer using Clear NLP. |
|
ClearNLP semantic role labeller. |
|
Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. |
|
Annotates compound parts and linking morphemes. |
|
Deterministic coreference annotator from CoreNLP. |
|
Dependency parser from CoreNLP. |
|
Lemmatizer from CoreNLP. |
|
Named entity recognizer from CoreNLP. |
|
Parser from CoreNLP. |
|
Part-of-speech tagger from CoreNLP. |
|
Tokenizer and sentence splitter using from CoreNLP. |
|
This component assumes that some spell checker has already been applied upstream (e.g. |
|
Takes a plain text file with phrases as input and annotates the phrases in the CAS file. |
|
Reads a tab-separated file containing mappings from one token to another. |
|
Double-Metaphone phonetic transcription based on Apache Commons Codec. |
|
Takes a text and shortens extra long words |
|
Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT. |
|
Flexible part-of-speech tagger. |
|
Count unigrams and bigrams in a collection. |
|
Wrapper for the GATE rule based lemmatizer. |
|
Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. |
|
Segmenter for Japanese text based on GoSen. |
|
GATE Hepple part-of-speech tagger. |
|
Part-of-Speech annotator using HunPos. |
|
Simple dictionary-based hyphenation remover. |
|
ICU segmenter. |
|
Lemmatizer using the OpenNLP-based Ixa implementation. |
|
Part-of-Speech annotator using OpenNLP with IXA extensions. |
|
de.tudarmstadt.ukp.dkpro.core.textnormalizer.util.JCasHolder |
Utility analysis engine for use with CAS multipliers in uimaFIT pipelines. |
JTok segmenter. |
|
This annotator uses Jazzy for the decision whether a word is spelled correctly or not. |
|
This Paice/Husk Lancaster stemmer implementation only works with the English language so far. |
|
Langdetect language identifier based on character n-grams. |
|
Language detector based on n-gram frequency counts, e.g. as provided by Web1T |
|
Detection based on character n-grams. |
|
Detect grammatical errors in text using LanguageTool a rule based grammar checker. |
|
Naive lexicon-based lemmatizer. |
|
Segmenter using LanguageTool to do the heavy lifting. |
|
Annotates each line in the source text as a sentence. |
|
LingPipe named entity recognizer. |
|
LingPipe named entity recognizer trainer. |
|
LingPipe part-of-speech tagger. |
|
LingPipe segmenter. |
|
Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas. |
|
Compute word embeddings from the given collection using skip-grams. |
|
Infers the topic distribution over documents using a Mallet ParallelTopicModel. |
|
Estimate an LDA topic model using Mallet and write it to a file. |
|
Dependency parsing using MaltPaser. |
|
DKPro Annotator for the MateToolsLemmatizer. |
|
DKPro Annotator for the MateToolsMorphTagger. |
|
DKPro Annotator for the MateToolsParser. |
|
DKPro Annotator for the MateToolsPosTagger |
|
DKPro Annotator for the MateTools Semantic Role Labeler. |
|
Annotator for the MeCab Japanese POS Tagger. |
|
Metaphone phonetic transcription based on Apache Commons Codec. |
|
Lemmatize based on a finite-state machine. |
|
Dependency parsing using MSTParser. |
|
N-gram annotator. |
|
Emory NLP4J dependency parser. |
|
Emory NLP4J lemmatizer. |
|
Emory NLP4J name finder wrapper. |
|
Part-of-Speech annotator using Emory NLP4J. |
|
Segmenter using Emory NLP4J. |
|
Creates SofaChangeAnnotations containing corrections for previously identified spelling errors. |
|
Chunk annotator using OpenNLP. |
|
Train a chunker model for OpenNLP. |
|
Lemmatizer using OpenNLP. |
|
Train a lemmatizer model for OpenNLP. |
|
OpenNLP name finder wrapper. |
|
Train a named entity recognizer model for OpenNLP. |
|
OpenNLP parser. |
|
Part-of-Speech annotator using OpenNLP. |
|
Train a POS tagging model for OpenNLP. |
|
Tokenizer and sentence splitter using OpenNLP. |
|
Train a sentence splitter model for OpenNLP. |
|
Train a tokenizer model for OpenNLP. |
|
This class creates paragraph annotations for the given input document. |
|
Split up existing tokens again at particular split-chars. |
|
Annotate phrases in a sentence. |
|
Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech. |
|
Maps existing POS tags from one tagset to another using a user provided properties file. |
|
Assign a set of popular readability scores to the text. |
|
A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions. |
|
This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries. |
|
Remove every token that does or does not match a given regular expression. |
|
Takes a text and replaces desired expressions. |
|
Rftagger morphological analyzer. |
|
This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. |
|
SFST morphological analyzer. |
|
Takes a text and replaces sharp s |
|
UIMA wrapper for the Snowball stemmer. |
|
Soundex phonetic transcription based on Apache Commons Codec. |
|
Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation. |
|
No description |
|
Converts a constituency structure into a dependency structure. |
|
Stanford Lemmatizer component. |
|
Stanford Named Entity Recognizer component. |
|
Train a NER model for Stanford CoreNLP Named Entity Recognizer. |
|
Stanford Parser component. |
|
Stanford Part-of-Speech tagger component. |
|
Train a POS tagging model for the Stanford POS tagger. |
|
Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. |
|
Stanford sentence splitter and tokenizer. |
|
Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. |
|
Can be used to measure how long the processing between two points in a pipeline takes. |
|
This component adds Tfidf annotations consisting of a term and a tfidf weight. |
|
This consumer builds a DfModel. |
|
Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen. |
|
Merges any Tokens that are covered by a given annotation type. |
|
Remove prefixes and suffixes from tokens. |
|
Removing trailing character (sequences) from tokens, e.g. punctuation. |
|
Chunk annotator using TreeTagger. |
|
Part-of-Speech and lemmatizer annotator using TreeTagger. |
|
Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model. |
|
A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only. |
Checker
Component | Description |
---|---|
This annotator uses Jazzy for the decision whether a word is spelled correctly or not. |
|
Detect grammatical errors in text using LanguageTool a rule based grammar checker. |
Jazzy Spellchecker
This annotator uses Jazzy for the decision whether a word is spelled correctly or not.
modelEncoding |
The character encoding used by the model. Type: String — Default value: |
modelLocation |
Location from which the model is read. The model file is a simple word-list with one word per line. Type: String |
scoreThreshold |
Determines the maximum edit distance (as an int value) that a suggestion for a spelling error may have. E.g. if set to one suggestions are limited to words within edit distance 1 to the original word. Type: Integer — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
LanguageTool Grammar Checker
Detect grammatical errors in text using LanguageTool a rule based grammar checker.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh |
Chunker
Component | Description |
---|---|
Chunk annotator using OpenNLP. |
|
Train a chunker model for OpenNLP. |
|
Chunk annotator using TreeTagger. |
OpenNLP Chunker
Chunk annotator using OpenNLP.
ChunkMappingLocation |
Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20100908.1 |
|
en |
20160205.1 |
OpenNLP Chunker Trainer
Train a chunker model for OpenNLP.
algorithm |
Type: String — Default value: |
beamSize |
Type: Integer — Default value: |
cutoff |
Type: Integer — Default value: |
iterations |
Type: Integer — Default value: |
language |
Type: String |
numThreads |
Type: Integer — Default value: |
targetLocation |
Type: String |
trainerType |
Type: String — Default value: |
TreeTagger Chunker
Chunk annotator using TreeTagger.
ChunkMappingLocation |
Location of the mapping file for chunk tags to UIMA types. Optional — Type: String |
executablePath |
Use this TreeTagger executable instead of trying to locate the executable automatically. Optional — Type: String |
flushSequence |
A sequence to flush the internal TreeTagger buffer and to force it to output the rest of the completed analysis. This is typically just a sequence of like 5-10 full stops (".") separated by new line characters. However, some models may require a different flush sequence, e.g. a short sentence in the respective language. For chunker models, mind that the sentence must also be POS tagged, e.g. Nous-PRO:PER\n.... Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
performanceMode |
TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided. Type: Boolean — Default value: |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20110429.1 |
|
en |
20090824.1 |
|
en |
20140520.1 |
|
fr |
20141218.2 |
Coreference resolver
Component | Description |
---|---|
Deterministic coreference annotator from CoreNLP. |
|
No description |
CoreNLP Coreference Resolver
Deterministic coreference annotator from CoreNLP.
maxDist |
DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance) Type: Integer — Default value: |
postprocessing |
DCoRef parameter: Do post-processing Type: Boolean — Default value: |
ptb3Escaping |
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: |
quoteBegin |
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
quoteEnd |
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
score |
DCoRef parameter: Scoring the output of the system Type: Boolean — Default value: |
sieves |
DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/. Type: String — Default value: |
singleton |
DCoRef parameter: setting singleton predictor Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
CoreNLP Coreference Resolver (old API)
maxDist |
DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance) Type: Integer — Default value: |
postprocessing |
DCoRef parameter: Do post processing Type: Boolean — Default value: |
score |
DCoRef parameter: Scoring the output of the system Type: Boolean — Default value: |
sieves |
DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/. Type: String — Default value: |
singleton |
DCoRef parameter: setting singleton predictor Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
${core.version}.1 |
Embeddings
Component | Description |
---|---|
Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas. |
|
Compute word embeddings from the given collection using skip-grams. |
Mallet Embeddings Annotator
Reads word embeddings from a file and adds WordEmbedding annotations to tokens/lemmas.
annotateUnknownTokens |
Specify how to handle unknown tokens:
Type: Boolean — Default value: |
lowercase |
If set to true (default: false), all tokens are lowercased. Type: Boolean — Default value: |
modelHasHeader |
If set to true (default: false), the first line is interpreted as header line containing the number of entries and the dimensionality. This should be set to true for models generated with Word2Vec. Type: Boolean — Default value: |
modelIsBinary |
Type: Boolean — Default value: |
modelLocation |
The file containing the word embeddings. Currently only supports text file format. Type: String |
tokenFeaturePath |
The annotation type to use for the model. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value Type: String — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Mallet Embeddings Trainer
Compute word embeddings from the given collection using skip-grams.
Set #PARAM_TOKEN_FEATURE_PATH to define what is considered as a token (Tokens, Lemmas, etc.).
Set #PARAM_COVERING_ANNOTATION_TYPE to define what is considered a document (sentences, paragraphs, etc.).
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
coveringAnnotationType |
If specified, the text contained in the given segmentation type annotations are fed as separate units ("documents") to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored. By default, the full text is used as a document. Type: String — Default value: `` |
dimensions |
The dimensionality of the output word embeddings (default: 50). Type: Integer — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
exampleWord |
An example word that is output with its nearest neighbours once in a while (default: null, i.e. none). Optional — Type: String |
filterRegex |
Filter out all tokens matching that regular expression. Type: String — Default value: `` |
filterRegexReplacement |
Type: String — Default value: `` |
lowercase |
If set to true (default: false), all tokens are lowercased. Type: Boolean — Default value: |
minDocumentLength |
Ignore documents with fewer tokens than this value (default: 10). Type: Integer — Default value: |
minTokenLength |
Ignore tokens (or any other annotation type, as specified by #PARAM_TOKEN_FEATURE_PATH) that are shorter than the given value. Default: 3. Type: Integer — Default value: |
numNegativeSamples |
The number of negative samples to be generated for each token (default: 5). Type: Integer — Default value: |
numThreads |
The number of threads to use during model estimation. If not set, the number of threads is automatically set by ComponentParameters#computeNumThreads(int). Warning: do not set this to more than 1 when using very small (test) data sets on MalletEmbeddingsTrainer! This might prevent the process from terminating. Type: Integer — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
paramStopwordsFile |
The location of the stopwords file. Type: String — Default value: `` |
paramStopwordsReplacement |
If set, stopwords found in the #PARAM_STOPWORDS_FILE location are not removed, but replaced by the given string (e.g. STOP). Type: String — Default value: `` |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
tokenFeaturePath |
The annotation type to use as input tokens for the model estimation. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token. For lemmas, for instance, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value Type: String — Default value: |
useCharacters |
If true (default: false), estimate character embeddings. #PARAM_TOKEN_FEATURE_PATH is ignored. Type: Boolean — Default value: |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
windowSize |
The context size when generating embeddings (default: 5). Type: Integer — Default value: |
Gazeteer
Component | Description |
---|---|
Takes a plain text file with phrases as input and annotates the phrases in the CAS file. |
Dictionary Annotator
Takes a plain text file with phrases as input and annotates the phrases in the CAS file. The annotation type defaults to NGram, but can be changed. The component requires that Tokens and Sentencees are annotated in the CAS. The format of the phrase file is one phrase per line, tokens are separated by space:
this is a phrase another phrase
annotationType |
The annotation to create on matching phases. If nothing is specified, this defaults to NGram. Optional — Type: String |
modelEncoding |
The character encoding used by the model. Type: String — Default value: |
modelLocation |
The file must contain one phrase per line - phrases will be split at " " Type: String |
value |
The value to set the feature configured in #PARAM_VALUE_FEATURE to. Optional — Type: String |
valueFeature |
Set this feature on the created annotations. Optional — Type: String — Default value: |
Inputs |
|
---|---|
Outputs |
none specified |
Languages |
none specified |
Language Identifier
Component | Description |
---|---|
Langdetect language identifier based on character n-grams. |
|
Detection based on character n-grams. |
|
Language detector based on n-gram frequency counts, e.g. as provided by Web1T |
LangDetect
Langdetect language identifier based on character n-grams. Due to the way LangDetect is implemented, this component does not support being instantiated multiple times with different model locations. Only a single model location can be active at a time over all instances of this component.
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
seed |
The random seed. Optional — Type: String |
Language | Variant | Version |
---|---|---|
any |
20141013.1 |
|
any |
20141013.1 |
TextCat Language Identifier (Character N-Gram-based)
Detection based on character n-grams. Uses the Java Text Categorizing Library based on a technique by Cavnar and Trenkle.
References
- Cavnar, W. B. and J. M. Trenkle (1994). N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.
Web1T Language Detector
Language detector based on n-gram frequency counts, e.g. as provided by Web1T
maxNGramSize |
The maximum n-gram size that should be considered. Default is 3. Type: Integer — Default value: |
minNGramSize |
The minimum n-gram size that should be considered. Default is 1. Type: Integer — Default value: |
Lemmatizer
Component | Description |
---|---|
Lemmatizer using Clear NLP. |
|
Lemmatizer from CoreNLP. |
|
Stanford Lemmatizer component. |
|
Wrapper for the GATE rule based lemmatizer. |
|
Lemmatizer using the OpenNLP-based Ixa implementation. |
|
Naive lexicon-based lemmatizer. |
|
DKPro Annotator for the MateToolsLemmatizer. |
|
Lemmatize based on a finite-state machine. |
|
Emory NLP4J lemmatizer. |
|
Lemmatizer using OpenNLP. |
|
Train a lemmatizer model for OpenNLP. |
ClearNLP Lemmatizer
Lemmatizer using Clear NLP.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String — Default value: |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20131111.0 |
CoreNLP Lemmatizer
Lemmatizer from CoreNLP.
ptb3Escaping |
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: |
quoteBegin |
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
quoteEnd |
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
CoreNLP Lemmatizer (old API)
Stanford Lemmatizer component. The Stanford Morphology-class computes the base form of English words, by removing just inflections (not derivational morphology). That is, it only does noun plurals, pronoun case, and verb endings, and not things like comparative adjectives or derived nominals. It is based on a finite-state transducer implemented by John Carroll et al., written in flex and publicly available. See: http://www.informatics.susx.ac.uk/research/nlp/carroll/morph.html
This only works for ENGLISH.
ptb3Escaping |
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: |
quoteBegin |
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
quoteEnd |
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
Inputs |
|
---|---|
Outputs |
|
Languages |
en |
GATE Lemmatizer
Wrapper for the GATE rule based lemmatizer. Based on code by Asher Stern from the BIUTEE textual entailment tool.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20160531.0 |
IXA Lemmatizer
Lemmatizer using the OpenNLP-based Ixa implementation.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20160213.1 |
|
en |
20160211.1 |
|
en |
20160214.1 |
|
en |
20160214.1 |
|
es |
20160211.1 |
|
eu |
20160212.1 |
|
fr |
20160215.1 |
|
gl |
20160212.1 |
|
it |
20160213.1 |
|
nl |
20160215.1 |
LanguageTool Lemmatizer
Naive lexicon-based lemmatizer. The words are looked up using the wordform lexicons of LanguageTool. Multiple readings are produced. The annotator simply takes the most frequent lemma from those readings. If no readings could be found, the original text is assigned as lemma.
sanitize |
Type: Boolean — Default value: |
sanitizeChars |
Type: String[] — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh |
Mate Tools Lemmatizer
DKPro Annotator for the MateToolsLemmatizer.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
uppercase |
Try reconstructing proper casing for lemmata. This is useful for German, but e.g. for English creates odd results. Type: Boolean — Default value: |
variant |
Override the default variant used to locate the model. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20121024.1 |
|
en |
20130117.1 |
|
es |
20130117.1 |
|
fr |
20130918.0 |
Morpha Lemmatizer
readPOS |
Pass part-of-speech information on to Morpha. Since we currently do not know in which format the part-of-speech tags are expected by Morpha, we just pass on the actual pos tag value we get from the token. This may produce worse results than not passing on pos tags at all, so this is disabled by default. Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
en |
NLP4J Lemmatizer
Emory NLP4J lemmatizer. This is a lower-casing lemmatizer.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
OpenNLP Lemmatizer
Lemmatizer using OpenNLP.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelEncoding |
The character encoding used by the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
OpenNLP Lemmatizer Trainer
Train a lemmatizer model for OpenNLP.
algorithm |
Type: String — Default value: |
beamSize |
Type: Integer — Default value: |
cutoff |
Type: Integer — Default value: |
iterations |
Type: Integer — Default value: |
language |
Type: String |
numThreads |
Type: Integer — Default value: |
targetLocation |
Type: String |
trainerType |
Type: String — Default value: |
Morphological analyzer
Component | Description |
---|---|
DKPro Annotator for the MateToolsMorphTagger. |
|
Rftagger morphological analyzer. |
|
SFST morphological analyzer. |
Mate Tools Morphological Analyzer
DKPro Annotator for the MateToolsMorphTagger.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20121024.1 |
|
es |
20130117.1 |
|
fr |
20130918.0 |
RFTagger Morphological Analyzer
Rftagger morphological analyzer.
MorphMappingLocation |
Optional — Type: String |
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelEncoding |
The character encoding used by the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Write the tag set(s) to the log when a model is loaded. Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
cz |
20150728.1 |
|
de |
20150928.1 |
|
hu |
20150728.1 |
|
ru |
20150728.1 |
|
sk |
20150728.1 |
|
sl |
20150728.1 |
SFST Morphological Analyzer
SFST morphological analyzer.
MorphMappingLocation |
Optional — Type: String |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
mode |
Type: String — Default value: |
modelEncoding |
Specifies the model encoding. Type: String — Default value: |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Write the tag set(s) to the log when a model is loaded. Type: Boolean — Default value: |
writeLemma |
Write lemma information. Default: true Type: Boolean — Default value: |
writePOS |
Write part-of-speech information. Default: true Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20110202.1 |
|
de |
20140801.1 |
|
de |
20140521.1 |
|
de |
20140521.1 |
|
it |
20090223.1 |
|
tr |
20130219.1 |
Named Entity Recognizer
Component | Description |
---|---|
Stanford Named Entity Recognizer component. |
|
Named entity recognizer from CoreNLP. |
|
Train a NER model for Stanford CoreNLP Named Entity Recognizer. |
|
LingPipe named entity recognizer. |
|
LingPipe named entity recognizer trainer. |
|
Emory NLP4J name finder wrapper. |
|
OpenNLP name finder wrapper. |
|
Train a named entity recognizer model for OpenNLP. |
CoreNLP Named Entity Recogizer (old API)
Stanford Named Entity Recognizer component.
NamedEntityMappingLocation |
Location of the mapping file for named entity tags to UIMA types. Optional — Type: String |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Type: Boolean — Default value: |
ptb3Escaping |
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: |
quoteBegin |
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
quoteEnd |
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20150130.1 |
|
de |
20161213.1 |
|
de |
20141024.1 |
|
en |
20161213.0 |
|
en |
20161213.1 |
|
en |
20160110.1 |
|
en |
20160110.0 |
|
en |
20150420.1 |
|
en |
20160110.1 |
|
en |
20150925.1 |
|
en |
20150129.0 |
|
en |
20150129.1 |
|
en |
20160110.1 |
|
en |
20161213.0 |
|
en |
20160110.0 |
|
es |
20161211.1 |
|
es |
20150925.1 |
|
fr |
20150925.1 |
|
it |
20150925.1 |
|
nl |
20150925.1 |
|
ru |
20160726.1 |
CoreNLP Named Entity Recognizer
Named entity recognizer from CoreNLP.
NamedEntityMappingLocation |
Location of the mapping file for named entity tags to UIMA types. Optional — Type: String |
applyNumericClassifiers |
Type: Boolean — Default value: |
augmentRegexNER |
Type: Boolean — Default value: |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String |
maxSentenceLength |
Type: Integer — Default value: |
maxTime |
Type: Integer — Default value: |
modelEncoding |
The character encoding used by the model. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
numThreads |
Type: Integer — Default value: |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
ptb3Escaping |
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: |
quoteBegin |
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
quoteEnd |
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
useSUTime |
Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
CoreNLP Named Entity Recognizer Trainer
Train a NER model for Stanford CoreNLP Named Entity Recognizer.
acceptedTagsRegex |
Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type. Optional — Type: String |
entitySubClassification |
Optional — Type: String — Default value: |
propertiesFile |
Training file containing the parameters. The Optional — Type: String |
retainClassification |
Flag to keep the label set specified by PARAM_LABEL_SET. If set to false, representation is mapped to IOB1 on output. Default: true Optional — Type: Boolean — Default value: |
targetLocation |
Location of the target model file. Type: String |
LingPipe Named Entity Recognizer
LingPipe named entity recognizer.
NamedEntityMappingLocation |
Location of the mapping file for named entity tags to UIMA types. Optional — Type: String |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20110623.1 |
|
en |
20110623.1 |
|
en |
20110623.1 |
LingPipe Named Entity Recognizer Trainer
LingPipe named entity recognizer trainer.
acceptedTagsRegex |
Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type. Optional — Type: String |
targetLocation |
Type: String |
NLP4J Named Entity Recognizer
Emory NLP4J name finder wrapper.
NamedEntityMappingLocation |
Location of the mapping file for named entity tags to UIMA types. Optional — Type: String |
ignoreMissingFeatures |
Process anyway, even if the model relies on features that are not supported by this component. Default: false Type: Boolean — Default value: |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20160802.0 |
OpenNLP Named Entity Recognizer
OpenNLP name finder wrapper.
NamedEntityMappingLocation |
Location of the mapping file for named entity tags to UIMA types. Optional — Type: String |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Type: String — Default value: |
printTagSet |
Log the tag set(s) when a model is loaded. Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20141024.1 |
|
en |
20100907.0 |
|
en |
20100907.0 |
|
en |
20100907.0 |
|
en |
20100907.0 |
|
en |
20100907.0 |
|
en |
20130624.1 |
|
en |
20100907.0 |
|
es |
20100908.0 |
|
es |
20100908.0 |
|
es |
20100908.0 |
|
es |
20100908.0 |
|
nl |
20100908.0 |
|
nl |
20100908.0 |
|
nl |
20100908.0 |
|
nl |
20100908.0 |
OpenNLP Named Entity Recognizer Trainer
Train a named entity recognizer model for OpenNLP.
acceptedTagsRegex |
Regex to filter the de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity#getValue() named entity by type. Optional — Type: String |
algorithm |
Type: String — Default value: |
beamSize |
Type: Integer — Default value: |
cutoff |
Type: Integer — Default value: |
featureGen |
Optional — Type: String |
iterations |
Type: Integer — Default value: |
language |
Type: String |
numThreads |
Type: Integer — Default value: |
sequenceEncoding |
Type: String — Default value: |
targetLocation |
Type: String |
trainerType |
Type: String — Default value: |
Parser
Component | Description |
---|---|
Berkeley Parser annotator . |
|
CLEAR parser annotator. |
|
Converts a constituency structure into a dependency structure. |
|
Dependency parser from CoreNLP. |
|
Parser from CoreNLP. |
|
Stanford Parser component. |
|
Dependency parsing using MSTParser. |
|
Dependency parsing using MaltPaser. |
|
DKPro Annotator for the MateToolsParser. |
|
Emory NLP4J dependency parser. |
|
OpenNLP parser. |
Berkeley Parser
Berkeley Parser annotator . Requires Sentences to be annotated before.
ConstituentMappingLocation |
Location of the mapping file for constituent tags to UIMA types. Optional — Type: String |
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
accurate |
Set thresholds for accuracy. Default: false (set thresholds for efficiency) Type: Boolean — Default value: |
binarize |
Output binarized trees. Default: false Type: Boolean — Default value: |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: |
keepFunctionLabels |
Retain predicted function labels. Model must have been trained with function labels. Default: false Type: Boolean — Default value: |
language |
Use this language instead of the language set in the CAS to locate the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
readPOS |
Sets whether to use or not to use already existing POS tags from another annotator for the parsing process. Default: false Type: Boolean — Default value: |
scores |
Output inside scores (only for binarized viterbi trees). Default: false Type: Boolean — Default value: |
substates |
Output sub-categories (only for binarized Viterbi trees). Default: false Type: Boolean — Default value: |
variational |
Use variational rule score approximation instead of max-rule Default: false Type: Boolean — Default value: |
viterbi |
Compute Viterbi derivation instead of max-rule tree. Default: false (max-rule) Type: Boolean — Default value: |
writePOS |
Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Default: true Type: Boolean — Default value: |
writePennTree |
If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
ar |
20090917.1 |
|
bg |
20090917.1 |
|
de |
20090917.1 |
|
en |
20100819.1 |
|
fr |
20090917.1 |
|
zh |
20090917.1 |
ClearNLP Parser
CLEAR parser annotator.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
printTagSet |
Write the tag set(s) to the log when a model is loaded. Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20131111.0 |
|
en |
20131128.0 |
CoreNLP Dependency Converter
Converts a constituency structure into a dependency structure.
language |
Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String |
mode |
Sets the kind of dependencies being created. Default: DependenciesMode#COLLAPSED TREE Optional — Type: String — Default value: |
originalDependencies |
Create original dependencies. If this is disabled, universal dependencies are created. The default is to create the original dependencies. Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
CoreNLP Dependency Parser
Dependency parser from CoreNLP.
DependencyMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
extraDependencies |
Type: String — Default value: |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String |
maxSentenceLength |
Type: Integer — Default value: |
maxTime |
Type: Integer — Default value: |
modelEncoding |
The character encoding used by the model. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
numThreads |
Type: Integer — Default value: |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
ptb3Escaping |
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: |
quoteBegin |
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
quoteEnd |
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20161213.1 |
|
en |
20160119.1 |
|
en |
20150418.1 |
|
en |
20161213.1 |
|
en |
20150418.1 |
|
en |
20161213.1 |
|
fr |
20161211.1 |
|
zh |
20160119.1 |
|
zh |
20161223.1 |
|
zh |
20161223.1 |
CoreNLP Parser
Parser from CoreNLP.
ConstituentMappingLocation |
Location of the mapping file for dependency tags to UIMA types. Optional — Type: String |
DependencyMappingLocation |
Location of the mapping file for dependency tags to UIMA types. Optional — Type: String |
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
extraDependencies |
Type: String — Default value: |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false Optional — Type: Boolean — Default value: |
keepPunctuation |
Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String |
maxSentenceLength |
Type: Integer — Default value: |
maxTime |
Type: Integer — Default value: |
modelEncoding |
The character encoding used by the model. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
numThreads |
Type: Integer — Default value: |
originalDependencies |
Type: Boolean — Default value: |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
ptb3Escaping |
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: |
quoteBegin |
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
quoteEnd |
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
readPOS |
Sets whether to use or not to use existing POS tags. Default: true Type: Boolean — Default value: |
writeConstituent |
Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization. Default: true Type: Boolean — Default value: |
writeDependency |
Sets whether to create or not to create dependency annotations. Default: true Type: Boolean — Default value: |
writePOS |
Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Default: false Type: Boolean — Default value: |
writePennTree |
If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
CoreNLP Parser (old API)
Stanford Parser component.
ConstituentMappingLocation |
Location of the mapping file for constituent tags to UIMA types. Optional — Type: String |
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
annotationTypeToParse |
This parameter can be used to override the standard behavior which uses the Sentence annotation as the basic unit for parsing. If the parameter is set with the name of an annotation type x, the parser will no longer parse Sentence-annotations, but x-Annotations. Default: null Optional — Type: String |
keepPunctuation |
Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String |
maxItems |
Controls when the factored parser considers a sentence to be too complex and falls back to the PCFG parser. Default: 200000 Type: Integer — Default value: |
maxSentenceLength |
Maximum number of tokens in a sentence. Longer sentences are not parsed. This is to avoid out of memory exceptions. Default: 130 Type: Integer — Default value: |
mode |
Sets the kind of dependencies being created. Default: DependenciesMode#TREE TREE Optional — Type: String — Default value: |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
printTagSet |
Write the tag set(s) to the log when a model is loaded. Type: Boolean — Default value: |
ptb3Escaping |
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: |
quoteBegin |
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
quoteEnd |
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
readPOS |
Sets whether to use or not to use already existing POS tags from another annotator for the parsing process. Default: true Type: Boolean — Default value: |
writeConstituent |
Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization. Default: true Type: Boolean — Default value: |
writeDependency |
Sets whether to create or not to create dependency annotations. Default: true Type: Boolean — Default value: |
writePOS |
Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Default: false Type: Boolean — Default value: |
writePennTree |
If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
ar |
20150129.1 |
|
ar |
20141031.1 |
|
de |
20150129.1 |
|
de |
20150129.1 |
|
de |
20141031.1 |
|
en |
20150129.1 |
|
en |
20150129.1 |
|
en |
20160110.1 |
|
en |
20140104.1 |
|
en |
20141031.1 |
|
en |
20141031.1 |
|
en |
20150129.1 |
|
en |
20150129.1 |
|
en |
20140104.1 |
|
es |
20161211.1 |
|
es |
20161211.1 |
|
es |
20161211.1 |
|
fr |
20150129.1 |
|
fr |
20160114.1 |
|
fr |
20141023.1 |
|
zh |
20150129.1 |
|
zh |
20150129.1 |
|
zh |
20141023.1 |
|
zh |
20150129.1 |
|
zh |
20150129.1 |
MSTParser Dependency Parser
Dependency parsing using MSTParser.
Wrapper for the MSTParser (high memory requirements). More information about the parser can be found here here
The MSTParser models tend to be very large, e.g. the Eisner model is about 600 MB uncompressed. With this model, parsing a simple sentence with MSTParser requires about 3 GB heap memory.
This component feeds MSTParser only with the FORM (token) and POS (part-of-speech) fields. LEMMA, CPOS, and other columns from the CONLL 2006 format are not generated (cf. mstparser.DependencyInstance DependencyInstance).
DependencyMappingLocation |
Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
order |
Specifies the order/scope of features. 1 only has features over single edges and 2 has features over pairs of adjacent edges in the tree. The model must have been trained with the respective order set here. Optional — Type: Integer |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20100416.2 |
|
en |
20121019.2 |
|
hr |
20130527.1 |
|
hr |
20130527.1 |
MaltParser Dependency Parser
Dependency parsing using MaltPaser.
Required annotations:
- Token
- Sentence
- POS
- Dependency (annotated over sentence-span)
ignoreMissingFeatures |
Process anyway, even if the model relies on features that are not supported by this component. Default: false Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
bn |
20120905.1 |
|
en |
20120312.1 |
|
en |
20120312.1 |
|
es |
20130220.0 |
|
fa |
20130522.1 |
|
fr |
20120312.1 |
|
pl |
20120904.1 |
|
sv |
20120925.2 |
Mate Tools Dependency Parser
DKPro Annotator for the MateToolsParser.
Please cite the following paper, if you use the parser: Bernd Bohnet. 2010. Top Accuracy and Fast Dependency Parsing is not a Contradiction. The 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China.
DependencyMappingLocation |
Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20121024.1 |
|
en |
20130117.2 |
|
es |
20130117.1 |
|
fa |
20141124.0 |
|
fr |
20130918.0 |
|
zh |
20130117.1 |
NLP4J Dependency Parser
Emory NLP4J dependency parser.
DependencyMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
ignoreMissingFeatures |
Process anyway, even if the model relies on features that are not supported by this component. Default: false Type: Boolean — Default value: |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
OpenNLP Parser
OpenNLP parser. The parser ignores existing POS tags and internally creates new ones. However, these tags are only added as annotation if explicitly requested via #PARAM_WRITE_POS.
ConstituentMappingLocation |
Location of the mapping file for constituent tags to UIMA types. Optional — Type: String |
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
writePOS |
Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work. Default: true Type: Boolean — Default value: |
writePennTree |
If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20120616.1 |
|
en |
20140426.1 |
|
es |
20140426.1 |
Part-of-speech tagger
Component | Description |
---|---|
Wrapper for Twitter Tokenizer and POS Tagger. |
|
Part-of-Speech annotator using Clear NLP. |
|
Part-of-speech tagger from CoreNLP. |
|
Stanford Part-of-Speech tagger component. |
|
Train a POS tagging model for the Stanford POS tagger. |
|
Flexible part-of-speech tagger. |
|
GATE Hepple part-of-speech tagger. |
|
Part-of-Speech annotator using HunPos. |
|
Part-of-Speech annotator using OpenNLP with IXA extensions. |
|
LingPipe part-of-speech tagger. |
|
DKPro Annotator for the MateToolsPosTagger |
|
Annotator for the MeCab Japanese POS Tagger. |
|
Part-of-Speech annotator using Emory NLP4J. |
|
Part-of-Speech annotator using OpenNLP. |
|
Train a POS tagging model for OpenNLP. |
|
Part-of-Speech and lemmatizer annotator using TreeTagger. |
ArkTweet POS-Tagger
Wrapper for Twitter Tokenizer and POS Tagger. As described in: Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider and Noah A. Smith. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters In Proceedings of NAACL 2013.
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
language |
Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20120919.1 |
|
en |
20121211.1 |
|
en |
20130723.1 |
ClearNLP POS-Tagger
Part-of-Speech annotator using Clear NLP. Requires Sentences to be annotated before.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
dictLocation |
Load the dictionary from this location instead of locating the dictionary automatically. Optional — Type: String |
dictVariant |
Override the default variant used to locate the dictionary. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the pos-tagging model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the pos-tagging model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20131111.0 |
|
en |
20131128.0 |
CoreNLP POS-Tagger
Part-of-speech tagger from CoreNLP.
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String |
maxSentenceLength |
Type: Integer — Default value: |
modelEncoding |
The character encoding used by the model. Optional — Type: String |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
numThreads |
Type: Integer — Default value: |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
ptb3Escaping |
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: |
quoteBegin |
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
quoteEnd |
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
CoreNLP POS-Tagger (old API)
Stanford Part-of-Speech tagger component.
POSMappingLocation |
Location of the mapping file for part-of-speech tags to UIMA types. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String |
maxSentenceLength |
Sentences with more tokens than the specified max amount will be ignored if this parameter is set to a value larger than zero. The default value zero will allow all sentences to be POS tagged. Optional — Type: Integer |
modelLocation |
Location from which the model is read. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
ptb3Escaping |
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-). Type: Boolean — Default value: |
quoteBegin |
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
quoteEnd |
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser. Optional — Type: String[] |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
ar |
20131112.1 |
|
de |
20140827.1 |
|
de |
20140827.1 |
|
de |
20140827.0 |
|
de |
20140827.1 |
|
de |
20161213.1 |
|
en |
20140616.1 |
|
en |
20140827.0 |
|
en |
20130730.1 |
|
en |
20140616.1 |
|
en |
20130730.1 |
|
en |
20130914.0 |
|
en |
20160110.1 |
|
en |
20131112.1 |
|
en |
20140827.0 |
|
en |
20140616.1 |
|
en |
20131112.1 |
|
es |
20161211.1 |
|
es |
20161211.1 |
|
fr |
20140616.1 |
|
zh |
20140616.1 |
|
zh |
20140616.1 |
CoreNLP POS-Tagger Trainer
Train a POS tagging model for the Stanford POS tagger.
clusterFile |
Distsim cluster files. Optional — Type: String |
targetLocation |
Type: String |
trainFile |
Training file containing the parameters. The Optional — Type: String |
FlexTag POS-Tagger
Flexible part-of-speech tagger.
POSMappingLocation |
Optional — Type: String |
language |
Optional — Type: String |
modelLocation |
Optional — Type: String |
modelVariant |
Optional — Type: String |
Language | Variant | Version |
---|---|---|
de |
20170512.1 |
|
en |
20170512.1 |
GATE Hepple POS-Tagger
GATE Hepple part-of-speech tagger.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
lexiconLocation |
Load the lexicon from this location instead of locating it automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
rulesetLocation |
Load the ruleset from this location instead of locating it automatically. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20160531.0 |
HunPos POS-Tagger
Part-of-Speech annotator using HunPos. Requires Sentences to be annotated before.
References
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
cs |
20121123.2 |
|
da |
20121123.2 |
|
de |
20121123.2 |
|
en |
20070724.2 |
|
fa |
20140414.0 |
|
hr |
20130509.2 |
|
hu |
20070724.2 |
|
pt |
20121123.2 |
|
pt |
20121123.2 |
|
pt |
20130119.2 |
|
pt |
20110419.2 |
|
ru |
20121123.2 |
|
sl |
20121123.2 |
|
sv |
20100215.2 |
|
sv |
20100927.2 |
IXA POS-Tagger
Part-of-Speech annotator using OpenNLP with IXA extensions.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelEncoding |
The character encoding used by the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Language | Variant | Version |
---|---|---|
de |
20160213.1 |
|
en |
20160211.1 |
|
en |
20160211.1 |
|
en |
20160214.1 |
|
en |
20160214.1 |
|
es |
20160212.1 |
|
eu |
20160212.1 |
|
fr |
20160215.1 |
|
gl |
20160212.1 |
|
it |
20160213.1 |
|
nl |
20160214.1 |
|
nl |
20160214.1 |
LingPipe POS-Tagger
LingPipe part-of-speech tagger.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
uppercaseTags |
Lingpipe models tend to be trained on lower-case tags, but our POS mappings use uppercase. Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20110623.1 |
|
en |
20110623.1 |
|
en |
20110623.1 |
Mate Tools POS-Tagger
DKPro Annotator for the MateToolsPosTagger
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20121024.1 |
|
en |
20130117.1 |
|
es |
20130117.1 |
|
fr |
20130918.0 |
|
zh |
20130117.1 |
MeCab POS-Tagger
Annotator for the MeCab Japanese POS Tagger.
language |
The language. Optional — Type: String |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
ja |
Language | Variant | Version |
---|---|---|
jp |
20140917.0 |
|
jp |
20140917.0 |
|
jp |
20140917.0 |
|
jp |
20070801.0 |
NLP4J POS-Tagger
Part-of-Speech annotator using Emory NLP4J. Requires Sentences to be annotated before.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
ignoreMissingFeatures |
Process anyway, even if the model relies on features that are not supported by this component. Default: false Type: Boolean — Default value: |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20160802.0 |
OpenNLP POS-Tagger
Part-of-Speech annotator using OpenNLP.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelEncoding |
The character encoding used by the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
da |
20120616.1 |
|
da |
20120616.1 |
|
de |
20120616.1 |
|
de |
20120616.1 |
|
en |
20120616.1 |
|
en |
20120616.1 |
|
en |
20131115.1 |
|
es |
20120410.1 |
|
es |
20140425.1 |
|
es |
20120410.1 |
|
es |
20120410.1 |
|
es |
20131115.1 |
|
es |
20120410.1 |
|
it |
20130618.0 |
|
nl |
20120616.1 |
|
nl |
20120616.1 |
|
pt |
20120616.1 |
|
pt |
20130121.1 |
|
pt |
20130121.1 |
|
pt |
20120616.1 |
|
sv |
20120616.1 |
|
sv |
20120616.1 |
OpenNLP POS-Tagger Trainer
Train a POS tagging model for OpenNLP.
algorithm |
Type: String — Default value: |
beamSize |
Type: Integer — Default value: |
cutoff |
Type: Integer — Default value: |
iterations |
Type: Integer — Default value: |
language |
Type: String |
numThreads |
Type: Integer — Default value: |
targetLocation |
Type: String |
trainerType |
Type: String — Default value: |
TreeTagger POS-Tagger
Part-of-Speech and lemmatizer annotator using TreeTagger.
POSMappingLocation |
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically. Optional — Type: String |
executablePath |
Use this TreeTagger executable instead of trying to locate the executable automatically. Optional — Type: String |
internTags |
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true Optional — Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelEncoding |
The character encoding used by the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
performanceMode |
TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided. Type: Boolean — Default value: |
printTagSet |
Log the tag set(s) when a model is loaded. Default: false Type: Boolean — Default value: |
writeLemma |
Write lemma information. Default: true Type: Boolean — Default value: |
writePOS |
Write part-of-speech information. Default: true Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
bg |
20160430.1 |
|
de |
20170316.1 |
|
en |
20170220.1 |
|
es |
20161222.1 |
|
et |
20110124.1 |
|
fi |
20140704.1 |
|
fr |
20100111.1 |
|
gl |
20130516.1 |
|
gmh |
20161107.1 |
|
it |
20141020.1 |
|
la |
20110819.1 |
|
mn |
20120925.1 |
|
nl |
20130107.1 |
|
pl |
20150506.1 |
|
pt |
20101115.2 |
|
ru |
20140505.1 |
|
sk |
20130725.1 |
|
sw |
20130729.1 |
|
zh |
20101115.1 |
Phonetic Transcriptor
Component | Description |
---|---|
Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. |
|
Double-Metaphone phonetic transcription based on Apache Commons Codec. |
|
Metaphone phonetic transcription based on Apache Commons Codec. |
|
Soundex phonetic transcription based on Apache Commons Codec. |
Commons Codec Cologne Phonetic Transcriptor
Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. Works for German.
Inputs |
|
---|---|
Outputs |
|
Languages |
de |
Commons Codec Double-Metaphone Phonetic Transcriptor
Double-Metaphone phonetic transcription based on Apache Commons Codec. Works for English.
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Commons Codec Metaphone Phonetic Transcriptor
Metaphone phonetic transcription based on Apache Commons Codec. Works for English.
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Segmenter
Segmenter components identify sentence boundaries and tokens. The order in which sentence splitting and tokenization are done differs between the integrated the NLP libraries. Thus, we chose to integrate both steps into a segmenter component to avoid the need to reorder the components in a pipeline when replacing one segmenter with another.
Component | Description |
---|---|
Removes annotations that do not conform to minimum or maximum length constraints. |
|
ArkTweet tokenizer. |
|
Split up existing tokens again if they are camel-case text. |
|
Tokenizer using Clear NLP. |
|
Tokenizer and sentence splitter using from CoreNLP. |
|
Stanford sentence splitter and tokenizer. |
|
Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. |
|
Segmenter for Japanese text based on GoSen. |
|
ICU segmenter. |
|
JTok segmenter. |
|
BreakIterator segmenter. |
|
Segmenter using LanguageTool to do the heavy lifting. |
|
Annotates each line in the source text as a sentence. |
|
LingPipe segmenter. |
|
Segmenter using Emory NLP4J. |
|
Tokenizer and sentence splitter using OpenNLP. |
|
Train a sentence splitter model for OpenNLP. |
|
Train a tokenizer model for OpenNLP. |
|
This class creates paragraph annotations for the given input document. |
|
Split up existing tokens again at particular split-chars. |
|
This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries. |
|
Merges any Tokens that are covered by a given annotation type. |
|
Remove prefixes and suffixes from tokens. |
|
A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only. |
Annotation-By-Length Filter
Removes annotations that do not conform to minimum or maximum length constraints. (This was previously called TokenFilter).
FilterTypes |
A set of annotation types that should be filtered. Type: String[] — Default value: |
MaxLengthFilter |
Any annotation in filterAnnotations shorter than this value will be removed. Type: Integer — Default value: |
MinLengthFilter |
Any annotation in filterTypes shorter than this value will be removed. Type: Integer — Default value: |
CamelCase Token Segmenter
Split up existing tokens again if they are camel-case text.
deleteCover |
Wether to remove the original token. Default: true Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
ClearNLP Segmenter
Tokenizer using Clear NLP.
language |
The language. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
en |
Language | Variant | Version |
---|---|---|
en |
20131111.0 |
CoreNLP Segmenter
Tokenizer and sentence splitter using from CoreNLP.
boundaryMultiTokenRegex |
Optional — Type: String |
boundaryToDiscard |
The set of regex for sentence boundary tokens that should be discarded. Optional — Type: String[] — Default value: |
boundaryTokenRegex |
The set of boundary tokens. If null, use default. Optional — Type: String — Default value: |
htmlElementsToDiscard |
These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary. Optional — Type: String[] |
language |
The language. Optional — Type: String |
newlineIsSentenceBreak |
Strategy for treating newlines as sentence breaks. Optional — Type: String — Default value: |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
tokenRegexesToDiscard |
The set of regex for sentence boundary tokens that should be discarded. Optional — Type: String[] — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
none specified |
CoreNLP Segmenter (old API)
Stanford sentence splitter and tokenizer.
allowEmptySentences |
Whether to generate empty sentences. Type: Boolean — Default value: |
boundaryFollowersRegex |
This is a Set of String that are matched with .equals() which are allowed to be tacked onto the end of a sentence after a sentence boundary token, for example ")". Optional — Type: String — Default value: |
boundaryToDiscard |
The set of regex for sentence boundary tokens that should be discarded. Optional — Type: String[] — Default value: |
boundaryTokenRegex |
The set of boundary tokens. If null, use default. Optional — Type: String — Default value: |
isOneSentence |
Whether to treat all input as one sentence. Type: Boolean — Default value: |
language |
The language. Optional — Type: String |
languageFallback |
If this component is not configured for a specific language and if the language stored in the document metadata is not supported, use the given language as a fallback. Optional — Type: String |
newlineIsSentenceBreak |
Strategy for treating newlines as paragraph breaks. Optional — Type: String — Default value: |
regionElementRegex |
A regular expression for element names containing a sentence region. Only tokens in such elements will be included in sentences. The start and end tags themselves are not included in the sentence. Optional — Type: String |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
tokenRegexesToDiscard |
The set of regex for sentence boundary tokens that should be discarded. Optional — Type: String[] — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
xmlBreakElementsToDiscard |
These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary. Optional — Type: String[] |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
en, es, fr |
German Separated Particle Annotator
Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. This Annotator deals with German particle verbs. Particle verbs consist of a particle and a stem, e.g. anfangen = an+fangen There are many usages of German particle verbs where the stem and the particle are separated, e.g., Wir fangen gleich an. The TreeTagger lemmatizes the verb stem as "fangen" and the separated particle as "an", the proper verblemma "anfangen" is thus not available as an annotation. The GermanSeparatedParticleAnnotator replaces the lemma of the stem of particle-verbs (e.g., fangen) by the proper verb lemma (e.g. anfangen) and leaves the lemma of the separated particle unchanged.
Inputs |
|
---|---|
Outputs |
|
Languages |
de |
Gosen Segmenter
Segmenter for Japanese text based on GoSen.
language |
The language. Optional — Type: String |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
ja |
ICU Segmenter
ICU segmenter.
language |
The language. Optional — Type: String |
splitAtApostrophe |
Per default, the segmenter does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered. Type: Boolean — Default value: |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
af, ak, am, ar, as, az, be, bg, bm, bn, bo, br, bs, ca, ce, cs, cy, da, de, dz, ee, el, en, eo, es, et, eu, fa, ff, fi, fo, fr, fy, ga, gd, gl, gu, gv, ha, hi, hr, hu, hy, ig, ii, in, is, it, iw, ja, ji, ka, ki, kk, kl, km, kn, ko, ks, kw, ky, lb, lg, ln, lo, lt, lu, lv, mg, mk, ml, mn, mr, ms, mt, my, nb, nd, ne, nl, nn, om, or, os, pa, pl, ps, pt, qu, rm, rn, ro, ru, rw, se, sg, si, sk, sl, sn, so, sq, sr, sv, sw, ta, te, th, ti, to, tr, ug, uk, ur, uz, vi, yo, zh, zu |
JTok Segmenter
JTok segmenter.
language |
The language. Optional — Type: String |
ptbEscaping |
Type: Boolean — Default value: |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeParagraph |
Create Paragraph annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
de, en, it |
Java BreakIterator Segmenter
BreakIterator segmenter.
language |
The language. Optional — Type: String |
splitAtApostrophe |
Per default the Java BreakIterator does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered. Type: Boolean — Default value: |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
ar, be, bg, ca, cs, da, de, el, en, es, et, fi, fr, ga, hi, hr, hu, in, is, it, iw, ja, ko, lt, lv, mk, ms, mt, nl, no, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, vi, zh |
LanguageTool Segmenter
Segmenter using LanguageTool to do the heavy lifting. LanguageTool internally uses different strategies for tokenization.
language |
The language. Optional — Type: String |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
be, br, ca, da, de, el, en, eo, es, fa, fr, gl, is, it, ja, km, lt, ml, nl, pl, pt, ro, ru, sk, sl, sv, ta, tl, uk, zh |
Line-based Sentence Segmenter
Annotates each line in the source text as a sentence. This segmenter is not capable of creating tokens! All respective parameters have no functionality.
language |
The language. Optional — Type: String |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
none specified |
LingPipe Segmenter
LingPipe segmenter.
language |
The language. Optional — Type: String |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
none specified |
NLP4J Segmenter
Segmenter using Emory NLP4J.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
none specified |
OpenNLP Segmenter
Tokenizer and sentence splitter using OpenNLP.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
segmentationModelLocation |
Load the segmentation model from this location instead of locating the model automatically. Optional — Type: String |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
tokenizationModelLocation |
Load the tokenization model from this location instead of locating the model automatically. Optional — Type: String |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
da |
20120616.1 |
|
da |
20120616.1 |
|
de |
20120616.1 |
|
de |
20120616.1 |
|
en |
20120616.1 |
|
en |
20120616.1 |
|
it |
20130618.0 |
|
it |
20130618.0 |
|
nb |
20120131.1 |
|
nb |
20120131.1 |
|
nl |
20120616.1 |
|
nl |
20120616.1 |
|
pt |
20120616.1 |
|
pt |
20120616.1 |
|
sv |
20120616.1 |
|
sv |
20120616.1 |
OpenNLP Sentence Splitter Trainer
Train a sentence splitter model for OpenNLP.
abbreviationDictionaryEncoding |
Type: String — Default value: |
abbreviationDictionaryLocation |
Optional — Type: String |
algorithm |
Type: String — Default value: |
cutoff |
Type: Integer — Default value: |
eosCharacters |
Optional — Type: String[] |
iterations |
Type: Integer — Default value: |
language |
Type: String |
numThreads |
Type: Integer — Default value: |
targetLocation |
Type: String |
trainerType |
Type: String — Default value: |
OpenNLP Tokenizer Trainer
Train a tokenizer model for OpenNLP.
abbreviationDictionaryEncoding |
Type: String — Default value: |
abbreviationDictionaryLocation |
Optional — Type: String |
algorithm |
Type: String — Default value: |
alphaNumericPattern |
Optional — Type: String — Default value: |
cutoff |
Type: Integer — Default value: |
iterations |
Type: Integer — Default value: |
language |
Type: String |
numThreads |
Type: Integer — Default value: |
targetLocation |
Type: String |
trainerType |
Type: String — Default value: |
useAlphaNumericOptimization |
Type: Boolean — Default value: |
Paragraph Splitter
This class creates paragraph annotations for the given input document. It searches for the occurrence of two or more line-breaks (Unix and Windows) and regards this as the boundary between paragraphs.
splitPattern |
A regular expression used to detect paragraph splits. Default: #DOUBLE_LINE_BREAKS_PATTERN (split on two consecutive line breaks) Type: String — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
none specified |
Pattern-based Token Segmenter
Split up existing tokens again at particular split-chars. The prefix states whether the split chars should be added as separate Token Tokens. If the #INCLUDE_PREFIX precedes the split pattern, the pattern is included. Consequently, patterns following the #EXCLUDE_PREFIX, will not be added as a Token.
deleteCover |
Wether to remove the original token. Default: true Type: Boolean — Default value: |
patterns |
A list of regular expressions, prefixed with #INCLUDE_PREFIX or #EXCLUDE_PREFIX. If neither of the prefixes is used, #EXCLUDE_PREFIX is assumed. Type: String[] |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Regex Segmenter
This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.
The default behaviour is to split sentences by a line break and tokens by whitespace.
language |
The language. Optional — Type: String |
sentenceBoundaryRegex |
Define the sentence boundary. Default: \n (assume one sentence per line). Type: String — Default value: `` |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
tokenBoundaryRegex |
Defines the pattern that is used as token end boundary. Default: [\s\n]+ (matching whitespace and linebreaks. When setting custom patterns, take into account that the final token is often terminated by a linebreak rather than the boundary character. Therefore, the newline typically has to be added to the group of matching characters, e.g. "tokenized-text" is correctly tokenized with the pattern [-\n]. Type: String — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
none specified |
Token Merger
Merges any Tokens that are covered by a given annotation type. E.g. this component can be used to create a single tokens from all tokens that constitute a multi-token named entity.
POSMappingLocation |
Override the tagset mapping. Optional — Type: String |
annotationType |
Annotation type for which tokens should be merged. Type: String |
constraint |
A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to merge only tokens that are part of a location named entity. Optional — Type: String |
cposValue |
Set a new coarse POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset). Optional — Type: String |
language |
Use this language instead of the document language to resolve the model and tag set mapping. Optional — Type: String |
lemmaMode |
Configure what should happen to the lemma of the merged tokens. It is possible to JOIN the lemmata to a single lemma (space separated), to REMOVE the lemma or LEAVE the lemma of the first token as-is. Type: String — Default value: |
posType |
Set a new POS tag for the new merged token. This is the mapped type. If this is specified, tag set mapping will not be performed. This parameter has no effect unless PARAM_POS_VALUE is also set. Optional — Type: String |
posValue |
Set a new POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset). Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Token Trimmer
Remove prefixes and suffixes from tokens.
prefixes |
List of prefixes to remove. Type: String[] |
suffixes |
List of suffixes to remove. Type: String[] |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Whitespace Segmenter
A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.
If PARAM_WRITE_SENTENCES is set to true, one sentence per line is assumed. Otherwise, no sentences are created.
language |
The language. Optional — Type: String |
strictZoning |
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them. Type: Boolean — Default value: |
writeForm |
Create TokenForm annotations. Type: Boolean — Default value: |
writeSentence |
Create Sentence annotations. Type: Boolean — Default value: |
writeToken |
Create Token annotations. Type: Boolean — Default value: |
zoneTypes |
A list of type names used for zoning. Optional — Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
none specified |
Semantic role labeler
Component | Description |
---|---|
ClearNLP semantic role labeller. |
|
DKPro Annotator for the MateTools Semantic Role Labeler. |
ClearNLP Semantic Role Labeler
ClearNLP semantic role labeller.
expandArguments |
Normally the arguments point only to the head words of arguments in the dependency tree. With this option enabled, they are expanded to the text covered by the minimal and maximal token offsets of all descendants (or self) of the head word. Warning: this parameter should be used with caution! For one, if the descentants of a head word cover a non-continuous region of the text, this information is lost. The arguments will appear to be spanning a continuous region. For another, the arguments may overlap with each other. E.g. if a sentence contains a relative clause with a verb, the subject of the main clause may be recognized as a dependent of the verb and may cause the whole main clause to be recorded in the argument. Type: Boolean — Default value: |
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelVariant |
Variant of a model the model. Used to address a specific model if here are multiple models for one language. Optional — Type: String |
predModelLocation |
Location from which the predicate identifier model is read. Optional — Type: String |
printTagSet |
Write the tag set(s) to the log when a model is loaded. Type: Boolean — Default value: |
roleModelLocation |
Location from which the roleset classification model is read. Optional — Type: String |
srlModelLocation |
Location from which the semantic role labeling model is read. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
en |
20131111.0 |
|
en |
20131128.0 |
Mate Tools Semantic Role Labeler
DKPro Annotator for the MateTools Semantic Role Labeler.
Please cite the following paper, if you use the semantic role labeler Anders Björkelund, Love Hafdell, and Pierre Nugues. Multilingual semantic role labeling. In Proceedings of The Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 43--48, Boulder, June 4--5 2009.
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
modelLocation |
Load the model from this location instead of locating the model automatically. Optional — Type: String |
modelVariant |
Override the default variant used to locate the model. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
see available models |
Language | Variant | Version |
---|---|---|
de |
20130105.0 |
|
en |
20130117.0 |
|
es |
20130320.0 |
|
zh |
20130117.0 |
Stemmer
Component | Description |
---|---|
UIMA wrapper for the CISTEM algorithm. |
|
This Paice/Husk Lancaster stemmer implementation only works with the English language so far. |
|
UIMA wrapper for the Snowball stemmer. |
CIS Stemmer
UIMA wrapper for the CISTEM algorithm. CISTEM is a stemming algorithm for the German language, developed by Leonie Weißweiler and Alexander Fraser. Annotation types to be stemmed can be configured by a FeaturePath.
If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.
filterConditionOperator |
Specifies the operator for a filtering condition.
It is only used if Optional — Type: String |
filterConditionValue |
Specifies the value for a filtering condition.
It is only used if Optional — Type: String |
filterFeaturePath |
Specifies a feature path that is used in the filter. If this is set, you also have to specify
Optional — Type: String |
lowerCase |
Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer. Optional — Type: Boolean — Default value: |
paths |
Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path. Optional — Type: String[] |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
de |
Lancaster Stemmer
This Paice/Husk Lancaster stemmer implementation only works with the English language so far.
filterConditionOperator |
Specifies the operator for a filtering condition.
It is only used if Optional — Type: String |
filterConditionValue |
Specifies the value for a filtering condition.
It is only used if Optional — Type: String |
filterFeaturePath |
Specifies a feature path that is used in the filter. If this is set, you also have to specify
Optional — Type: String |
language |
Specifies the language supported by the stemming model. Default value is "en" (English). Type: String — Default value: |
modelLocation |
Specifies an URL that should resolve to a location from where to load custom rules. If the location starts with classpath: the location is interpreted as a classpath location, e.g. "classpath:my/path/to/the/rules". Otherwise it is tried as an URL, file and at last UIMA resource. Optional — Type: String |
paths |
Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path. Optional — Type: String[] |
stripPrefix |
True if the stemmer will strip prefix such as kilo, micro, milli, intra, ultra, mega, nano, pico, pseudo. Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
en |
Snowball Stemmer
UIMA wrapper for the Snowball stemmer. Annotation types to be stemmed can be configured by a FeaturePath.
If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.
filterConditionOperator |
Specifies the operator for a filtering condition.
It is only used if Optional — Type: String |
||||||||||||
filterConditionValue |
Specifies the value for a filtering condition.
It is only used if Optional — Type: String |
||||||||||||
filterFeaturePath |
Specifies a feature path that is used in the filter. If this is set, you also have to specify
Optional — Type: String |
||||||||||||
language |
Use this language instead of the document language to resolve the model. Optional — Type: String |
||||||||||||
lowerCase |
Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.
Optional — Type: Boolean — Default value: |
||||||||||||
paths |
Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path. Optional — Type: String[] |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
da, de, en, es, fi, fr, hu, it, nl, no, pt, ro, ru, sv, tr |
Topic Model
Topic modeling is a statistical approach to discover abstract topics in a collection of documents. A topic is characterized by a probability distribution of the words in the document collection. Once a topic model has been generated, it can be used to analyze unseen documents. The result of the analysis is describes the probability by which a document belongs to each of the topics in the model.
Component | Description |
---|---|
Infers the topic distribution over documents using a Mallet ParallelTopicModel. |
|
Estimate an LDA topic model using Mallet and write it to a file. |
Mallet LDA Topic Model Inferencer
Infers the topic distribution over documents using a Mallet ParallelTopicModel.
burnIn |
The number of iterations before hyperparameter optimization begins. Default: 1 Type: Integer — Default value: |
lowercase |
If set to true (default: false), all tokens are lowercased. Type: Boolean — Default value: |
maxTopicAssignments |
Maximum number of topics to assign. If not set (or <= 0), the number of topics in the model divided by 10 is set. Type: Integer — Default value: |
minTokenLength |
Ignore tokens (or lemmas, respectively) that are shorter than the given value. Default: 3. Type: Integer — Default value: |
minTopicProb |
Minimum topic proportion for the document-topic assignment. Type: Float — Default value: |
modelLocation |
Type: String |
nIterations |
The number of iterations during inference. Default: 100. Type: Integer — Default value: |
thinning |
Type: Integer — Default value: |
tokenFeaturePath |
The annotation type to use for the model. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token. For lemmas, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value Type: String — Default value: |
typeName |
The annotation type to use as tokens. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token Type: String — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Mallet LDA Topic Model Trainer
Estimate an LDA topic model using Mallet and write it to a file. It stores all incoming CAS' to Mallet Instances before estimating the model, using a ParallelTopicModel.
Set #PARAM_TOKEN_FEATURE_PATH to define what is considered as a token (Tokens, Lemmas, etc.).
Set #PARAM_COVERING_ANNOTATION_TYPE to define what is considered a document (sentences, paragraphs, etc.).
alphaSum |
The sum of alphas over all topics. Default: 1.0. Another recommended value is 50 / T (number of topics). Type: Float — Default value: |
beta |
Beta for a single dimension of the Dirichlet prior. Default: 0.01. Type: Float — Default value: |
burninPeriod |
The number of iterations before hyper-parameter optimization begins. Default: 100 Type: Integer — Default value: |
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
coveringAnnotationType |
If specified, the text contained in the given segmentation type annotations are fed as separate units ("documents") to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored. By default, the full text is used as a document. Type: String — Default value: `` |
displayInterval |
The interval in which to display the estimated topics. Default: 50. Type: Integer — Default value: |
displayNTopicWords |
The number of top words to display during estimation. Default: 7. Type: Integer — Default value: |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
filterRegex |
Filter out all tokens matching that regular expression. Type: String — Default value: `` |
filterRegexReplacement |
Type: String — Default value: `` |
lowercase |
If set to true (default: false), all tokens are lowercased. Type: Boolean — Default value: |
minTokenLength |
Ignore tokens (or any other annotation type, as specified by #PARAM_TOKEN_FEATURE_PATH) that are shorter than the given value. Default: 3. Type: Integer — Default value: |
nIterations |
The number of iterations during model estimation. Default: 1000. Type: Integer — Default value: |
nTopics |
The number of topics to estimate. Type: Integer — Default value: |
numThreads |
The number of threads to use during model estimation. If not set, the number of threads is automatically set by ComponentParameters#computeNumThreads(int). Warning: do not set this to more than 1 when using very small (test) data sets on MalletEmbeddingsTrainer! This might prevent the process from terminating. Type: Integer — Default value: |
optimizeInterval |
Interval for optimizing Dirichlet hyper-parameters. Default: 50 Type: Integer — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
paramStopwordsFile |
The location of the stopwords file. Type: String — Default value: `` |
paramStopwordsReplacement |
If set, stopwords found in the #PARAM_STOPWORDS_FILE location are not removed, but replaced by the given string (e.g. STOP). Type: String — Default value: `` |
randomSeed |
Set random seed. If set to -1 (default), uses random generator. Type: Integer — Default value: |
saveInterval |
Define how frequently a serialized model is saved to disk during estimation. Default: 0 (only save when estimation is done). Type: Integer — Default value: |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
tokenFeaturePath |
The annotation type to use as input tokens for the model estimation. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token. For lemmas, for instance, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value Type: String — Default value: |
useCharacters |
If true (default: false), estimate character embeddings. #PARAM_TOKEN_FEATURE_PATH is ignored. Type: Boolean — Default value: |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
useSymmetricAlpha |
Use a symmetric alpha value during model estimation? Default: false. Type: Boolean — Default value: |
Transformer
Component | Description |
---|---|
Applies changes annotated using a SofaChangeAnnotation. |
|
After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view. |
|
Takes a text and replaces wrong capitalization |
|
Converts traditional Chinese to simplified Chinese or vice-versa. |
|
Reads a tab-separated file containing mappings from one token to another. |
|
Takes a text and shortens extra long words |
|
Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT. |
|
Simple dictionary-based hyphenation remover. |
|
A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions. |
|
Takes a text and replaces desired expressions. |
|
Takes a text and replaces sharp s |
|
Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation. |
|
Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. |
|
Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen. |
|
Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model. |
CAS Transformation - Apply
Applies changes annotated using a SofaChangeAnnotation.
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
CAS Transformation - Map back
After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.
Chain |
Chain of views for backmapping. This should be the reverse of the chain of views that the ApplyChangesAnnotator has used. For example, if view A has been mapped to B using ApplyChangesAnnotator, then this parameter should be set using an array containing [B, A]. Optional — Type: String[] — Default value: |
Capitalization Normalizer
Takes a text and replaces wrong capitalization
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Inputs |
|
---|---|
Outputs |
none specified |
Languages |
none specified |
Chinese Traditional/Simplified Converter
Converts traditional Chinese to simplified Chinese or vice-versa.
direction |
Type: String — Default value: |
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
none specified |
Languages |
zh |
Dictionary-based Token Transformer
Reads a tab-separated file containing mappings from one token to another. All tokens that match an entry in the first column are changed to the corresponding token in the second column.
commentMarker |
Lines starting with this character (or String) are ignored. Default: '#' Type: String — Default value: |
modelEncoding |
Type: String — Default value: |
modelLocation |
Type: String |
separator |
Separator for mappings file. Default: "\t" (TAB). Type: String — Default value: `` |
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Expressive Lengthening Normalizer
Takes a text and shortens extra long words
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Inputs |
|
---|---|
Outputs |
none specified |
Languages |
none specified |
File-based Token Transformer
Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.
ignoreCase |
Type: Boolean — Default value: |
modelLocation |
Type: String |
replacement |
Type: String |
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Hyphenation Remover
Simple dictionary-based hyphenation remover.
modelEncoding |
Type: String — Default value: |
modelLocation |
Type: String |
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Regex-based Token Transformer
A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.
The parameters #PARAM_REGEX defines the regular expression to be searcher, #PARAM_REPLACEMENT defines the string with which matching patterns are replaces.
regex |
Define the regular expression to be replaced Type: String |
replacement |
Define the string to replace matching tokens with Type: String |
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Replacement File Normalizer
Takes a text and replaces desired expressions. This class should not work on tokens as some expressions might span several tokens.
modelEncoding |
The character encoding used by the model. Type: String — Default value: |
modelLocation |
Location of a file which contains all replacing characters Type: String |
srcExpressionSurroundings |
Type: String — Default value: |
targetExpressionSurroundings |
Type: String — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Sharp S (ß) Normalizer
Takes a text and replaces sharp s
minFrequencyThreshold |
Type: Integer — Default value: |
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Inputs |
none specified |
---|---|
Outputs |
none specified |
Languages |
de |
Spelling Normalizer
Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Inputs |
|
---|---|
Outputs |
none specified |
Languages |
none specified |
Stanford Penn Treebank Normalizer
Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. This component operates directly on the text and does not require prior segmentation.
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Token Case Transformer
Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.
tokenCase |
The case to convert tokens to:
Type: String |
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Umlaut Normalizer
Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.
minFrequencyThreshold |
Type: Integer — Default value: |
typesToCopy |
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted. Type: String[] — Default value: |
Inputs |
|
---|---|
Outputs |
none specified |
Languages |
de |
Other
Component | Description |
---|---|
Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words. |
|
Annotates compound parts and linking morphemes. |
|
This component assumes that some spell checker has already been applied upstream (e.g. |
|
Count unigrams and bigrams in a collection. |
|
N-gram annotator. |
|
Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech. |
|
Maps existing POS tags from one tagset to another using a user provided properties file. |
|
Annotate phrases in a sentence. |
|
Assign a set of popular readability scores to the text. |
|
Remove every token that does or does not match a given regular expression. |
|
This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. |
|
Creates SofaChangeAnnotations containing corrections for previously identified spelling errors. |
|
Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. |
|
Can be used to measure how long the processing between two points in a pipeline takes. |
|
This component adds Tfidf annotations consisting of a term and a tfidf weight. |
|
This consumer builds a DfModel. |
|
Removing trailing character (sequences) from tokens, e.g. punctuation. |
|
Utility analysis engine for use with CAS multipliers in uimaFIT pipelines. |
Annotation-By-Text Filter
Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.
ignoreCase |
If true, annotation texts are filtered case-independently. Default: true, i.e. words that occur in the list with different casing are not filtered out. Type: Boolean — Default value: |
modelEncoding |
Type: String — Default value: |
modelLocation |
Type: String |
typeName |
Annotation type to filter. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token. Type: String — Default value: |
Compound Annotator
Annotates compound parts and linking morphemes.
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Corrections Contextualizer
This component assumes that some spell checker has already been applied upstream (e.g. Jazzy). It then uses ngram frequencies from a frequency provider in order to rank the provided corrections.
Frequency Count Writer
Count unigrams and bigrams in a collection.
compression |
Choose a compression method. (default: CompressionMethod#NONE) Optional — Type: String — Default value: |
coveringType |
Set this parameter if bigrams should only be counted when occurring within a covering type, e.g. sentences. Optional — Type: String |
escapeDocumentId |
URL-encode the document ID in the file name to avoid illegal characters (e.g. \, :, etc.) Type: Boolean — Default value: |
featurePath |
The feature path. Default: tokens. Optional — Type: String |
filterRegex |
Type: String — Default value: `` |
lowercase |
If true, all tokens are lowercased. Type: Boolean — Default value: |
minCount |
Tokens occurring fewer times than this value are omitted. Default: 5. Type: Integer — Default value: |
overwrite |
Allow overwriting target files (ignored when writing to ZIP archives). Type: Boolean — Default value: |
regexReplacement |
Type: String — Default value: `` |
singularTarget |
Treat target location as a single file name. This is particularly useful if only a single input file is processed and the result should be written to a pre-defined output file instead of deriving the file name from the document URI or document ID. It can also be useful if the user wishes to force multiple input files to be written to a single target file. The latter case does not work for all formats (e.g. binary, XMI, etc.), but can be useful, e.g. for Conll-based formats. This option has no effect if the target location points to an archive location (ZIP/JAR). The #PARAM_COMPRESSION is respected, but does not automatically add an extension. The #PARAM_STRIP_EXTENSION has no effect as the original extension is not preserved. Type: Boolean — Default value: |
sortByAlphabet |
If true, sort output alphabetically. Type: Boolean — Default value: |
sortByCount |
If true, sort output by count (descending order). Type: Boolean — Default value: |
stopwordsFile |
Type: String — Default value: `` |
stopwordsReplacement |
Type: String — Default value: `` |
stripExtension |
Remove the original extension. Type: Boolean — Default value: |
targetLocation |
Target location. If this parameter is not set, data is written to stdout. Optional — Type: String |
useDocumentId |
Use the document ID as file name even if a relative path information is present. Type: Boolean — Default value: |
N-Gram Annotator
N-gram annotator.
N |
The length of the n-grams to generate (the "n" in n-gram). Type: Integer — Default value: |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
POS Filter
Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.
adj |
Keep/remove adjectives (true: keep, false: remove) Type: Boolean — Default value: |
adp |
Keep/remove adpositions (true: keep, false: remove) Type: Boolean — Default value: |
adv |
Keep/remove adverbs (true: keep, false: remove) Type: Boolean — Default value: |
aux |
Keep/remove auxiliary verbs (true: keep, false: remove) Type: Boolean — Default value: |
conj |
Keep/remove conjunctions (true: keep, false: remove) Type: Boolean — Default value: |
det |
Keep/remove articles (true: keep, false: remove) Type: Boolean — Default value: |
intj |
Keep/remove interjections (true: keep, false: remove) Type: Boolean — Default value: |
noun |
Keep/remove nouns (true: keep, false: remove) Type: Boolean — Default value: |
num |
Keep/remove numerals (true: keep, false: remove) Type: Boolean — Default value: |
part |
Keep/remove particles (true: keep, false: remove) Type: Boolean — Default value: |
pron |
Keep/remove pronnouns (true: keep, false: remove) Type: Boolean — Default value: |
propn |
Keep/remove proper nouns (true: keep, false: remove) Type: Boolean — Default value: |
punct |
Keep/remove punctuation (true: keep, false: remove) Type: Boolean — Default value: |
sconj |
Keep/remove conjunctions (true: keep, false: remove) Type: Boolean — Default value: |
sym |
Keep/remove symbols (true: keep, false: remove) Type: Boolean — Default value: |
typeToRemove |
The fully qualified name of the type that should be filtered. Type: String |
verb |
Keep/remove verbs (true: keep, false: remove) Type: Boolean — Default value: |
x |
Keep/remove other (true: keep, false: remove) Type: Boolean — Default value: |
Inputs |
|
---|---|
Outputs |
none specified |
Languages |
none specified |
POS Mapper
Maps existing POS tags from one tagset to another using a user provided properties file.
dkproMappingLocation |
A properties file containing mappings from the new tagset to (fully qualified) DKPro POS
classes. Optional — Type: String |
mappingFile |
A properties file containing POS tagset mappings. Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Phrase Annotator
Annotate phrases in a sentence. Depending on the provided unigrams and the threshold, these comprise either one or two annotations (tokens, lemmas, ...).
In order to identify longer phrases, run the FrequencyCounter and this annotator multiple times, each time taking the results of the previous run as input. From the second run on, set phrases in the feature path parameter #PARAM_FEATURE_PATH.
PARAM_LOWERCASE |
If true, lowercase everything. Type: Boolean — Default value: |
coveringType |
Set this parameter if bigrams should only be counted when occurring within a covering type, e.g. sentences. Optional — Type: String |
discount |
The discount in order to prevent too many phrases consisting of very infrequent words to be formed. A typical value is the minimum count set during model creation (FrequencyCounter#PARAM_MIN_COUNT), which is by default set to 5. Type: Integer — Default value: |
featurePath |
The feature path to use for building bigrams. Default: tokens. Optional — Type: String |
filterRegex |
Type: String — Default value: `` |
modelLocation |
The file providing the unigram and bigram unigrams to use. Type: String |
regexReplacement |
Type: String — Default value: `` |
stopwordsFile |
Type: String — Default value: `` |
stopwordsReplacement |
Type: String — Default value: `` |
threshold |
The threshold score for phrase construction. Default is 100. Lower values result in fewer phrases. The value strongly depends on the size of the corpus and the token unigrams. Type: Float — Default value: |
Regex Token Filter
Remove every token that does or does not match a given regular expression.
mustMatch |
If this parameter is set to true (default), retain only tokens that match the regex given in #PARAM_REGEX. If set to false, all tokens that match the given regex are removed. Type: Boolean — Default value: |
regex |
Every token that does or does not match this regular expression will be removed. Type: String |
Semantic Field Annotator
This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. This could be a lexical resource such as WordNet or a simple key-value map. The annotation is stored in the SemanticField annotation type.
annotationType |
Annotation types which should be annotated with semantic fields Type: String |
constraint |
A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to annotate only tokens with semantic fields that are part of a location named entity. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Simple Spelling Corrector
Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
Stop Word Remover
Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. Also remove any other of the specified types that is covered by a matching instance.
Paths |
Feature paths for annotations that should be matched/removed. The default is StopWord.class.getName() Token.class.getName() Lemma.class.getName()+"/value" Optional — Type: String[] |
StopWordType |
Anything annotated with this type will be removed even if it does not match any word in the lists. Optional — Type: String |
modelEncoding |
The character encoding used by the model. Type: String — Default value: |
modelLocation |
A list of URLs from which to load the stop word lists. If an URL is prefixed with a language code in square brackets, the stop word list is only used for documents in that language. Using no prefix or the prefix "[*]" causes the list to be used for every document. Example: "[de]classpath:/stopwords/en_articles.txt" Type: String[] |
Inputs |
|
---|---|
Outputs |
none specified |
Languages |
none specified |
Stopwatch
Can be used to measure how long the processing between two points in a pipeline takes. For that purpose, the AE needs to be added two times, before and after the part of the pipeline that should be measured.
timerName |
Name of the timer pair. Upstream and downstream timer need to use the same name. Type: String |
timerOutputFile |
Name of the timer pair. Upstream and downstream timer need to use the same name. Optional — Type: String |
Inputs |
|
---|---|
Outputs |
|
Languages |
none specified |
TF/IDF Annotator
This component adds Tfidf annotations consisting of a term and a tfidf weight.
The annotator is type agnostic concerning the input annotation, so you have to specify the
annotation type and string representation. It uses a pre-serialized DfStore, which can be
created using the TfidfConsumer.
featurePath |
This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path. Type: String |
lowercase |
If set to true, the whole text is handled in lower case. Optional — Type: Boolean — Default value: |
tfdfPath |
Provide the path to the Df-Model. When a shared SharedDfModel is bound to this annotator, this is ignored. Optional — Type: String |
weightingModeIdf |
The model for inverse document frequency weighting. Default value is "NORMAL" yielding an unweighted idf. Optional — Type: String — Default value: |
weightingModeTf |
The model for term frequency weighting. Default value is "NORMAL" yielding an unweighted tf. Optional — Type: String — Default value: |
Inputs |
none specified |
---|---|
Outputs |
|
Languages |
none specified |
TF/IDF Model Writer
This consumer builds a DfModel. It collects the df (document frequency) counts for the processed collection. The counts are serialized as a DfModel-object.
featurePath |
This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path. Type: String |
lowercase |
If set to true, the whole text is handled in lower case. Type: Boolean — Default value: |
targetLocation |
Specifies the path and filename where the model file is written. Type: String |
Trailing Character Remover
Removing trailing character (sequences) from tokens, e.g. punctuation.
minTokenLength |
All tokens that are shorter than the minimum token length after removing trailing chars are completely removed. By default (1), empty tokens are removed. Set to 0 or a negative value if no tokens should be removed. Shorter tokens that do not have trailing chars removed are always retained, regardless of their length. Type: Integer — Default value: |
pattern |
A regex to be trimmed from the end of tokens. Default: "[\\Q,-“^»*’()&/\"'©§'—«·=\\E0-9A-Z]+" (remove punctuations, special characters and capital letters). Type: String — Default value: |