The document provides detailed information about the DKPro Core UIMA components.
Analytics components
Component | Description |
---|---|
Removes annotations that do not conform to minimum or maximum length constraints. |
|
Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words. |
|
Applies changes annotated using a SofaChangeAnnotation. |
|
Wrapper for Twitter Tokenizer and POS Tagger. |
|
ArkTweet tokenizer. |
|
After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view. |
|
Berkeley Parser annotator . |
|
BreakIterator segmenter. |
|
Split up existing tokens again if they are camel-case text. |
|
Takes a text and replaces wrong capitalization |
|
Converts traditional Chinese to simplified Chinese or vice-versa. |
|
Lemmatizer using Clear NLP. |
|
Clear parser annotator. |
|
Part-of-Speech annotator using Clear NLP. |
|
Tokenizer using Clear NLP. |
|
ClearNLP semantic role labeller. |
|
Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. |
|
Annotates compound parts and linking morphemes. |
|
This component assumes that some spell checker has already been applied upstream (e.g. |
|
Takes a plain text file with phrases as input and annotates the phrases in the CAS file. |
|
Reads a tab-separated file containing mappings from one token to another. |
|
Double-Metaphone phonetic transcription based on Apache Commons Codec. |
|
Takes a text and shortens extra long words |
|
Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT. |
|
Wrapper for the GATE rule based lemmatizer. |
|
Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. |
|
GATE Hepple part-of-speech tagger. |
|
Part-of-Speech annotator using HunPos. |
|
Simple dictionary-based hyphenation remover. |
|
Utility analysis engine for use with CAS multipliers in uimaFIT pipelines. |
|
JTok segmenter. |
|
This annotator uses Jazzy for the decision whether a word is spelled correctly or not. |
|
Langdetect language identifier based on character n-grams. |
|
Language detector based on n-gram frequency counts, e.g. as provided by Web1T |
|
Detection based on character n-grams. |
|
Detect grammatical errors in text using LanguageTool a rule based grammar checker. |
|
Naive lexicon-based lemmatizer. |
|
Segmenter using LanguageTool to do the heavy lifting. |
|
Annotates each line in the source text as a sentence. |
|
Estimate an LDA topic model using Mallet and write it to a file. |
|
Infers the topic distribution over documents using a Mallet ParallelTopicModel. |
|
Dependency parsing using MaltPaser. |
|
DKPro Annotator for the MateToolsLemmatizer. |
|
DKPro Annotator for the MateToolsMorphTagger. |
|
DKPro Annotator for the MateToolsParser. |
|
DKPro Annotator for the MateToolsPosTagger |
|
DKPro Annotator for the MateTools Semantic Role Labeler. |
|
Annotator for the MeCab Japanese POS Tagger. |
|
Metaphone phonetic transcription based on Apache Commons Codec. |
|
Lemmatize based on a finite-state machine. |
|
Dependency parsing using MSTParser. |
|
N-gram annotator. |
|
Creates SofaChangeAnnotations containing corrections for previously identified spelling errors. |
|
Chunk annotator using OpenNLP. |
|
OpenNLP name finder wrapper. |
|
OpenNLP parser. |
|
Part-of-Speech annotator using OpenNLP. |
|
Tokenizer and sentence splitter using OpenNLP. |
|
This class creates paragraph annotations for the given input document. |
|
Split up existing tokens again at particular split-chars. |
|
Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech. |
|
Maps existing POS tags from one tagset to another using a user provided properties file. |
|
Assign a set of popular readability scores to the text. |
|
A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions. |
|
Remove every token that does or does not match a given regular expression. |
|
This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries. |
|
Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens |
|
Rftagger morphological analyzer. |
|
This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. |
|
Sfst morphological analyzer. |
|
Takes a text and replaces sharp s |
|
UIMA wrapper for the Snowball stemmer. |
|
Soundex phonetic transcription based on Apache Commons Codec. |
|
Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation. |
|
No description |
|
Converts a constituency structure into a dependency structure. |
|
Stanford Lemmatizer component. |
|
Stanford Named Entity Recognizer component. |
|
Stanford Parser component. |
|
Stanford Part-of-Speech tagger component. |
|
Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. |
|
No description |
|
Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. |
|
Can be used to measure how long the processing between two points in a pipeline takes. |
|
This component adds Tfidf annotations consisting of a term and a tfidf weight. |
|
This consumer builds a DfModel. |
|
Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen. |
|
Merges any Tokens that are covered by a given annotation type. |
|
Remove prefixes and suffixes from tokens. |
|
Removing trailing character (sequences) from tokens, e.g. punctuation. |
|
Chunk annotator using TreeTagger. |
|
Part-of-Speech and lemmatizer annotator using TreeTagger. |
|
Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model. |
|
A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only. |
Checker
Component | Description |
---|---|
This annotator uses Jazzy for the decision whether a word is spelled correctly or not. |
|
Detect grammatical errors in text using LanguageTool a rule based grammar checker. |
JazzyChecker
Role: Checker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.jazzy-asl
Class: de.tudarmstadt.ukp.dkpro.core.jazzy.JazzyChecker
This annotator uses Jazzy for the decision whether a word is spelled correctly or not.
Parameters
ScoreThreshold
(Integer) =1
-
Determines the maximum edit distance (as an int value) that a suggestion for a spelling error may have. E.g. if set to one suggestions are limited to words within edit distance 1 to the original word.
modelEncoding
(String) =UTF-8
-
The character encoding used by the model.
modelLocation
(String)-
Location from which the model is read. The model file is a simple word-list with one word per line.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
LanguageToolChecker
Role: Checker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolChecker
Detect grammatical errors in text using LanguageTool a rule based grammar checker.
Parameters
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
Chunker
Component | Description |
---|---|
Chunk annotator using OpenNLP. |
|
Chunk annotator using TreeTagger. |
OpenNlpChunker
Role: Chunker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpChunker
Chunk annotator using OpenNLP.
Parameters
ChunkMappingLocation
(String) [optional]-
Load the chunk tag to UIMA type mapping from this location instead of locating the mapping automatically.
internTags
(Boolean) =true
[optional]-
Use the String#intern() method on tags. This is usually a good idea to avoid spamming the heap with thousands of strings representing only a few different tags. Default: true
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
en |
20100908.1 |
TreeTaggerChunker
Role: Chunker
Artifact ID: de.tudarmstadt.ukp.dkpro.core.treetagger-asl
Class: de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerChunker
Chunk annotator using TreeTagger.
Parameters
ChunkMappingLocation
(String) [optional]-
Location of the mapping file for chunk tags to UIMA types.
executablePath
(String) [optional]-
Use this TreeTagger executable instead of trying to locate the executable automatically.
flushSequence
(String) [optional]-
A sequence to flush the internal TreeTagger buffer and to force it to output the rest of the completed analysis. This is typically just a sequence of like 5-10 full stops (".") separated by new line characters. However, some models may require a different flush sequence, e.g. a short sentence in the respective language. For chunker models, mind that the sentence must also be POS tagged, e.g. Nous-PRO:PER\n....
internTags
(Boolean) =true
[optional]-
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
performanceMode
(Boolean) =false
-
TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
de |
20110429.1 |
|
en |
20090824.1 |
|
en |
20140520.1 |
|
fr |
20141218.2 |
Coreference resolver
Component | Description |
---|---|
No description |
StanfordCoreferenceResolver
Role: Coreference resolver
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordCoreferenceResolver
Parameters
maxDist
(Integer) =-1
-
DCoRef parameter: Maximum sentence distance between two mentions for resolution (-1: no constraint on the distance)
postprocessing
(Boolean) =false
-
DCoRef parameter: Do post processing
score
(Boolean) =false
-
DCoRef parameter: Scoring the output of the system
sieves
(String) =MarkRole, DiscourseMatch, ExactStringMatch, RelaxedExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, RelaxedHeadMatch, PronounMatch
-
DCoRef parameter: Sieve passes - each class is defined in dcoref/sievepasses/.
singleton
(Boolean) =true
-
DCoRef parameter: setting singleton predictor
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
en |
${core.version}.1 |
Language Identifier
Component | Description |
---|---|
Langdetect language identifier based on character n-grams. |
|
Language detector based on n-gram frequency counts, e.g. as provided by Web1T |
|
Detection based on character n-grams. |
LangDetectLanguageIdentifier
Role: Language Identifier
Artifact ID: de.tudarmstadt.ukp.dkpro.core.langdetect-asl
Class: de.tudarmstadt.ukp.dkpro.core.langdetect.LangDetectLanguageIdentifier
Langdetect language identifier based on character n-grams.
Parameters
modelLocation
(String) [optional]-
Location from which the model is read.
modelVariant
(String) [optional]-
Variant of a model the model. Used to address a specific model if here are multiple models for one language.
Models
Language | Variant | Version |
---|---|---|
any |
20141013.1 |
|
any |
20141013.1 |
LanguageDetectorWeb1T
Role: Language Identifier
Artifact ID: de.tudarmstadt.ukp.dkpro.core.ldweb1t-asl
Class: de.tudarmstadt.ukp.dkpro.core.ldweb1t.LanguageDetectorWeb1T
Language detector based on n-gram frequency counts, e.g. as provided by Web1T
Parameters
maxNGramSize
(Integer) =3
-
The maximum n-gram size that should be considered. Default is 3.
minNGramSize
(Integer) =1
-
The minimum n-gram size that should be considered. Default is 1.
LanguageIdentifier
Role: Language Identifier
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textcat-asl
Class: de.tudarmstadt.ukp.dkpro.core.textcat.LanguageIdentifier
Detection based on character n-grams. Uses the Java Text Categorizing Library based on a technique by Cavnar and Trenkle.
References:
- Cavnar, W. B. and J. M. Trenkle (1994). N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.
Lemmatizer
Component | Description |
---|---|
Lemmatizer using Clear NLP. |
|
Wrapper for the GATE rule based lemmatizer. |
|
Naive lexicon-based lemmatizer. |
|
DKPro Annotator for the MateToolsLemmatizer. |
|
Lemmatize based on a finite-state machine. |
|
Stanford Lemmatizer component. |
ClearNlpLemmatizer
Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpLemmatizer
Lemmatizer using Clear NLP.
Parameters
language
(String) =en
[optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
en |
20130715.0 |
GateLemmatizer
Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.gate-gpl
Class: de.tudarmstadt.ukp.dkpro.core.gate.GateLemmatizer
Wrapper for the GATE rule based lemmatizer. Based on code by Asher Stern from the BIUTEE textual entailment tool.
Parameters
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
LanguageToolLemmatizer
Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolLemmatizer
Naive lexicon-based lemmatizer. The words are looked up using the wordform lexicons of LanguageTool. Multiple readings are produced. The annotator simply takes the most frequent lemma from those readings. If no readings could be found, the original text is assigned as lemma.
Parameters
sanitize
(Boolean) =true
sanitizeChars
(String[]) =[(, ), [, ]]
Inputs and outputs
Inputs |
|
---|---|
Outputs |
MateLemmatizer
Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateLemmatizer
DKPro Annotator for the MateToolsLemmatizer.
Parameters
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
uppercase
(Boolean) =false
-
Try reconstructing proper casing for lemmata. This is useful for German, but e.g. for English creates odd results.
variant
(String) [optional]-
Override the default variant used to locate the model.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
de |
20121024.1 |
|
en |
20130117.1 |
|
es |
20130117.1 |
|
fr |
20130918.0 |
MorphaLemmatizer
Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.morpha-asl
Class: de.tudarmstadt.ukp.dkpro.core.morpha.MorphaLemmatizer
Lemmatize based on a finite-state machine. Uses the Java port of Morpha.
References:
- Minnen, G., J. Carroll and D. Pearce (2001). Applied morphological processing of English, Natural Language Engineering, 7(3). 207-223.
Parameters
readPOS
(Boolean) =false
-
Pass part-of-speech information on to Morpha. Since we currently do not know in which format the part-of-speech tags are expected by Morpha, we just pass on the actual pos tag value we get from the token. This may produce worse results than not passing on pos tags at all, so this is disabled by default.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
StanfordLemmatizer
Role: Lemmatizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordLemmatizer
Stanford Lemmatizer component. The Stanford Morphology-class computes the base form of English words, by removing just inflections (not derivational morphology). That is, it only does noun plurals, pronoun case, and verb endings, and not things like comparative adjectives or derived nominals. It is based on a finite-state transducer implemented by John Carroll et al., written in flex and publicly available. See: http://www.informatics.susx.ac.uk/research/nlp/carroll/morph.html
This only works for ENGLISH.
Parameters
ptb3Escaping
(Boolean) =true
-
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).
quoteBegin
(String[]) [optional]-
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.
quoteEnd
(String[]) [optional]-
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Morphological analyzer
Component | Description |
---|---|
Rftagger morphological analyzer. |
|
Sfst morphological analyzer. |
RfTagger
Role: Morphological analyzer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.rftagger-asl
Class: de.tudarmstadt.ukp.dkpro.core.rftagger.RfTagger
Rftagger morphological analyzer.
Parameters
MorphMappingLocation
(String) [optional]POSMappingLocation
(String) [optional]-
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelEncoding
(String) [optional]-
The character encoding used by the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Write the tag set(s) to the log when a model is loaded.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
cz |
20150728.1 |
|
de |
20150928.1 |
|
hu |
20150728.1 |
|
ru |
20150728.1 |
|
sk |
20150728.1 |
|
sl |
20150728.1 |
SfstAnnotator
Role: Morphological analyzer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.sfst-gpl
Class: de.tudarmstadt.ukp.dkpro.core.sfst.SfstAnnotator
Sfst morphological analyzer.
Parameters
MorphMappingLocation
(String) [optional]language
(String) [optional]-
Use this language instead of the document language to resolve the model.
mode
(String) =FIRST
modelEncoding
(String) =UTF-8
-
Specifies the model encoding.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Write the tag set(s) to the log when a model is loaded.
writeLemma
(Boolean) =true
-
Write lemma information. Default: true
writePOS
(Boolean) =true
-
Write part-of-speech information. Default: true
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
de |
20110202.1 |
|
de |
20140801.1 |
|
de |
20140521.1 |
|
de |
20140521.1 |
|
it |
20090223.1 |
|
tr |
20130219.1 |
Named Entity Recognizer
Component | Description |
---|---|
OpenNLP name finder wrapper. |
|
Stanford Named Entity Recognizer component. |
OpenNlpNamedEntityRecognizer
Role: Named Entity Recognizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpNamedEntityRecognizer
OpenNLP name finder wrapper.
Parameters
NamedEntityMappingLocation
(String) [optional]-
Location of the mapping file for named entity tags to UIMA types.
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Location from which the model is read.
modelVariant
(String) =person
-
Variant of a model the model. Used to address a specific model if here are multiple models for one language.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
en |
20100907.0 |
|
en |
20100907.0 |
|
en |
20100907.0 |
|
en |
20100907.0 |
|
en |
20100907.0 |
|
en |
20130624.1 |
|
en |
20100907.0 |
|
es |
20100908.0 |
|
es |
20100908.0 |
|
es |
20100908.0 |
|
es |
20100908.0 |
|
nl |
20100908.0 |
|
nl |
20100908.0 |
|
nl |
20100908.0 |
|
nl |
20100908.0 |
StanfordNamedEntityRecognizer
Role: Named Entity Recognizer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizer
Stanford Named Entity Recognizer component.
Parameters
NamedEntityMappingLocation
(String) [optional]-
Location of the mapping file for named entity tags to UIMA types.
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Location from which the model is read.
modelVariant
(String) [optional]-
Variant of a model the model. Used to address a specific model if here are multiple models for one language.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded.
ptb3Escaping
(Boolean) =true
-
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).
quoteBegin
(String[]) [optional]-
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.
quoteEnd
(String[]) [optional]-
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
de |
20150130.1 |
|
de |
20150130.1 |
|
en |
20160110.0 |
|
en |
20150420.1 |
|
en |
20160110.1 |
|
en |
20160110.0 |
|
en |
20150420.1 |
|
en |
20160110.1 |
|
en |
20150129.0 |
|
en |
20150129.1 |
|
en |
20160110.1 |
|
en |
20160110.0 |
|
en |
20160110.0 |
|
es |
20140826.1 |
Parser
Component | Description |
---|---|
Berkeley Parser annotator . |
|
Clear parser annotator. |
|
Dependency parsing using MaltPaser. |
|
DKPro Annotator for the MateToolsParser. |
|
Dependency parsing using MSTParser. |
|
OpenNLP parser. |
|
Stanford Parser component. |
BerkeleyParser
Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.berkeleyparser-gpl
Class: de.tudarmstadt.ukp.dkpro.core.berkeleyparser.BerkeleyParser
Berkeley Parser annotator . Requires Sentences to be annotated before.
Parameters
ConstituentMappingLocation
(String) [optional]-
Location of the mapping file for constituent tags to UIMA types.
POSMappingLocation
(String) [optional]-
Location of the mapping file for part-of-speech tags to UIMA types.
accurate
(Boolean) =false
-
Set thresholds for accuracy.
Default: false (set thresholds for efficiency)
binarize
(Boolean) =false
-
Output binarized trees.
Default: false
internTags
(Boolean) =true
[optional]-
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
keepFunctionLabels
(Boolean) =false
-
Retain predicted function labels. Model must have been trained with function labels.
Default: false
language
(String) [optional]-
Use this language instead of the language set in the CAS to locate the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
readPOS
(Boolean) =true
-
Sets whether to use or not to use already existing POS tags from another annotator for the parsing process.
Default: false
scores
(Boolean) =false
-
Output inside scores (only for binarized viterbi trees).
Default: false
substates
(Boolean) =false
-
Output sub-categories (only for binarized Viterbi trees).
Default: false
variational
(Boolean) =false
-
Use variational rule score approximation instead of max-rule
Default: false
viterbi
(Boolean) =false
-
Compute Viterbi derivation instead of max-rule tree.
Default: false (max-rule)
writePOS
(Boolean) =false
-
Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.
Default: true
writePennTree
(Boolean) =false
-
If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.
Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
ar |
20090917.1 |
|
bg |
20090917.1 |
|
de |
20090917.1 |
|
en |
20100819.1 |
|
fr |
20090917.1 |
|
zh |
20090917.1 |
ClearNlpParser
Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpParser
Clear parser annotator.
Parameters
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Location from which the model is read.
modelVariant
(String) [optional]-
Variant of a model the model. Used to address a specific model if here are multiple models for one language.
printTagSet
(Boolean) =false
-
Write the tag set(s) to the log when a model is loaded.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
MaltParser
Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.maltparser-asl
Class: de.tudarmstadt.ukp.dkpro.core.maltparser.MaltParser
Dependency parsing using MaltPaser.
Required annotations:
- Token
- Sentence
- POS
- Dependency (annotated over sentence-span)
Parameters
ignoreMissingFeatures
(Boolean) =false
-
Process anyway, even if the model relies on features that are not supported by this component. Default: false
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
bn |
20120905.1 |
|
en |
20120312.1 |
|
en |
20120312.1 |
|
es |
20130220.0 |
|
fa |
20130522.1 |
|
fr |
20120312.1 |
|
pl |
20120904.1 |
|
sv |
20120925.2 |
MateParser
Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateParser
DKPro Annotator for the MateToolsParser.
Please cite the following paper, if you use the parser: Bernd Bohnet. 2010. Top Accuracy and Fast Dependency Parsing is not a Contradiction. The 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China.
Parameters
DependencyMappingLocation
(String) [optional]-
Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
de |
20121024.1 |
|
en |
20130117.2 |
|
es |
20130117.1 |
|
fr |
20130918.0 |
|
zh |
20130117.1 |
MstParser
Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mstparser-asl
Class: de.tudarmstadt.ukp.dkpro.core.mstparser.MstParser
Dependency parsing using MSTParser.
Wrapper for the MSTParser (high memory requirements). More information about the parser can be found here here
The MSTParser models tend to be very large, e.g. the Eisner model is about 600 MB uncompressed. With this model, parsing a simple sentence with MSTParser requires about 3 GB heap memory.
This component feeds MSTParser only with the FORM (token) and POS (part-of-speech) fields. LEMMA, CPOS, and other columns from the CONLL 2006 format are not generated (cf. mstparser.DependencyInstance DependencyInstance).
Parameters
DependencyMappingLocation
(String) [optional]-
Load the dependency to UIMA type mapping from this location instead of locating the mapping automatically.
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
order
(Integer) [optional]-
Specifies the order/scope of features. 1 only has features over single edges and 2 has features over pairs of adjacent edges in the tree. The model must have been trained with the respective order set here.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
en |
20100416.2 |
|
en |
20121019.2 |
|
hr |
20130527.1 |
|
hr |
20130527.1 |
OpenNlpParser
Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpParser
OpenNLP parser. The parser ignores existing POS tags and internally creates new ones. However, these tags are only added as annotation if explicitly requested via #PARAM_WRITE_POS.
Parameters
ConstituentMappingLocation
(String) [optional]-
Location of the mapping file for constituent tags to UIMA types.
POSMappingLocation
(String) [optional]-
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
internTags
(Boolean) =true
[optional]-
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.
Default: true
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded.
Default: false
writePOS
(Boolean) =false
-
Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.
Default: true
writePennTree
(Boolean) =false
-
If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.
Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
en |
20120616.1 |
StanfordParser
Role: Parser
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordParser
Stanford Parser component.
Parameters
ConstituentMappingLocation
(String) [optional]-
Location of the mapping file for constituent tags to UIMA types.
POSMappingLocation
(String) [optional]-
Location of the mapping file for part-of-speech tags to UIMA types.
annotationTypeToParse
(String) [optional]-
This parameter can be used to override the standard behavior which uses the Sentence annotation as the basic unit for parsing.
If the parameter is set with the name of an annotation type x, the parser will no longer parse Sentence-annotations, but x-Annotations.
Default: null
language
(String) [optional]-
Use this language instead of the document language to resolve the model and tag set mapping.
maxItems
(Integer) =200000
-
Controls when the factored parser considers a sentence to be too complex and falls back to the PCFG parser.
Default: 200000
maxSentenceLength
(Integer) =130
-
Maximum number of tokens in a sentence. Longer sentences are not parsed. This is to avoid out of memory exceptions.
Default: 130
mode
(String) =TREE
[optional]-
Sets the kind of dependencies being created.
Default: DependenciesMode#COLLAPSED TREE
modelLocation
(String) [optional]-
Location from which the model is read.
modelVariant
(String) [optional]-
Variant of a model the model. Used to address a specific model if here are multiple models for one language.
printTagSet
(Boolean) =false
-
Write the tag set(s) to the log when a model is loaded.
ptb3Escaping
(Boolean) =true
-
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).
quoteBegin
(String[]) [optional]-
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.
quoteEnd
(String[]) [optional]-
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.
readPOS
(Boolean) =true
-
Sets whether to use or not to use already existing POS tags from another annotator for the parsing process.
Default: true
writeConstituent
(Boolean) =true
-
Sets whether to create or not to create constituent tags. This is required for POS-tagging and lemmatization.
Default: true
writeDependency
(Boolean) =true
-
Sets whether to create or not to create dependency annotations.
Default: true
writePOS
(Boolean) =false
-
Sets whether to create or not to create POS tags. The creation of constituent tags must be turned on for this to work.
Default: false
writePennTree
(Boolean) =false
-
If this parameter is set to true, each sentence is annotated with a PennTree-Annotation, containing the whole parse tree in Penn Treebank style format.
Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
ar |
20150129.1 |
|
ar |
20141031.1 |
|
de |
20150129.1 |
|
de |
20150129.1 |
|
de |
20141031.1 |
|
en |
20150129.1 |
|
en |
20150129.1 |
|
en |
20160110.1 |
|
en |
20140104.1 |
|
en |
20141031.1 |
|
en |
20141031.1 |
|
en |
20150129.1 |
|
en |
20150129.1 |
|
en |
20140104.1 |
|
es |
20150108.1 |
|
es |
20141023.1 |
|
es |
20141023.1 |
|
fr |
20150129.1 |
|
fr |
20160114.1 |
|
fr |
20141023.1 |
|
zh |
20150129.1 |
|
zh |
20150129.1 |
|
zh |
20141023.1 |
|
zh |
20150129.1 |
|
zh |
20150129.1 |
Part-of-speech tagger
Component | Description |
---|---|
Wrapper for Twitter Tokenizer and POS Tagger. |
|
Part-of-Speech annotator using Clear NLP. |
|
GATE Hepple part-of-speech tagger. |
|
Part-of-Speech annotator using HunPos. |
|
DKPro Annotator for the MateToolsMorphTagger. |
|
DKPro Annotator for the MateToolsPosTagger |
|
Annotator for the MeCab Japanese POS Tagger. |
|
Part-of-Speech annotator using OpenNLP. |
|
Stanford Part-of-Speech tagger component. |
|
Part-of-Speech and lemmatizer annotator using TreeTagger. |
ArktweetPosTagger
Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.arktools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetPosTagger
Wrapper for Twitter Tokenizer and POS Tagger. As described in: Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider and Noah A. Smith. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters In Proceedings of NAACL 2013.
Parameters
POSMappingLocation
(String) [optional]-
Location of the mapping file for part-of-speech tags to UIMA types.
language
(String) [optional]-
Use this language instead of the document language to resolve the model and tag set mapping.
modelLocation
(String) [optional]-
Location from which the model is read.
modelVariant
(String) [optional]-
Variant of a model the model. Used to address a specific model if here are multiple models for one language.
Models
Language | Variant | Version |
---|---|---|
en |
20120919.1 |
|
en |
20121211.1 |
|
en |
20130723.1 |
ClearNlpPosTagger
Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpPosTagger
Part-of-Speech annotator using Clear NLP. Requires Sentences to be annotated before.
Parameters
POSMappingLocation
(String) [optional]-
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
dictLocation
(String) [optional]-
Load the dictionary from this location instead of locating the dictionary automatically.
dictVariant
(String) [optional]-
Override the default variant used to locate the dictionary.
internTags
(Boolean) =true
[optional]-
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags.
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the pos-tagging model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the pos-tagging model.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
HepplePosTagger
Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.gate-gpl
Class: de.tudarmstadt.ukp.dkpro.core.gate.HepplePosTagger
GATE Hepple part-of-speech tagger.
Parameters
POSMappingLocation
(String) [optional]-
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
internTags
(Boolean) =true
[optional]-
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
lexiconLocation
(String) [optional]-
Load the lexicon from this location instead of locating it automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
rulesetLocation
(String) [optional]-
Load the ruleset from this location instead of locating it automatically.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
HunPosTagger
Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.hunpos-asl
Class: de.tudarmstadt.ukp.dkpro.core.hunpos.HunPosTagger
Part-of-Speech annotator using HunPos. Requires Sentences to be annotated before.
Parameters
POSMappingLocation
(String) [optional]-
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
internTags
(Boolean) =true
[optional]-
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
cs |
20121123.2 |
|
da |
20121123.2 |
|
de |
20121123.2 |
|
en |
20070724.2 |
|
fa |
20140414.0 |
|
hr |
20130509.2 |
|
hu |
20070724.2 |
|
pt |
20121123.2 |
|
pt |
20121123.2 |
|
pt |
20130119.2 |
|
pt |
20110419.2 |
|
ru |
20121123.2 |
|
sl |
20121123.2 |
|
sv |
20100215.2 |
|
sv |
20100927.2 |
MateMorphTagger
Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateMorphTagger
DKPro Annotator for the MateToolsMorphTagger.
Parameters
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
MatePosTagger
Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MatePosTagger
DKPro Annotator for the MateToolsPosTagger
Parameters
POSMappingLocation
(String) [optional]-
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
de |
20121024.1 |
|
en |
20130117.1 |
|
es |
20130117.1 |
|
fr |
20130918.0 |
|
zh |
20130117.1 |
MeCabTagger
Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mecab-asl
Class: de.tudarmstadt.ukp.dkpro.core.mecab.MeCabTagger
Annotator for the MeCab Japanese POS Tagger.
Parameters
language
(String) [optional]-
The language.
strictZoning
(Boolean) =false
-
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence
(Boolean) =true
-
Create Sentence annotations.
writeToken
(Boolean) =true
-
Create Token annotations.
zoneTypes
(String[]) =[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]
[optional]-
A list of type names used for zoning.
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
jp |
. |
|
jp |
. |
|
jp |
. |
|
jp |
. |
OpenNlpPosTagger
Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTagger
Part-of-Speech annotator using OpenNLP. Requires Sentences to be annotated before.
Parameters
POSMappingLocation
(String) [optional]-
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
internTags
(Boolean) =true
[optional]-
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
da |
20120616.1 |
|
da |
20120616.1 |
|
de |
20120616.1 |
|
de |
20120616.1 |
|
en |
20120616.1 |
|
en |
20120616.1 |
|
en |
20131115.1 |
|
es |
20120410.1 |
|
es |
20140425.1 |
|
es |
20120410.1 |
|
es |
20120410.1 |
|
es |
20131115.1 |
|
es |
20120410.1 |
|
it |
20130618.0 |
|
nl |
20120616.1 |
|
nl |
20120616.1 |
|
pt |
20120616.1 |
|
pt |
20130121.1 |
|
pt |
20130121.1 |
|
pt |
20120616.1 |
|
sv |
20120616.1 |
|
sv |
20120616.1 |
StanfordPosTagger
Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPosTagger
Stanford Part-of-Speech tagger component.
Parameters
POSMappingLocation
(String) [optional]-
Location of the mapping file for part-of-speech tags to UIMA types.
internTags
(Boolean) =true
[optional]-
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: false
language
(String) [optional]-
Use this language instead of the document language to resolve the model and tag set mapping.
maxSentenceLength
(Integer) [optional]-
Sentences with more tokens than the specified max amount will be ignored if this parameter is set to a value larger than zero. The default value zero will allow all sentences to be POS tagged.
modelLocation
(String) [optional]-
Location from which the model is read.
modelVariant
(String) [optional]-
Variant of a model the model. Used to address a specific model if here are multiple models for one language.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
ptb3Escaping
(Boolean) =true
-
Enable all traditional PTB3 token transforms (like -LRB-, -RRB-).
quoteBegin
(String[]) [optional]-
List of extra token texts (usually single character strings) that should be treated like opening quotes and escaped accordingly before being sent to the parser.
quoteEnd
(String[]) [optional]-
List of extra token texts (usually single character strings) that should be treated like closing quotes and escaped accordingly before being sent to the parser.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
ar |
20131112.1 |
|
de |
20140827.1 |
|
de |
20140827.1 |
|
de |
20140827.0 |
|
de |
20140827.1 |
|
en |
20140616.1 |
|
en |
20140827.0 |
|
en |
20130730.1 |
|
en |
20140616.1 |
|
en |
20130730.1 |
|
en |
20130914.0 |
|
en |
20160110.1 |
|
en |
20131112.1 |
|
en |
20140827.0 |
|
en |
20140616.1 |
|
en |
20131112.1 |
|
es |
20151014.1 |
|
es |
20150108.1 |
|
fr |
20140616.1 |
|
zh |
20140616.1 |
|
zh |
20140616.1 |
TreeTaggerPosTagger
Role: Part-of-speech tagger
Artifact ID: de.tudarmstadt.ukp.dkpro.core.treetagger-asl
Class: de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerPosTagger
Part-of-Speech and lemmatizer annotator using TreeTagger.
Parameters
POSMappingLocation
(String) [optional]-
Load the part-of-speech tag to UIMA type mapping from this location instead of locating the mapping automatically.
executablePath
(String) [optional]-
Use this TreeTagger executable instead of trying to locate the executable automatically.
internTags
(Boolean) =true
[optional]-
Use the String#intern() method on tags. This is usually a good idea to avoid spaming the heap with thousands of strings representing only a few different tags. Default: true
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelEncoding
(String) [optional]-
The character encoding used by the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
performanceMode
(Boolean) =false
-
TT4J setting: Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.
printTagSet
(Boolean) =false
-
Log the tag set(s) when a model is loaded. Default: false
writeLemma
(Boolean) =true
-
Write lemma information. Default: true
writePOS
(Boolean) =true
-
Write part-of-speech information. Default: true
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
bg |
20160430.1 |
|
de |
20121207.1 |
|
en |
20151119.1 |
|
es |
20150724.1 |
|
et |
20110124.1 |
|
fi |
20140704.1 |
|
fr |
20100111.1 |
|
gl |
20130516.1 |
|
it |
20141020.1 |
|
la |
20110819.1 |
|
mn |
20120925.1 |
|
nl |
20130107.1 |
|
pl |
20150506.1 |
|
pt |
20101115.2 |
|
ru |
20140505.1 |
|
sk |
20130725.1 |
|
sw |
20130729.1 |
|
zh |
20101115.1 |
Phonetic Transcriptor
Component | Description |
---|---|
Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. |
|
Double-Metaphone phonetic transcription based on Apache Commons Codec. |
|
Metaphone phonetic transcription based on Apache Commons Codec. |
|
Soundex phonetic transcription based on Apache Commons Codec. |
ColognePhoneticTranscriptor
Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.ColognePhoneticTranscriptor
Cologne phonetic (Kölner Phonetik) transcription based on Apache Commons Codec. Works for German.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
DoubleMetaphonePhoneticTranscriptor
Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.DoubleMetaphonePhoneticTranscriptor
Double-Metaphone phonetic transcription based on Apache Commons Codec. Works for English.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
MetaphonePhoneticTranscriptor
Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.MetaphonePhoneticTranscriptor
Metaphone phonetic transcription based on Apache Commons Codec. Works for English.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
SoundexPhoneticTranscriptor
Role: Phonetic Transcriptor
Artifact ID: de.tudarmstadt.ukp.dkpro.core.commonscodec-asl
Class: de.tudarmstadt.ukp.dkpro.core.commonscodec.SoundexPhoneticTranscriptor
Soundex phonetic transcription based on Apache Commons Codec. Works for English.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Segmenter
Segmenter components identify sentence boundaries and tokens. The order in which sentence splitting and tokenization are done differs between the integrated the NLP libraries. Thus, we chose to integrate both steps into a segmenter component to avoid the need to reorder the components in a pipeline when replacing one segmenter with another.
Component | Description |
---|---|
Removes annotations that do not conform to minimum or maximum length constraints. |
|
ArkTweet tokenizer. |
|
BreakIterator segmenter. |
|
Split up existing tokens again if they are camel-case text. |
|
Tokenizer using Clear NLP. |
|
Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. |
|
JTok segmenter. |
|
Segmenter using LanguageTool to do the heavy lifting. |
|
Annotates each line in the source text as a sentence. |
|
Tokenizer and sentence splitter using OpenNLP. |
|
This class creates paragraph annotations for the given input document. |
|
Split up existing tokens again at particular split-chars. |
|
This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries. |
|
No description |
|
Merges any Tokens that are covered by a given annotation type. |
|
Remove prefixes and suffixes from tokens. |
|
A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only. |
AnnotationByLengthFilter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.AnnotationByLengthFilter
Removes annotations that do not conform to minimum or maximum length constraints. (This was previously called TokenFilter).
Parameters
FilterTypes
(String[]) =[]
-
A set of annotation types that should be filtered.
MaxLengthFilter
(Integer) =1000
-
Any annotation in filterAnnotations shorter than this value will be removed.
MinLengthFilter
(Integer) =0
-
Any annotation in filterTypes shorter than this value will be removed.
ArktweetTokenizer
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.arktools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.arktools.ArktweetTokenizer
ArkTweet tokenizer.
BreakIteratorSegmenter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.BreakIteratorSegmenter
BreakIterator segmenter.
Parameters
language
(String) [optional]-
The language.
splitAtApostrophe
(Boolean) =false
-
Per default the Java BreakIterator does not split off contractions like John's into two tokens. When this parameter is enabled, a non-default token split is generated when an apostrophe (') is encountered.
strictZoning
(Boolean) =false
-
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence
(Boolean) =true
-
Create Sentence annotations.
writeToken
(Boolean) =true
-
Create Token annotations.
zoneTypes
(String[]) =[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]
[optional]-
A list of type names used for zoning.
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
CamelCaseTokenSegmenter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.CamelCaseTokenSegmenter
Split up existing tokens again if they are camel-case text.
Parameters
deleteCover
(Boolean) =true
-
Wether to remove the original token. Default: true
Inputs and outputs
Inputs |
|
---|---|
Outputs |
ClearNlpSegmenter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter
Tokenizer using Clear NLP.
Parameters
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
strictZoning
(Boolean) =false
-
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence
(Boolean) =true
-
Create Sentence annotations.
writeToken
(Boolean) =true
-
Create Token annotations.
zoneTypes
(String[]) =[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]
[optional]-
A list of type names used for zoning.
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
GermanSeparatedParticleAnnotator
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.GermanSeparatedParticleAnnotator
Annotator to be used for post-processing of German corpora that have been lemmatized and POS-tagged with the TreeTagger, based on the STTS tagset. This Annotator deals with German particle verbs. Particle verbs consist of a particle and a stem, e.g. anfangen = an+fangen There are many usages of German particle verbs where the stem and the particle are separated, e.g., Wir fangen gleich an. The TreeTagger lemmatizes the verb stem as "fangen" and the separated particle as "an", the proper verblemma "anfangen" is thus not available as an annotation. The GermanSeparatedParticleAnnotator replaces the lemma of the stem of particle-verbs (e.g., fangen) by the proper verb lemma (e.g. anfangen) and leaves the lemma of the separated particle unchanged.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
JTokSegmenter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.jtok-asl
Class: de.tudarmstadt.ukp.dkpro.core.jtok.JTokSegmenter
JTok segmenter.
Parameters
language
(String) [optional]-
The language.
strictZoning
(Boolean) =false
-
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeParagraph
(Boolean) =true
-
Create Paragraph annotations.
writeSentence
(Boolean) =true
-
Create Sentence annotations.
writeToken
(Boolean) =true
-
Create Token annotations.
zoneTypes
(String[]) =[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]
[optional]-
A list of type names used for zoning.
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
LanguageToolSegmenter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.LanguageToolSegmenter
Segmenter using LanguageTool to do the heavy lifting. LanguageTool internally uses different strategies for tokenization.
Parameters
language
(String) [optional]-
The language.
strictZoning
(Boolean) =false
-
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence
(Boolean) =true
-
Create Sentence annotations.
writeToken
(Boolean) =true
-
Create Token annotations.
zoneTypes
(String[]) =[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]
[optional]-
A list of type names used for zoning.
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
LineBasedSentenceSegmenter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.LineBasedSentenceSegmenter
Annotates each line in the source text as a sentence. This segmenter is not capable of creating tokens! All respective parameters have no functionality.
Parameters
language
(String) [optional]-
The language.
strictZoning
(Boolean) =false
-
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence
(Boolean) =true
-
Create Sentence annotations.
writeToken
(Boolean) =true
-
Create Token annotations.
zoneTypes
(String[]) =[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]
[optional]-
A list of type names used for zoning.
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
OpenNlpSegmenter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.opennlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpSegmenter
Tokenizer and sentence splitter using OpenNLP.
Parameters
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
segmentationModelLocation
(String) [optional]-
Load the segmentation model from this location instead of locating the model automatically.
strictZoning
(Boolean) =false
-
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
tokenizationModelLocation
(String) [optional]-
Load the tokenization model from this location instead of locating the model automatically.
writeSentence
(Boolean) =true
-
Create Sentence annotations.
writeToken
(Boolean) =true
-
Create Token annotations.
zoneTypes
(String[]) =[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]
[optional]-
A list of type names used for zoning.
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
da |
20120616.1 |
|
da |
20120616.1 |
|
de |
20120616.1 |
|
de |
20120616.1 |
|
en |
20120616.1 |
|
en |
20120616.1 |
|
it |
20130618.0 |
|
it |
20130618.0 |
|
nb |
20120131.1 |
|
nb |
20120131.1 |
|
nl |
20120616.1 |
|
nl |
20120616.1 |
|
pt |
20120616.1 |
|
pt |
20120616.1 |
|
sv |
20120616.1 |
|
sv |
20120616.1 |
ParagraphSplitter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.ParagraphSplitter
This class creates paragraph annotations for the given input document. It searches for the occurrence of two or more line-breaks (Unix and Windows) and regards this as the boundary between paragraphs.
Parameters
splitPattern
(String) =((\r\n\r\n)(\r\n)*)|((\n\n)(\n)*)
-
A regular expression used to detect paragraph splits. Default: #DOUBLE_LINE_BREAKS_PATTERN (split on two consecutive line breaks)
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
PatternBasedTokenSegmenter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.PatternBasedTokenSegmenter
Split up existing tokens again at particular split-chars. The prefix states whether the split chars should be added as separate Token Tokens. If the #INCLUDE_PREFIX precedes the split pattern, the pattern is included. Consequently, patterns following the #EXCLUDE_PREFIX, will not be added as a Token.
Parameters
deleteCover
(Boolean) =true
-
Wether to remove the original token. Default: true
patterns
(String[])-
A list of regular expressions, prefixed with #INCLUDE_PREFIX or #EXCLUDE_PREFIX. If neither of the prefixes is used, #EXCLUDE_PREFIX is assumed.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
RegexTokenizer
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.RegexTokenizer
This segmenter splits sentences and tokens based on regular expressions that define the sentence and token boundaries.
The default behaviour is to split sentences by a line break and tokens by whitespace.
Parameters
language
(String) [optional]-
The language.
sentenceBoundaryRegex
(String) = ``-
Define the sentence boundary. Default: \n (assume one sentence per line).
strictZoning
(Boolean) =false
-
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
tokenBoundaryRegex
(String) =[\\s\n]+
-
Defines the pattern that is used as token end boundary. Default: [\s\n]+ (matching whitespace and linebreaks.
When setting custom patterns, take into account that the final token is often terminated by a linebreak rather than the boundary character. Therefore, the newline typically has to be added to the group of matching characters, e.g. "tokenized-text" is correctly tokenized with the pattern [-\n].
writeSentence
(Boolean) =true
-
Create Sentence annotations.
writeToken
(Boolean) =true
-
Create Token annotations.
zoneTypes
(String[]) =[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]
[optional]-
A list of type names used for zoning.
StanfordSegmenter
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordSegmenter
Parameters
allowEmptySentences
(Boolean) =false
-
Whether to generate empty sentences.
boundaryFollowers
(String[]) =[), ], }, \", ', '', \u2019, \u201D, -RRB-, -RSB-, -RCB-, ), ], }]
[optional]-
This is a Set of String that are matched with .equals() which are allowed to be tacked onto the end of a sentence after a sentence boundary token, for example ")".
boundaryToDiscard
(String[]) =[, NL]
[optional]-
The set of regex for sentence boundary tokens that should be discarded.
boundaryTokenRegex
(String) =\\.|[!?]+
[optional]-
The set of boundary tokens. If null, use default.
isOneSentence
(Boolean) =false
-
Whether to treat all input as one sentence.
language
(String) [optional]-
The language.
languageFallback
(String) [optional]newlineIsSentenceBreak
(String) =TWO_CONSECUTIVE
[optional]-
Strategy for treating newlines as paragraph breaks.
regionElementRegex
(String) [optional]-
A regular expression for element names containing a sentence region. Only tokens in such elements will be included in sentences. The start and end tags themselves are not included in the sentence.
strictZoning
(Boolean) =false
-
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
tokenRegexesToDiscard
(String[]) =[]
[optional]-
The set of regex for sentence boundary tokens that should be discarded.
writeSentence
(Boolean) =true
-
Create Sentence annotations.
writeToken
(Boolean) =true
-
Create Token annotations.
xmlBreakElementsToDiscard
(String[]) [optional]-
These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary.
zoneTypes
(String[]) =[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]
[optional]-
A list of type names used for zoning.
TokenMerger
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.TokenMerger
Merges any Tokens that are covered by a given annotation type. E.g. this component can be used to create a single tokens from all tokens that constitute a multi-token named entity.
Parameters
POSMappingLocation
(String) [optional]-
Override the tagset mapping.
annotationType
(String)-
Annotation type for which tokens should be merged.
constraint
(String) [optional]-
A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to merge only tokens that are part of a location named entity.
language
(String) [optional]-
Use this language instead of the document language to resolve the model and tag set mapping.
lemmaMode
(String) =JOIN
-
Configure what should happen to the lemma of the merged tokens. It is possible to JOIN the lemmata to a single lemma (space separated), to REMOVE the lemma or LEAVE the lemma of the first token as-is.
posType
(String) [optional]-
Set a new POS tag for the new merged token. This is the mapped type. If this is specified, tag set mapping will not be performed. This parameter has no effect unless PARAM_POS_VALUE is also set.
posValue
(String) [optional]-
Set a new POS value for the new merged token. This is the actual tag set value and is subject to tagset mapping. For example when merging tokens for named entities, the new POS value may be set to "NNP" (English/Penn Treebank Tagset).
Inputs and outputs
Inputs |
|
---|---|
Outputs |
TokenTrimmer
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.TokenTrimmer
Remove prefixes and suffixes from tokens.
Parameters
prefixes
(String[])-
List of prefixes to remove.
suffixes
(String[])-
List of suffixes to remove.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
WhitespaceTokenizer
Role: Segmenter
Artifact ID: de.tudarmstadt.ukp.dkpro.core.tokit-asl
Class: de.tudarmstadt.ukp.dkpro.core.tokit.WhitespaceTokenizer
A strict whitespace tokenizer, i.e. tokenizes according to whitespaces and linebreaks only.
If PARAM_WRITE_SENTENCES is set to true, one sentence per line is assumed. Otherwise, no sentences are created.
Parameters
language
(String) [optional]-
The language.
strictZoning
(Boolean) =false
-
Strict zoning causes the segmentation to be applied only within the boundaries of a zone annotation. This works only if a single zone type is specified (the zone annotations should NOT overlap) or if no zone type is specified - in which case the whole document is taken as a zone. If strict zoning is turned off, multiple zone types can be specified. A list of all zone boundaries (start and end) is created and segmentation happens between them.
writeSentence
(Boolean) =true
-
Create Sentence annotations.
writeToken
(Boolean) =true
-
Create Token annotations.
zoneTypes
(String[]) =[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Div]
[optional]-
A list of type names used for zoning.
Semantic role labeler
Component | Description |
---|---|
ClearNLP semantic role labeller. |
|
DKPro Annotator for the MateTools Semantic Role Labeler. |
ClearNlpSemanticRoleLabeler
Role: Semantic role labeler
Artifact ID: de.tudarmstadt.ukp.dkpro.core.clearnlp-asl
Class: de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSemanticRoleLabeler
ClearNLP semantic role labeller.
Parameters
expandArguments
(Boolean) =false
-
Normally the arguments point only to the head words of arguments in the dependency tree. With this option enabled, they are expanded to the text covered by the minimal and maximal token offsets of all descendants (or self) of the head word.
Warning: this parameter should be used with caution! For one, if the descentants of a head word cover a non-continuous region of the text, this information is lost. The arguments will appear to be spanning a continuous region. For another, the arguments may overlap with each other. E.g. if a sentence contains a relative clause with a verb, the subject of the main clause may be recognized as a dependent of the verb and may cause the whole main clause to be recorded in the argument.
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelVariant
(String) [optional]-
Variant of a model the model. Used to address a specific model if here are multiple models for one language.
predModelLocation
(String) [optional]-
Location from which the predicate identifier model is read.
printTagSet
(Boolean) =false
-
Write the tag set(s) to the log when a model is loaded.
roleModelLocation
(String) [optional]-
Location from which the roleset classification model is read.
srlModelLocation
(String) [optional]-
Location from which the semantic role labeling model is read.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
MateSemanticRoleLabeler
Role: Semantic role labeler
Artifact ID: de.tudarmstadt.ukp.dkpro.core.matetools-gpl
Class: de.tudarmstadt.ukp.dkpro.core.matetools.MateSemanticRoleLabeler
DKPro Annotator for the MateTools Semantic Role Labeler.
Please cite the following paper, if you use the semantic role labeler Anders Björkelund, Love Hafdell, and Pierre Nugues. Multilingual semantic role labeling. In Proceedings of The Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 43--48, Boulder, June 4--5 2009.
Parameters
language
(String) [optional]-
Use this language instead of the document language to resolve the model.
modelLocation
(String) [optional]-
Load the model from this location instead of locating the model automatically.
modelVariant
(String) [optional]-
Override the default variant used to locate the model.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Models
Language | Variant | Version |
---|---|---|
de |
20130105.0 |
|
en |
20130117.0 |
|
es |
20130320.0 |
|
zh |
20130117.0 |
Stemmer
Component | Description |
---|---|
UIMA wrapper for the Snowball stemmer. |
SnowballStemmer
Role: Stemmer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.snowball-asl
Class: de.tudarmstadt.ukp.dkpro.core.snowball.SnowballStemmer
UIMA wrapper for the Snowball stemmer. Annotation types to be stemmed can beconfigured by a FeaturePath.
If you use this component in a pipeline which uses stop word removal, make sure that it runs after the stop word removal step, so only words that are no stop words are stemmed.
Parameters
filterConditionOperator
(String) [optional]-
Specifies the operator for a filtering condition.
It is only used if
PARAM_FILTER_FEATUREPATH
is set. filterConditionValue
(String) [optional]-
Specifies the value for a filtering condition.
It is only used if
PARAM_FILTER_FEATUREPATH
is set. filterFeaturePath
(String) [optional]-
Specifies a feature path that is used in the filter. If this is set, you also have to specify
PARAM_FILTER_CONDITION_OPERATOR
andPARAM_FILTER_CONDITION_VALUE
. language
(String) [optional]-
Use this language instead of the document language to resolve the model.
lowerCase
(Boolean) =false
[optional]-
Per default the stemmer runs in case-sensitive mode. If this parameter is enabled, tokens are lower-cased before being passed to the stemmer.
Examples false (default) true EDUCATIONAL EDUCATIONAL educ Educational Educat educ educational educ educ paths
(String[]) [optional]-
Specify a path that is used for annotation. Format is de.type.name/feature/path. All type objects will be annotated with a IndexTermAnnotation. The value of the IndexTerm is specified by the feature path.
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
Topic Model
Topic modeling is a statistical approach to discover abstract topics in a collection of documents. A topic is characterized by a probability distribution of the words in the document collection. Once a topic model has been generated, it can be used to analyze unseen documents. The result of the analysis is describes the probability by which a document belongs to each of the topics in the model.
Component | Description |
---|---|
Estimate an LDA topic model using Mallet and write it to a file. |
|
Infers the topic distribution over documents using a Mallet ParallelTopicModel. |
MalletTopicModelEstimator
Role: Topic Model
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mallet-asl
Class: de.tudarmstadt.ukp.dkpro.core.mallet.topicmodel.MalletTopicModelEstimator
Estimate an LDA topic model using Mallet and write it to a file. It stores all incoming CAS' to Mallet Instances before estimating the model, using a ParallelTopicModel.
Parameters
alphaSum
(Float) =1.0
-
The sum of alphas over all topics. Default: 1.0.
Another recommended value is 50 / T (number of topics).
beta
(Float) =0.01
-
Beta for a single dimension of the Dirichlet prior. Default: 0.01.
burninPeriod
(Integer) =100
-
The number of iterations before hyperparameter optimization begins. Default: 100
displayInterval
(Integer) =50
-
The interval in which to display the estimated topics. Default: 50.
displayNTopicWords
(Integer) =7
-
The number of top words to display during estimation. Default: 7.
minTokenLength
(Integer) =3
-
Ignore tokens (or lemmas, respectively) that are shorter than the given value. Default: 3.
modelEntityType
(String) [optional]-
If specific, the text contained in the given segmentation type annotations are fed as separate units to the topic model estimator e.g. de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence. Text that is not within such annotations is ignored.
By default, the full document text is used as a document.
nIterations
(Integer) =1000
-
The number of iterations during model estimation. Default: 1000.
nThreads
(Integer) =1
-
The number of threads to use during model estimation. Default: 1.
nTopics
(Integer) =10
-
The number of topics to estimate for the topic model.
optimizeInterval
(Integer) =50
-
Interval for optimizing Dirichlet hyperparameters. Default: 50
randomSeed
(Integer) =-1
-
Set random seed. If set to -1 (default), uses random generator.
saveInterval
(Integer) =0
-
Define how often to save a serialized model during estimation. Default: 0 (only save when estimation is done).
targetLocation
(String)-
The target model file location.
typeName
(String) =de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token
-
The annotation type to use for the topic model. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token.
useLemma
(Boolean) =false
-
If set, uses lemmas instead of original text as features.
useSymmetricAlph
(Boolean) =false
-
Use a symmatric alpha value during model estimation? Default: false.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
none specified |
MalletTopicModelInferencer
Role: Topic Model
Artifact ID: de.tudarmstadt.ukp.dkpro.core.mallet-asl
Class: de.tudarmstadt.ukp.dkpro.core.mallet.topicmodel.MalletTopicModelInferencer
Infers the topic distribution over documents using a Mallet ParallelTopicModel.
Parameters
burnIn
(Integer) =1
-
The number of iterations before hyperparameter optimization begins. Default: 1
maxTopicAssignments
(Integer) =0
-
Maximum number of topics to assign. If not set (or <= 0), the number of topics in the model divided by 10 is set.
minTokenLength
(Integer) =3
-
Ignore tokens (or lemmas, respectively) that are shorter than the given value. Default: 3.
minTopicProb
(Float) =0.2
-
Minimum topic proportion for the document-topic assignment.
modelLocation
(String)nIterations
(Integer) =10
-
The number of iterations during inference. Default: 10.
thinning
(Integer) =5
typeName
(String) =de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token
-
The annotation type to use as tokens. Default: Token
useLemma
(Boolean) =false
-
If set, uses lemmas instead of original text as features.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Transformer
Component | Description |
---|---|
Takes a text and replaces wrong capitalization |
|
Converts traditional Chinese to simplified Chinese or vice-versa. |
|
Reads a tab-separated file containing mappings from one token to another. |
|
Takes a text and shortens extra long words |
|
Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT. |
|
Simple dictionary-based hyphenation remover. |
|
A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions. |
|
Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens |
|
Takes a text and replaces sharp s |
|
Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation. |
|
Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. |
|
Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen. |
|
Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model. |
CapitalizationNormalizer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.CapitalizationNormalizer
Takes a text and replaces wrong capitalization
Parameters
typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
none specified |
CjfNormalizer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.languagetool-asl
Class: de.tudarmstadt.ukp.dkpro.core.languagetool.CjfNormalizer
Converts traditional Chinese to simplified Chinese or vice-versa.
Parameters
direction
(String) =TO_SIMPLIFIED
typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
DictionaryBasedTokenTransformer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.DictionaryBasedTokenTransformer
Reads a tab-separated file containing mappings from one token to another. All tokens that match an entry in the first column are changed to the corresponding token in the second column.
Parameters
commentMarker
(String) =#
-
Lines starting with this character (or String) are ignored. Default: '#'
modelEncoding
(String) =UTF-8
modelLocation
(String)separator
(String) = ``-
Separator for mappings file. Default: "\t" (TAB).
typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
ExpressiveLengtheningNormalizer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.ExpressiveLengtheningNormalizer
Takes a text and shortens extra long words
Parameters
typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
none specified |
FileBasedTokenTransformer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.FileBasedTokenTransformer
Replaces all tokens that are listed in the file in #PARAM_MODEL_LOCATION by the string specified in #PARAM_REPLACEMENT.
Parameters
ignoreCase
(Boolean) =false
modelLocation
(String)replacement
(String)typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
HyphenationRemover
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.HyphenationRemover
Simple dictionary-based hyphenation remover.
Parameters
modelEncoding
(String) =UTF-8
modelLocation
(String)typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
RegexBasedTokenTransformer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.RegexBasedTokenTransformer
A JCasTransformerChangeBased_ImplBase implementation that replaces tokens based on a regular expressions.
The parameters #PARAM_REGEX defines the regular expression to be searcher, #PARAM_REPLACEMENT defines the string with which matching patterns are replaces.
Parameters
regex
(String)-
Define the regular expression to be replaced
replacement
(String)-
Define the string to replace matching tokens with
typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
ReplacementFileNormalizer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.ReplacementFileNormalizer
Takes a text and replaces desired expressions This class should not work on tokens as some expressions might span several tokens
Parameters
modelLocation
(String)-
Location of a file which contains all replacing characters
srcExpressionSurroundings
(String) =IRRELEVANT
targetExpressionSurroundings
(String) =NOTHING
Inputs and outputs
Inputs |
|
---|---|
Outputs |
SharpSNormalizer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.SharpSNormalizer
Takes a text and replaces sharp s
Parameters
MinFrequencyThreshold
(Integer) =100
typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
SpellingNormalizer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.SpellingNormalizer
Converts annotations of the type SpellingAnomaly into a SofaChangeAnnoatation.
Parameters
typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
none specified |
StanfordPtbTransformer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordPtbTransformer
Uses the normalizing tokenizer of the Stanford CoreNLP tools to escape the text PTB-style. This component operates directly on the text and does not require prior segmentation.
Parameters
typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
TokenCaseTransformer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.transformation.TokenCaseTransformer
Change tokens to follow a specific casing: all upper case, all lower case, or 'normal case': lowercase everything but the first character of a token and the characters immediately following a hyphen.
Parameters
tokenCase
(String)-
The case to convert tokens to:
- UPPERCASE: uppercase everything.
- LOWERCASE: lowercase everything.
- NORMALCASE: retain first letter in word and after hyphens, lowercase everything else.
typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
UmlautNormalizer
Role: Transformer
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.frequency.UmlautNormalizer
Takes a text and checks for umlauts written as "ae", "oe", or "ue" and normalizes them if they really are umlauts depending on a frequency model.
Parameters
MinFrequencyThreshold
(Integer) =100
typesToCopy
(String[]) =[]
-
A list of fully qualified type names that should be copied to the transformed CAS where available. By default, no types are copied apart from DocumentMetaData, i.e. all other annotations are omitted.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
none specified |
Other
Component | Description |
---|---|
Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words. |
|
Applies changes annotated using a SofaChangeAnnotation. |
|
After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view. |
|
Annotates compound parts and linking morphemes. |
|
This component assumes that some spell checker has already been applied upstream (e.g. |
|
Takes a plain text file with phrases as input and annotates the phrases in the CAS file. |
|
Utility analysis engine for use with CAS multipliers in uimaFIT pipelines. |
|
N-gram annotator. |
|
Creates SofaChangeAnnotations containing corrections for previously identified spelling errors. |
|
Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech. |
|
Maps existing POS tags from one tagset to another using a user provided properties file. |
|
Assign a set of popular readability scores to the text. |
|
Remove every token that does or does not match a given regular expression. |
|
This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. |
|
Converts a constituency structure into a dependency structure. |
|
Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. |
|
Can be used to measure how long the processing between two points in a pipeline takes. |
|
This component adds Tfidf annotations consisting of a term and a tfidf weight. |
|
This consumer builds a DfModel. |
|
Removing trailing character (sequences) from tokens, e.g. punctuation. |
AnnotationByTextFilter
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.AnnotationByTextFilter
Reads a list of words from a text file (one token per line) and retains only tokens or other annotations that match any of these words.
Parameters
ignoreCase
(Boolean) =true
-
If true, annotation texts are filtered case-independently. Default: true, i.e. words that occur in the list with different casing are not filtered out.
modelEncoding
(String) =UTF-8
modelLocation
(String)typeName
(String) =de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token
-
Annotation type to filter. Default: de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token.
ApplyChangesAnnotator
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.castransformation-asl
Class: de.tudarmstadt.ukp.dkpro.core.castransformation.ApplyChangesAnnotator
Applies changes annotated using a SofaChangeAnnotation.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
Backmapper
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.castransformation-asl
Class: de.tudarmstadt.ukp.dkpro.core.castransformation.Backmapper
After processing a file with the ApplyChangesAnnotator this annotator can be used to map the annotations created in the cleaned view back to the original view.
Parameters
Chain
(String[]) =[source, target]
[optional]-
Chain of views for backmapping. This should be the reverse of the chain of views that the ApplyChangesAnnotator has used. For example, if view A has been mapped to B using ApplyChangesAnnotator, then this parameter should be set using an array containing [B, A].
CompoundAnnotator
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.decompounding-asl
Class: de.tudarmstadt.ukp.dkpro.core.decompounding.uima.annotator.CompoundAnnotator
Annotates compound parts and linking morphemes.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
CorrectionsContextualizer
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.jazzy-asl
Class: de.tudarmstadt.ukp.dkpro.core.jazzy.CorrectionsContextualizer
This component assumes that some spell checker has already been applied upstream (e.g. Jazzy). It then uses ngram frequencies from a frequency provider in order to rank the provided corrections.
DictionaryAnnotator
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator-asl
Class: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.DictionaryAnnotator
Takes a plain text file with phrases as input and annotates the phrases in the CAS file. The annotation type defaults to NGram, but can be changed. The component requires that Tokens and Sentencees are annotated in the CAS. The format of the phrase file is one phrase per line, tokens are separated by space:
this is a phrase another phrase
Parameters
annotationType
(String) [optional]-
The annotation to create on matching phases. If nothing is specified, this defaults to NGram.
modelEncoding
(String) =UTF-8
-
The character encoding used by the model.
modelLocation
(String)-
The file must contain one phrase per line - phrases will be split at " "
value
(String) [optional]-
The value to set the feature configured in #PARAM_VALUE_FEATURE to.
valueFeature
(String) =value
[optional]-
Set this feature on the created annotations.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
none specified |
JCasHolder
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.util.JCasHolder
Utility analysis engine for use with CAS multipliers in uimaFIT pipelines.
NGramAnnotator
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.ngrams-asl
Class: de.tudarmstadt.ukp.dkpro.core.ngrams.NGramAnnotator
N-gram annotator.
Parameters
N
(Integer) =3
-
The length of the n-grams to generate (the "n" in n-gram).
Inputs and outputs
Inputs |
|
---|---|
Outputs |
NorvigSpellingCorrector
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.norvig-asl
Class: de.tudarmstadt.ukp.dkpro.core.norvig.NorvigSpellingCorrector
Creates SofaChangeAnnotations containing corrections for previously identified spelling errors.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
PosFilter
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.posfilter-asl
Class: de.tudarmstadt.ukp.dkpro.core.posfilter.PosFilter
Removes all tokens/lemmas/stems/POS tags (depending on the "Mode" setting) that do not match the given parts of speech.
Parameters
Verbs
(Boolean) =false
-
Keep/remove verbs (true: keep, false: v)
adj
(Boolean) =false
-
Keep/remove adjectives (true: keep, false: remove)
adv
(Boolean) =false
-
Keep/remove adverbs (true: keep, false: remove)
art
(Boolean) =false
-
Keep/remove articles (true: keep, false: remove)
card
(Boolean) =false
-
Keep/remove cardinal numbers (true: keep, false: remove)
conj
(Boolean) =false
-
Keep/remove conjunctions (true: keep, false: remove)
n
(Boolean) =false
-
Keep/remove nouns (true: keep, false: remove)
o
(Boolean) =false
-
Keep/remove "others" (true: keep, false: remove)
pp
(Boolean) =false
-
Keep/remove prepositions (true: keep, false: remove)
pr
(Boolean) =false
-
Keep/remove pronouns (true: keep, false: remove)
punc
(Boolean) =false
-
Keep/remove punctuation (true: keep, false: remove)
typeToRemove
(String)-
The fully qualified name of the type that should be filtered.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
none specified |
PosMapper
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.posfilter-asl
Class: de.tudarmstadt.ukp.dkpro.core.posfilter.PosMapper
Maps existing POS tags from one tagset to another using a user provided properties file.
Parameters
dkproMappingLocation
(String) [optional]-
A properties file containing mappings from the new tagset to (fully qualified) DKPro POS classes.
If such a file is not supplied, the DKPro POS classes stay the same regardless of the new POS tag value, and only the value is changed. mappingFile
(String)-
A properties file containing POS tagset mappings.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
ReadabilityAnnotator
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.readability-asl
Class: de.tudarmstadt.ukp.dkpro.core.readability.ReadabilityAnnotator
Assign a set of popular readability scores to the text.
RegexTokenFilter
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.RegexTokenFilter
Remove every token that does or does not match a given regular expression.
Parameters
mustMatch
(Boolean) =true
-
If this parameter is set to true (default), retain only tokens that match the regex given in #PARAM_REGEX. If set to false, all tokens that match the given regex are removed.
regex
(String)-
Every token that does or does not match this regular expression will be removed.
SemanticFieldAnnotator
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator-asl
Class: de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.SemanticFieldAnnotator
This Analysis Engine annotates English single words with semantic field information retrieved from an ExternalResource. This could be a lexical resource such as WordNet or a simple key-value map. The annotation is stored in the SemanticField annotation type.
Parameters
annotationType
(String)-
Annotation types which should be annotated with semantic fields
constraint
(String) [optional]-
A constraint on the annotations that should be considered in form of a JXPath statement. Example: set #PARAM_ANNOTATION_TYPE to a NamedEntity type and set the #PARAM_CONSTRAINT to ".[value = 'LOCATION']" to annotate only tokens with semantic fields that are part of a location named entity.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
StanfordDependencyConverter
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl
Class: de.tudarmstadt.ukp.dkpro.core.stanfordnlp.StanfordDependencyConverter
Converts a constituency structure into a dependency structure.
Parameters
language
(String) [optional]-
Use this language instead of the document language to resolve the model and tag set mapping.
mode
(String) =TREE
[optional]-
Sets the kind of dependencies being created.
Default: DependenciesMode#COLLAPSED TREE
originalDependencies
(Boolean) =true
-
Create original dependencies. If this is disabled, universal dependencies are created. The default is to create the original dependencies.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
StopWordRemover
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.stopwordremover-asl
Class: de.tudarmstadt.ukp.dkpro.core.stopwordremover.StopWordRemover
Remove all of the specified types from the CAS if their covered text is in the stop word dictionary. Also remove any other of the specified types that is covered by a matching instance.
Parameters
Paths
(String[]) [optional]-
Feature paths for annotations that should be matched/removed. The default is
StopWord.class.getName() Token.class.getName() Lemma.class.getName()+"/value"
StopWordType
(String) [optional]-
Anything annotated with this type will be removed even if it does not match any word in the lists.
modelEncoding
(String) =UTF-8
-
The character encoding used by the model.
modelLocation
(String[])-
A list of URLs from which to load the stop word lists. If an URL is prefixed with a language code in square brackets, the stop word list is only used for documents in that language. Using no prefix or the prefix "[*]" causes the list to be used for every document. Example: "[de]classpath:/stopwords/en_articles.txt"
Inputs and outputs
Inputs |
|
---|---|
Outputs |
none specified |
Stopwatch
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.performance-asl
Class: de.tudarmstadt.ukp.dkpro.core.performance.Stopwatch
Can be used to measure how long the processing between two points in a pipeline takes. For that purpose, the AE needs to be added two times, before and after the part of the pipeline that should be measured.
Parameters
timerName
(String)-
Name of the timer pair. Upstream and downstream timer need to use the same name.
timerOutputFile
(String) [optional]-
Name of the timer pair. Upstream and downstream timer need to use the same name.
Inputs and outputs
Inputs |
|
---|---|
Outputs |
TfidfAnnotator
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.frequency-asl
Class: de.tudarmstadt.ukp.dkpro.core.frequency.tfidf.TfidfAnnotator
This component adds Tfidf annotations consisting of a term and a tfidf weight.
The annotator is type agnostic concerning the input annotation, so you have to specify the
annotation type and string representation. It uses a pre-serialized DfStore, which can be
created using the TfidfConsumer.
Parameters
featurePath
(String)-
This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path.
lowercase
(Boolean) =false
[optional]-
If set to true, the whole text is handled in lower case.
tfdfPath
(String) [optional]-
Provide the path to the Df-Model. When a shared SharedDfModel is bound to this annotator, this is ignored.
weightingModeIdf
(String) =NORMAL
[optional]-
The model for inverse document frequency weighting.
Invoke toString() on an enum of WeightingModeIdf for setup.Default value is "NORMAL" yielding an unweighted idf.
weightingModeTf
(String) =NORMAL
[optional]-
The model for term frequency weighting.
Invoke toString() on an enum of WeightingModeTf for setup.Default value is "NORMAL" yielding an unweighted tf.
Inputs and outputs
Inputs |
none specified |
---|---|
Outputs |
TfidfConsumer
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.frequency-asl
Class: de.tudarmstadt.ukp.dkpro.core.frequency.tfidf.TfidfConsumer
This consumer builds a DfModel. It collects the df (document frequency) counts for the processed collection. The counts are serialized as a DfModel-object.
Parameters
featurePath
(String)-
This annotator is type agnostic, so it is mandatory to specify the type of the working annotation and how to obtain the string representation with the feature path.
lowercase
(Boolean) =false
-
If set to true, the whole text is handled in lower case.
targetLocation
(String)-
Specifies the path and filename where the model file is written.
TrailingCharacterRemover
Role: Other
Artifact ID: de.tudarmstadt.ukp.dkpro.core.textnormalizer-asl
Class: de.tudarmstadt.ukp.dkpro.core.textnormalizer.annotations.TrailingCharacterRemover
Removing trailing character (sequences) from tokens, e.g. punctuation.
Parameters
minTokenLength
(Integer) =1
-
All tokens that are shorter than the minimum token length after removing trailing chars are completely removed. By default (1), empty tokens are removed. Set to 0 or a negative value if no tokens should be removed.
Shorter tokens that do not have trailing chars removed are always retained, regardless of their length.
pattern
(String) =[\\Q,-\u201C^\u00BB*\u2019()&/\"'\u00A9\u00A7'\u2014\u00AB\u00B7=\\E0-9A-Z]+
-
A regex to be trimmed from the end of tokens.
Default: "[\\Q,-“^»*’()&/\"'©§'—«·=\\E0-9A-Z]+" (remove punctuations, special characters and capital letters).