public abstract class MalletModelTrainer extends JCasFileWriter_ImplBase
It creates a Mallet InstanceList
from the input documents so that inheriting estimators
can create a model, typically implemented by overriding the JCasFileWriter_ImplBase.collectionProcessComplete()
method.
MalletEmbeddingsTrainer
,
MalletLdaTopicModelTrainer
JCasFileWriter_ImplBase.NamedOutputStream
Modifier and Type | Field and Description |
---|---|
static String |
PARAM_COVERING_ANNOTATION_TYPE
If specified, the text contained in the given segmentation type annotations are fed as
separate units ("documents") to the topic model estimator e.g.
|
static String |
PARAM_FILTER_REGEX
Filter out all tokens matching that regular expression.
|
static String |
PARAM_FILTER_REGEX_REPLACEMENT |
static String |
PARAM_LOWERCASE
If set to true (default: false), all tokens are lowercased.
|
static String |
PARAM_MIN_TOKEN_LENGTH
Ignore tokens (or any other annotation type, as specified by
PARAM_TOKEN_FEATURE_PATH )
that are shorter than the given value. |
static String |
PARAM_NUM_THREADS
The number of threads to use during model estimation.
|
static String |
PARAM_STOPWORDS_FILE
The location of the stopwords file.
|
static String |
PARAM_STOPWORDS_REPLACEMENT
If set, stopwords found in the
PARAM_STOPWORDS_FILE location are not removed, but
replaced by the given string (e.g. |
static String |
PARAM_TOKEN_FEATURE_PATH
The annotation type to use as input tokens for the model estimation.
|
static String |
PARAM_USE_CHARACTERS
If true (default: false), estimate character embeddings.
|
JAR_PREFIX, PARAM_COMPRESSION, PARAM_ESCAPE_DOCUMENT_ID, PARAM_OVERWRITE, PARAM_SINGULAR_TARGET, PARAM_STRIP_EXTENSION, PARAM_TARGET_LOCATION, PARAM_USE_DOCUMENT_ID
Constructor and Description |
---|
MalletModelTrainer() |
Modifier and Type | Method and Description |
---|---|
cc.mallet.types.InstanceList |
getInstanceList() |
protected int |
getNumThreads() |
void |
initialize(org.apache.uima.UimaContext context) |
void |
process(org.apache.uima.jcas.JCas aJCas) |
collectionProcessComplete, getCompressionMethod, getOutputStream, getOutputStream, getRelativePath, getTargetLocation, isStripExtension, isUseDocumentId
getRequiredCasInterface, process
getCasInstancesRequired, hasNext, next
public static final String PARAM_TOKEN_FEATURE_PATH
de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token
.
For lemmas, for instance, use de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token/lemma/value
public static final String PARAM_NUM_THREADS
ComponentParameters.computeNumThreads(int)
.
Warning: do not set this to more than 1 when using very small (test) data sets on MalletEmbeddingsTrainer
!
This might prevent the process from terminating.
public static final String PARAM_MIN_TOKEN_LENGTH
PARAM_TOKEN_FEATURE_PATH
)
that are shorter than the given value. Default: 3.public static final String PARAM_COVERING_ANNOTATION_TYPE
de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence
. Text that is not within
such annotations is ignored.
By default, the full text is used as a document.
public static final String PARAM_USE_CHARACTERS
PARAM_TOKEN_FEATURE_PATH
is
ignored.public static final String PARAM_LOWERCASE
public static final String PARAM_STOPWORDS_FILE
public static final String PARAM_STOPWORDS_REPLACEMENT
PARAM_STOPWORDS_FILE
location are not removed, but
replaced by the given string (e.g. STOP
).public static final String PARAM_FILTER_REGEX
public static final String PARAM_FILTER_REGEX_REPLACEMENT
public void initialize(org.apache.uima.UimaContext context) throws org.apache.uima.resource.ResourceInitializationException
initialize
in interface org.apache.uima.analysis_component.AnalysisComponent
initialize
in class org.apache.uima.fit.component.JCasConsumer_ImplBase
org.apache.uima.resource.ResourceInitializationException
public void process(org.apache.uima.jcas.JCas aJCas) throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
process
in class org.apache.uima.analysis_component.JCasAnnotator_ImplBase
org.apache.uima.analysis_engine.AnalysisEngineProcessException
protected int getNumThreads()
public cc.mallet.types.InstanceList getInstanceList()
Copyright © 2007–2018 Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt. All rights reserved.