public class MalletTopicModelEstimator
extends org.apache.uima.fit.component.JCasAnnotator_ImplBase
Instance
s before estimating the model, using a ParallelTopicModel
.Modifier and Type | Field and Description |
---|---|
static String |
PARAM_ALPHA_SUM
The sum of alphas over all topics.
|
static String |
PARAM_BETA
Beta for a single dimension of the Dirichlet prior.
|
static String |
PARAM_BURNIN_PERIOD
The number of iterations before hyperparameter optimization begins.
|
static String |
PARAM_DISPLAY_INTERVAL
The interval in which to display the estimated topics.
|
static String |
PARAM_DISPLAY_N_TOPIC_WORDS
The number of top words to display during estimation.
|
static String |
PARAM_MIN_TOKEN_LENGTH
Ignore tokens (or lemmas, respectively) that are shorter than the given value.
|
static String |
PARAM_MODEL_ENTITY_TYPE
If specific, the text contained in the given segmentation type annotations are fed as
separate units to the topic model estimator e.g.
|
static String |
PARAM_N_ITERATIONS
The number of iterations during model estimation.
|
static String |
PARAM_N_THREADS
The number of threads to use during model estimation.
|
static String |
PARAM_N_TOPICS
The number of topics to estimate for the topic model.
|
static String |
PARAM_OPTIMIZE_INTERVAL
Interval for optimizing Dirichlet hyperparameters.
|
static String |
PARAM_RANDOM_SEED
Set random seed.
|
static String |
PARAM_SAVE_INTERVAL
Define how often to save a serialized model during estimation.
|
static String |
PARAM_TARGET_LOCATION
The target model file location.
|
static String |
PARAM_TYPE_NAME
The annotation type to use for the topic model.
|
static String |
PARAM_USE_LEMMA
If set, uses lemmas instead of original text as features.
|
static String |
PARAM_USE_SYMMETRIC_ALPHA
Use a symmatric alpha value during model estimation? Default: false.
|
protected boolean |
useLemma |
Constructor and Description |
---|
MalletTopicModelEstimator() |
Modifier and Type | Method and Description |
---|---|
void |
collectionProcessComplete() |
protected static cc.mallet.types.TokenSequence |
generateTokenSequence(org.apache.uima.jcas.JCas aJCas,
org.apache.uima.cas.Type tokenType,
boolean useLemma,
int minTokenLength)
Generate a TokenSequence from the whole document.
|
protected Collection<cc.mallet.types.TokenSequence> |
generateTokenSequences(org.apache.uima.jcas.JCas aJCas)
Generate one or multiple TokenSequences from the given document.
|
void |
initialize(org.apache.uima.UimaContext context) |
void |
process(org.apache.uima.jcas.JCas aJCas) |
getRequiredCasInterface, process
getCasInstancesRequired, hasNext, next
public static final String PARAM_TYPE_NAME
public static final String PARAM_TARGET_LOCATION
public static final String PARAM_N_TOPICS
public static final String PARAM_N_THREADS
public static final String PARAM_N_ITERATIONS
public static final String PARAM_USE_LEMMA
protected boolean useLemma
public static final String PARAM_BURNIN_PERIOD
public static final String PARAM_OPTIMIZE_INTERVAL
public static final String PARAM_RANDOM_SEED
public static final String PARAM_SAVE_INTERVAL
public static final String PARAM_USE_SYMMETRIC_ALPHA
public static final String PARAM_DISPLAY_INTERVAL
public static final String PARAM_DISPLAY_N_TOPIC_WORDS
public static final String PARAM_MIN_TOKEN_LENGTH
public static final String PARAM_MODEL_ENTITY_TYPE
de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence
. Text that is not within
such annotations is ignored.
By default, the full document text is used as a document.
public static final String PARAM_ALPHA_SUM
Another recommended value is 50 / T (number of topics).
public static final String PARAM_BETA
public void initialize(org.apache.uima.UimaContext context) throws org.apache.uima.resource.ResourceInitializationException
initialize
in interface org.apache.uima.analysis_component.AnalysisComponent
initialize
in class org.apache.uima.fit.component.JCasAnnotator_ImplBase
org.apache.uima.resource.ResourceInitializationException
public void process(org.apache.uima.jcas.JCas aJCas) throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
process
in class org.apache.uima.analysis_component.JCasAnnotator_ImplBase
org.apache.uima.analysis_engine.AnalysisEngineProcessException
protected Collection<cc.mallet.types.TokenSequence> generateTokenSequences(org.apache.uima.jcas.JCas aJCas) throws FeaturePathException
PARAM_MODEL_ENTITY_TYPE
is set, an instance is generated from each segment annotated
with the given type. Otherwise, one instance is generated from the whole document.aJCas
- FeaturePathException
protected static cc.mallet.types.TokenSequence generateTokenSequence(org.apache.uima.jcas.JCas aJCas, org.apache.uima.cas.Type tokenType, boolean useLemma, int minTokenLength) throws FeaturePathException
aJCas
- a CAS holding the documenttokenType
- this type will be used as token, e.g. Token, N-gram etc.useLemma
- if this is true, use lemmasminTokenLength
- the minimum token length to useTokenSequence
FeaturePathException
- if the annotation type specified in PARAM_TYPE_NAME
cannot be extracted.public void collectionProcessComplete() throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
collectionProcessComplete
in interface org.apache.uima.analysis_component.AnalysisComponent
collectionProcessComplete
in class org.apache.uima.analysis_component.AnalysisComponent_ImplBase
org.apache.uima.analysis_engine.AnalysisEngineProcessException
Copyright © 2007–2016 Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt. All rights reserved.