public class RegexSegmenter extends SegmenterBase
The default behavior is to split sentences by a line break and tokens by whitespace.
| Modifier and Type | Field and Description |
|---|---|
static String |
PARAM_SENTENCE_BOUNDARY_REGEX
Define the sentence boundary.
|
static String |
PARAM_TOKEN_BOUNDARY_REGEX
Defines the pattern that is used as token end boundary.
|
PARAM_LANGUAGE, PARAM_STRICT_ZONING, PARAM_WRITE_FORM, PARAM_WRITE_SENTENCE, PARAM_WRITE_TOKEN, PARAM_ZONE_TYPES| Constructor and Description |
|---|
RegexSegmenter() |
| Modifier and Type | Method and Description |
|---|---|
void |
initialize(org.apache.uima.UimaContext context) |
protected void |
process(org.apache.uima.jcas.JCas aJCas,
String text,
int zoneBegin) |
createSentence, createToken, createToken, createToken, getLanguage, getLocale, getZoneTypes, isEmpty, isStrictZoning, isWriteSentence, isWriteToken, limit, processgetRequiredCasInterface, processgetCasInstancesRequired, hasNext, nextpublic static final String PARAM_TOKEN_BOUNDARY_REGEX
When setting custom patterns, take into account that the final token is often terminated by a
linebreak rather than the boundary character. Therefore, the newline typically has to be
added to the group of matching characters, e.g. "tokenized-text" is correctly
tokenized with the pattern [-\n].
public static final String PARAM_SENTENCE_BOUNDARY_REGEX
public void initialize(org.apache.uima.UimaContext context)
throws org.apache.uima.resource.ResourceInitializationException
initialize in interface org.apache.uima.analysis_component.AnalysisComponentinitialize in class org.apache.uima.fit.component.JCasAnnotator_ImplBaseorg.apache.uima.resource.ResourceInitializationExceptionprotected void process(org.apache.uima.jcas.JCas aJCas,
String text,
int zoneBegin)
throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
process in class SegmenterBaseorg.apache.uima.analysis_engine.AnalysisEngineProcessExceptionCopyright © 2007–2019 Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt. All rights reserved.