public class RegexSegmenter extends SegmenterBase
The default behaviour is to split sentences by a line break and tokens by whitespace.
Modifier and Type | Field and Description |
---|---|
static String |
PARAM_SENTENCE_BOUNDARY_REGEX
Define the sentence boundary.
|
static String |
PARAM_TOKEN_BOUNDARY_REGEX
Defines the pattern that is used as token end boundary.
|
PARAM_LANGUAGE, PARAM_STRICT_ZONING, PARAM_WRITE_FORM, PARAM_WRITE_SENTENCE, PARAM_WRITE_TOKEN, PARAM_ZONE_TYPES
Constructor and Description |
---|
RegexSegmenter() |
Modifier and Type | Method and Description |
---|---|
void |
initialize(org.apache.uima.UimaContext context) |
protected void |
process(org.apache.uima.jcas.JCas aJCas,
String text,
int zoneBegin) |
createSentence, createToken, createToken, createToken, getLanguage, getLocale, getZoneTypes, isEmpty, isStrictZoning, isWriteSentence, isWriteToken, limit, process, trim, trimChar
getRequiredCasInterface, process
getCasInstancesRequired, hasNext, next
public static final String PARAM_TOKEN_BOUNDARY_REGEX
[\s\n]+
(matching
whitespace and linebreaks.
When setting custom patterns, take into account that the final token is often terminated by a
linebreak rather than the boundary character. Therefore, the newline typically has to be
added to the group of matching characters, e.g. "tokenized-text"
is correctly
tokenized with the pattern [-\n]
.
public static final String PARAM_SENTENCE_BOUNDARY_REGEX
\n
(assume one sentence per line).public void initialize(org.apache.uima.UimaContext context) throws org.apache.uima.resource.ResourceInitializationException
initialize
in interface org.apache.uima.analysis_component.AnalysisComponent
initialize
in class org.apache.uima.fit.component.JCasAnnotator_ImplBase
org.apache.uima.resource.ResourceInitializationException
protected void process(org.apache.uima.jcas.JCas aJCas, String text, int zoneBegin) throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
process
in class SegmenterBase
org.apache.uima.analysis_engine.AnalysisEngineProcessException
Copyright © 2007–2018 Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt. All rights reserved.