CoreNlpSegmenter (DKPro Core 1.9.0 API)

java.lang.Object
- org.apache.uima.analysis_component.AnalysisComponent_ImplBase
- - org.apache.uima.analysis_component.Annotator_ImplBase
  - - org.apache.uima.analysis_component.JCasAnnotator_ImplBase
    - - org.apache.uima.fit.component.JCasAnnotator_ImplBase
      - de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase
        
        de.tudarmstadt.ukp.dkpro.core.corenlp.CoreNlpSegmenter

All Implemented Interfaces:

org.apache.uima.analysis_component.AnalysisComponent
```
public class CoreNlpSegmenter
extends SegmenterBase
```
Tokenizer and sentence splitter using from CoreNLP.

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`PARAM_BOUNDARIES_TO_DISCARD` The set of regex for sentence boundary tokens that should be discarded.
`static String`	`PARAM_BOUNDARY_MULTI_TOKEN_REGEX`
`static String`	`PARAM_BOUNDARY_TOKEN_REGEX` The set of boundary tokens.
`static String`	`PARAM_HTML_ELEMENTS_TO_DISCARD` These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching.
`static String`	`PARAM_NEWLINE_IS_SENTENCE_BREAK` Strategy for treating newlines as sentence breaks.
`static String`	`PARAM_TOKEN_REGEXES_TO_DISCARD` The set of regex for sentence boundary tokens that should be discarded.

Fields inherited from class de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase
PARAM_LANGUAGE, PARAM_STRICT_ZONING, PARAM_WRITE_FORM, PARAM_WRITE_SENTENCE, PARAM_WRITE_TOKEN, PARAM_ZONE_TYPES

Constructor Summary

Constructors
Constructor and Description

CoreNlpSegmenter()

Constructors
Constructor and Description
`CoreNlpSegmenter()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`initialize(org.apache.uima.UimaContext aContext)`
`protected void`	`process(org.apache.uima.jcas.JCas aJCas, String aText, int aZoneBegin)`

Methods inherited from class de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase
createSentence, createToken, createToken, createToken, getLanguage, getLocale, getZoneTypes, isEmpty, isStrictZoning, isWriteSentence, isWriteToken, limit, process, trim, trimChar

Methods inherited from class org.apache.uima.fit.component.JCasAnnotator_ImplBase
getLogger

Methods inherited from class org.apache.uima.analysis_component.JCasAnnotator_ImplBase
getRequiredCasInterface, process

Methods inherited from class org.apache.uima.analysis_component.Annotator_ImplBase
getCasInstancesRequired, hasNext, next

Methods inherited from class org.apache.uima.analysis_component.AnalysisComponent_ImplBase
batchProcessComplete, collectionProcessComplete, destroy, getContext, getResultSpecification, reconfigure, setResultSpecification

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - PARAM_BOUNDARY_TOKEN_REGEX
```
public static final String PARAM_BOUNDARY_TOKEN_REGEX
```
    The set of boundary tokens. If null, use default.
    
    See Also:
    
    WordToSentenceProcessor.WordToSentenceProcessor(java.lang.String, java.lang.String, java.util.Set<java.lang.String>, java.util.Set<java.lang.String>, java.lang.String, edu.stanford.nlp.process.WordToSentenceProcessor.NewlineIsSentenceBreak, edu.stanford.nlp.ling.tokensregex.SequencePattern<? super IN>, java.util.Set<java.lang.String>, boolean, boolean), Constant Field Values
  - PARAM_BOUNDARY_MULTI_TOKEN_REGEX
```
public static final String PARAM_BOUNDARY_MULTI_TOKEN_REGEX
```
    See Also:
    
    Constant Field Values
  - PARAM_HTML_ELEMENTS_TO_DISCARD
```
public static final String PARAM_HTML_ELEMENTS_TO_DISCARD
```
    These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary.
    
    See Also:
    
    Constant Field Values
  - PARAM_BOUNDARIES_TO_DISCARD
```
public static final String PARAM_BOUNDARIES_TO_DISCARD
```
    The set of regex for sentence boundary tokens that should be discarded.
    
    See Also:
    
    WordToSentenceProcessor.DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD, Constant Field Values
  - PARAM_NEWLINE_IS_SENTENCE_BREAK
```
public static final String PARAM_NEWLINE_IS_SENTENCE_BREAK
```
    Strategy for treating newlines as sentence breaks.
    
    See Also:
    
    Constant Field Values
  - PARAM_TOKEN_REGEXES_TO_DISCARD
```
public static final String PARAM_TOKEN_REGEXES_TO_DISCARD
```
    The set of regex for sentence boundary tokens that should be discarded.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - CoreNlpSegmenter
```
public CoreNlpSegmenter()
```
- Method Detail
  - initialize
```
public void initialize(org.apache.uima.UimaContext aContext)
                throws org.apache.uima.resource.ResourceInitializationException
```
    Specified by:
    
    initialize in interface org.apache.uima.analysis_component.AnalysisComponent
    
    Overrides:
    
    initialize in class org.apache.uima.fit.component.JCasAnnotator_ImplBase
    
    Throws:
    
    org.apache.uima.resource.ResourceInitializationException
  - process
```
protected void process(org.apache.uima.jcas.JCas aJCas,
                       String aText,
                       int aZoneBegin)
                throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
```
    Specified by:
    
    process in class SegmenterBase
    
    Throws:
    
    org.apache.uima.analysis_engine.AnalysisEngineProcessException

Class CoreNlpSegmenter

Field Summary

Fields inherited from class de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase

Constructor Summary

Method Summary

Methods inherited from class de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase

Methods inherited from class org.apache.uima.fit.component.JCasAnnotator_ImplBase

Methods inherited from class org.apache.uima.analysis_component.JCasAnnotator_ImplBase

Methods inherited from class org.apache.uima.analysis_component.Annotator_ImplBase

Methods inherited from class org.apache.uima.analysis_component.AnalysisComponent_ImplBase

Methods inherited from class java.lang.Object

Field Detail

PARAM_BOUNDARY_TOKEN_REGEX

PARAM_BOUNDARY_MULTI_TOKEN_REGEX

PARAM_HTML_ELEMENTS_TO_DISCARD

PARAM_BOUNDARIES_TO_DISCARD

PARAM_NEWLINE_IS_SENTENCE_BREAK

PARAM_TOKEN_REGEXES_TO_DISCARD

Constructor Detail

CoreNlpSegmenter

Method Detail

initialize

process