public class Conll2002Reader extends JCasResourceCollectionReader_ImplBase
Reads by default the CoNLL 2002 named entity format.
The reader is also compatible with the CoNLL-based GermEval 2014 named entity format, in which the columns are separated by a tab, and there is an extra column for embedded named entities, besides the token number being put in the first column (see below). For that, additional parameters are provided, by which one can determine the column separator, whether there is an additional first column for token numbers, and whether embedded named entities should be read. (Note: Currently, the reader only reads the outer named entities, not the embedded ones.
The following snippet shows an example of the TSV format
# http://de.wikipedia.org/wiki/Manfred_Korfmann [2009-10-17]
1 Aufgrund O O
2 seiner O O
3 Initiative O O
4 fand O O
5 2001/2002 O O
6 in O O
7 Stuttgart B-LOC O
8 , O O
9 Braunschweig B-LOC O
10 und O O
11 Bonn B-LOC O
12 eine O O
13 große O O
14 und O O
15 publizistisch O O
16 vielbeachtete O O
17 Troia-Ausstellung B-LOCpart O
18 statt O O
19 , O O
20 „ O O
21 Troia B-OTH B-LOC
22 - I-OTH O
23 Traum I-OTH O
24 und I-OTH O
25 Wirklichkeit I-OTH O
26 “ O O
27 . O O
Modifier and Type | Class and Description |
---|---|
static class |
Conll2002Reader.ColumnSeparators
Column Separators
|
ResourceCollectionReaderBase.Resource
Modifier and Type | Field and Description |
---|---|
static String |
PARAM_COLUMN_SEPARATOR
Column separator parameter.
|
static String |
PARAM_HAS_EMBEDDED_NAMED_ENTITY
Has embedded named entity extra column.
|
static String |
PARAM_HAS_HEADER
Indicates that there is a header line before the sentence
|
static String |
PARAM_HAS_TOKEN_NUMBER
Token number flag.
|
static String |
PARAM_INTERN_TAGS
Use the
String.intern() method on tags. |
static String |
PARAM_NAMED_ENTITY_MAPPING_LOCATION
Location of the mapping file for named entity tags to UIMA types.
|
static String |
PARAM_READ_NAMED_ENTITY
Read named entity information.
|
static String |
PARAM_SOURCE_ENCODING
Character encoding of the input data.
|
EXCLUDE_PREFIX, INCLUDE_PREFIX, JAR_PREFIX, KEY_RESOURCE_RESOLVER, PARAM_INCLUDE_HIDDEN, PARAM_LANGUAGE, PARAM_LOG_FREQ, PARAM_PATH, PARAM_PATTERNS, PARAM_SOURCE_LOCATION, PARAM_USE_DEFAULT_EXCLUDES
Constructor and Description |
---|
Conll2002Reader() |
Modifier and Type | Method and Description |
---|---|
void |
getNext(org.apache.uima.jcas.JCas aJCas)
Subclasses implement this method rather than
JCasResourceCollectionReader_ImplBase.getNext(CAS) |
void |
initialize(org.apache.uima.UimaContext aContext) |
getNext, initCas, initCas
getBase, getBase, getDefaultExcludes, getLanguage, getProgress, getResolver, getResourceIterator, getResources, getSourceLocation, hasNext, initCas, initCas, isSingleLocation, locationToUrl, nextFile, scan
close, getLogger, initialize
destroy, getCasInitializer, getProcessingResourceMetaData, initialize, isConsuming, reconfigure, setCasInitializer, typeSystemInit
getConfigParameterValue, getConfigParameterValue, setConfigParameterValue, setConfigParameterValue
getCasManager, getMetaData, getRelativePathResolver, getResourceManager, getUimaContext, getUimaContextAdmin, setLogger, setMetaData
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
public static final String PARAM_COLUMN_SEPARATOR
Conll2002Reader.ColumnSeparators
.Conll2002Reader.ColumnSeparators.TAB.getName()
public static final String PARAM_HAS_TOKEN_NUMBER
public static final String PARAM_HAS_HEADER
public static final String PARAM_SOURCE_ENCODING
public static final String PARAM_INTERN_TAGS
String.intern()
method on tags. This is usually a good idea to avoid
spamming the heap with thousands of strings representing only a few different tags.
Default: true
public static final String PARAM_READ_NAMED_ENTITY
true
public static final String PARAM_HAS_EMBEDDED_NAMED_ENTITY
false
public static final String PARAM_NAMED_ENTITY_MAPPING_LOCATION
public void initialize(org.apache.uima.UimaContext aContext) throws org.apache.uima.resource.ResourceInitializationException
initialize
in class ResourceCollectionReaderBase
org.apache.uima.resource.ResourceInitializationException
public void getNext(org.apache.uima.jcas.JCas aJCas) throws IOException, org.apache.uima.collection.CollectionException
JCasResourceCollectionReader_ImplBase
JCasResourceCollectionReader_ImplBase.getNext(CAS)
getNext
in class JCasResourceCollectionReader_ImplBase
aJCas
- the JCas.IOException
- if an i/o error occurs reading the data.org.apache.uima.collection.CollectionException
- if another type of error occurs.Copyright © 2007–2018 Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt. All rights reserved.