DKProTC works with two data types that are expected to be present when an experiment is executed. TextClassificationTarget
and the corresponding actual value of this target, the TextClassificationOutcome
. For sequence classification examples an additional TextClassificationSequence
is necessary that marks explicitly the span of the sequence. The data readers provided by DKPro TC in the package dkpro-tc-io
set these values in the reader. When one of the many DKPro Core data format readers is used, these information are not set, yet. The required data types exist only in DKPro TC and are not used by DKPro Core.
Consequently, an additional step is needed that adds the required annotation. This is most easily done by adding a Preprocessing
step to a DKPro TC experiment. Below is an example how it could be used for Part-of-Speech (PoS) tagging. In PoS tagging, each token (TextClassificationTarget
) of a sentence (TextClassificationSequence
) is assigned a single label (TextClassificationOutcome
). In order to be used in the Preprocessing
step, the class have to inherit from JCasAnnotator_ImplBase
:
public class SequenceOutcomeAnnotator
extends JCasAnnotator_ImplBase
{
int tcId = 0;
@Override
public void process(JCas aJCas)
throws AnalysisEngineProcessException
{
/* Iterate all sentences */
for (Sentence sent : JCasUtil.select(aJCas, Sentence.class)) {
/* Each sentence is a classification sequence */
TextClassificationSequence sequence = new TextClassificationSequence(aJCas,
sent.getBegin(), sent.getEnd());
sequence.addToIndexes();
/* Iterate all tokens in the span of the sentence */
List<Token> tokens = JCasUtil.selectCovered(aJCas, Token.class, sent);
for (Token token : tokens) {
// Each token is a classification target, i.e. we want to predict a label for each word in the sentence/sequence
TextClassificationTarget target = new TextClassificationTarget(aJCas,
token.getBegin(), token.getEnd());
unit.setId(tcId++);
unit.setSuffix(token.getCoveredText());
unit.addToIndexes();
/* The outcome annotation defines the `true` value that shall be predicted
The outcome shares the same span as the token above to keep annotations aligned. */
TextClassificationOutcome outcome = new TextClassificationOutcome(aJCas,
token.getBegin(), token.getEnd());
outcome.setOutcome(getTextClassificationOutcome(aJCas, target));
outcome.addToIndexes();
}
}
}
public String getTextClassificationOutcome(JCas jcas, TextClassificationTarget target)
{
// Select the POS annotation in range of the target (the token in this case)
List<POS> posList = JCasUtil.selectCovered(jcas, POS.class, target);
return posList.get(0).getPosValue().replaceAll(" ", "_"); // Return this value as expected outcome
}
}
To add the Preprocessing
to your experiment, only a minor modification to your code is necessary:
ExperimentTrainTest experiment = new ExperimentTrainTest("CrfExperiment");
//The following line adds the above class as preprocessing component
experiment.setPreprocessing(createEngineDescription(SequenceOutcomeAnnotator.class));
experiment.setParameterSpace(pSpace);
The Preprocessing
is not limited to a single component, assuming we would read plain text with the reader, we would need additionally tokenization (to split the text into sentences and words) and a PoS tagger that provides the expected outcomen in order to train a sequence classifier. In practice, you probably do not want to train a model on tags that are automatically annotated but for the sake of this example, lets assume you do. In this case, the preprocessing could look like shown belown. Note the order of the preprocessing steps, PoS tagging requires tokens why the the segmentation step is comes first, the SequenceOutcomeAnnotator
requires tokens and the PoS and is consequently the last component.
experiment.setPreprocessing(createEngineDescription(
createEngineDescription(BreakIteratorSegmenter.class),
createEngineDescription(OpenNlpPosTagger.class),
createEngineDescription(SequenceOutcomeAnnotator.class)
);