DKPro Core - Working with n-grams

Analytics

DKPro components for handling n-grams.

Annotating n-grams

If n-grams are required by some downstream component, the NGramAnnotator can be used to add NGram annotations to the CAS.

Quickly Creating n-grams from Annotations

If the n-grams are only needed locally but not by downstream components, it is more efficient to use the NGramIterable that returns the n-grams constructed from a certain set of arbitrary annotations without adding them to the CAS.

If you are just interested in the String representation of the n-grams, you can also use NGramStringIterable.

String[] tokens = StringUtils.split("This is a simple example sentence .");
for (String ngram : new NGramStringIterable(tokens, 2, 2)) {
System.out.println(ngram);
}

n-gram Frequency Counts

Many applications require to determine the number of occurrences of a certain n-gram (or phrase) in a collection. DKPro supports that by providing special external resources.

A component that wants to use frequency counts should declare this by specifying an external resource:

public static final String FREQUENCY_COUNT_RESOURCE= "FrequencyProvider";
@ExternalResource(key = FREQUENCY_COUNT_RESOURCE)
private FrequencyCountProvider frequencyProvider;

The frequency count can then be accessed using:

long freq = frequencyProvider.getFrequency(phrase);

The user of this component then just needs to add the resource as a configuration parameter:

AnalysisEngineDescription desc = AnalysisEngineFactory.createPrimitiveDescription(
Annotator.class,
Annotator.FREQUENCY_COUNT_RESOURCE, ExternalResourceFactory.createExternalResourceDescription(
Web1TFrequencyCountResource.class,
Web1TFrequencyCountResource.PARAM_MIN_NGRAM_LEVEL, "1",
Web1TFrequencyCountResource.PARAM_MAX_NGRAM_LEVEL, "3",
Web1TFrequencyCountResource.PARAM_INDEX_PATH, indexPath
)
);

The Web1TFrequencyCountResource of DKPro Core directly supports the format of the Google Web1T web size n-gram corpus.

Creating Web1T data files

You can create your own n-gram frequency count models using the Web1TFormatWriter provided by DKPro.

The following example shows how to create a n-gram model from the ACL Anthology corpus:

public class CreateAclNgrams
{
private static final String OUTPUT_PATH = "target/ngrams/";

public static void main(String[] args) throws Exception
{
File aclPath = DKProContext.getContext().getWorkspace("acl_anthology");

CollectionReader reader = createCollectionReader(
AclAnthologyReader.class,
AclAnthologyReader.PARAM_PATH, aclPath.getAbsolutePath(),
AclAnthologyReader.PARAM_PATTERNS, new String[] { "[+]**/*.txt" });

AnalysisEngineDescription segmenter = createPrimitiveDescription(
BreakIteratorSegmenter.class);

AnalysisEngineDescription ngramWriter = createPrimitiveDescription(
Web1TFormatWriter.class,
Web1TFormatWriter.PARAM_TARGET_LOCATION, OUTPUT_PATH,
Web1TFormatWriter.PARAM_INPUT_TYPES, new String[] { Token.class.getName() },
Web1TFormatWriter.PARAM_MIN_NGRAM_LENGTH, 1,
Web1TFormatWriter.PARAM_MAX_NGRAM_LENGTH, 3,
Web1TFormatWriter.PARAM_MIN_FREQUENCY, 2);

SimplePipeline.runPipeline(reader, segmenter, ngramWriter);

// create the necessary indexes
JWeb1TIndexer indexCreator = new JWeb1TIndexer("target/web1t/", 3);
indexCreator.create();
}
}