DKPro Core - Working with n-grams

Analytics

DKPro components for handling n-grams.

Annotating n-grams

If n-grams are required by some downstream component, the NGramAnnotator can be used to add NGram annotations to the CAS.

Quickly Creating n-grams from Annotations

If the n-grams are only needed locally but not by downstream components, it is more efficient to use the NGramIterable that returns the n-grams constructed from a certain set of arbitrary annotations without adding them to the CAS.

If you are just interested in the String representation of the n-grams, you can also use NGramStringIterable.

String[] tokens = StringUtils.split("This is a simple example sentence .");
for (String ngram : new NGramStringIterable(tokens, 2, 2)) {
    System.out.println(ngram);
}

n-gram Frequency Counts

Many applications require to determine the number of occurrences of a certain n-gram (or phrase) in a collection. DKPro supports that by providing special external resources.

A component that wants to use frequency counts should declare this by specifying an external resource:

public static final String FREQUENCY_COUNT_RESOURCE= "FrequencyProvider";
@ExternalResource(key = FREQUENCY_COUNT_RESOURCE)
private FrequencyCountProvider frequencyProvider;

The frequency count can then be accessed using:

long freq = frequencyProvider.getFrequency(phrase);

The user of this component then just needs to add the resource as a configuration parameter:

AnalysisEngineDescription desc = AnalysisEngineFactory.createPrimitiveDescription(
    Annotator.class,
    Annotator.FREQUENCY_COUNT_RESOURCE, ExternalResourceFactory.createExternalResourceDescription(
        Web1TFrequencyCountResource.class,
        Web1TFrequencyCountResource.PARAM_MIN_NGRAM_LEVEL, "1",
        Web1TFrequencyCountResource.PARAM_MAX_NGRAM_LEVEL, "3",
        Web1TFrequencyCountResource.PARAM_INDEX_PATH, indexPath
    )
);

The Web1TFrequencyCountResource of DKPro Core directly supports the format of the Google Web1T web size n-gram corpus.

Creating Web1T data files

You can create your own n-gram frequency count models using the Web1TFormatWriter provided by DKPro.

The following example shows how to create a n-gram model from the ACL Anthology corpus:

public class CreateAclNgrams
{
    private static final String OUTPUT_PATH = "target/ngrams/";
    
    public static void main(String[] args) throws Exception
    {       
        File aclPath = DKProContext.getContext().getWorkspace("acl_anthology");
        
        CollectionReader reader = createCollectionReader(
            AclAnthologyReader.class,
            AclAnthologyReader.PARAM_PATH, aclPath.getAbsolutePath(),
            AclAnthologyReader.PARAM_PATTERNS, new String[] { "[+]**/*.txt" });
 
        AnalysisEngineDescription segmenter = createPrimitiveDescription(
            BreakIteratorSegmenter.class);
 
        AnalysisEngineDescription ngramWriter = createPrimitiveDescription(
            Web1TFormatWriter.class,
            Web1TFormatWriter.PARAM_TARGET_LOCATION,  OUTPUT_PATH,
            Web1TFormatWriter.PARAM_INPUT_TYPES, new String[] { Token.class.getName() },
            Web1TFormatWriter.PARAM_MIN_NGRAM_LENGTH, 1,
            Web1TFormatWriter.PARAM_MAX_NGRAM_LENGTH, 3,
            Web1TFormatWriter.PARAM_MIN_FREQUENCY, 2);
        
        SimplePipeline.runPipeline(reader, segmenter, ngramWriter);

        // create the necessary indexes
        JWeb1TIndexer indexCreator = new JWeb1TIndexer("target/web1t/", 3);
        indexCreator.create();
    }
}