Index the Google Web1T corpus in Lucene.
All values are stored in the index. The fields are * gram: The n-gram * freq: The frequency of
the n-gram in the corpus
Note: This was only tested with the german corpus of Web1T. The english one is much bigger and
Lucene can only handle Integer.MAX_VALUE (2 147 483 647) documents per index. Each n-gram is a
In the /bin folder is a script file to run the indexer. Simple run:
./bin/web1TLuceneIndexer.sh \ --web1t PATH/TO/FOLDER/WITH/ALL/EXTRACTED/N-GRAM/FILES \