DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
Consult the official C4CorpusTools documentation which contains
Please cite DKPro C4CorpusTools itself as:
Habernal, I., Zayed, O., & Gurevych, I. (2016). C4Corpus: Multilingual Web-size corpus with free license. In N. Calzolari et al. (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) (pp. 914–922). Portorož, Slovenia: European Language Resources Association (ELRA). (pdf) (bib)
This project is licensed under the Apache Software License (ASL) version 2 - but its dependencies may not be.