DKPro C4CorpusTools - Welcome

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

  • DKPro C4CorpusTools (or C4CorpusTools) refers to the project source codes
  • C4Corpus refers the preprocessed CommonCrawl data set (C4 = Creative Commons from Common Crawl)

Consult the official C4CorpusTools documentation which contains

  • C4Corpus Users’s Guide
    • How to access C4Corpus at S3
    • Running boilerplate removal outside Hadoop
    • Examples of simple search in C4Corpus
  • C4Corpus Developers’s Guide
    • How to run the full processing pipeline on CommonCrawl
  • Corpus statistics reported in the LREC article

How to cite

Please cite DKPro C4CorpusTools itself as:

Habernal, I., Zayed, O., & Gurevych, I. (2016). C4Corpus: Multilingual Web-size corpus with free license. In N. Calzolari et al. (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) (pp. 914–922). Portorož, Slovenia: European Language Resources Association (ELRA). (pdf) (bib)

License

This project is licensed under the Apache Software License (ASL) version 2 - but its dependencies may not be.

About DKPro C4CorpusTools

  • Contact person: Ivan Habernal, habernal@ukp.informatik.tu-darmstadt.de
  • UKP Lab: http://www.ukp.tu-darmstadt.de/
  • TU Darmstadt: http://www.tu-darmstadt.de/