DKPro Similarity is an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. The framework is designed to complement DKPro Core, a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and phonetic measures. In order to promote the reproducibility of experimental results and to provide reliable, permanent experimental conditions for future studies, DKPro Similarity additionally comes with a set of full-featured experimental setups which can be run out-of-the-box and be used for future systems to built upon.
Check out our Getting started guide. You may also want to have a closer look at our ACL 2013 system demonstration paper which summarizes the architecture, the available text similarity measures, and the existing experimental setups.
The project contains a ready-made experiment with the most common evaluation datasets for word pair similarity. Learn more …
Pipeline with datasets for word choice / TOEFL Synonym Question experiments. Wiki Page in ACL Wiki on the topic.
Pipelines and datasets for RTE 1-5 experiments. [http://aclweb.org/aclwiki/index.php?title=Recognizing_Textual_Entailment Wikki Page in ACL Wiki] on the topic.
For all users interested in the Shared Task of the *SEM 2013 conference, we describe here one of the task’s offical baseline systems, which is roughly the system ranked best in the SemEval-2012 exercises.
If you plan to refer to DKPro Similarity in your publications, please cite
Daniel Bär, Torsten Zesch, and Iryna Gurevych. DKPro Similarity: An Open Source Framework for Text Similarity, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 121-126, August 2013, Sofia, Bulgaria. (pdf) (bib)
In this example, we want to compute similarity between two given texts which are already lemmatized. We assume that lemmatization has already been done e.g. with a DKPro pipeline. As a similarity measure, we choose a popular word n-gram model by Lyon et al. (2004). Moreover, make sure that both the *.algorithms.api-asl and the *.algorithms.lexical-asl dependency modules have been added to your pom.xml, as described in the Getting Started Guide.
The algorithms collected in this framework implement one of the following interfaces:
Collection<String>) - Similarity between two collections of strings representing whole documents.
String) - Similarity between two arrays of strings representing whole documents.
|algorithms.lexical||GreedyStringTiling, Jaro, Levenshtein, LongestCommonSubsequence, MongeElkan, NGramBased, …|
|algorithms.lsr||Based on lexical-semantic resources such as WordNet or Wikipedia, e.g. GlossOverlap, JiangConrath, LeacockChodorow, Lin, Resnik, WuPalmerComparator|
|algorithms.style||FunctionWordFrequency, MTLD, TypeTokenRatio|
|algorithms.vsm||Vector-space models, e.g. ESA|
|algorithms.wikipedia||Special Wikipedia measures like, WikipediaLinkMeasure or measures based on the CategoryGraph.|
|dkpro.core||UIMA resources for the core algorithms.|
|dkpro.io||UIMA readers for the usual similarity datasets: Meter, RTE, SemEval, WebisCPC11|