DKPro Similarity

Word Pair Similarity


DKPro Similarity comes with a ready-made experiment pipeline for evaluating word pair similarity/relatedness. The most common evaluation datasets are already included (see below).


The following datasets are included in the experiment


  • Rubenstein & Goodenough (RG65)
    • the classical similarity dataset  * Rubenstein, H., & Goodenough, J. B. (1965). Contextual Correlates of Synonymy. Communications of the ACM, 8(10), 627-633.
  • Miller & Charles (MC30)
    • subset of Rubenstein & Goodenough dataset * Miller, G. A., & Charles, W. G. (1991). Contextual Correlates of Semantic Similarity. Language and Cognitive Processes, 6(1), 1-28.
  • Finkelstein et al. (WS353)
    • the full finkelstein dataset, as well as the two parts that were annotated by different groups of annotators * Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., & Wolfman, G. (2002). Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems, 20(1), 116-131.
  • Gerz et al. (SimVerb-3500)
    • verb similarity
    • SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity (2016). Daniela Gerz, Ivan Vulić, Felix Hill, Roi Reichart and Anna Korhonen. EMNLP 2016.
  • Hill et al. 2014 (Sim999)
    • SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. 2014. Felix Hill, Roi Reichart and Anna Korhonen. Preprint pubslished on arXiv. arXiv:1408.3456
    • semantic similarity
  • Li et al. (MWE300)
    • multi-word similarity
  • Yang & Powers (YP130)
    • verb similarity  * Yang, D., & Powers, D. M. W. (2006). Verb Similarity on the Taxonomy of WordNet. Proceedings of the Third International WordNet Conference (GWC-06) (pp. 121-128). Jeju Island, Korea.
  • Szumlanski et al. (2013) (SGS130)
    • relatedness dataset
    • Note: leaves/rake has been changed to leaf/rake as the plural is not found in WordNet and most other pairs are singular, too.
    • Szumlanski, S., Gomez, F. & Sims, V. K. (2013). A New Set of Norms for Semantic Relatedness Measures. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 890-895). Sofia, Bulgaria.


  • Translated and re-annotated data for Miller & Charles (Gur30) and Rubenstein & Goodenough (Gur65), as well as 350 cross POS word pairs (Gur350)
    • Gurevych, I. (2005). Using the Structure of a Conceptual Network in Computing Semantic Relatedness. Proceedings of IJCNLP (pp. 767-778).
  • 222 German word pairs annotated on the sense level (ZG222)
    • Zesch, T., & Gurevych, I. (2006). Automatically Creating Datasets for Measures of Semantic Relatedness. Proceedings of the Workshop on Linguistic Distances (pp. 16-24). Sydney, Australia

Arabic, Romanian, Spanish

  • Translated and re-annotated data for Miller & Charles and Finkelstein
    • Hassan, S., & Mihalcea, R. (2009). Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (pp. 1192-1201).

Is your dataset missing from the list? Contact us and we will be glad to add it.