DKPro Core™ Dataset Reference

Table 1. Datasets (48)
Dataset	Version	Language	Encoding	License ^[2]
AQMAR Arabic Wikipedia Named Entity Corpus	1.0	ar	UTF-8	CC-BY-SA 3.0
Alpino2conll	20100114	nl	UTF-8	unknown
Ancient Greek and Latin Dependency Treebank (Greek)	2.1	el	UTF-8	CC-BY-SA 3.0
Ancient Greek and Latin Dependency Treebank (Latin)	2.1	la	ISO-8859-1	CC-BY-SA 3.0
Brown Corpus (TEI XML)	20081013	en	ISO-8859-1	Brown Corpus License (?)
CoNLL-2000 Chunking Shared Task Data (English)	20000221	en	ISO-8859-1	WSJ Corpus License (?)
CoNLL-2002 NER Shared Task Data (Dutch)	20021107	nl	ISO-8859-1	unknown
CoNLL-2002 NER Shared Task Data (Spanish)	20020522	es	ISO-8859-1	unknown
CoNLL-2006 Shared Task (Portuguese)	20100302	pt	UTF-8	Floresta Sintá(c)tica License
CoNLL-2009 Shared Task (Catalan)	2.1	ca	UTF-8	GPLv3 (?)
CoNLL-2009 Shared Task (German)	1.1	de	UTF-8	multiple
CoNLL-2009 Shared Task (Japanese)	1.0	ja	UTF-8	unknown
CoNLL-2009 Shared Task (Spanish)	2.1	es	UTF-8	GPLv3 (?)
Copenhagen Dependency Treebank	1	da	UTF-8	GPLv2
Coptic Treebank	1.0	cop	UTF-8	CC-BY 4.0
Deep Sequoia (Surface)	7.0	fr	UTF-8	LGPL-LR
English Word Sense and Semantic Role Datasets (WaSR)	1.0	en	UTF-8	CC-BY-NC-ND 3.0
English Word Sense and Semantic Role Datasets (WaSR)	1.0	en	UTF-8	CC-BY-NC-ND 3.0
English Word Sense and Semantic Role Datasets (WaSR)	1.0	en	UTF-8	CC-BY-NC-ND 3.0
FinnTreeBank	3.1	nfi	UTF-8	CC-BY 3.0
Georgetown University Multilayer Corpus	5.0.0	en	UTF-8	multiple
Georgetown University Multilayer Corpus	2.3.2	en	UTF-8	multiple
Georgetown University Multilayer Corpus	3.0.0	en	UTF-8	multiple
Georgetown University Multilayer Corpus	2.2.0	en	UTF-8	multiple
Georgetown University Multilayer Corpus (UD)	4.1.0	en	UTF-8	multiple
GermEval 2014 Named Entity Recognition Shared Task	20200808	de	UTF-8	CC-BY 4.0
GloVe pre-trained vectors - Wikipedia 2014 + Gigaword 5	20151025	en	UTF-8	Open Data Commons Public Domain Dedication and License (PDDL)
Hamburg Dependency Treebank	1.0.1	de	UTF-8	multiple
IULA Spanish LSP Treebank	1	es	UTF-8	CC-BY 3.0
JOS - jos100k	2.0	sl	UTF-8	CC-BY-NC 3.0
MASC-CONLL	20080522	en	ISO-8859-1	unknown
NAIST/NTT TED Treebank	1.0	en	UTF-8	CC-BY-NC-SA 3.0 (?)
Named Entity Model for German, Politics (NEMGP)	0.1	de	UTF-8	CC-BY-SA 3.0
Norwegian Dependency Treebank (Norwegian Bokmål)	1.01	nb	UTF-8	CC0 1.0
Norwegian Dependency Treebank (Norwegian Nynorsk)	1.01	nn	UTF-8	CC0 1.0
Polish Constituency Treebank	0.5	pl	UTF-8	GPL 3.0
Polish Dependency Bank	0.5	pl	UTF-8	GPL 3.0
SETimes.HR dependency treebank	1	hr	UTF-8	CC-BY-SA 3.0
SETimes.HR+ Croatian dependency treebank	20160613	hr	UTF-8	multiple
Slovene Dependency Treebank 0.1	0.1	sl	UTF-8	SDT CoNLL-X
Slovene Dependency Treebank 0.4	0.4	sl	UTF-8	SDT License
Stanford POS Tagger Distsim Clusters	20130608	en	UTF-8	unknown
Talbanken05 DEP	1.1	sv	UTF-8	Talbanken05 License
Talbanken05 DPS	1.1	sv	ISO-8859-1	Talbanken05 License
Talbanken05 FPS	1.1	sv	ISO-8859-1	Talbanken05 License
Turin University Treebank	20101122	it	UTF-8	CC-BY-NC-SA 2.5
Universal Dependencies 1.4 Treebanks	1.4	en	UTF-8	CC-BY-SA 4.0
Uppsala Persian Dependency Treebank	1.3	fa	UTF-8	CC-BY 3.0

ar

AQMAR Arabic Wikipedia Named Entity Corpus

ID	aqmar-ar-1.0
Version	1.0
Media type	text/x.org.dkpro.conll-2000
Language	ar
Encoding	UTF-8
URL	http://www.cs.cmu.edu/~ark/ArabicNER/
Attribution^[1]	By Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah Smith as part of the AQMAR project.
License^[2]	CC-BY-SA 3.0

Description

73,853 tokens in 28 Arabic Wikipedia articles hand-annotated for named entities.

(This description has been partially copied from the corpus website).

Table 2. Artifacts for AQMAR Arabic Wikipedia Named Entity Corpus
Artifact	SHA1
LICENSE.txt	54977f4065ec070057e99b4b446273e5c8f071d2
data.zip	4fa2c37d7673bb456c6e382566a091545531d85f

ca

CoNLL-2009 Shared Task (Catalan)

Edit on GitHub

ID	conll2009-ca
Version	2.1
Media type	text/x.org.dkpro.conll-2009
Language	ca
Encoding	UTF-8
URL	http://ufal.mff.cuni.cz/conll2009-st/
Attribution^[1]	Lluís Màrquez, Ma. Antònia Martí, Mariona Taulé, Manuel Bertran, Oriol Borrega
License^[2]	GPLv3 (?)

Description

This is a subset of the Ancora corpus (see http://clic.ub.edu/ancora) which was used in the CoNLL-2009 shared task on extracting syntactic and semantic Dependencies in multiple languages.

496,672 lexical tokens; training: 390,302; development: 53,015; test: 53,355

(This description has been partially copied from the README file included with the corpus).

The description states that the data was extracted from the Ancora corpus, but it does not say from which version. One might assume it comes from AnCora Catalan dependency 1.0.1. However, this version does not include a license file. The next version is AnCora Catalan 2.0.0 which was released under GPL 3.0. Thus, one might conclude that this data can also be used under these conditions.

Table 3. Artifacts for CoNLL-2009 Shared Task (Catalan)
Artifact	SHA1
data.zip	500cbb81709012cce4d23bfa72d93c320b0b7e6f

cop

Coptic Treebank

Edit on GitHub

ID	coptictb-conll-cop-1.0
Version	1.0
Media type	text/x.org.dkpro.conll-2006
Language	cop
Encoding	UTF-8
URL	http://copticscriptorium.org
Attribution^[1]	Amir Zeldes
License^[2]	CC-BY 4.0

Description

The Coptic Treebank from the Coptic SCRIPTORIUM corpora (http://copticscriptorium.org/).

Table 4. Artifacts for Coptic Treebank
Artifact	SHA1
LICENSE.txt	3015e20629818d25c34527d59808e716fd0d8ced
coptic.treebank.conll10	8c363df27408cb14cb42f3869916c1575fe1625a

da

Copenhagen Dependency Treebank

Edit on GitHub

ID	cdt-conll-da-1
Version	1
Media type	text/x.org.dkpro.conll-2006
Language	da
Encoding	UTF-8
URL	http://mbkromann.github.io/copenhagen-dependency-treebank
Attribution^[1]	Matthias Trautner Kromann, 2003. The Danish Dependency Treebank and the DTAG treebank tool. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT 2003), 14-15 November, Växjö. pp. 217-220. (PDF)
License^[2]	GPLv2

Description

Version 1 (the directory "da") was orginally called the Danish Dependency Treebank. It was used in the CoNLL 2006 shared task on dependency parsing, but has since been updated with bug fixes and an improved CoNLL conversion which includes a decomposition of the PAROLE part-of-speech tags into the underlying features for number, gender, etc.).

(This description has been sourced from the corpus README file).

Table 5. Artifacts for Copenhagen Dependency Treebank
Artifact	SHA1
LICENSE.txt	0e5aad9553dc0ed784ec220bb09e22d52fefbb8b
data.zip	11313d405abb0f268247a2d5420afa413eb244e7

de

CoNLL-2009 Shared Task (German)

Edit on GitHub

ID	conll2009-de
Version	1.1
Media type	text/x.org.dkpro.conll-2009
Language	de
Encoding	UTF-8
URL	http://ufal.mff.cuni.cz/conll2009-st/
Attribution^[1]	Yi Zhang, Sebastian Pado
License^[2]	TIGER Corpus License, SALSA Corpus License

Description

This dataset contains the basic information regarding the German corpus provided for the CoNLL-2009 shared task on "Syntactic and Semantic Dependencies in Multiple Languages" (http://ufal.mff.cuni.cz/conll2009-st/). The data of this distribution is derived from the TIGER Treebank and the SALSA Corpus, converted into the syntactic and semantic dependencies compatible with the CoNLL-2009 shared task.

(This description has been sourced from the README file included with the corpus).

Table 6. Artifacts for CoNLL-2009 Shared Task (German)
Artifact	SHA1
data.zip	null

GermEval 2014 Named Entity Recognition Shared Task

Edit on GitHub

ID	germeval2014-de
Version	20200808
Media type	text/x.org.dkpro.germeval-2014
Language	de
Encoding	UTF-8
URL	https://sites.google.com/site/germeval2014ner/
Attribution^[1]	D. Benikova, C. Biemann, M. Reznicek. NoSta-D Named Entity Annotation for German: Guidelines and Dataset. Proceedings of LREC 2014, Reykjavik, Iceland
License^[2]	CC-BY 4.0

Description

The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following properties:

The data was sampled from German Wikipedia and News Corpora as a collection of citations.
The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.
The NER annotation uses the NoSta-D guidelines, which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating embeddings among NEs such as [ORG FC Kickers [LOC Darmstadt]].

(This description has been sourced from the dataset website).

Table 7. Artifacts for GermEval 2014 Named Entity Recognition Shared Task
Artifact	SHA1
LICENSE.txt	9c5bee7a22ab39ad6c19ab29ea9e94ac5874f9c6
GermEval2014NER.zip	827edc0232f813fb1344e06924a46e9344ec2f61

Hamburg Dependency Treebank

Edit on GitHub

ID	hdt-de-conll-1.0.1
Version	1.0.1
Media type	text/x.org.dkpro.conll-2006
Language	de
Encoding	UTF-8
URL	https://corpora.uni-hamburg.de/drupal/de/islandora/object/treebank:hdt
Attribution^[1]	Wolfgang Menzel
License^[2]	CC-BY-SA 4.0, HZSK-ACA

Description

Contains annotated text from the German technical news website www.heise.de.

Table 8. License comments for Hamburg Dependency Treebank
License	Comment
CC-BY-SA 4.0	Annotation
HZSK-ACA	Text

Table 9. Artifacts for Hamburg Dependency Treebank
Artifact	SHA1
LICENSE-CC-BY-SA.txt	7f893542ae74df4c277b98278ad9e3ad6c09e690
LICENSE-HZSK-ACA.txt	generated
hamburgDepTreebank.tar.xz	6594e5cd48966db7dac04f2b5ff948eb2bcadf37

Named Entity Model for German, Politics (NEMGP)

Edit on GitHub

ID	nemgp-de-0.1
Version	0.1
Media type	unknown
Language	de
Encoding	UTF-8
URL	http://www.thomas-zastrow.de/nlp/
Attribution^[1]	Thomas Zastrow
License^[2]	CC-BY-SA 3.0

Description

The Named Entity Model for German, Politics (NEMGP) is a collection of texts from Wikipedia and WikiNews, manually annotated with named entity information.

(This description has been sourced from the dataset website).

Table 10. Artifacts for Named Entity Model for German, Politics (NEMGP)
Artifact	SHA1
LICENSE.txt	fb6f31be27fed5efbcd4c2e1e64c50de470364b1
data.zip	f2a1fd54df9232741a3a1892d1ffb0a4d7205991

el

Ancient Greek and Latin Dependency Treebank (Greek)

Edit on GitHub

ID	perseus-el-2.1
Version	2.1
Media type	unknown
Language	el
Encoding	UTF-8
URL	https://perseusdl.github.io/treebank_data/
Attribution^[1]	Giuseppe G. A. Celano, Gregory Crane, Bridget Almas et al.
License^[2]	CC-BY-SA 3.0

Description

The Ancient Greek and Latin Dependency Treebank (AGLDT) is the earliest treebank for Ancient Greek and Latin. The project started at Tufts University in 2006 and is currently under development and maintenance at Leipzig University-Tufts University.

(This description has been sourced from the dataset website).

Table 11. Artifacts for Ancient Greek and Latin Dependency Treebank (Greek)
Artifact	SHA1
LICENSE.txt	fb6f31be27fed5efbcd4c2e1e64c50de470364b1
perseus.zip	140eee6d2e3e83745f95d3d5274d9e965d898980

en

Brown Corpus (TEI XML)

Edit on GitHub

ID	brown-en-teixml
Version	20081013
Media type	application/tei+xml
Language	en
Encoding	ISO-8859-1
URL	http://www.nltk.org/nltk_data/
Attribution^[1]	W. N. Francis and H. Kucera. Converted to TEI by Lou Burnard.
License^[2]	Brown Corpus License (?)

Description

This version derives directly from

"A Standard Corpus of Present-Day Edited American English, for use with Digital Computers." by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html

as distributed with NLTK (version 0.9.2)

(This description has been taken from the README file included with the corpus).

We did not find license information included with this dataset. One might assume the TEI version of the Brown Corpus is provided under the same conditions as the original Brown Corpus.

Table 12. Artifacts for Brown Corpus (TEI XML)
Artifact	SHA1
LICENSE.txt	generated
brown.zip	1e4eadeb358f6f7e6ac9b3677a82f4353bbe91ed

CoNLL-2000 Chunking Shared Task Data (English)

Edit on GitHub

ID	conll2000-en
Version	20000221
Media type	text/x.org.dkpro.conll-2000
Language	en
Encoding	ISO-8859-1
URL	http://www.cnts.ua.ac.be/conll2000/chunking/
Attribution^[1]	unknown
License^[2]	WSJ Corpus License (?)

Description

This is the data from the CoNLL-2000 shared task on text chunking. The data consists of the same partitions of the Wall Street Journal corpus (WSJ) as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the data has been derived from the WSJ corpus by a program written by Sabine Buchholz from Tilburg University, The Netherlands. Instead of using the part-of-speech tags of the WSJ corpus, the data set used tags generated by the Brill tagger.

(This description has been partially copied from the corpus website).

We did not find any license information for this dataset. However, as the texts appear to come from the WSJ corpus, probably the WSJ corpus license applies here.

Table 13. Artifacts for CoNLL-2000 Chunking Shared Task Data (English)
Artifact	SHA1
train.txt.gz	9f31cf936554cebf558d07cce923dca0b7f31864
test.txt.gz	dc57527f1f60eeafad03da51235185141152f849

English Word Sense and Semantic Role Datasets (WaSR)

Edit on GitHub

ID	wasr-l-en-1.00
Version	1.0
Media type	text/x.org.dkpro.conll-2009
Language	en
Encoding	UTF-8
URL	https://www.informatik.tu-darmstadt.de/ukp/research_6/data/semantic_role_resources/knowledge_based_semantic_role_labeling/index.en.jsp
Attribution^[1]	Silvana Hartmann, Judith Eckle-Kohler, and Iryna Gurevych. Generating Training Data for Semantic Role Labeling based on Label Transfer from Linked Lexical Resources. In: Transactions of the Association for Computational Linguistics, vol. 4, no. 1, p. (to appear), 2016. (PDF)
License^[2]	CC-BY-NC-ND 3.0

Description

English Frame and Role Annotations.

(This description has been sourced from the README file included with the corpus).

Table 14. Artifacts for English Word Sense and Semantic Role Datasets (WaSR)
Artifact	SHA1
LICENSE.txt	90490d92475de1dc68502b6cdb317187c4336b36
part1.tar.bz2	ef7ccf5cb23da63003bdb19d99b15b0ea2821e55

English Word Sense and Semantic Role Datasets (WaSR)

Edit on GitHub

ID	wasr-xl-en-1.00
Version	1.0
Media type	text/x.org.dkpro.conll-2009
Language	en
Encoding	UTF-8
URL	https://www.informatik.tu-darmstadt.de/ukp/research_6/data/semantic_role_resources/knowledge_based_semantic_role_labeling/index.en.jsp
Attribution^[1]	Silvana Hartmann, Judith Eckle-Kohler, and Iryna Gurevych. Generating Training Data for Semantic Role Labeling based on Label Transfer from Linked Lexical Resources. In: Transactions of the Association for Computational Linguistics, vol. 4, no. 1, p. (to appear), 2016. (PDF)
License^[2]	CC-BY-NC-ND 3.0

Description

English Frame and Role Annotations.

(This description has been sourced from the README file included with the corpus).

Table 15. Artifacts for English Word Sense and Semantic Role Datasets (WaSR)
Artifact	SHA1
LICENSE.txt	90490d92475de1dc68502b6cdb317187c4336b36
part1.tar.bz2	ef7ccf5cb23da63003bdb19d99b15b0ea2821e55
part2.tar.bz2	0a9c98cbf1fe02841edf52e963444a7e38986577
part3.tar.bz2	9c0cc79ecab9140f82683d39ed6acb51b148f9f7

English Word Sense and Semantic Role Datasets (WaSR)

Edit on GitHub

ID	wasr-de-1.00
Version	1.0
Media type	text/x.org.dkpro.conll-2009
Language	en
Encoding	UTF-8
URL	https://www.informatik.tu-darmstadt.de/ukp/research_6/data/semantic_role_resources/knowledge_based_semantic_role_labeling/index.en.jsp
Attribution^[1]	Silvana Hartmann, Judith Eckle-Kohler, and Iryna Gurevych. Generating Training Data for Semantic Role Labeling based on Label Transfer from Linked Lexical Resources. In: Transactions of the Association for Computational Linguistics, vol. 4, no. 1, p. (to appear), 2016. (PDF)
License^[2]	CC-BY-NC-ND 3.0

Description

German Frame and Role Annotations.

(This description has been sourced from the README file included with the corpus).

Table 16. Artifacts for English Word Sense and Semantic Role Datasets (WaSR)
Artifact	SHA1
LICENSE.txt	90490d92475de1dc68502b6cdb317187c4336b36
data.tar.bz2	b706711ae6fffc94409f80b635595bd45d8c2ece

Georgetown University Multilayer Corpus

Edit on GitHub

ID	gum-ud-en-conll-5.0.0
Version	5.0.0
Media type	text/x.org.dkpro.conll-u
Language	en
Encoding	UTF-8
URL	https://corpling.uis.georgetown.edu/gum/
Attribution^[1]	Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation 51(3), 581–612. For Gum annotation team, see https://corpling.uis.georgetown.edu/gum/
License^[2]	CC-BY 2.5, CC-BY-SA 3.0, CC-BY-NC-SA 3.0, CC-BY 4.0

Description

GUM is an open source multilayer corpus of richly annotated web texts from eight text types. The corpus is collected and expanded by students as part of the curriculum in LING-367 Computational Corpus Linguistics at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (mostly Creative Commons licenses), so that new texts can be annotated and published with ease.

(This description has been sourced from the dataset website).

Table 17. License comments for Georgetown University Multilayer Corpus
License	Comment
CC-BY 2.5	Wikinews/interviews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright)
CC-BY-SA 3.0	WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use); Wikipedia biographies (Source: https://en.wikipedia.org/wiki/Wikipedia:Copyrights)
CC-BY-NC-SA 3.0	WikiHow texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons); Fiction texts (Source: http://smallbeerpress.com/creative-commons/)
CC-BY 4.0	Annotations (Source: https://corpling.uis.georgetown.edu/gum/); Academic texts (various sources, see LICENSE.txt file)

Table 18. Artifacts for Georgetown University Multilayer Corpus
Artifact	SHA1
gum.zip	null

Georgetown University Multilayer Corpus

Edit on GitHub

ID	gum-en-conll-2.3.2
Version	2.3.2
Media type	text/x.org.dkpro.conll-2006
Language	en
Encoding	UTF-8
URL	https://corpling.uis.georgetown.edu/gum/
Attribution^[1]	Zeldes, Amir (2016) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation. For Gum annotation team, see https://corpling.uis.georgetown.edu/gum/
License^[2]	CC-BY 2.5, CC-BY-SA 3.0, CC-BY-NC-SA 3.0, CC-BY 4.0

Description

This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.

The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.

(This description has been sourced from the dataset website).

Table 19. License comments for Georgetown University Multilayer Corpus
License	Comment
CC-BY 2.5	Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright)
CC-BY-SA 3.0	WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use)
CC-BY-NC-SA 3.0	WikiHow texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons)
CC-BY 4.0	Annotations (Source: https://corpling.uis.georgetown.edu/gum/)

Table 20. Artifacts for Georgetown University Multilayer Corpus
Artifact	SHA1
gum.zip	471c3a35c2a0e9aee4bbff9a9cf05441fce3ef21

Georgetown University Multilayer Corpus

Edit on GitHub

ID	gum-en-conll-3.0.0
Version	3.0.0
Media type	text/x.org.dkpro.conll-2006
Language	en
Encoding	UTF-8
URL	https://corpling.uis.georgetown.edu/gum/
Attribution^[1]	Zeldes, Amir (2016) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation. For Gum annotation team, see https://corpling.uis.georgetown.edu/gum/
License^[2]	CC-BY 2.5, CC-BY-SA 3.0, CC-BY-NC-SA 3.0, CC-BY 4.0

Description

This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.

The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.

(This description has been sourced from the dataset website).

The CPOS column of the files contains an extended POS tagset as it is used by the English TreeTagger models. The POS column contains the regular PTB tagset.

Table 21. License comments for Georgetown University Multilayer Corpus
License	Comment
CC-BY 2.5	Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright)
CC-BY-SA 3.0	WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use)
CC-BY-NC-SA 3.0	WikiHow texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons)
CC-BY 4.0	Annotations (Source: https://corpling.uis.georgetown.edu/gum/)

Table 22. Artifacts for Georgetown University Multilayer Corpus
Artifact	SHA1
gum.zip	b590dbe3f4ae198ca500618a53491f75c221e98b

Georgetown University Multilayer Corpus

Edit on GitHub

ID	gum-en-conll-2.2.0
Version	2.2.0
Media type	text/x.org.dkpro.conll-2006
Language	en
Encoding	UTF-8
URL	https://corpling.uis.georgetown.edu/gum/
Attribution^[1]	Zeldes, Amir (2016) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation. For Gum annotation team, see https://corpling.uis.georgetown.edu/gum/
License^[2]	CC-BY 2.5, CC-BY-SA 3.0, CC-BY-NC-SA 3.0, CC-BY 4.0

Description

This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.

The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.

(This description has been sourced from the dataset website).

Table 23. License comments for Georgetown University Multilayer Corpus
License	Comment
CC-BY 2.5	Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright)
CC-BY-SA 3.0	WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use)
CC-BY-NC-SA 3.0	WikiHow texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons)
CC-BY 4.0	Annotations (Source: https://corpling.uis.georgetown.edu/gum/)

Table 24. Artifacts for Georgetown University Multilayer Corpus
Artifact	SHA1
gum.zip	b17e276998ced83153be605d8157afacf1f10fdc

Georgetown University Multilayer Corpus (UD)

Edit on GitHub

ID	gum-dep-stanford-en-4.1.0
Version	4.1.0
Media type	text/x.org.dkpro.conll-2006
Language	en
Encoding	UTF-8
URL	https://corpling.uis.georgetown.edu/gum/
Attribution^[1]	Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation 51(3), 581–612. For the GUM annotation team, see https://corpling.uis.georgetown.edu/gum/
License^[2]	CC-BY 2.5, CC-BY-SA 3.0, CC-BY-NC-SA 3.0, CC-BY 4.0

Description

This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.

The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.

(This description has been sourced from the dataset website).

The CPOS column of the files contains an extended POS tagset as it is used by the English TreeTagger models. The POS column contains the regular PTB tagset.

Note that this dataset does not include the Reddit data as it can only be obtained by running a Python script which comes with GUM.

Table 25. License comments for Georgetown University Multilayer Corpus (UD)
License	Comment
CC-BY 2.5	Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright)
CC-BY-SA 3.0	WikiVoyage and Biographies texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use)
CC-BY-NC-SA 3.0	WikiHow and Fiction texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons)
CC-BY 4.0	Annotations (Source: https://corpling.uis.georgetown.edu/gum/)

Table 26. Artifacts for Georgetown University Multilayer Corpus (UD)
Artifact	SHA1
gum.tar.gz	91ded1ba5b6c05fe8e70e42a0a36ee0d20556888

GloVe pre-trained vectors - Wikipedia 2014 + Gigaword 5

Edit on GitHub

ID	glove.6B-en-20151025
Version	20151025
Media type	unknown
Language	en
Encoding	UTF-8
URL	https://nlp.stanford.edu/projects/glove/
Attribution^[1]	Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.
License^[2]	Open Data Commons Public Domain Dedication and License (PDDL)

Description

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

(This description has been sourced from the dataset website).

Table 27. Artifacts for GloVe pre-trained vectors - Wikipedia 2014 + Gigaword 5
Artifact	SHA1
data.zip	b64e54f1877d2f735bdd000c1d7d771e25c7dfdc

MASC-CONLL

Edit on GitHub

ID	masc-conll-en-20080522
Version	20080522
Media type	text/x.org.dkpro.conll-2008
Language	en
Encoding	ISO-8859-1
URL	http://www.anc.org/data/masc/
Attribution^[1]	unknown
License^[2]	unknown

Description

The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).

A 40K subset of MASC1 data with annotations for Penn Treebank syntactic dependencies and semantic dependencies from NomBank and PropBank in CONLL IOB format. This data set was used in the CoNLL 2008 shared task on Joint Parsing of Syntactic and Semantic Dependencies.

(This description has been sourced from the dataset website).

Table 28. Artifacts for MASC-CONLL
Artifact	SHA1
data.zip	d9f53a05c659204a3223e901c450fe8ffa5fa9fa

NAIST/NTT TED Treebank

Edit on GitHub

ID	tedtreebank-conll-en-1.0
Version	1.0
Media type	text/x.org.dkpro.conll-2006
Language	en
Encoding	UTF-8
URL	http://ahclab.naist.jp/resource/tedtreebank/
Attribution^[1]	Graham Neubig, Katsuhito Sudoh, Yusuke Oda, Kevin Duh, Hajime Tsukada, Masaaki Nagata. The NAIST-NTT Ted Talk Treebank. In proceedings of International Workshop on Spoken Language Translation (IWSLT). Lake Tahoe, USA. December 2014. (PDF) (bib)
License^[2]	CC-BY-NC-SA 3.0 (?)

Description

The NAIST-NTT Ted Talk Treebank is a manually annotated treebank of TED talks that was created through a joint research project of NAIST and the NTT CS Lab. All treebank annotation follows the Penn Treebank standard.

(This description has been sourced from the corpus website/README file in the corpus).

The website does not state which version of the CC-BY-SA-NC applies. One might consider it is the version 3.0 which is also used for the TED talks themselves.

Table 29. Artifacts for NAIST/NTT TED Treebank
Artifact	SHA1
LICENSE.txt	90490d92475de1dc68502b6cdb317187c4336b36
data.tar.gz	89c6495bd64c4b3e699b4c478b47a0c827ea46ea

Stanford POS Tagger Distsim Clusters

Edit on GitHub

ID	stanford-egw4-reut-512-clusters-20130608
Version	20130608
Media type	unknown
Language	en
Encoding	UTF-8
URL	http://nlp.stanford.edu/software/pos-tagger-faq.shtml#distsim
Attribution^[1]	unknown
License^[2]	unknown

Description

Distributional similarity clusters that can be used e.g. with the Stanford POS tagger.

These clusters are a feature extracted from larger, untagged text which clusters the words into similar classes.

(This description has been sourced from the dataset website).

Table 30. Artifacts for Stanford POS Tagger Distsim Clusters
Artifact	SHA1
egw4-reut.512.clusters	3f1352641a46e985c07d0023c0ada7e5be97e527

Universal Dependencies 1.4 Treebanks

Edit on GitHub

ID	ud-en-conllu-1.4
Version	1.4
Media type	text/x.org.dkpro.conll-u
Language	en
Encoding	UTF-8
URL	https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1827
Attribution^[1]	Silveira, N., Dozat, T., De Marneffe, M. C., Bowman, S. R., Connor, M., Bauer, J., & Manning, C. (2014, May). A Gold Standard Dependency Corpus for English. In LREC (pp. 2897-2904). (pdf)
License^[2]	CC-BY-SA 4.0

Description

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).

(This description has been sourced from the dataset website).

Table 31. Artifacts for Universal Dependencies 1.4 Treebanks
Artifact	SHA1
data.tgz	1c41c28b000935ffa6c63b9ff17c48e892c56597

es

CoNLL-2002 NER Shared Task Data (Spanish)

Edit on GitHub

ID	conll2002-es
Version	20020522
Media type	text/x.org.dkpro.conll-2002
Language	es
Encoding	ISO-8859-1
URL	http://www.clips.ua.ac.be/conll2002/ner/
Attribution^[1]	unknown
License^[2]	unknown

Description

This is the data from the CoNLL-2002 shared task on language independent named entity recognition. The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are from May 2000. The annotation was carried out by the TALP Research Center (http://www.talp.upc.es/) of the Technical University of Catalonia (UPC) and the Center of Language and Computation (CLiC, http://clic.fil.ub.es/) of the University of Barcelona (UB), and funded by the European Commission through the NAMIC project (IST-1999-12392).

(This description has been sourced from the README file included with the corpus).

Table 32. Artifacts for CoNLL-2002 NER Shared Task Data (Spanish)
Artifact	SHA1
data.tgz	686ef8fed3125a1d8aefe1351ff0e619fe9c34cb

CoNLL-2009 Shared Task (Spanish)

Edit on GitHub

ID	conll2009-es
Version	2.1
Media type	text/x.org.dkpro.conll-2009
Language	es
Encoding	UTF-8
URL	http://ufal.mff.cuni.cz/conll2009-st/
Attribution^[1]	Lluís Màrquez, Ma. Antònia Martí, Mariona Taulé, Manuel Bertran, Oriol Borrega
License^[2]	GPLv3 (?)

Description

This is a subset of the Ancora corpus (see http://clic.ub.edu/ancora) which was used in the CoNLL-2009 shared task on extracting syntactic and semantic Dependencies in multiple languages.

528,440 lexical tokens; training: 427,442; development: 50,368; test: 50,630

(This description has been partially copied from the README file included with the corpus).

The description states that the data was extracted from the Ancora corpus, but it does not say from which version. One might assume it comes from AnCora Spanish dependency 1.0.1. However, this version does not include a license file. The next version is AnCora Catalan 2.0.0 which was released under GPL 3.0. Thus, one might conclude that this data can also be used under these conditions.

Table 33. Artifacts for CoNLL-2009 Shared Task (Spanish)
Artifact	SHA1
data.zip	ef36c3369bd05966609b4b13d6bf78884c23ece1

IULA Spanish LSP Treebank

Edit on GitHub

ID	iulatb-es-1
Version	1
Media type	text/x.org.dkpro.conll-2006
Language	es
Encoding	UTF-8
URL	http://www.iula.upf.edu/recurs01_tbk_uk.htm
Attribution^[1]	Marimon, Montserrat; Fisas, Beatriz; Bel, Núria; Arias, Blanca; Vázquez, Silvia; Vivaldi, Jorge; Torner, Sergi; Villegas, Marta; Lorente, Mercè (2012). "The IULA Treebank" in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA). p. 1920-1926. (PDF)
License^[2]	CC-BY 3.0

Description

IULA Spanish LSP Treebank is an Spanish treebank containing the syntactic annotation of 42,000 sentences (almost 590,000 tokens). It has been developed within the frame of Metanet4U project (Enhancing the European Linguistic Infrastructure, GA 270893).

The sentences in IULA Spanish LSP Treebank are extracted from the Corpus Tècnic de l’IULA, a collection of written texts from the fields of Law, Economy, Genomics, Medicine, and Environment, as well as a contrastive corpus from the press.

(This description has been sourced from the corpus website).

Table 34. Artifacts for IULA Spanish LSP Treebank
Artifact	SHA1
LICENSE.txt	aaf1a43d7cf20483321212f54bff33132a070ec0
data.rar	67e2ce3327501605b7c9f0844cc4982070612222

fa

Uppsala Persian Dependency Treebank

Edit on GitHub

ID	updt-fa-1.3
Version	1.3
Media type	text/x.org.dkpro.conll-2006
Language	fa
Encoding	UTF-8
URL	http://stp.lingfil.uu.se/%7Emojgan/UPDT.html
Attribution^[1]	Mojgan Seraji, under the supervision of Joakim Nivre and Carina Jahani.
License^[2]	CC-BY 3.0

Description

Uppsala Persian Dependency Treebank (UPDT) (Seraji, 2015, Chapter 5, pp. 97-146) is a dependency-based syntactically annotated corpus.

(This description has been sourced from the dataset website).

Table 35. Artifacts for Uppsala Persian Dependency Treebank
Artifact	SHA1
LICENSE.txt	aaf1a43d7cf20483321212f54bff33132a070ec0
data.tar	336ba453635ff079ab2ae9a5349247efa11acdf8

fr

Deep Sequoia (Surface)

Edit on GitHub

ID	sequoia-surf-conll-fr-7.0
Version	7.0
Media type	text/x.org.dkpro.conll-2006
Language	fr
Encoding	UTF-8
URL	https://deep-sequoia.inria.fr
Attribution^[1]	Marie Candito, Guy Perrier, Bruno Guillaume, Corentin Ribeyre, Karën Fort, Djamé Seddah and Éric de la Clergerie. (2014) Deep Syntax Annotation of the Sequoia French Treebank. Proc. of LREC 2014, Reykjavic, Iceland.
License^[2]	LGPL-LR

Description

Deep-sequoia is a corpus of French sentences annotated with both surface and deep syntactic dependency structures.

(This description has been sourced from the dataset website).

Table 36. Artifacts for Deep Sequoia (Surface)
Artifact	SHA1
LICENSE.txt	generated
sequoia.tgz	9f53475f809ef1032a92adedf262226da1615051

hr

SETimes.HR dependency treebank

Edit on GitHub

ID	sethr-hr-1
Version	1
Media type	text/x.org.dkpro.conll-2006
Language	hr
Encoding	UTF-8
URL	http://nlp.ffzg.hr/resources/corpora/setimes-hr/
Attribution^[1]	unknown
License^[2]	CC-BY-SA 3.0

Description

The corpus is based on the Croatian part of the SETimes parallel corpus.

(This description has been sourced from the corpus website).

Table 37. Artifacts for SETimes.HR dependency treebank
Artifact	SHA1
LICENSE.txt	fb6f31be27fed5efbcd4c2e1e64c50de470364b1
setimes.hr.v1.conllx.gz	0faebfe55136692f83dcddd4cf659a8b59655d62

SETimes.HR+ Croatian dependency treebank

Edit on GitHub

ID	sethrplus-hr-20160613
Version	20160613
Media type	text/x.org.dkpro.conll-u
Language	hr
Encoding	UTF-8
URL	https://github.com/ffnlp/sethr
Attribution^[1]	Agić and Ljubešić (2014) (PDF) (bib)
License^[2]	CC-BY 4.0, CC-BY-NC-SA 4.0

Description

The treebank is a result of an effort in providing free-culture language resources for Croatian by the NLP group at FF Zagreb.

(This description has been sourced from the corpus website).

Table 38. License comments for SETimes.HR+ Croatian dependency treebank
License	Comment
CC-BY 4.0	SETimes.HR dataset (set.hr.conll)
CC-BY-NC-SA 4.0	web.hr.conll and news.hr.conll datasets

Table 39. Artifacts for SETimes.HR+ Croatian dependency treebank
Artifact	SHA1
LICENSE-CC-BY.txt	9c5bee7a22ab39ad6c19ab29ea9e94ac5874f9c6
LICENSE-CC-BY-NC-SA.txt	54cc324681563e5ede8088f020f0b21e35d37fb9
data.zip	a52d13cfa91589c0d93fe0a90333a4f0e997b7cf

it

Turin University Treebank

Edit on GitHub

ID	tut-conll-it-20101122
Version	20101122
Media type	text/x.org.dkpro.conll-2006
Language	it
Encoding	UTF-8
URL	http://www.di.unito.it/~tutreeb/treebanks.html
Attribution^[1]	Cristina Bosco, Leonardo Lesmo, Vincenzo Lombardo, Alessandro Mazzei, Livio Robaldo
License^[2]	CC-BY-NC-SA 2.5

Description

TUT is a morpho-syntactically annotated collection of Italian sentences, which includes texts from different text genres and domains, released in several annotation formats.

(This description has been sourced from the corpus website).

Table 40. Artifacts for Turin University Treebank
Artifact	SHA1
NEWS.zip	3d9b22d8ebf533aa1d6d39d417316c30900b9a0e
VEDCH.zip	2278e6e770ddc4a8eea5e045c4a77a5df2ae0977
CODICECIVILE.zip	9cf9c0a9c652b3df6564d1fa0ca97c2d7905faa3
EUDIR.zip	72a6e55627481ff99930b59714cfc0909ccf60e1
WIKI.zip	a421f488859324e3e12687b9a3067652248eb8df

ja

CoNLL-2009 Shared Task (Japanese)

Edit on GitHub

ID	conll2009-ja
Version	1.0
Media type	text/x.org.dkpro.conll-2009
Language	ja
Encoding	UTF-8
URL	http://ufal.mff.cuni.cz/conll2009-st/
Attribution^[1]	Daisuke Kawahara
License^[2]	unknown

Description

This file contains the basic information regarding the Japanese corpus provided for the CoNLL-2009 shared task on "Syntactic and Semantic Dependencies in Multiple Languages". The current version corresponds to the release of the training data sets.

The data of this distribution uses portions of the Kyoto University Text Corpus 4.0. The Kyoto University Text Corpus is freely available at http://nlp.kuee.kyoto-u.ac.jp/nl-resource/corpus-e.html.

(This description has been sourced from the README file included with the corpus).

Table 41. Artifacts for CoNLL-2009 Shared Task (Japanese)
Artifact	SHA1
data.zip	8c96a1eda2527a9ba1bf37dd4125cc6af11e7dd4

la

Ancient Greek and Latin Dependency Treebank (Latin)

Edit on GitHub

ID	perseus-la-2.1
Version	2.1
Media type	unknown
Language	la
Encoding	ISO-8859-1
URL	https://perseusdl.github.io/treebank_data/
Attribution^[1]	Giuseppe G. A. Celano, Gregory Crane, Bridget Almas et al.
License^[2]	CC-BY-SA 3.0

Description

The Ancient Greek and Latin Dependency Treebank (AGLDT) is the earliest treebank for Ancient Greek and Latin. The project started at Tufts University in 2006 and is currently under development and maintenance at Leipzig University-Tufts University.

(This description has been sourced from the dataset website).

Table 42. Artifacts for Ancient Greek and Latin Dependency Treebank (Latin)
Artifact	SHA1
LICENSE.txt	fb6f31be27fed5efbcd4c2e1e64c50de470364b1
perseus.zip	140eee6d2e3e83745f95d3d5274d9e965d898980

nb

Norwegian Dependency Treebank (Norwegian Bokmål)

Edit on GitHub

ID	ndt-nb-1.01
Version	1.01
Media type	text/x.org.dkpro.conll-2006
Language	nb
Encoding	UTF-8
URL	http://www.nb.no/sprakbanken/show?serial=sbr-10
Attribution^[1]	CLARINO NB – Språkbanken
License^[2]	CC0 1.0

Description

The Norwegian Dependency Treebank (NDT) consists of text which is manually annotated with morphological features, syntactic functions and hierarchical structure. The formalism used for the syntactic annotation is dependency grammar. With a few exceptions, the syntactic analysis follows Norsk referensegrammatikk ‘Norwegian Reference Grammar'.

(This description has been sourced from the dataset website).

Table 43. Artifacts for Norwegian Dependency Treebank (Norwegian Bokmål)
Artifact	SHA1
LICENSE_NDT.txt	ae02a3ca7e000d6cc98f07d3a8aa017f38900499
20140328_NDT_1-01.tar.gz	97935c225f98119aa94d53f37aa64762cba332f3

nfi

FinnTreeBank

Edit on GitHub

ID	finntb-fi-3.1
Version	3.1
Media type	text/x.org.dkpro.conll-2006
Language	nfi
Encoding	UTF-8
URL	http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/
Attribution^[1]	unknown
License^[2]	CC-BY 3.0

Description

The FinnTreeBank project is creating a treebank and a parsebank for Finnish. This work is licensed under a Creative Commons Attribution 3.0.

The first and second version of the treebank is annotated by hand and based on 17.000 model senctences in the Large Grammar of Finnish VISK - Iso Suomen Kielioppi. Brief samples of text from other sources, e.g. news items and literature, are also available in the second version. A parsebank for Finnish based on the Europarl and the JRC-Aquis will be available in June 2012.

(This description has been sourced from the dataset website).

Table 44. Artifacts for FinnTreeBank
Artifact	SHA1
LICENSE.txt	aaf1a43d7cf20483321212f54bff33132a070ec0
ftb3.1.conllx.gz	7c58064bf9995980cea08e84035c0414adc54f06

nl

Alpino2conll

Edit on GitHub

ID	alpino-conll-nl-20100114
Version	20100114
Media type	text/x.org.dkpro.conll-2006
Language	nl
Encoding	UTF-8
URL	http://www.let.rug.nl/~bplank/alpino2conll/
Attribution^[1]	Barbara Plank. Improved statistical measures to assess natural language parser performance across domains. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC2010), Valletta, Malta, May 2010.
License^[2]	unknown

Description

Training and test datasets for Dutch in retagged CoNLL format. The data was converted from Alpino XML into CoNLL format based on an adapted version of Erwin Marsi’s conversion software, but PoS tags were replaced by automatically assigned Alpino tags.

(This description has been sourced from the corpus website).

Table 45. Artifacts for Alpino2conll
Artifact	SHA1
cdb.conll.utf8	f5e1517383f4489c8cb0c75ad202ac57c21874fc
conll2006-test.conll	c055154ae56dfa8c29d304ed852af90aedf00a5d

CoNLL-2002 NER Shared Task Data (Dutch)

Edit on GitHub

ID	conll2002-nl
Version	20021107
Media type	text/x.org.dkpro.conll-2002
Language	nl
Encoding	ISO-8859-1
URL	http://www.clips.ua.ac.be/conll2002/ner/
Attribution^[1]	unknown
License^[2]	unknown

Description

This is the data from the CoNLL-2002 shared task on language independent named entity recognition. The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). The data was annotated as a part of the Atranos project (http://atranos.esat.kuleuven.ac.be/) at the University of Antwerp.

(This description has been sourced from the README file included with the corpus).

Table 46. Artifacts for CoNLL-2002 NER Shared Task Data (Dutch)
Artifact	SHA1
data.tgz	686ef8fed3125a1d8aefe1351ff0e619fe9c34cb

nn

Norwegian Dependency Treebank (Norwegian Nynorsk)

Edit on GitHub

ID	ndt-nn-1.01
Version	1.01
Media type	text/x.org.dkpro.conll-2006
Language	nn
Encoding	UTF-8
URL	http://www.nb.no/sprakbanken/show?serial=sbr-10
Attribution^[1]	CLARINO NB – Språkbanken
License^[2]	CC0 1.0

Description

The Norwegian Dependency Treebank (NDT) consists of text which is manually annotated with morphological features, syntactic functions and hierarchical structure. The formalism used for the syntactic annotation is dependency grammar. With a few exceptions, the syntactic analysis follows Norsk referensegrammatikk ‘Norwegian Reference Grammar'.

(This description has been sourced from the dataset website).

Table 47. Artifacts for Norwegian Dependency Treebank (Norwegian Nynorsk)
Artifact	SHA1
LICENSE_NDT.txt	ae02a3ca7e000d6cc98f07d3a8aa017f38900499
20140328_NDT_1-01.tar.gz	97935c225f98119aa94d53f37aa64762cba332f3

pl

Polish Constituency Treebank

Edit on GitHub

ID	poltb-pl-0.5
Version	0.5
Media type	application/x.org.dkpro.tiger+xml
Language	pl
Encoding	UTF-8
URL	http://zil.ipipan.waw.pl/Składnica
Attribution^[1]	unknown
License^[2]	GPL 3.0

Description

The Polish constituency treebank (Składnica frazowa), version 0.5. Trees in the Tiger XML format containing only parse trees selected by dendrologists (one interpretation per sentence).

(This description has been sourced from the corpus website).

Table 48. Artifacts for Polish Constituency Treebank
Artifact	SHA1
LICENSE.txt	8b0cb355ed76e07cc7c876fec58341c2940cfee7
poltb-0.5-tiger.xml.gz	c8977d436d218b726d657224305bced178071dcf

Polish Dependency Bank

Edit on GitHub

ID	poldb-pl-0.5
Version	0.5
Media type	text/x.org.dkpro.conll-2006
Language	pl
Encoding	UTF-8
URL	http://zil.ipipan.waw.pl/Składnica
Attribution^[1]	unknown
License^[2]	GPL 3.0

Description

The dependency treebank (Składnica zależnościowa), version 0.5, is a result of an automatic conversion of manually disambiguated constituency trees into dependency structures.

(This description has been sourced from the corpus website).

Table 49. Artifacts for Polish Dependency Bank
Artifact	SHA1
LICENSE.txt	8b0cb355ed76e07cc7c876fec58341c2940cfee7
poldb-0.5.conll.gz	187424608e91b271957dabcf140a7274f1c88d63

pt

CoNLL-2006 Shared Task (Portuguese)

Edit on GitHub

ID	conll2006-pt
Version	20100302
Media type	text/x.org.dkpro.conll-2006
Language	pt
Encoding	UTF-8
URL	http://ilk.uvt.nl/conll/
Attribution^[1]	Diana Santos, Eckhard Bick
License^[2]	Floresta Sintá(c)tica License

Description

This is the Portuguese part of the CONLL-X Shared Task. The was derived from the Floresta Sintá(c)tica Bosque 7.3 by Sabine Buchholz.

(This description has been partially sourced from the README file included with the corpus).

We dd not find license information for this dataset. One might assume the license of this dataset is equivalent to that of the Floresta Sintá(c)tica from which it was derived.

Table 50. Artifacts for CoNLL-2006 Shared Task (Portuguese)
Artifact	SHA1
README.txt	10da89fed0ecb888c8fc7fad350b1a11bb9050d7
portuguese_bosque_train.conll	29e630e207c74a42e0d2999193aa25d73f262920
portuguese_bosque_test_blind.conll	fabcfbd73a531e21786af9b8233f1a4aa78dfddb
portuguese_bosque_test.conll	e399cdc1203df1ff43816f3f934223cb9a625992

sl

JOS - jos100k

Edit on GitHub

ID	jos100k-conll-sl-2.0
Version	2.0
Media type	text/x.org.dkpro.conll-2006
Language	sl
Encoding	UTF-8
URL	http://nl.ijs.si/jos/jos100k-en.html
Attribution^[1]	Tomaž Erjavec, Darja Fišer, Simon Krek, Nina Ledinek: The JOS Linguistically Tagged Corpus of Slovene. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Malta, 2010. (PDF)
License^[2]	CC-BY-NC 3.0

Description

The jos100k corpus contains 100,000 words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a reference annotated corpus of Slovene: its manually-validated annotations cover three level of linguistic description.

(This description has been sourced from the corpus website).

Table 51. Artifacts for JOS - jos100k
Artifact	SHA1
LICENSE.txt	23e82cd9f77862b5a26bf268aba9822784a9ab6a
data.zip	9f330ffd102cc5d5734fdaecbbf67751c84a1339

Slovene Dependency Treebank 0.1

Edit on GitHub

ID	sdt-conll-sl-0.1
Version	0.1
Media type	text/x.org.dkpro.conll-2006
Language	sl
Encoding	UTF-8
URL	http://nl.ijs.si/sdt/
Attribution^[1]	Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdenek Žabokrtský, Andreja Žele: Towards a Slovene Dependency Treebank. In Proceedings of Fifth International Conference on Language Resources and Evaluation, LREC'06, 24-26 May 2006. Genoa. (PDF)
License^[2]	SDT CoNLL-X

Description

The Slovene Dependency Treebank project built a small syntactically annotated corpus of Slovene texts. The corpus was annotated with dependency analyses, taking the Prague Dependecy Treebank as the model. The Slovene Dependency Treebank is annotated with Analytic Tree Structures and contains a part of the morphosyntactically annotated Slovene component of the parallel MULTEXT-East corpus, i.e. the first third of the Slovene translation of the novel "1984" by G. Orwell, containing 30,000 words.

(This description has been sourced from the corpus website).

Table 52. Artifacts for Slovene Dependency Treebank 0.1
Artifact	SHA1
data.zip	2bd85ad77c35d0c305a6afb7ee092676d5d22a35

Slovene Dependency Treebank 0.4

Edit on GitHub

ID	sdt-conll-sl-0.4
Version	0.4
Media type	text/x.org.dkpro.conll-2006
Language	sl
Encoding	UTF-8
URL	http://nl.ijs.si/sdt/
Attribution^[1]	Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdenek Žabokrtský, Andreja Žele: Towards a Slovene Dependency Treebank. In Proceedings of Fifth International Conference on Language Resources and Evaluation, LREC'06, 24-26 May 2006. Genoa. (PDF)
License^[2]	SDT License

Description

This is the preliminary release of the Slovene Dependency Treebank, SDT V0.4 which contains the Prague Dependency Treebank-like annotation of the first part of Slovene translation of Orwell’s "1984", taken from the MULTEXT-East parallel corpus, V3.0, c.f. http://ufal.mff.cuni.cz/pdt/ http://nl.ijs.si/ME/V3/ http://nl.ijs.si/ME/V3/doc/index.html#mtev3-doc-div2-id2305296

(This description has been sourced from the corpus website).

Table 53. Artifacts for Slovene Dependency Treebank 0.4
Artifact	SHA1
README.txt	9d047377eb96aa896461544cd1117b11b812809f
sdt-conll.tbl	16cfa8a20ebf8ed0e4f13c0119c7aa76a2498b1f

sv

Talbanken05 DEP

Edit on GitHub

ID	talkbanken05-dep-sv-1.1
Version	1.1
Media type	text/x.org.dkpro.conll-2006
Language	sv
Encoding	UTF-8
URL	http://stp.lingfil.uu.se/%7Enivre/research/Talbanken05.html
Attribution^[1]	Joakim Nivre, Jens Nilsson and Johan Hall (2006) Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, 2006, Genoa, Italy. (pdf)
License^[2]	Talbanken05 License

Description

Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of roughly 300,000 words, constructed at Lund University in the 1970s. The treebank comes with no guarantee but is freely available for research and educational purposes as long as proper credit is given for the work done to produce the material (both in Lund and in Växjö).

Dep: Dependency structure annotation (CoNLL-X shared task format in UTF-8).

(This description has been sourced from the corpus website).

Table 54. Artifacts for Talbanken05 DEP
Artifact	SHA1
data.tar.gz	bc836ab364ba37522e2989481104bad2eb96a92e

Talbanken05 DPS

Edit on GitHub

ID	talkbanken05-dps-sv-1.1
Version	1.1
Media type	application/x.org.dkpro.tiger+xml
Language	sv
Encoding	ISO-8859-1
URL	http://stp.lingfil.uu.se/%7Enivre/research/Talbanken05.html
Attribution^[1]	Joakim Nivre, Jens Nilsson and Johan Hall (2006) Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, 2006, Genoa, Italy. (pdf)
License^[2]	Talbanken05 License

Description

Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of roughly 300,000 words, constructed at Lund University in the 1970s. The treebank comes with no guarantee but is freely available for research and educational purposes as long as proper credit is given for the work done to produce the material (both in Lund and in Växjö).

DPS: Deepened phrase structure annotation (TIGER-XML encoding in ISO-8859-1)

(This description has been sourced from the corpus website).

Table 55. Artifacts for Talbanken05 DPS
Artifact	SHA1
data.tar.gz	bc836ab364ba37522e2989481104bad2eb96a92e

Talbanken05 FPS

Edit on GitHub

ID	talkbanken05-fps-sv-1.1
Version	1.1
Media type	application/x.org.dkpro.tiger+xml
Language	sv
Encoding	ISO-8859-1
URL	http://stp.lingfil.uu.se/%7Enivre/research/Talbanken05.html
Attribution^[1]	Joakim Nivre, Jens Nilsson and Johan Hall (2006) Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, 2006, Genoa, Italy. (pdf)
License^[2]	Talbanken05 License

Description

Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of roughly 300,000 words, constructed at Lund University in the 1970s. The treebank comes with no guarantee but is freely available for research and educational purposes as long as proper credit is given for the work done to produce the material (both in Lund and in Växjö).

FPS: Flat phrase structure annotation (TIGER-XML encoding in ISO-8859-1)

(This description has been sourced from the corpus website).

Table 56. Artifacts for Talbanken05 FPS
Artifact	SHA1
data.tar.gz	bc836ab364ba37522e2989481104bad2eb96a92e

DKPro Core™ Dataset Reference

…​ more datasets?

Overview

Datasets

ar

AQMAR Arabic Wikipedia Named Entity Corpus

ca

CoNLL-2009 Shared Task (Catalan)

cop

Coptic Treebank

da

Copenhagen Dependency Treebank

de

CoNLL-2009 Shared Task (German)

GermEval 2014 Named Entity Recognition Shared Task

Hamburg Dependency Treebank

Named Entity Model for German, Politics (NEMGP)

el

Ancient Greek and Latin Dependency Treebank (Greek)

en

Brown Corpus (TEI XML)

CoNLL-2000 Chunking Shared Task Data (English)

English Word Sense and Semantic Role Datasets (WaSR)

English Word Sense and Semantic Role Datasets (WaSR)

English Word Sense and Semantic Role Datasets (WaSR)

Georgetown University Multilayer Corpus

Georgetown University Multilayer Corpus

Georgetown University Multilayer Corpus

Georgetown University Multilayer Corpus

Georgetown University Multilayer Corpus (UD)

GloVe pre-trained vectors - Wikipedia 2014 + Gigaword 5

MASC-CONLL

NAIST/NTT TED Treebank

Stanford POS Tagger Distsim Clusters

Universal Dependencies 1.4 Treebanks

es

CoNLL-2002 NER Shared Task Data (Spanish)

CoNLL-2009 Shared Task (Spanish)

IULA Spanish LSP Treebank

fa

Uppsala Persian Dependency Treebank

fr

Deep Sequoia (Surface)

hr

SETimes.HR dependency treebank

SETimes.HR+ Croatian dependency treebank

it

Turin University Treebank

ja

CoNLL-2009 Shared Task (Japanese)

la

Ancient Greek and Latin Dependency Treebank (Latin)

nb

Norwegian Dependency Treebank (Norwegian Bokmål)

nfi

FinnTreeBank

nl

Alpino2conll

CoNLL-2002 NER Shared Task Data (Dutch)

nn

Norwegian Dependency Treebank (Norwegian Nynorsk)

pl

Polish Constituency Treebank

Polish Dependency Bank

pt

CoNLL-2006 Shared Task (Portuguese)

sl

JOS - jos100k

Slovene Dependency Treebank 0.1

Slovene Dependency Treebank 0.4

sv

Talbanken05 DEP

Talbanken05 DPS

Talbanken05 FPS

… more datasets?