The document provides information about the datasets available through the DKPro Core DatasetFactory class.

The factory automatically downloads the datasets. It maintains a local cache to avoid redundant downloads. Datasets are validated against checksums stored in the dataset descriptions included with DKPro Core to ensure the descriptions match the datasets. While we try to maintain a good quality of the descriptions, they may not be perfect.[1] [2] Please use the Edit on GitHub links next to the descriptions in the document below or the issue tracker to report/fix any problems you may notice.

…​ more datasets?

This is not an exhaustive list of the datasets supported by DKPro Core. Any dataset in a format supported by DKPro Core can be used. For more details, refer to the Format Reference. If you are missing any datasets from the list, please tell us by opening an issue in our issue tracker. You can also simply create a new dataset description yourself and submit a pull request. For details on describing new datasets, please refer to the User Guide.

Overview

Table 1. Datasets (46)
Dataset Version Language Encoding License [2]

AQMAR Arabic Wikipedia Named Entity Corpus

1.0

ar

UTF-8

CC-BY-SA 3.0

Alpino2conll

20100114

nl

UTF-8

unknown

Ancient Greek and Latin Dependency Treebank (Greek)

2.1

el

UTF-8

CC-BY-SA 3.0

Ancient Greek and Latin Dependency Treebank (Latin)

2.1

la

ISO-8859-1

CC-BY-SA 3.0

Brown Corpus (TEI XML)

20081013

en

ISO-8859-1

Brown Corpus License (?)

CoNLL-2000 Chunking Shared Task Data (English)

20000221

en

ISO-8859-1

WSJ Corpus License (?)

CoNLL-2002 NER Shared Task Data (Dutch)

20021107

nl

ISO-8859-1

unknown

CoNLL-2002 NER Shared Task Data (Spanish)

20020522

es

ISO-8859-1

unknown

CoNLL-2006 Shared Task (Portuguese)

20100302

pt

UTF-8

Floresta Sintá(c)tica License

CoNLL-2009 Shared Task (Catalan)

2.1

ca

UTF-8

GPLv3 (?)

CoNLL-2009 Shared Task (German)

1.1

de

UTF-8

multiple

CoNLL-2009 Shared Task (Japanese)

1.0

ja

UTF-8

unknown

CoNLL-2009 Shared Task (Spanish)

2.1

es

UTF-8

GPLv3 (?)

Copenhagen Dependency Treebank

1

da

UTF-8

GPLv2

Coptic Treebank

1.0

cop

UTF-8

CC-BY 4.0

Deep Sequoia (Surface)

7.0

fr

UTF-8

LGPL-LR

English Word Sense and Semantic Role Datasets (WaSR)

1.0

en

UTF-8

CC-BY-NC-ND 3.0

English Word Sense and Semantic Role Datasets (WaSR)

1.0

en

UTF-8

CC-BY-NC-ND 3.0

English Word Sense and Semantic Role Datasets (WaSR)

1.0

en

UTF-8

CC-BY-NC-ND 3.0

FinnTreeBank

3.1

nfi

UTF-8

CC-BY 3.0

Georgetown University Multilayer Corpus

2.2.0

en

UTF-8

multiple

Georgetown University Multilayer Corpus

2.3.2

en

UTF-8

multiple

Georgetown University Multilayer Corpus

3.0.0

en

UTF-8

multiple

GermEval 2014 Named Entity Recognition Shared Task

20140911

de

UTF-8

CC-BY 4.0

GloVe pre-trained vectors - Wikipedia 2014 + Gigaword 5

20151025

en

UTF-8

Open Data Commons Public Domain Dedication and License (PDDL)

Hamburg Dependency Treebank

1.0.1

de

UTF-8

multiple

IULA Spanish LSP Treebank

1

es

UTF-8

CC-BY 3.0

JOS - jos100k

2.0

sl

UTF-8

CC-BY-NC 3.0

MASC-CONLL

20080522

en

ISO-8859-1

unknown

NAIST/NTT TED Treebank

1.0

en

UTF-8

CC-BY-NC-SA 3.0 (?)

Named Entity Model for German, Politics (NEMGP)

0.1

de

UTF-8

CC-BY-SA 3.0

Norwegian Dependency Treebank (Norwegian Bokmål)

1.01

nb

UTF-8

CC0 1.0

Norwegian Dependency Treebank (Norwegian Nynorsk)

1.01

nn

UTF-8

CC0 1.0

Polish Constituency Treebank

0.5

pl

UTF-8

GPL 3.0

Polish Dependency Bank

0.5

pl

UTF-8

GPL 3.0

SETimes.HR dependency treebank

1

hr

UTF-8

CC-BY-SA 3.0

SETimes.HR+ Croatian dependency treebank

20160613

hr

UTF-8

multiple

Slovene Dependency Treebank 0.1

0.1

sl

UTF-8

SDT CoNLL-X

Slovene Dependency Treebank 0.4

0.4

sl

UTF-8

SDT License

Stanford POS Tagger Distsim Clusters

20130608

en

UTF-8

unknown

Talbanken05 DEP

1.1

sv

UTF-8

Talbanken05 License

Talbanken05 DPS

1.1

sv

ISO-8859-1

Talbanken05 License

Talbanken05 FPS

1.1

sv

ISO-8859-1

Talbanken05 License

Turin University Treebank

20101122

it

UTF-8

CC-BY-NC-SA 2.5

Universal Dependencies 1.4 Treebanks

1.4

en

UTF-8

CC-BY-SA 4.0

Uppsala Persian Dependency Treebank

1.3

fa

UTF-8

CC-BY 3.0

Datasets

ar

AQMAR Arabic Wikipedia Named Entity Corpus

ID

aqmar-ar-1.0

Version

1.0

Media type

text/x.org.dkpro.conll-2000

Language

ar

Encoding

UTF-8

URL

http://www.cs.cmu.edu/~ark/ArabicNER/

Attribution[1]

By Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah Smith as part of the AQMAR project.

License[2]

CC-BY-SA 3.0

Description

73,853 tokens in 28 Arabic Wikipedia articles hand-annotated for named entities.

(This description has been partially copied from the corpus website).

Table 2. Artifacts for AQMAR Arabic Wikipedia Named Entity Corpus
Artifact SHA1

LICENSE.txt

43f4082fb8432ad86d927bdff687f9406db43d0f

data.zip

4fa2c37d7673bb456c6e382566a091545531d85f

ca

CoNLL-2009 Shared Task (Catalan)

ID

conll2009-ca

Version

2.1

Media type

text/x.org.dkpro.conll-2009

Language

ca

Encoding

UTF-8

URL

http://ufal.mff.cuni.cz/conll2009-st/

Attribution[1]

Lluís Màrquez, Ma. Antònia Martí, Mariona Taulé, Manuel Bertran, Oriol Borrega

License[2]

GPLv3 (?)

Description

This is a subset of the Ancora corpus (see http://clic.ub.edu/ancora) which was used in the CoNLL-2009 shared task on extracting syntactic and semantic Dependencies in multiple languages.

496,672 lexical tokens; training: 390,302; development: 53,015; test: 53,355

(This description has been partially copied from the README file included with the corpus).

The description states that the data was extracted from the Ancora corpus, but it does not say from which version. One might assume it comes from AnCora Catalan dependency 1.0.1. However, this version does not include a license file. The next version is AnCora Catalan 2.0.0 which was released under GPL 3.0. Thus, one might conclude that this data can also be used under these conditions.
Table 3. Artifacts for CoNLL-2009 Shared Task (Catalan)
Artifact SHA1

data.zip

500cbb81709012cce4d23bfa72d93c320b0b7e6f

cop

Coptic Treebank

ID

coptictb-conll-cop-1.0

Version

1.0

Media type

text/x.org.dkpro.conll-2006

Language

cop

Encoding

UTF-8

URL

http://copticscriptorium.org

Attribution[1]

Amir Zeldes

License[2]

CC-BY 4.0

Description

The Coptic Treebank from the Coptic SCRIPTORIUM corpora (http://copticscriptorium.org/).

Table 4. Artifacts for Coptic Treebank
Artifact SHA1

LICENSE.txt

fc0bdc662ce901ac2c631f9574c9aa8b54ebf8c7

coptic.treebank.conll10

8c363df27408cb14cb42f3869916c1575fe1625a

da

Copenhagen Dependency Treebank

ID

cdt-conll-da-1

Version

1

Media type

text/x.org.dkpro.conll-2006

Language

da

Encoding

UTF-8

URL

http://mbkromann.github.io/copenhagen-dependency-treebank

Attribution[1]

Matthias Trautner Kromann, 2003. The Danish Dependency Treebank and the DTAG treebank tool. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT 2003), 14-15 November, Växjö. pp. 217-220. (PDF)

License[2]

GPLv2

Description

Version 1 (the directory "da") was orginally called the Danish Dependency Treebank. It was used in the CoNLL 2006 shared task on dependency parsing, but has since been updated with bug fixes and an improved CoNLL conversion which includes a decomposition of the PAROLE part-of-speech tags into the underlying features for number, gender, etc.).

(This description has been sourced from the corpus README file).

Table 5. Artifacts for Copenhagen Dependency Treebank
Artifact SHA1

LICENSE.txt

4cc77b90af91e615a64ae04893fdffa7939db84c

data.zip

11313d405abb0f268247a2d5420afa413eb244e7

de

CoNLL-2009 Shared Task (German)

ID

conll2009-de

Version

1.1

Media type

text/x.org.dkpro.conll-2009

Language

de

Encoding

UTF-8

URL

http://ufal.mff.cuni.cz/conll2009-st/

Attribution[1]

Yi Zhang, Sebastian Pado

License[2]

TIGER Corpus License, SALSA Corpus License

Description

This dataset contains the basic information regarding the German corpus provided for the CoNLL-2009 shared task on "Syntactic and Semantic Dependencies in Multiple Languages" (http://ufal.mff.cuni.cz/conll2009-st/). The data of this distribution is derived from the TIGER Treebank and the SALSA Corpus, converted into the syntactic and semantic dependencies compatible with the CoNLL-2009 shared task.

(This description has been sourced from the README file included with the corpus).

Table 6. Artifacts for CoNLL-2009 Shared Task (German)
Artifact SHA1

data.zip

ad4c03c3c4e4668c8beb34c399e71f539e6d633d

GermEval 2014 Named Entity Recognition Shared Task

ID

germeval2014-de

Version

20140911

Media type

text/x.org.dkpro.germeval-2014

Language

de

Encoding

UTF-8

URL

https://sites.google.com/site/germeval2014ner/

Attribution[1]

D. Benikova, C. Biemann, M. Reznicek. NoSta-D Named Entity Annotation for German: Guidelines and Dataset. Proceedings of LREC 2014, Reykjavik, Iceland

License[2]

CC-BY 4.0

Description

The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following properties:

  • The data was sampled from German Wikipedia and News Corpora as a collection of citations.

  • The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.

  • The NER annotation uses the NoSta-D guidelines, which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating embeddings among NEs such as [ORG FC Kickers [LOC Darmstadt]].

(This description has been sourced from the dataset website).

Table 7. Artifacts for GermEval 2014 Named Entity Recognition Shared Task
Artifact SHA1

LICENSE.txt

1167f0e28fe2db01e38e883aaf1e749fb09f9ceb

NER-de-dev.tsv

70aba5d247f51ec22e0bcc671c7fb325e4ff4277

NER-de-test.tsv

214deaf091e01567af2e958aac87863bf685342a

NER-de-train.tsv

7644cb09676050c0a2836e06fa0aeb8509b9e1cb

Hamburg Dependency Treebank

ID

hdt-de-conll-1.0.1

Version

1.0.1

Media type

text/x.org.dkpro.conll-2006

Language

de

Encoding

UTF-8

URL

https://corpora.uni-hamburg.de/drupal/de/islandora/object/treebank:hdt

Attribution[1]

Wolfgang Menzel

License[2]

CC-BY-SA 4.0, HZSK-ACA

Description

Contains annotated text from the German technical news website www.heise.de.

Table 8. License comments for Hamburg Dependency Treebank
License Comment

CC-BY-SA 4.0

Annotation

HZSK-ACA

Text

Table 9. Artifacts for Hamburg Dependency Treebank
Artifact SHA1

LICENSE-CC-BY-SA.txt

8f551a766d1f4556d1a2596365c0fc2191366efa

LICENSE-HZSK-ACA.txt

generated

hamburgDepTreebank.tar.xz

6594e5cd48966db7dac04f2b5ff948eb2bcadf37

Named Entity Model for German, Politics (NEMGP)

ID

nemgp-de-0.1

Version

0.1

Media type

unknown

Language

de

Encoding

UTF-8

URL

http://www.thomas-zastrow.de/nlp/

Attribution[1]

Thomas Zastrow

License[2]

CC-BY-SA 3.0

Description

The Named Entity Model for German, Politics (NEMGP) is a collection of texts from Wikipedia and WikiNews, manually annotated with named entity information.

(This description has been sourced from the dataset website).

Table 10. Artifacts for Named Entity Model for German, Politics (NEMGP)
Artifact SHA1

LICENSE.txt

fb41626a3005c2b6e14b8b3f5d9d0b19b5faaa51

data.zip

f2a1fd54df9232741a3a1892d1ffb0a4d7205991

el

Ancient Greek and Latin Dependency Treebank (Greek)

ID

perseus-el-2.1

Version

2.1

Media type

unknown

Language

el

Encoding

UTF-8

URL

https://perseusdl.github.io/treebank_data/

Attribution[1]

Giuseppe G. A. Celano, Gregory Crane, Bridget Almas et al.

License[2]

CC-BY-SA 3.0

Description

The Ancient Greek and Latin Dependency Treebank (AGLDT) is the earliest treebank for Ancient Greek and Latin. The project started at Tufts University in 2006 and is currently under development and maintenance at Leipzig University-Tufts University.

(This description has been sourced from the dataset website).

Table 11. Artifacts for Ancient Greek and Latin Dependency Treebank (Greek)
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

perseus.zip

140eee6d2e3e83745f95d3d5274d9e965d898980

en

Brown Corpus (TEI XML)

ID

brown-en-teixml

Version

20081013

Media type

application/tei+xml

Language

en

Encoding

ISO-8859-1

URL

http://www.nltk.org/nltk_data/

Attribution[1]

W. N. Francis and H. Kucera. Converted to TEI by Lou Burnard.

License[2]

Brown Corpus License (?)

Description

This version derives directly from

"A Standard Corpus of Present-Day Edited American English, for use with Digital Computers." by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html

as distributed with NLTK (version 0.9.2)

(This description has been taken from the README file included with the corpus).

We did not find license information included with this dataset. One might assume the TEI version of the Brown Corpus is provided under the same conditions as the original Brown Corpus.
Table 12. Artifacts for Brown Corpus (TEI XML)
Artifact SHA1

LICENSE.txt

generated

brown.zip

1e4eadeb358f6f7e6ac9b3677a82f4353bbe91ed

CoNLL-2000 Chunking Shared Task Data (English)

ID

conll2000-en

Version

20000221

Media type

text/x.org.dkpro.conll-2000

Language

en

Encoding

ISO-8859-1

URL

http://www.cnts.ua.ac.be/conll2000/chunking/

Attribution[1]

unknown

License[2]

WSJ Corpus License (?)

Description

This is the data from the CoNLL-2000 shared task on text chunking. The data consists of the same partitions of the Wall Street Journal corpus (WSJ) as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the data has been derived from the WSJ corpus by a program written by Sabine Buchholz from Tilburg University, The Netherlands. Instead of using the part-of-speech tags of the WSJ corpus, the data set used tags generated by the Brill tagger.

(This description has been partially copied from the corpus website).

We did not find any license information for this dataset. However, as the texts appear to come from the WSJ corpus, probably the WSJ corpus license applies here.
Table 13. Artifacts for CoNLL-2000 Chunking Shared Task Data (English)
Artifact SHA1

train.txt.gz

9f31cf936554cebf558d07cce923dca0b7f31864

test.txt.gz

dc57527f1f60eeafad03da51235185141152f849

English Word Sense and Semantic Role Datasets (WaSR)

ID

wasr-de-1.00

Version

1.0

Media type

text/x.org.dkpro.conll-2009

Language

en

Encoding

UTF-8

URL

https://www.ukp.tu-darmstadt.de/data/semantic-role-resources/knowledge-based-semantic-role-labeling/

Attribution[1]

Silvana Hartmann, Judith Eckle-Kohler, and Iryna Gurevych. Generating Training Data for Semantic Role Labeling based on Label Transfer from Linked Lexical Resources. In: Transactions of the Association for Computational Linguistics, vol. 4, no. 1, p. (to appear), 2016. (PDF)

License[2]

CC-BY-NC-ND 3.0

Description

German Frame and Role Annotations.

(This description has been sourced from the README file included with the corpus).

Table 14. Artifacts for English Word Sense and Semantic Role Datasets (WaSR)
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

data.tar.bz2

b706711ae6fffc94409f80b635595bd45d8c2ece

English Word Sense and Semantic Role Datasets (WaSR)

ID

wasr-l-en-1.00

Version

1.0

Media type

text/x.org.dkpro.conll-2009

Language

en

Encoding

UTF-8

URL

https://www.ukp.tu-darmstadt.de/data/semantic-role-resources/knowledge-based-semantic-role-labeling/

Attribution[1]

Silvana Hartmann, Judith Eckle-Kohler, and Iryna Gurevych. Generating Training Data for Semantic Role Labeling based on Label Transfer from Linked Lexical Resources. In: Transactions of the Association for Computational Linguistics, vol. 4, no. 1, p. (to appear), 2016. (PDF)

License[2]

CC-BY-NC-ND 3.0

Description

English Frame and Role Annotations.

(This description has been sourced from the README file included with the corpus).

Table 15. Artifacts for English Word Sense and Semantic Role Datasets (WaSR)
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

part1.tar.bz2

ef7ccf5cb23da63003bdb19d99b15b0ea2821e55

English Word Sense and Semantic Role Datasets (WaSR)

ID

wasr-xl-en-1.00

Version

1.0

Media type

text/x.org.dkpro.conll-2009

Language

en

Encoding

UTF-8

URL

https://www.ukp.tu-darmstadt.de/data/semantic-role-resources/knowledge-based-semantic-role-labeling/

Attribution[1]

Silvana Hartmann, Judith Eckle-Kohler, and Iryna Gurevych. Generating Training Data for Semantic Role Labeling based on Label Transfer from Linked Lexical Resources. In: Transactions of the Association for Computational Linguistics, vol. 4, no. 1, p. (to appear), 2016. (PDF)

License[2]

CC-BY-NC-ND 3.0

Description

English Frame and Role Annotations.

(This description has been sourced from the README file included with the corpus).

Table 16. Artifacts for English Word Sense and Semantic Role Datasets (WaSR)
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

part1.tar.bz2

ef7ccf5cb23da63003bdb19d99b15b0ea2821e55

part2.tar.bz2

0a9c98cbf1fe02841edf52e963444a7e38986577

part3.tar.bz2

9c0cc79ecab9140f82683d39ed6acb51b148f9f7

Georgetown University Multilayer Corpus

ID

gum-en-conll-2.2.0

Version

2.2.0

Media type

text/x.org.dkpro.conll-2006

Language

en

Encoding

UTF-8

URL

https://corpling.uis.georgetown.edu/gum/

Attribution[1]

Zeldes, Amir (2016) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation. For Gum annotation team, see https://corpling.uis.georgetown.edu/gum/

License[2]

CC-BY 2.5, CC-BY-SA 3.0, CC-BY-NC-SA 3.0, CC-BY 4.0

Description

This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.

The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.

(This description has been sourced from the dataset website).

Table 17. License comments for Georgetown University Multilayer Corpus
License Comment

CC-BY 2.5

Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright)

CC-BY-SA 3.0

WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use)

CC-BY-NC-SA 3.0

WikiVoyage texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons)

CC-BY 4.0

Annotations (Source: https://corpling.uis.georgetown.edu/gum/)

Table 18. Artifacts for Georgetown University Multilayer Corpus
Artifact SHA1

gum.zip

b17e276998ced83153be605d8157afacf1f10fdc

Georgetown University Multilayer Corpus

ID

gum-en-conll-2.3.2

Version

2.3.2

Media type

text/x.org.dkpro.conll-2006

Language

en

Encoding

UTF-8

URL

https://corpling.uis.georgetown.edu/gum/

Attribution[1]

Zeldes, Amir (2016) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation. For Gum annotation team, see https://corpling.uis.georgetown.edu/gum/

License[2]

CC-BY 2.5, CC-BY-SA 3.0, CC-BY-NC-SA 3.0, CC-BY 4.0

Description

This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.

The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.

(This description has been sourced from the dataset website).

Table 19. License comments for Georgetown University Multilayer Corpus
License Comment

CC-BY 2.5

Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright)

CC-BY-SA 3.0

WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use)

CC-BY-NC-SA 3.0

WikiVoyage texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons)

CC-BY 4.0

Annotations (Source: https://corpling.uis.georgetown.edu/gum/)

Table 20. Artifacts for Georgetown University Multilayer Corpus
Artifact SHA1

gum.zip

471c3a35c2a0e9aee4bbff9a9cf05441fce3ef21

Georgetown University Multilayer Corpus

ID

gum-en-conll-3.0.0

Version

3.0.0

Media type

text/x.org.dkpro.conll-2006

Language

en

Encoding

UTF-8

URL

https://corpling.uis.georgetown.edu/gum/

Attribution[1]

Zeldes, Amir (2016) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation. For Gum annotation team, see https://corpling.uis.georgetown.edu/gum/

License[2]

CC-BY 2.5, CC-BY-SA 3.0, CC-BY-NC-SA 3.0, CC-BY 4.0

Description

This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.

The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.

(This description has been sourced from the dataset website).

The CPOS column of the files contains an extended POS tagset as it is used by the English TreeTagger models. The POS column contains the regular PTB tagset.

Table 21. License comments for Georgetown University Multilayer Corpus
License Comment

CC-BY 2.5

Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright)

CC-BY-SA 3.0

WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use)

CC-BY-NC-SA 3.0

WikiVoyage texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons)

CC-BY 4.0

Annotations (Source: https://corpling.uis.georgetown.edu/gum/)

Table 22. Artifacts for Georgetown University Multilayer Corpus
Artifact SHA1

gum.zip

b590dbe3f4ae198ca500618a53491f75c221e98b

GloVe pre-trained vectors - Wikipedia 2014 + Gigaword 5

ID

glove.6B-en-20151025

Version

20151025

Media type

unknown

Language

en

Encoding

UTF-8

URL

https://nlp.stanford.edu/projects/glove/

Attribution[1]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

License[2]

Open Data Commons Public Domain Dedication and License (PDDL)

Description

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

(This description has been sourced from the dataset website).

Table 23. Artifacts for GloVe pre-trained vectors - Wikipedia 2014 + Gigaword 5
Artifact SHA1

glove.6B.zip

b64e54f1877d2f735bdd000c1d7d771e25c7dfdc

MASC-CONLL

ID

masc-conll-en-20080522

Version

20080522

Media type

text/x.org.dkpro.conll-2008

Language

en

Encoding

ISO-8859-1

URL

http://www.anc.org/data/masc/

Attribution[1]

unknown

License[2]

unknown

Description

The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).

A 40K subset of MASC1 data with annotations for Penn Treebank syntactic dependencies and semantic dependencies from NomBank and PropBank in CONLL IOB format. This data set was used in the CoNLL 2008 shared task on Joint Parsing of Syntactic and Semantic Dependencies.

(This description has been sourced from the dataset website).

Table 24. Artifacts for MASC-CONLL
Artifact SHA1

data.zip

d9f53a05c659204a3223e901c450fe8ffa5fa9fa

NAIST/NTT TED Treebank

ID

tedtreebank-conll-en-1.0

Version

1.0

Media type

text/x.org.dkpro.conll-2006

Language

en

Encoding

UTF-8

URL

http://ahclab.naist.jp/resource/tedtreebank/

Attribution[1]

Graham Neubig, Katsuhito Sudoh, Yusuke Oda, Kevin Duh, Hajime Tsukada, Masaaki Nagata. The NAIST-NTT Ted Talk Treebank. In proceedings of International Workshop on Spoken Language Translation (IWSLT). Lake Tahoe, USA. December 2014. (PDF) (bib)

License[2]

CC-BY-NC-SA 3.0 (?)

Description

The NAIST-NTT Ted Talk Treebank is a manually annotated treebank of TED talks that was created through a joint research project of NAIST and the NTT CS Lab. All treebank annotation follows the Penn Treebank standard.

(This description has been sourced from the corpus website/README file in the corpus).

The website does not state which version of the CC-BY-SA-NC applies. One might consider it is the version 3.0 which is also used for the TED talks themselves.
Table 25. Artifacts for NAIST/NTT TED Treebank
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

data.tar.gz

89c6495bd64c4b3e699b4c478b47a0c827ea46ea

Stanford POS Tagger Distsim Clusters

ID

stanford-egw4-reut-512-clusters-20130608

Version

20130608

Media type

unknown

Language

en

Encoding

UTF-8

URL

http://nlp.stanford.edu/software/pos-tagger-faq.shtml#distsim

Attribution[1]

unknown

License[2]

unknown

Description

Distributional similarity clusters that can be used e.g. with the Stanford POS tagger.

These clusters are a feature extracted from larger, untagged text which clusters the words into similar classes.

(This description has been sourced from the dataset website).

Table 26. Artifacts for Stanford POS Tagger Distsim Clusters
Artifact SHA1

egw4-reut.512.clusters

3f1352641a46e985c07d0023c0ada7e5be97e527

Universal Dependencies 1.4 Treebanks

ID

ud-en-conllu-1.4

Version

1.4

Media type

text/x.org.dkpro.conll-u

Language

en

Encoding

UTF-8

URL

https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1827

Attribution[1]

Silveira, N., Dozat, T., De Marneffe, M. C., Bowman, S. R., Connor, M., Bauer, J., & Manning, C.

  1. (2014, May). A Gold Standard Dependency Corpus for English. In LREC (pp. 2897-2904). (pdf)

License[2]

CC-BY-SA 4.0

Description

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).

(This description has been sourced from the dataset website).

Table 27. Artifacts for Universal Dependencies 1.4 Treebanks
Artifact SHA1

data.tgz

1c41c28b000935ffa6c63b9ff17c48e892c56597

es

CoNLL-2002 NER Shared Task Data (Spanish)

ID

conll2002-es

Version

20020522

Media type

text/x.org.dkpro.conll-2002

Language

es

Encoding

ISO-8859-1

URL

http://www.clips.ua.ac.be/conll2002/ner/

Attribution[1]

unknown

License[2]

unknown

Description

This is the data from the CoNLL-2002 shared task on language independent named entity recognition. The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are from May 2000. The annotation was carried out by the TALP Research Center (http://www.talp.upc.es/) of the Technical University of Catalonia (UPC) and the Center of Language and Computation (CLiC, http://clic.fil.ub.es/) of the University of Barcelona (UB), and funded by the European Commission through the NAMIC project (IST-1999-12392).

(This description has been sourced from the README file included with the corpus).

Table 28. Artifacts for CoNLL-2002 NER Shared Task Data (Spanish)
Artifact SHA1

data.tgz

686ef8fed3125a1d8aefe1351ff0e619fe9c34cb

CoNLL-2009 Shared Task (Spanish)

ID

conll2009-es

Version

2.1

Media type

text/x.org.dkpro.conll-2009

Language

es

Encoding

UTF-8

URL

http://ufal.mff.cuni.cz/conll2009-st/

Attribution[1]

Lluís Màrquez, Ma. Antònia Martí, Mariona Taulé, Manuel Bertran, Oriol Borrega

License[2]

GPLv3 (?)

Description

This is a subset of the Ancora corpus (see http://clic.ub.edu/ancora) which was used in the CoNLL-2009 shared task on extracting syntactic and semantic Dependencies in multiple languages.

528,440 lexical tokens; training: 427,442; development: 50,368; test: 50,630

(This description has been partially copied from the README file included with the corpus).

The description states that the data was extracted from the Ancora corpus, but it does not say from which version. One might assume it comes from AnCora Spanish dependency 1.0.1. However, this version does not include a license file. The next version is AnCora Catalan 2.0.0 which was released under GPL 3.0. Thus, one might conclude that this data can also be used under these conditions.
Table 29. Artifacts for CoNLL-2009 Shared Task (Spanish)
Artifact SHA1

data.zip

ef36c3369bd05966609b4b13d6bf78884c23ece1

IULA Spanish LSP Treebank

ID

iulatb-es-1

Version

1

Media type

text/x.org.dkpro.conll-2006

Language

es

Encoding

UTF-8

URL

https://www.iula.upf.edu/recurs01_tbk_uk.htm

Attribution[1]

Marimon, Montserrat; Fisas, Beatriz; Bel, Núria; Arias, Blanca; Vázquez, Silvia; Vivaldi, Jorge; Torner, Sergi; Villegas, Marta; Lorente, Mercè (2012). "The IULA Treebank" in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA). p. 1920-1926. (PDF)

License[2]

CC-BY 3.0

Description

IULA Spanish LSP Treebank is an Spanish treebank containing the syntactic annotation of 42,000 sentences (almost 590,000 tokens). It has been developed within the frame of Metanet4U project (Enhancing the European Linguistic Infrastructure, GA 270893).

The sentences in IULA Spanish LSP Treebank are extracted from the Corpus Tècnic de l’IULA, a collection of written texts from the fields of Law, Economy, Genomics, Medicine, and Environment, as well as a contrastive corpus from the press.

(This description has been sourced from the corpus website).

Table 30. Artifacts for IULA Spanish LSP Treebank
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

data.rar

67e2ce3327501605b7c9f0844cc4982070612222

fa

Uppsala Persian Dependency Treebank

ID

updt-fa-1.3

Version

1.3

Media type

text/x.org.dkpro.conll-2006

Language

fa

Encoding

UTF-8

URL

http://stp.lingfil.uu.se/%7Emojgan/UPDT.html

Attribution[1]

Mojgan Seraji, under the supervision of Joakim Nivre and Carina Jahani.

License[2]

CC-BY 3.0

Description

Uppsala Persian Dependency Treebank (UPDT) (Seraji, 2015, Chapter 5, pp. 97-146) is a dependency-based syntactically annotated corpus.

(This description has been sourced from the dataset website).

Table 31. Artifacts for Uppsala Persian Dependency Treebank
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

train-conll.tar.gz

6ace1d1132b121b09d0b88f53749d28a59843cd5

dev-conll.tar.gz

e96a06b399bb1f565e16e49fb4dfe7da241f5d75

test-conll.tar.gz

ec79e91413dd2c49883bfbbd1a207f68377ac683

fr

Deep Sequoia (Surface)

ID

sequoia-surf-conll-fr-7.0

Version

7.0

Media type

text/x.org.dkpro.conll-2006

Language

fr

Encoding

UTF-8

URL

https://deep-sequoia.inria.fr

Attribution[1]

Marie Candito, Guy Perrier, Bruno Guillaume, Corentin Ribeyre, Karën Fort, Djamé Seddah and Éric de la Clergerie. (2014) Deep Syntax Annotation of the Sequoia French Treebank. Proc. of LREC 2014, Reykjavic, Iceland.

License[2]

LGPL-LR

Description

Deep-sequoia is a corpus of French sentences annotated with both surface and deep syntactic dependency structures.

(This description has been sourced from the dataset website).

Table 32. Artifacts for Deep Sequoia (Surface)
Artifact SHA1

LICENSE.txt

generated

sequoia.tgz

9f53475f809ef1032a92adedf262226da1615051

hr

SETimes.HR dependency treebank

ID

sethr-hr-1

Version

1

Media type

text/x.org.dkpro.conll-2006

Language

hr

Encoding

UTF-8

URL

http://nlp.ffzg.hr/resources/corpora/setimes-hr/

Attribution[1]

unknown

License[2]

CC-BY-SA 3.0

Description

The corpus is based on the Croatian part of the SETimes parallel corpus.

(This description has been sourced from the corpus website).

Table 33. Artifacts for SETimes.HR dependency treebank
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

setimes.hr.v1.conllx.gz

0faebfe55136692f83dcddd4cf659a8b59655d62

SETimes.HR+ Croatian dependency treebank

ID

sethrplus-hr-20160613

Version

20160613

Media type

text/x.org.dkpro.conll-u

Language

hr

Encoding

UTF-8

URL

https://github.com/ffnlp/sethr

Attribution[1]

Agić and Ljubešić (2014) (PDF) (bib)

License[2]

CC-BY 4.0, CC-BY-NC-SA 4.0

Description

The treebank is a result of an effort in providing free-culture language resources for Croatian by the NLP group at FF Zagreb.

(This description has been sourced from the corpus website).

Table 34. License comments for SETimes.HR+ Croatian dependency treebank
License Comment

CC-BY 4.0

SETimes.HR dataset (set.hr.conll)

CC-BY-NC-SA 4.0

web.hr.conll and news.hr.conll datasets

Table 35. Artifacts for SETimes.HR+ Croatian dependency treebank
Artifact SHA1

LICENSE-CC-BY.txt

1167f0e28fe2db01e38e883aaf1e749fb09f9ceb

LICENSE-CC-BY-NC-SA.txt

5d572362228001e9dbc0c8802f49121ceb78ace2

data.zip

a52d13cfa91589c0d93fe0a90333a4f0e997b7cf

it

Turin University Treebank

ID

tut-conll-it-20101122

Version

20101122

Media type

text/x.org.dkpro.conll-2006

Language

it

Encoding

UTF-8

URL

http://www.di.unito.it/~tutreeb/treebanks.html

Attribution[1]

Cristina Bosco, Leonardo Lesmo, Vincenzo Lombardo, Alessandro Mazzei, Livio Robaldo

License[2]

CC-BY-NC-SA 2.5

Description

TUT is a morpho-syntactically annotated collection of Italian sentences, which includes texts from different text genres and domains, released in several annotation formats.

(This description has been sourced from the corpus website).

Table 36. Artifacts for Turin University Treebank
Artifact SHA1

NEWS.zip

3d9b22d8ebf533aa1d6d39d417316c30900b9a0e

VEDCH.zip

2278e6e770ddc4a8eea5e045c4a77a5df2ae0977

CODICECIVILE.zip

9cf9c0a9c652b3df6564d1fa0ca97c2d7905faa3

EUDIR.zip

72a6e55627481ff99930b59714cfc0909ccf60e1

WIKI.zip

a421f488859324e3e12687b9a3067652248eb8df

ja

CoNLL-2009 Shared Task (Japanese)

ID

conll2009-ja

Version

1.0

Media type

text/x.org.dkpro.conll-2009

Language

ja

Encoding

UTF-8

URL

http://ufal.mff.cuni.cz/conll2009-st/

Attribution[1]

Daisuke Kawahara

License[2]

unknown

Description

This file contains the basic information regarding the Japanese corpus provided for the CoNLL-2009 shared task on "Syntactic and Semantic Dependencies in Multiple Languages". The current version corresponds to the release of the training data sets.

The data of this distribution uses portions of the Kyoto University Text Corpus 4.0. The Kyoto University Text Corpus is freely available at http://nlp.kuee.kyoto-u.ac.jp/nl-resource/corpus-e.html.

(This description has been sourced from the README file included with the corpus).

Table 37. Artifacts for CoNLL-2009 Shared Task (Japanese)
Artifact SHA1

data.zip

8c96a1eda2527a9ba1bf37dd4125cc6af11e7dd4

la

Ancient Greek and Latin Dependency Treebank (Latin)

ID

perseus-la-2.1

Version

2.1

Media type

unknown

Language

la

Encoding

ISO-8859-1

URL

https://perseusdl.github.io/treebank_data/

Attribution[1]

Giuseppe G. A. Celano, Gregory Crane, Bridget Almas et al.

License[2]

CC-BY-SA 3.0

Description

The Ancient Greek and Latin Dependency Treebank (AGLDT) is the earliest treebank for Ancient Greek and Latin. The project started at Tufts University in 2006 and is currently under development and maintenance at Leipzig University-Tufts University.

(This description has been sourced from the dataset website).

Table 38. Artifacts for Ancient Greek and Latin Dependency Treebank (Latin)
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

perseus.zip

140eee6d2e3e83745f95d3d5274d9e965d898980

nb

Norwegian Dependency Treebank (Norwegian Bokmål)

ID

ndt-nb-1.01

Version

1.01

Media type

text/x.org.dkpro.conll-2006

Language

nb

Encoding

UTF-8

URL

http://www.nb.no/sprakbanken/show?serial=sbr-10

Attribution[1]

CLARINO NB – Språkbanken

License[2]

CC0 1.0

Description

The Norwegian Dependency Treebank (NDT) consists of text which is manually annotated with morphological features, syntactic functions and hierarchical structure. The formalism used for the syntactic annotation is dependency grammar. With a few exceptions, the syntactic analysis follows Norsk referensegrammatikk ‘Norwegian Reference Grammar'.

(This description has been sourced from the dataset website).

Table 39. Artifacts for Norwegian Dependency Treebank (Norwegian Bokmål)
Artifact SHA1

LICENSE_NDT.txt

a2f433206f421c0d630b3bec5fad01334673b765

20140328_NDT_1-01.tar.gz

97935c225f98119aa94d53f37aa64762cba332f3

nfi

FinnTreeBank

ID

finntb-fi-3.1

Version

3.1

Media type

text/x.org.dkpro.conll-2006

Language

nfi

Encoding

UTF-8

URL

http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/

Attribution[1]

unknown

License[2]

CC-BY 3.0

Description

The FinnTreeBank project is creating a treebank and a parsebank for Finnish. This work is licensed under a Creative Commons Attribution 3.0.

The first and second version of the treebank is annotated by hand and based on 17.000 model senctences in the Large Grammar of Finnish VISK - Iso Suomen Kielioppi. Brief samples of text from other sources, e.g. news items and literature, are also available in the second version. A parsebank for Finnish based on the Europarl and the JRC-Aquis will be available in June 2012.

(This description has been sourced from the dataset website).

Table 40. Artifacts for FinnTreeBank
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

ftb3.1.conllx.gz

7c58064bf9995980cea08e84035c0414adc54f06

nl

Alpino2conll

ID

alpino-conll-nl-20100114

Version

20100114

Media type

text/x.org.dkpro.conll-2006

Language

nl

Encoding

UTF-8

URL

http://www.let.rug.nl/~bplank/alpino2conll/

Attribution[1]

Barbara Plank. Improved statistical measures to assess natural language parser performance across domains. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC2010), Valletta, Malta, May 2010.

License[2]

unknown

Description

Training and test datasets for Dutch in retagged CoNLL format. The data was converted from Alpino XML into CoNLL format based on an adapted version of Erwin Marsi’s conversion software, but PoS tags were replaced by automatically assigned Alpino tags.

(This description has been sourced from the corpus website).

Table 41. Artifacts for Alpino2conll
Artifact SHA1

cdb.conll.utf8

11313d405abb0f268247a2d5420afa413eb244e7

conll2006-test.conll

11313d405abb0f268247a2d5420afa413eb244e7

CoNLL-2002 NER Shared Task Data (Dutch)

ID

conll2002-nl

Version

20021107

Media type

text/x.org.dkpro.conll-2002

Language

nl

Encoding

ISO-8859-1

URL

http://www.clips.ua.ac.be/conll2002/ner/

Attribution[1]

unknown

License[2]

unknown

Description

This is the data from the CoNLL-2002 shared task on language independent named entity recognition. The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). The data was annotated as a part of the Atranos project (http://atranos.esat.kuleuven.ac.be/) at the University of Antwerp.

(This description has been sourced from the README file included with the corpus).

Table 42. Artifacts for CoNLL-2002 NER Shared Task Data (Dutch)
Artifact SHA1

data.tgz

686ef8fed3125a1d8aefe1351ff0e619fe9c34cb

nn

Norwegian Dependency Treebank (Norwegian Nynorsk)

ID

ndt-nn-1.01

Version

1.01

Media type

text/x.org.dkpro.conll-2006

Language

nn

Encoding

UTF-8

URL

http://www.nb.no/sprakbanken/show?serial=sbr-10

Attribution[1]

CLARINO NB – Språkbanken

License[2]

CC0 1.0

Description

The Norwegian Dependency Treebank (NDT) consists of text which is manually annotated with morphological features, syntactic functions and hierarchical structure. The formalism used for the syntactic annotation is dependency grammar. With a few exceptions, the syntactic analysis follows Norsk referensegrammatikk ‘Norwegian Reference Grammar'.

(This description has been sourced from the dataset website).

Table 43. Artifacts for Norwegian Dependency Treebank (Norwegian Nynorsk)
Artifact SHA1

LICENSE_NDT.txt

a2f433206f421c0d630b3bec5fad01334673b765

20140328_NDT_1-01.tar.gz

97935c225f98119aa94d53f37aa64762cba332f3

pl

Polish Constituency Treebank

ID

poltb-pl-0.5

Version

0.5

Media type

application/x.org.dkpro.tiger+xml

Language

pl

Encoding

UTF-8

URL

http://zil.ipipan.waw.pl/Składnica

Attribution[1]

unknown

License[2]

GPL 3.0

Description

The Polish constituency treebank (Składnica frazowa), version 0.5. Trees in the Tiger XML format containing only parse trees selected by dendrologists (one interpretation per sentence).

(This description has been sourced from the corpus website).

Table 44. Artifacts for Polish Constituency Treebank
Artifact SHA1

LICENSE.txt

8624bcdae55baeef00cd11d5dfcfa60f68710a02

poltb-0.5-tiger.xml.gz

c8977d436d218b726d657224305bced178071dcf

Polish Dependency Bank

ID

poldb-pl-0.5

Version

0.5

Media type

text/x.org.dkpro.conll-2006

Language

pl

Encoding

UTF-8

URL

http://zil.ipipan.waw.pl/Składnica

Attribution[1]

unknown

License[2]

GPL 3.0

Description

The dependency treebank (Składnica zależnościowa), version 0.5, is a result of an automatic conversion of manually disambiguated constituency trees into dependency structures.

(This description has been sourced from the corpus website).

Table 45. Artifacts for Polish Dependency Bank
Artifact SHA1

LICENSE.txt

8624bcdae55baeef00cd11d5dfcfa60f68710a02

poldb-0.5.conll.gz

187424608e91b271957dabcf140a7274f1c88d63

pt

CoNLL-2006 Shared Task (Portuguese)

ID

conll2006-pt

Version

20100302

Media type

text/x.org.dkpro.conll-2006

Language

pt

Encoding

UTF-8

URL

http://ilk.uvt.nl/conll/

Attribution[1]

Diana Santos, Eckhard Bick

License[2]

Floresta Sintá(c)tica License

Description

This is the Portuguese part of the CONLL-X Shared Task. The was derived from the Floresta Sintá(c)tica Bosque 7.3 by Sabine Buchholz.

(This description has been partially sourced from the README file included with the corpus).

We dd not find license information for this dataset. One might assume the license of this dataset is equivalent to that of the Floresta Sintá(c)tica from which it was derived.
Table 46. Artifacts for CoNLL-2006 Shared Task (Portuguese)
Artifact SHA1

README.txt

7afe672cba645d22fc037d8f6e2bf9d501d0aee6

portuguese_bosque_train.conll

29e630e207c74a42e0d2999193aa25d73f262920

portuguese_bosque_test_blind.conll

fabcfbd73a531e21786af9b8233f1a4aa78dfddb

portuguese_bosque_test.conll

e399cdc1203df1ff43816f3f934223cb9a625992

sl

JOS - jos100k

ID

jos100k-conll-sl-2.0

Version

2.0

Media type

text/x.org.dkpro.conll-2006

Language

sl

Encoding

UTF-8

URL

http://nl.ijs.si/jos/jos100k-en.html

Attribution[1]

Tomaž Erjavec, Darja Fišer, Simon Krek, Nina Ledinek: The JOS Linguistically Tagged Corpus of Slovene. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Malta, 2010. (PDF)

License[2]

CC-BY-NC 3.0

Description

The jos100k corpus contains 100,000 words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a reference annotated corpus of Slovene: its manually-validated annotations cover three level of linguistic description.

(This description has been sourced from the corpus website).

Table 47. Artifacts for JOS - jos100k
Artifact SHA1

LICENSE.txt

da39a3ee5e6b4b0d3255bfef95601890afd80709

data.zip

9f330ffd102cc5d5734fdaecbbf67751c84a1339

Slovene Dependency Treebank 0.1

ID

sdt-conll-sl-0.1

Version

0.1

Media type

text/x.org.dkpro.conll-2006

Language

sl

Encoding

UTF-8

URL

http://nl.ijs.si/sdt/

Attribution[1]

Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdenek Žabokrtský, Andreja Žele: Towards a Slovene Dependency Treebank. In Proceedings of Fifth International Conference on Language Resources and Evaluation, LREC'06, 24-26 May 2006. Genoa. (PDF)

License[2]

SDT CoNLL-X

Description

The Slovene Dependency Treebank project built a small syntactically annotated corpus of Slovene texts. The corpus was annotated with dependency analyses, taking the Prague Dependecy Treebank as the model. The Slovene Dependency Treebank is annotated with Analytic Tree Structures and contains a part of the morphosyntactically annotated Slovene component of the parallel MULTEXT-East corpus, i.e. the first third of the Slovene translation of the novel "1984" by G. Orwell, containing 30,000 words.

(This description has been sourced from the corpus website).

Table 48. Artifacts for Slovene Dependency Treebank 0.1
Artifact SHA1

data.zip

2bd85ad77c35d0c305a6afb7ee092676d5d22a35

Slovene Dependency Treebank 0.4

ID

sdt-conll-sl-0.4

Version

0.4

Media type

text/x.org.dkpro.conll-2006

Language

sl

Encoding

UTF-8

URL

http://nl.ijs.si/sdt/

Attribution[1]

Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdenek Žabokrtský, Andreja Žele: Towards a Slovene Dependency Treebank. In Proceedings of Fifth International Conference on Language Resources and Evaluation, LREC'06, 24-26 May 2006. Genoa. (PDF)

License[2]

SDT License

Description

This is the preliminary release of the Slovene Dependency Treebank, SDT V0.4 which contains the Prague Dependency Treebank-like annotation of the first part of Slovene translation of Orwell’s "1984", taken from the MULTEXT-East parallel corpus, V3.0, c.f. http://ufal.mff.cuni.cz/pdt/ http://nl.ijs.si/ME/V3/ http://nl.ijs.si/ME/V3/doc/index.html#mtev3-doc-div2-id2305296

(This description has been sourced from the corpus website).

Table 49. Artifacts for Slovene Dependency Treebank 0.4
Artifact SHA1

README.txt

d2ac8d9f8b45ceae34ce77f57b354662292bd609

sdt-conll.tbl

16cfa8a20ebf8ed0e4f13c0119c7aa76a2498b1f

sv

Talbanken05 DEP

ID

talkbanken05-dep-sv-1.1

Version

1.1

Media type

text/x.org.dkpro.conll-2006

Language

sv

Encoding

UTF-8

URL

http://stp.lingfil.uu.se/%7Enivre/research/Talbanken05.html

Attribution[1]

Joakim Nivre, Jens Nilsson and Johan Hall (2006) Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, 2006, Genoa, Italy. (pdf)

License[2]

Talbanken05 License

Description

Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of roughly 300,000 words, constructed at Lund University in the 1970s. The treebank comes with no guarantee but is freely available for research and educational purposes as long as proper credit is given for the work done to produce the material (both in Lund and in Växjö).

Dep: Dependency structure annotation (CoNLL-X shared task format in UTF-8).

(This description has been sourced from the corpus website).

Table 50. Artifacts for Talbanken05 DEP
Artifact SHA1

data.tar.gz

bc836ab364ba37522e2989481104bad2eb96a92e

Talbanken05 DPS

ID

talkbanken05-dps-sv-1.1

Version

1.1

Media type

application/x.org.dkpro.tiger+xml

Language

sv

Encoding

ISO-8859-1

URL

http://stp.lingfil.uu.se/%7Enivre/research/Talbanken05.html

Attribution[1]

Joakim Nivre, Jens Nilsson and Johan Hall (2006) Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, 2006, Genoa, Italy. (pdf)

License[2]

Talbanken05 License

Description

Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of roughly 300,000 words, constructed at Lund University in the 1970s. The treebank comes with no guarantee but is freely available for research and educational purposes as long as proper credit is given for the work done to produce the material (both in Lund and in Växjö).

DPS: Deepened phrase structure annotation (TIGER-XML encoding in ISO-8859-1)

(This description has been sourced from the corpus website).

Table 51. Artifacts for Talbanken05 DPS
Artifact SHA1

data.tar.gz

bc836ab364ba37522e2989481104bad2eb96a92e

Talbanken05 FPS

ID

talkbanken05-fps-sv-1.1

Version

1.1

Media type

application/x.org.dkpro.tiger+xml

Language

sv

Encoding

ISO-8859-1

URL

http://stp.lingfil.uu.se/%7Enivre/research/Talbanken05.html

Attribution[1]

Joakim Nivre, Jens Nilsson and Johan Hall (2006) Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, 2006, Genoa, Italy. (pdf)

License[2]

Talbanken05 License

Description

Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of roughly 300,000 words, constructed at Lund University in the 1970s. The treebank comes with no guarantee but is freely available for research and educational purposes as long as proper credit is given for the work done to produce the material (both in Lund and in Växjö).

FPS: Flat phrase structure annotation (TIGER-XML encoding in ISO-8859-1)

(This description has been sourced from the corpus website).

Table 52. Artifacts for Talbanken05 FPS
Artifact SHA1

data.tar.gz

bc836ab364ba37522e2989481104bad2eb96a92e


1. Provided attribution information is indicative and typically for the annotations. Additional attributions may be due, e.g. for the underlying texts.
2. Provided license information is indicative and typically for the annotations. Additional license may apply, e.g. for the underlying texts.