The document provides information about the datasets available through the DKPro Core
DatasetFactory
class.
The factory automatically downloads the datasets. It maintains a local cache to avoid redundant downloads. Datasets are validated against checksums stored in the dataset descriptions included with DKPro Core to ensure the descriptions match the datasets. While we try to maintain a good quality of the descriptions, they may not be perfect.[1] [2] Please use the Edit on GitHub links next to the descriptions in the document below or the issue tracker to report/fix any problems you may notice.
… more datasets?
This is not an exhaustive list of the datasets supported by DKPro Core. Any dataset in a format supported by DKPro Core can be used. For more details, refer to the Format Reference. If you are missing any datasets from the list, please tell us by opening an issue in our issue tracker. You can also simply create a new dataset description yourself and submit a pull request. For details on describing new datasets, please refer to the User Guide.
Overview
Dataset | Version | Language | Encoding | License [2] |
---|---|---|---|---|
1.0 |
ar |
UTF-8 |
CC-BY-SA 3.0 |
|
20100114 |
nl |
UTF-8 |
unknown |
|
2.1 |
el |
UTF-8 |
CC-BY-SA 3.0 |
|
2.1 |
la |
ISO-8859-1 |
CC-BY-SA 3.0 |
|
20081013 |
en |
ISO-8859-1 |
Brown Corpus License (?) |
|
20000221 |
en |
ISO-8859-1 |
WSJ Corpus License (?) |
|
20021107 |
nl |
ISO-8859-1 |
unknown |
|
20020522 |
es |
ISO-8859-1 |
unknown |
|
20100302 |
pt |
UTF-8 |
Floresta Sintá(c)tica License |
|
2.1 |
ca |
UTF-8 |
GPLv3 (?) |
|
1.1 |
de |
UTF-8 |
multiple |
|
1.0 |
ja |
UTF-8 |
unknown |
|
2.1 |
es |
UTF-8 |
GPLv3 (?) |
|
1 |
da |
UTF-8 |
GPLv2 |
|
1.0 |
cop |
UTF-8 |
CC-BY 4.0 |
|
7.0 |
fr |
UTF-8 |
LGPL-LR |
|
1.0 |
en |
UTF-8 |
CC-BY-NC-ND 3.0 |
|
1.0 |
en |
UTF-8 |
CC-BY-NC-ND 3.0 |
|
1.0 |
en |
UTF-8 |
CC-BY-NC-ND 3.0 |
|
3.1 |
nfi |
UTF-8 |
CC-BY 3.0 |
|
5.0.0 |
en |
UTF-8 |
multiple |
|
2.3.2 |
en |
UTF-8 |
multiple |
|
3.0.0 |
en |
UTF-8 |
multiple |
|
2.2.0 |
en |
UTF-8 |
multiple |
|
4.1.0 |
en |
UTF-8 |
multiple |
|
20200808 |
de |
UTF-8 |
CC-BY 4.0 |
|
20151025 |
en |
UTF-8 |
Open Data Commons Public Domain Dedication and License (PDDL) |
|
1.0.1 |
de |
UTF-8 |
multiple |
|
1 |
es |
UTF-8 |
CC-BY 3.0 |
|
2.0 |
sl |
UTF-8 |
CC-BY-NC 3.0 |
|
20080522 |
en |
ISO-8859-1 |
unknown |
|
1.0 |
en |
UTF-8 |
CC-BY-NC-SA 3.0 (?) |
|
0.1 |
de |
UTF-8 |
CC-BY-SA 3.0 |
|
1.01 |
nb |
UTF-8 |
CC0 1.0 |
|
1.01 |
nn |
UTF-8 |
CC0 1.0 |
|
0.5 |
pl |
UTF-8 |
GPL 3.0 |
|
0.5 |
pl |
UTF-8 |
GPL 3.0 |
|
1 |
hr |
UTF-8 |
CC-BY-SA 3.0 |
|
20160613 |
hr |
UTF-8 |
multiple |
|
0.1 |
sl |
UTF-8 |
SDT CoNLL-X |
|
0.4 |
sl |
UTF-8 |
SDT License |
|
20130608 |
en |
UTF-8 |
unknown |
|
1.1 |
sv |
UTF-8 |
Talbanken05 License |
|
1.1 |
sv |
ISO-8859-1 |
Talbanken05 License |
|
1.1 |
sv |
ISO-8859-1 |
Talbanken05 License |
|
20101122 |
it |
UTF-8 |
CC-BY-NC-SA 2.5 |
|
1.4 |
en |
UTF-8 |
CC-BY-SA 4.0 |
|
1.3 |
fa |
UTF-8 |
CC-BY 3.0 |
Datasets
ar
AQMAR Arabic Wikipedia Named Entity Corpus
73,853 tokens in 28 Arabic Wikipedia articles hand-annotated for named entities.
(This description has been partially copied from the corpus website).
Artifact | SHA1 |
---|---|
54977f4065ec070057e99b4b446273e5c8f071d2 |
|
4fa2c37d7673bb456c6e382566a091545531d85f |
ca
CoNLL-2009 Shared Task (Catalan)
This is a subset of the Ancora corpus (see http://clic.ub.edu/ancora) which was used in the CoNLL-2009 shared task on extracting syntactic and semantic Dependencies in multiple languages.
496,672 lexical tokens; training: 390,302; development: 53,015; test: 53,355
(This description has been partially copied from the README file included with the corpus).
The description states that the data was extracted from the Ancora corpus, but it does not say from which version. One might assume it comes from AnCora Catalan dependency 1.0.1. However, this version does not include a license file. The next version is AnCora Catalan 2.0.0 which was released under GPL 3.0. Thus, one might conclude that this data can also be used under these conditions. |
Artifact | SHA1 |
---|---|
500cbb81709012cce4d23bfa72d93c320b0b7e6f |
cop
Coptic Treebank
The Coptic Treebank from the Coptic SCRIPTORIUM corpora (http://copticscriptorium.org/).
Artifact | SHA1 |
---|---|
3015e20629818d25c34527d59808e716fd0d8ced |
|
8c363df27408cb14cb42f3869916c1575fe1625a |
da
Copenhagen Dependency Treebank
Version 1 (the directory "da") was orginally called the Danish Dependency Treebank. It was used in the CoNLL 2006 shared task on dependency parsing, but has since been updated with bug fixes and an improved CoNLL conversion which includes a decomposition of the PAROLE part-of-speech tags into the underlying features for number, gender, etc.).
(This description has been sourced from the corpus README file).
Artifact | SHA1 |
---|---|
0e5aad9553dc0ed784ec220bb09e22d52fefbb8b |
|
11313d405abb0f268247a2d5420afa413eb244e7 |
de
CoNLL-2009 Shared Task (German)
This dataset contains the basic information regarding the German corpus provided for the CoNLL-2009 shared task on "Syntactic and Semantic Dependencies in Multiple Languages" (http://ufal.mff.cuni.cz/conll2009-st/). The data of this distribution is derived from the TIGER Treebank and the SALSA Corpus, converted into the syntactic and semantic dependencies compatible with the CoNLL-2009 shared task.
(This description has been sourced from the README file included with the corpus).
Artifact | SHA1 |
---|---|
null |
GermEval 2014 Named Entity Recognition Shared Task
The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following properties:
-
The data was sampled from German Wikipedia and News Corpora as a collection of citations.
-
The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.
-
The NER annotation uses the NoSta-D guidelines, which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating embeddings among NEs such as
[ORG FC Kickers [LOC Darmstadt]]
.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
9c5bee7a22ab39ad6c19ab29ea9e94ac5874f9c6 |
|
827edc0232f813fb1344e06924a46e9344ec2f61 |
Hamburg Dependency Treebank
Contains annotated text from the German technical news website www.heise.de.
License | Comment |
---|---|
Annotation |
|
Text |
Artifact | SHA1 |
---|---|
7f893542ae74df4c277b98278ad9e3ad6c09e690 |
|
LICENSE-HZSK-ACA.txt |
generated |
6594e5cd48966db7dac04f2b5ff948eb2bcadf37 |
Named Entity Model for German, Politics (NEMGP)
The Named Entity Model for German, Politics (NEMGP) is a collection of texts from Wikipedia and WikiNews, manually annotated with named entity information.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
fb6f31be27fed5efbcd4c2e1e64c50de470364b1 |
|
f2a1fd54df9232741a3a1892d1ffb0a4d7205991 |
el
Ancient Greek and Latin Dependency Treebank (Greek)
The Ancient Greek and Latin Dependency Treebank (AGLDT) is the earliest treebank for Ancient Greek and Latin. The project started at Tufts University in 2006 and is currently under development and maintenance at Leipzig University-Tufts University.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
fb6f31be27fed5efbcd4c2e1e64c50de470364b1 |
|
140eee6d2e3e83745f95d3d5274d9e965d898980 |
en
Brown Corpus (TEI XML)
This version derives directly from
"A Standard Corpus of Present-Day Edited American English, for use with Digital Computers." by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html
as distributed with NLTK (version 0.9.2)
(This description has been taken from the README file included with the corpus).
We did not find license information included with this dataset. One might assume the TEI version of the Brown Corpus is provided under the same conditions as the original Brown Corpus. |
Artifact | SHA1 |
---|---|
LICENSE.txt |
generated |
1e4eadeb358f6f7e6ac9b3677a82f4353bbe91ed |
CoNLL-2000 Chunking Shared Task Data (English)
This is the data from the CoNLL-2000 shared task on text chunking. The data consists of the same partitions of the Wall Street Journal corpus (WSJ) as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the data has been derived from the WSJ corpus by a program written by Sabine Buchholz from Tilburg University, The Netherlands. Instead of using the part-of-speech tags of the WSJ corpus, the data set used tags generated by the Brill tagger.
(This description has been partially copied from the corpus website).
We did not find any license information for this dataset. However, as the texts appear to come from the WSJ corpus, probably the WSJ corpus license applies here. |
Artifact | SHA1 |
---|---|
9f31cf936554cebf558d07cce923dca0b7f31864 |
|
dc57527f1f60eeafad03da51235185141152f849 |
English Word Sense and Semantic Role Datasets (WaSR)
English Frame and Role Annotations.
(This description has been sourced from the README file included with the corpus).
Artifact | SHA1 |
---|---|
90490d92475de1dc68502b6cdb317187c4336b36 |
|
ef7ccf5cb23da63003bdb19d99b15b0ea2821e55 |
English Word Sense and Semantic Role Datasets (WaSR)
English Frame and Role Annotations.
(This description has been sourced from the README file included with the corpus).
Artifact | SHA1 |
---|---|
90490d92475de1dc68502b6cdb317187c4336b36 |
|
ef7ccf5cb23da63003bdb19d99b15b0ea2821e55 |
|
0a9c98cbf1fe02841edf52e963444a7e38986577 |
|
9c0cc79ecab9140f82683d39ed6acb51b148f9f7 |
English Word Sense and Semantic Role Datasets (WaSR)
German Frame and Role Annotations.
(This description has been sourced from the README file included with the corpus).
Artifact | SHA1 |
---|---|
90490d92475de1dc68502b6cdb317187c4336b36 |
|
b706711ae6fffc94409f80b635595bd45d8c2ece |
Georgetown University Multilayer Corpus
GUM is an open source multilayer corpus of richly annotated web texts from eight text types. The corpus is collected and expanded by students as part of the curriculum in LING-367 Computational Corpus Linguistics at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (mostly Creative Commons licenses), so that new texts can be annotated and published with ease.
(This description has been sourced from the dataset website).
License | Comment |
---|---|
Wikinews/interviews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright) |
|
WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use); Wikipedia biographies (Source: https://en.wikipedia.org/wiki/Wikipedia:Copyrights) |
|
WikiHow texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons); Fiction texts (Source: http://smallbeerpress.com/creative-commons/) |
|
Annotations (Source: https://corpling.uis.georgetown.edu/gum/); Academic texts (various sources, see LICENSE.txt file) |
Artifact | SHA1 |
---|---|
null |
Georgetown University Multilayer Corpus
This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.
The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.
(This description has been sourced from the dataset website).
License | Comment |
---|---|
Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright) |
|
WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use) |
|
WikiHow texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons) |
|
Annotations (Source: https://corpling.uis.georgetown.edu/gum/) |
Artifact | SHA1 |
---|---|
471c3a35c2a0e9aee4bbff9a9cf05441fce3ef21 |
Georgetown University Multilayer Corpus
This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.
The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.
(This description has been sourced from the dataset website).
The CPOS column of the files contains an extended POS tagset as it is used by the English TreeTagger models. The POS column contains the regular PTB tagset.
License | Comment |
---|---|
Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright) |
|
WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use) |
|
WikiHow texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons) |
|
Annotations (Source: https://corpling.uis.georgetown.edu/gum/) |
Artifact | SHA1 |
---|---|
b590dbe3f4ae198ca500618a53491f75c221e98b |
Georgetown University Multilayer Corpus
This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.
The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.
(This description has been sourced from the dataset website).
License | Comment |
---|---|
Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright) |
|
WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use) |
|
WikiHow texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons) |
|
Annotations (Source: https://corpling.uis.georgetown.edu/gum/) |
Artifact | SHA1 |
---|---|
b17e276998ced83153be605d8157afacf1f10fdc |
Georgetown University Multilayer Corpus (UD)
This dataset contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.
The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.
(This description has been sourced from the dataset website).
The CPOS column of the files contains an extended POS tagset as it is used by the English TreeTagger models. The POS column contains the regular PTB tagset.
Note that this dataset does not include the Reddit data as it can only be obtained by running a Python script which comes with GUM.
License | Comment |
---|---|
Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright) |
|
WikiVoyage and Biographies texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use) |
|
WikiHow and Fiction texts (Source: http://www.wikihow.com/wikiHow:Creative-Commons) |
|
Annotations (Source: https://corpling.uis.georgetown.edu/gum/) |
Artifact | SHA1 |
---|---|
91ded1ba5b6c05fe8e70e42a0a36ee0d20556888 |
GloVe pre-trained vectors - Wikipedia 2014 + Gigaword 5
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
b64e54f1877d2f735bdd000c1d7d771e25c7dfdc |
MASC-CONLL
The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).
A 40K subset of MASC1 data with annotations for Penn Treebank syntactic dependencies and semantic dependencies from NomBank and PropBank in CONLL IOB format. This data set was used in the CoNLL 2008 shared task on Joint Parsing of Syntactic and Semantic Dependencies.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
d9f53a05c659204a3223e901c450fe8ffa5fa9fa |
NAIST/NTT TED Treebank
The NAIST-NTT Ted Talk Treebank is a manually annotated treebank of TED talks that was created through a joint research project of NAIST and the NTT CS Lab. All treebank annotation follows the Penn Treebank standard.
(This description has been sourced from the corpus website/README file in the corpus).
The website does not state which version of the CC-BY-SA-NC applies. One might consider it is the version 3.0 which is also used for the TED talks themselves. |
Artifact | SHA1 |
---|---|
90490d92475de1dc68502b6cdb317187c4336b36 |
|
89c6495bd64c4b3e699b4c478b47a0c827ea46ea |
Stanford POS Tagger Distsim Clusters
Distributional similarity clusters that can be used e.g. with the Stanford POS tagger.
These clusters are a feature extracted from larger, untagged text which clusters the words into similar classes.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
3f1352641a46e985c07d0023c0ada7e5be97e527 |
Universal Dependencies 1.4 Treebanks
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
1c41c28b000935ffa6c63b9ff17c48e892c56597 |
es
CoNLL-2002 NER Shared Task Data (Spanish)
This is the data from the CoNLL-2002 shared task on language independent named entity recognition. The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are from May 2000. The annotation was carried out by the TALP Research Center (http://www.talp.upc.es/) of the Technical University of Catalonia (UPC) and the Center of Language and Computation (CLiC, http://clic.fil.ub.es/) of the University of Barcelona (UB), and funded by the European Commission through the NAMIC project (IST-1999-12392).
(This description has been sourced from the README file included with the corpus).
Artifact | SHA1 |
---|---|
686ef8fed3125a1d8aefe1351ff0e619fe9c34cb |
CoNLL-2009 Shared Task (Spanish)
This is a subset of the Ancora corpus (see http://clic.ub.edu/ancora) which was used in the CoNLL-2009 shared task on extracting syntactic and semantic Dependencies in multiple languages.
528,440 lexical tokens; training: 427,442; development: 50,368; test: 50,630
(This description has been partially copied from the README file included with the corpus).
The description states that the data was extracted from the Ancora corpus, but it does not say from which version. One might assume it comes from AnCora Spanish dependency 1.0.1. However, this version does not include a license file. The next version is AnCora Catalan 2.0.0 which was released under GPL 3.0. Thus, one might conclude that this data can also be used under these conditions. |
Artifact | SHA1 |
---|---|
ef36c3369bd05966609b4b13d6bf78884c23ece1 |
IULA Spanish LSP Treebank
IULA Spanish LSP Treebank is an Spanish treebank containing the syntactic annotation of 42,000 sentences (almost 590,000 tokens). It has been developed within the frame of Metanet4U project (Enhancing the European Linguistic Infrastructure, GA 270893).
The sentences in IULA Spanish LSP Treebank are extracted from the Corpus Tècnic de l’IULA, a collection of written texts from the fields of Law, Economy, Genomics, Medicine, and Environment, as well as a contrastive corpus from the press.
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
aaf1a43d7cf20483321212f54bff33132a070ec0 |
|
67e2ce3327501605b7c9f0844cc4982070612222 |
fa
Uppsala Persian Dependency Treebank
Uppsala Persian Dependency Treebank (UPDT) (Seraji, 2015, Chapter 5, pp. 97-146) is a dependency-based syntactically annotated corpus.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
aaf1a43d7cf20483321212f54bff33132a070ec0 |
|
336ba453635ff079ab2ae9a5349247efa11acdf8 |
fr
Deep Sequoia (Surface)
Deep-sequoia is a corpus of French sentences annotated with both surface and deep syntactic dependency structures.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
LICENSE.txt |
generated |
9f53475f809ef1032a92adedf262226da1615051 |
hr
SETimes.HR dependency treebank
The corpus is based on the Croatian part of the SETimes parallel corpus.
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
fb6f31be27fed5efbcd4c2e1e64c50de470364b1 |
|
0faebfe55136692f83dcddd4cf659a8b59655d62 |
SETimes.HR+ Croatian dependency treebank
The treebank is a result of an effort in providing free-culture language resources for Croatian by the NLP group at FF Zagreb.
(This description has been sourced from the corpus website).
License | Comment |
---|---|
SETimes.HR dataset (set.hr.conll) |
|
web.hr.conll and news.hr.conll datasets |
Artifact | SHA1 |
---|---|
9c5bee7a22ab39ad6c19ab29ea9e94ac5874f9c6 |
|
54cc324681563e5ede8088f020f0b21e35d37fb9 |
|
a52d13cfa91589c0d93fe0a90333a4f0e997b7cf |
it
Turin University Treebank
TUT is a morpho-syntactically annotated collection of Italian sentences, which includes texts from different text genres and domains, released in several annotation formats.
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
3d9b22d8ebf533aa1d6d39d417316c30900b9a0e |
|
2278e6e770ddc4a8eea5e045c4a77a5df2ae0977 |
|
9cf9c0a9c652b3df6564d1fa0ca97c2d7905faa3 |
|
72a6e55627481ff99930b59714cfc0909ccf60e1 |
|
a421f488859324e3e12687b9a3067652248eb8df |
ja
CoNLL-2009 Shared Task (Japanese)
This file contains the basic information regarding the Japanese corpus provided for the CoNLL-2009 shared task on "Syntactic and Semantic Dependencies in Multiple Languages". The current version corresponds to the release of the training data sets.
The data of this distribution uses portions of the Kyoto University Text Corpus 4.0. The Kyoto University Text Corpus is freely available at http://nlp.kuee.kyoto-u.ac.jp/nl-resource/corpus-e.html.
(This description has been sourced from the README file included with the corpus).
Artifact | SHA1 |
---|---|
8c96a1eda2527a9ba1bf37dd4125cc6af11e7dd4 |
la
Ancient Greek and Latin Dependency Treebank (Latin)
The Ancient Greek and Latin Dependency Treebank (AGLDT) is the earliest treebank for Ancient Greek and Latin. The project started at Tufts University in 2006 and is currently under development and maintenance at Leipzig University-Tufts University.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
fb6f31be27fed5efbcd4c2e1e64c50de470364b1 |
|
140eee6d2e3e83745f95d3d5274d9e965d898980 |
nb
Norwegian Dependency Treebank (Norwegian Bokmål)
The Norwegian Dependency Treebank (NDT) consists of text which is manually annotated with morphological features, syntactic functions and hierarchical structure. The formalism used for the syntactic annotation is dependency grammar. With a few exceptions, the syntactic analysis follows Norsk referensegrammatikk ‘Norwegian Reference Grammar'.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
ae02a3ca7e000d6cc98f07d3a8aa017f38900499 |
|
97935c225f98119aa94d53f37aa64762cba332f3 |
nfi
FinnTreeBank
The FinnTreeBank project is creating a treebank and a parsebank for Finnish. This work is licensed under a Creative Commons Attribution 3.0.
The first and second version of the treebank is annotated by hand and based on 17.000 model senctences in the Large Grammar of Finnish VISK - Iso Suomen Kielioppi. Brief samples of text from other sources, e.g. news items and literature, are also available in the second version. A parsebank for Finnish based on the Europarl and the JRC-Aquis will be available in June 2012.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
aaf1a43d7cf20483321212f54bff33132a070ec0 |
|
7c58064bf9995980cea08e84035c0414adc54f06 |
nl
Alpino2conll
Training and test datasets for Dutch in retagged CoNLL format. The data was converted from Alpino XML into CoNLL format based on an adapted version of Erwin Marsi’s conversion software, but PoS tags were replaced by automatically assigned Alpino tags.
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
f5e1517383f4489c8cb0c75ad202ac57c21874fc |
|
c055154ae56dfa8c29d304ed852af90aedf00a5d |
CoNLL-2002 NER Shared Task Data (Dutch)
This is the data from the CoNLL-2002 shared task on language independent named entity recognition. The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). The data was annotated as a part of the Atranos project (http://atranos.esat.kuleuven.ac.be/) at the University of Antwerp.
(This description has been sourced from the README file included with the corpus).
Artifact | SHA1 |
---|---|
686ef8fed3125a1d8aefe1351ff0e619fe9c34cb |
nn
Norwegian Dependency Treebank (Norwegian Nynorsk)
The Norwegian Dependency Treebank (NDT) consists of text which is manually annotated with morphological features, syntactic functions and hierarchical structure. The formalism used for the syntactic annotation is dependency grammar. With a few exceptions, the syntactic analysis follows Norsk referensegrammatikk ‘Norwegian Reference Grammar'.
(This description has been sourced from the dataset website).
Artifact | SHA1 |
---|---|
ae02a3ca7e000d6cc98f07d3a8aa017f38900499 |
|
97935c225f98119aa94d53f37aa64762cba332f3 |
pl
Polish Constituency Treebank
The Polish constituency treebank (Składnica frazowa), version 0.5. Trees in the Tiger XML format containing only parse trees selected by dendrologists (one interpretation per sentence).
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
8b0cb355ed76e07cc7c876fec58341c2940cfee7 |
|
c8977d436d218b726d657224305bced178071dcf |
Polish Dependency Bank
The dependency treebank (Składnica zależnościowa), version 0.5, is a result of an automatic conversion of manually disambiguated constituency trees into dependency structures.
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
8b0cb355ed76e07cc7c876fec58341c2940cfee7 |
|
187424608e91b271957dabcf140a7274f1c88d63 |
pt
CoNLL-2006 Shared Task (Portuguese)
This is the Portuguese part of the CONLL-X Shared Task. The was derived from the Floresta Sintá(c)tica Bosque 7.3 by Sabine Buchholz.
(This description has been partially sourced from the README file included with the corpus).
We dd not find license information for this dataset. One might assume the license of this dataset is equivalent to that of the Floresta Sintá(c)tica from which it was derived. |
Artifact | SHA1 |
---|---|
10da89fed0ecb888c8fc7fad350b1a11bb9050d7 |
|
29e630e207c74a42e0d2999193aa25d73f262920 |
|
fabcfbd73a531e21786af9b8233f1a4aa78dfddb |
|
e399cdc1203df1ff43816f3f934223cb9a625992 |
sl
JOS - jos100k
The jos100k corpus contains 100,000 words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a reference annotated corpus of Slovene: its manually-validated annotations cover three level of linguistic description.
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
23e82cd9f77862b5a26bf268aba9822784a9ab6a |
|
9f330ffd102cc5d5734fdaecbbf67751c84a1339 |
Slovene Dependency Treebank 0.1
The Slovene Dependency Treebank project built a small syntactically annotated corpus of Slovene texts. The corpus was annotated with dependency analyses, taking the Prague Dependecy Treebank as the model. The Slovene Dependency Treebank is annotated with Analytic Tree Structures and contains a part of the morphosyntactically annotated Slovene component of the parallel MULTEXT-East corpus, i.e. the first third of the Slovene translation of the novel "1984" by G. Orwell, containing 30,000 words.
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
2bd85ad77c35d0c305a6afb7ee092676d5d22a35 |
Slovene Dependency Treebank 0.4
This is the preliminary release of the Slovene Dependency Treebank, SDT V0.4 which contains the Prague Dependency Treebank-like annotation of the first part of Slovene translation of Orwell’s "1984", taken from the MULTEXT-East parallel corpus, V3.0, c.f. http://ufal.mff.cuni.cz/pdt/ http://nl.ijs.si/ME/V3/ http://nl.ijs.si/ME/V3/doc/index.html#mtev3-doc-div2-id2305296
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
9d047377eb96aa896461544cd1117b11b812809f |
|
16cfa8a20ebf8ed0e4f13c0119c7aa76a2498b1f |
sv
Talbanken05 DEP
Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of roughly 300,000 words, constructed at Lund University in the 1970s. The treebank comes with no guarantee but is freely available for research and educational purposes as long as proper credit is given for the work done to produce the material (both in Lund and in Växjö).
Dep: Dependency structure annotation (CoNLL-X shared task format in UTF-8).
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
bc836ab364ba37522e2989481104bad2eb96a92e |
Talbanken05 DPS
Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of roughly 300,000 words, constructed at Lund University in the 1970s. The treebank comes with no guarantee but is freely available for research and educational purposes as long as proper credit is given for the work done to produce the material (both in Lund and in Växjö).
DPS: Deepened phrase structure annotation (TIGER-XML encoding in ISO-8859-1)
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
bc836ab364ba37522e2989481104bad2eb96a92e |
Talbanken05 FPS
Talbanken05 is a modernized version of Talbanken76, a Swedish treebank of roughly 300,000 words, constructed at Lund University in the 1970s. The treebank comes with no guarantee but is freely available for research and educational purposes as long as proper credit is given for the work done to produce the material (both in Lund and in Växjö).
FPS: Flat phrase structure annotation (TIGER-XML encoding in ISO-8859-1)
(This description has been sourced from the corpus website).
Artifact | SHA1 |
---|---|
bc836ab364ba37522e2989481104bad2eb96a92e |