The document targets users employing the DKPro Core framework to build analysis pipelines.

Introduction

DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. Many powerful and state-of-the-art NLP tools are already freely available in the NLP research community. New and improved tools are being developed and released continuously. The tools cover the whole range of NLP-related processing tasks. DKPro Core provides UIMA components wrapping these tools so they can be used interchangeably in processing pipelines. DKPro Core builds heavily on uimaFIT which allows for rapid and easy development of UIMA-based NLP processing pipelines.

What can DKPro Core do for me?

Many NLP tasks require a dataset that has been preprocessed with several other NLP tools. For example, a corpus for coreference resolution must first be processed to break the text into words and sentences (segmentation), to add part of speech labels (POS tagging), and to identify noun phrases (chunking). Carefully developed state-of-the-art NLP tools exist for each of these preprocessing tasks, but used independently, input-output formats may not line up well, and if the annotations are not stored as offsets, information may be lost.

DKPro Core integrates many state-of-the-art NLP tools as uimaFIT components so that they can be seamlessly combined into an experiment pipeline.

Check out the dkpro-core examples repository to see some working Java examples.

Setup

OS X

DKPro Core should work fine without and additional actions on OS X.

Linux

Some DKPro Core modules may make use of native executables. Since some executables are only available as 32bit binaries, you may have to install the 32-bit libs on a 64-bit Linux system. For example, on Ubuntu this can be done using the following commands

Ubuntu 12.04 LTS and before (64-bit)
$ sudo apt-get install ia32-libs
Ubuntu 13.04 (64-bit) and later
$ sudo apt-get install lib32z1

In both cases, you will have to enter your password and you should restart your computer afterwards. For other Linux versions, please inform yourself how to install the respective libraries.

Windows

Some DKPro Core modules may make use of native executables. In order to use them, you may have to install the Microsoft Visual C++ 2010 SP1 Redistributable Package. Since some executables are only available as 32bit binaries, you may have to install the 32bit version of this package even if you have a 64bit Windows.

Analytics

Segmentation

Tokenization and sentence boundary detection

Tokenization and sentence boundary detection are usually realized in one segmenter component in DKPro Core. The reason is that in some cases, a segmenter needs to run before a tokenizer and in other cases it is the other way around. By running always both steps in a single component, the user does not have to worry about the order in which the steps are run internally. However, it is still possible to configure segmenters such that they produce either only tokens or sentence annotations.

DKPro Core assumes that tokens are non-overlapping. Many components additionally assume that tokens are located inside sentences and that they do not extend beyond the sentence bounaries.

Normalizing tokens

Some tokenizers apply a basic normalization of the token text. UIMA does not allow changing the document text. Thus, the Token annotation has a feature called form which can be used to store the normalized form of the token text. The method setText should be used to set the form of a token. It internally determines whether a TokenForm annotation is necessary (i.e. whether the form differs from the covered text of the token) or not. Likewise, the method getText should be used to retrieve the token text that should be used for further analysis. The method getText should not be used be writers - they should write the original document text (i.e. getCoveredText) unless there is a special field in the output for normalized token text.

Headings

Headings typically do not end with a sentence end marker (i.e. a full stop .). As a consequence segmenters are confused and consider a heading and the first sentence of the following paragraph to be a single sentence unit. This can be resolved by adding Div annotations to the document (and its subtypes) and configuring segmenters to respect these. Such annotations can be obtained for example by using:

  • a reader component that creates Paragraph and Heading annotations (e.g. the PdfReader)

  • an analysis component that detects Paragraph boundaries (e.g. the ParagraphSplitter)

Segmenters are by default configured to respect Div-type annotations (the default value for PARAM_ZONE_TYPES is Div). This means, segmenters will ensure that sentences and tokens to not overlap with their boundaries. If desired, PARAM_STRICT_ZONING can be set to true (default false) to ensure that tokens and sentences are only created within the boundaries of the zone types.

Hyphenation

Hyphenated words cannot be properly processed by NLP tools. The HyphenationRemover component can be used to join hypenated words.

To begin selecting tools to use in your pipeline, you may wish to use the same tools you were running on your data before switching to uimaFIT, or use the same tool specifications as others working on the your NLP task. With DKPro Core, it is easy to import the exact tool version and model version you need to replicate the work of others. Or, you may wish to conduct a comparison of multiple tools with the same function, to see which performs best on your data. With DKPro Core, it is simple to switch between multiple tools with the same function, or multiple models for a tool.

Because DKPro Core does not alter any of the integrated tools, however, it is up to the user to make sure that sequential components or models use matching tokenizations or tagsets, and that the component or model is appropriate for the user’s data (i.e., a pos-tagger for newstext may perform badly if applied to Twitter data).

Lemmatizaton

Lemmatizing multiwords

If you use lemma information in your pipeline, you should bear in mind that multiword expressions, in particular discontinuous multiwords, might not be lemmatized as one word (or expression), but rather each multiword part might be lemmatized separately. In languages such as German, there are verbs with separable particle such as anfangen (an occurs separate from fangen in particular sentence constructions). Therefore - depending on your use case - you might consider postprocessing the output of the lemmatizer in order to get the true lemmas (which you might need, e.g. in order to look up information in a lexical resource).

Dictionaries and other lexical resources

If you use components in your pipeline that access dictionaries or other lexical resources, it might be essential to include a Lemmatizer in your pipeline: Many dictionaries and well-known lexical resources such as WordNet require at minimum a lemma form as a search word in order to return information about that word. For large-scale lexical resources, e.g. for Wiktionary, additional information about POS is very helpful in order to reduce the ambiguity of a given lemma form.

Compatibility of Components

When selecting components for your pipeline you should make sure that the components are compatible regarding the annotation types they expect or offer.

  • if a component expects an annotation type that is not provided by the preceding component, that may lead to an error or simply to no results

  • if a component (e.g. a reader which adds sentence annotations) provides an annotation that is added again by a subsequent component (e.g. a segmenter), that will result in undefined behaviour of other components when you iterate over the annotation that has been added more than once.

To check whether components are compatible, you can look at the @TypeCapability annotation which is available in most DKPro Core components. Mind that many components can be configured with regards to which types they consume or produce, so the @TypeCapability should be taken as a rough indicator, not as a definitive information. It is also important to note, that the @TypeCapability does say anything about the tagset being consumed or produced by a component. E.g. one if a POS-tagger uses a model that produces POS-tags from the tagset X and a dependency-parser uses a model that requires POS-tags from the tagset Y, then the two models are not semantically compatible - even though the POS-tagger and dependency-parser components are compatible on the level of the type system.

Morphologically Rich Languages

  • Parsing: Morphologically rich languages (e.g. Czech, German, and Hungarian) pose a particular challenge to parser components (Tsarfaty et al. 2013).

  • Morphological analysis: for languages with case syncretism (displaying forms that are ambiguous regarding their case, e.g. Frauen in German can be nominative or genitive or dative or accusative), it might be better to leave case underspecified at the morphosyntactic level and leave disambiguation to the components at the syntactic level. Otherwise errors might be introduced that will then be propagated to the next pipeline component (Seeker and Kuhn 2013).

Domain-specific and other non-standard data

Most components (sentence splitters, POS taggers, parsers …​) are trained on (standard) newspaper text. As a consequence, you might encounter a significant performance drop if you apply the components to domain specific or other non-standard data (scientific abstracts, Twitter data, etc.) without adaptation.

  • Tokenizing: adapting the tokenizer to your specific domain is crucial, since tokenizer errors propagate to all subsequent components in the pipeline and worsen their performance. For example, you might adapt your tokenizer to become aware of emoticons or chemical formulae in order to process social media data or text from the biochemical domain.

Shallow processing and POS tagsets

While more advanced semantic processing (e.g. discourse analysis) typically depends on the output of a parser component, there might be settings where you prefer to perform shallow processing (i.e. POS tagging and chunking).

For shallow processing, it might be necessary to become familiar with the original POS tagsets of the POS taggers rather than relying on the uniform, but coarse-grained DKPro Core POS tags (because the original fine-grained POS tags carry more information).

Although many POS taggers in a given language are trained on the same POS tagset (e.g. the Penn TreeBank Tagset for English, the STTS Tagset for German), the individual POS Taggers might output variants of this tagset. You should be aware of the fact that in the DKPro Core version of the tagger, the original POS tagger output possibly has been mapped to a version that is compatible with the corresponding original tagset. (Example)

Adding components as dependencies (Maven)

In order to start using an integrated tool from DKPro Core, we can add it as a Maven dependency to our experiment.

As an example, we take the OpenNlpPosTagger component. To make it available in a pipeline, we add the following dependency to our POM file:

<properties>
  <dkpro.core.version>1.9.2</dkpro.core.version>
</properties>
<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
      <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
      <version>${dkpro.core.version}</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

<dependencies>
  <dependency>
    <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-asl</artifactId>
  </dependency>
</dependencies>

The dependency on DKPro Core declared in the dependency management section fixes the version of all DKPro Core dependencies that are added to the POM. Hence, it is not necessary to declare the version for each dependency. When upgrading to a new DKPro Core version, it is sufficient to change the value of the dkpro.core.version property in the properties section.

If you use a multi-module project, the properties and dependencyManagement sections should go into the parent-pom of your project, while the dependencies section should be added to the respective module requiring the dependency.
If you want to use GPLed components, you have to add an additional dependency declaration in the dependency management section referring to the de.tudarmstadt.ukp.dkpro.core-gpl artifact.

Adding resources as dependencies (Maven)

Most components (i.e., tools such as OpenNlpPosTagger) require resources such as models (such as opennlp-model-tagger-en-maxent) in order to operate. Since components and resources are versioned separately, it can be non-trivial to find the right version of a resource for a particular version of a component. For this reason, DKPro Core components each maintain a list of resources known to be compatible with them. This information can be accessed in a Maven POM, thus avoiding the need to manually specify the version of the models. Consequently, when you upgrade to a new version of DKPro Core, all models are automatically upgraded as well. This is usually the desired solution, although it can mean that your pipelines may produce slightly different results.

As an example, we take the OpenNlpPosTagger component. In the previous section, we have seen how to make it available in a pipeline. Now we also add the model for English.

<dependencies>
  <dependency>
    <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-en-maxent</artifactId>
  </dependency>
</dependencies>
<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
      <artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-asl</artifactId>
      <version>${dkpro.core.version}</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

The dependency on the DKPro Core OpenNLP module declared in the dependency management section fixes the version of all known OpenNLP models. Thus, it is not necessary to declare a version on each model dependency. When upgrading to a new DKPro Core version, it is sufficient to change the value of the dkpro.core.version property in the properties section.

Models are presently maintained in a separate repository that needs to be explicitly added to your POM:

<repositories>
  <repository>
    <id>ukp-oss-model-releases</id>
    <url>http://zoidberg.ukp.informatik.tu-darmstadt.de/artifactory/public-model-releases-local</url>
  </repository>
</repositories>

I/O

This section gives an overview on the I/O components. The components are organized into one module per file type. These modules contain typically one reader and/or writer component.

All readers initialize the CAS with a DocumentMetaData annotation.

Most readers and writers do not support all features of the respective formats. Additionally, readers and writers may only support a specific variant of a format.

Reading data

DKPro Core aims to provide a consistent API for reading and writing annotated data. Most of our readers are resource readers (RR) or file writers (FW) and support a common set of parameters which are explained below.

Table 1. Resource reader parameters
Parameter Description Default

PARAM_SOURCE_LOCATION

Location to read from.

optional

PARAM_PATTERNS

Include/exclude patterns.

optional

PARAM_USE_DEFAULT_EXCLUDES

Enable default excludes for versioning systems like Subversion, git, etc.

true

PARAM_INCLUDE_HIDDEN

Include hidden files

false

PARAM_LANGUAGE

Two letter ISO code

optional

Either PARAM_SOURCE_LOCATION or PARAM_PATTERNS or both must be set.

Read all files in the folder files/texts
PARAM_SOURCE_LOCATION, "files/texts"
Recursively read all .txt files in the folder files/texts (embedded pattern)
PARAM_SOURCE_LOCATION, "files/texts/**/*.txt"
Recursively read all .txt files in the folder files/texts (detached pattern)
PARAM_SOURCE_LOCATION, "files/texts"
PARAM_PATTERNS, "*.txt"
Excluding some files (detached pattern)
PARAM_SOURCE_LOCATION, "files/texts"
PARAM_PATTERNS, new String[] {"*.txt", "[-]broken*.txt"}
Read from the classpath
PARAM_SOURCE_LOCATION, "classpath*:texts/*.txt"

Writing data

Table 2. File writer parameters
Parameter Description Default

PARAM_TARGET_LOCATION

Location to write to.

mandatory

PARAM_COMPRESSION

Compression algorithm to use when writing output. File suffix automatically added depending on algorithm. Supported are: NONE, GZIP, BZIP2, and XZ (see class CompressionMethod).

NONE

PARAM_FILENAME_EXTENSION

Append this extension to files names. If PARAM_STRIP_EXTENSION is set to true, the original extension is replaced.

depends on writer

PARAM_STRIP_EXTENSION

Whether to remove the original file extension when writing. E.g. with the XmiWriter without extension stripping, an input file MyText.txt would be written as MyText.txt.xmi - with stripping it would be MyText.xmi.

false

PARAM_USE_DOCUMENT_ID

Use the document ID as the file name, even if an original file name is present in the document URI.

false

PARAM_ESCAPE_DOCUMENT_ID

Escape the document ID in case it contains characters that are not valid in a filename.

false

PARAM_SINGULAR_TARGET

Treat target location as a single file name.

false

PARAM_OVERWRITE

Allow overwriting existing files.

false

Working with ZIP archives

Most formats can be read from and written to ZIP archives.

Read from a ZIP archive
PARAM_SOURCE_LOCATION, "jar:file:archive.zip!texts/**/*.txt"

Most file writers write multiple files, so PARAM_TARGET_LOCATION is treated as a directory name. A few only write a single file (e.g. NegraExportWriter), in which case the parameter is treated as the file name. Instead of writing to a directory, it is possible to write to a ZIP archive:

Write to a ZIP archive
PARAM_TARGET_LOCATION, "jar:file:archive.zip"
Write to a folder inside a ZIP archive
PARAM_TARGET_LOCATION, "jar:file:archive.zip!folder/within/zip"
It is not possible to write into an existing ZIP file. A new file is created in every case. If a ZIP file by the name already exists, it is overwritten.

Models and Resources

Packaging models

Most models used by DKPro Core are available through our Maven repository. However, in some cases, we cannot redistribute the models. For these cases, we provide Ànt-based build.xml scripts that automatically download and package the models for use with DKPro Core.

For any given module supporting packaged resources, there is always the build.xml in SVN trunk and the ones in previous releases (tags folder) in SVN. Which one should you use?

You should always use only the build.xml files belonging to the verison of DKPro Core that you are using. From time to time, we change the metadata within these files and DKPro Core may be unable to properly resolve models belonging to a different version of DKPro Core. The files are contained in the src/scripts folder of the respective modules in SVN. We do not ship the build.xml files in any other way than via SVN.

That said, it might be necessary to make modifications to a build.xml file if it refers to files that are changed by upstream. E.g. the TreeTagger models tend to change without their name or version changing. Also, sometimes upstream files may become unavailable. In such cases, you have to update the MD5 hash for the model in the build.xml file or even to comment it out entirely.

In case you need to update MD5 sum, you should also update the upstreamVersion to correspond to the date of the new model. A good way to determine the date of the latest change is using the curl tool, e.g.:

curl -I http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin

In the output locate Last-Modified line

Last-Modified: Thu, 09 Sep 2010 06:57:11 GMT

So, here the upstreamVersion for en-pos-maxent.bin should be set to 20100909 (YYYYMMDD).

Datasets

Datasets are an important assert in NLP, e.g. to train models or to evaluate models in a meaningful and comparable way. The DatasetFactory of DKPro Core provides a convenient way of obtaining and using standard datasets. This section explains how to use it and how to describe datasets such that they can be obtained through the DatasetFactory.

Usage

The DatasetFactory class provides uniform access to datasets that are automatically downloaded from their providers websites, verified, and cached locally. This facilitates the use of datasets e.g. for training models or to evaluate models.

Example of obtaining and using a dataset in Java code
// Obtain dataset
DatasetFactory loader = new DatasetFactory(cache);
Dataset ds = loader.load("gum-en-conll-2.3.2");
Split split = ds.getSplit(0.8);

// Train model
System.out.println("Training model from training data");
CollectionReaderDescription trainReader = createReaderDescription(
        Conll2006Reader.class,
        Conll2006Reader.PARAM_PATTERNS, split.getTrainingFiles(),
        Conll2006Reader.PARAM_USE_CPOS_AS_POS, true,
        Conll2006Reader.PARAM_LANGUAGE, ds.getLanguage());

Integrated Datasets

An overview of the datasets already integrated with DKPro Core can be found in the Dataset Reference.

Describing Datasets

Datasets are described by a YAML 1.1 file. YAML is a very human-readable and allows us to embed markup in asciidoc format without any trouble.

Coordinates
groupId

Use to group multiple related datasets together, e.g. datasets from a shared task.

datasetId

A unique name of the dataset within the group.

version

The version of the dataset. If there is no official version, then a date in YYYYMMDD notation is used. Typically the date of the most recent file in the dataset - or if the dataset contains only one file, the date of the latest change of the file on the remote host. The remote file date can be obtained using curl -I <url>

language

The ISO 639-1 two-letter language code for the dataset language. If a upstream source provides a dataset in multiple languages, multiple dataset descriptions should be created, one per lanugage.

Informational metadata
name

The official name of the dataset.

url

Link to a website where information about the dataset can be obtained.

attribution

Some kind of reference to the authors or to a related publication. Asciidoc markup can be used to format the reference or to embed links to publications or bibtex files.

description

A short description of the dataset, typically obtained from the dataset’s website or from a readme file that ships with the dataset. The source should be stated as part of the description.

licenses

License information relevant to the dataset. A list of relevant licenses can be provided. Each license should state a name and url. The url should point to a canonical description of the license. Additionally, a comment can be provided, e.g. to indicate whether a license applies to the annotations or to the underlying text.

Example license section
licenses:
  - name: CC-BY 2.5
    url: http://creativecommons.org/licenses/by/2.5/
    comment: "Wikinews texts (Source: https://en.wikinews.org/wiki/Wikinews:Copyright)"
  - name: CC-BY-SA 3.0
    url: https://creativecommons.org/licenses/by-sa/3.0/
    comment: "WikiVoyage texts (Source: https://wikimediafoundation.org/wiki/Terms_of_Use)"
artifacts

A list of artifacts that make of the dataset. The relevant artifacts may not only be limited to the data files themselves, but could also include license texts or readme files if they are not part of a dataset archive. If a dataset is not distributed as an archive but rather as a set of files, each of the files should be listed here. To describe an artifact, the name, url, and sha1 checksum are required. The name of the artifact should correspond to the filename part of the URL from which the artifact is downloaded. However, sometimes it is convenient to use a simpler name, e.g. data.zip. However, the extension should always be preserved. This is particularly important for archives that need to be extracted.For more information, refer to the [sect_datasets_actions] section below.
If an artifact contains multiple datasets, it can be shared to avoid downloading and caching it redundantly. See [sect_datasets_sharing] below.

Example artifacts section
artifacts:
  gum.zip:
    url: "https://github.com/amir-zeldes/gum/archive/V2.2.0.zip"
    sha1: b17e276998ced83153be605d8157afacf1f10fdc
    actions:
      - action: explode
        configuration: { includes: ["dep/*", "LICENSE.txt", "README.md"], strip: 1 }
roles

Defines the roles of the files in the dataset. Here, files can refer to an artifact file or to files extracted from an artifact that is an archive. Mind that the paths are specified relative to the root of the dataset cache. So files extracted from an archive must be prefixed by the archive name (without any extensions). The role license should be assigned to all files containing licensing information. The role data should be assigned to all data files. If the dataset is already split into training, test, and/or development sets, then these should be indicated by assigning the training, testing, and development roles to these files. In this case, assigning the data role is not necessary.

Example roles section
roles:
  # Here the files have been extracted from an artifact name "data.zip", so the names
  # are all prefixed with "data/"
  licenses:
    - data/license-salsa.html
    - data/license-tiger.html
  training:
    - data/CoNLL2009-ST-German-train.txt
  development:
    - data/CoNLL2009-ST-German-development.txt
  testing:
    - data/CoNLL2009-ST-German-trial.txt
Description of a dataset
# Dataset coordinates - they uniquely identify the dataset
groupId: org.dkpro.core.datasets.conll2009
datasetId: conll2009
version: 1.1
language: de
mediaType: text/x.org.dkpro.conll-2009
encoding: UTF-8

# Informational metadata
name: CoNLL-2009 Shared Task (German)
url: http://ufal.mff.cuni.cz/conll2009-st/
attribution: Yi Zhang, Sebastian Pado
description: |
  This dataset contains the basic information regarding the German corpus
  provided for the CoNLL-2009 shared task on "Syntactic and Semantic
  Dependencies in Multiple Languages"
  (http://ufal.mff.cuni.cz/conll2009-st/). The data of this distribution
  is derived from the TIGER Treebank and the SALSA Corpus, converted
  into the syntactic and semantic dependencies compatible with the
  CoNLL-2009 shared task.

  (This description has been sourced from the README file included with the corpus).

# Indicative license information
licenses:
  - name: TIGER Corpus License
    url: http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/license/index.html
  - name: SALSA Corpus License
    url: http://www.coli.uni-saarland.de/projects/salsa/corpus/doc/license.html

# List of artifacts for the dataset
artifacts:
  # Name of the artifact
  data.zip:
    # URL of the artifact
    url: "http://ufal.mff.cuni.cz/conll2009-st/data/CoNLL2009-ST-German-traindevB.zip"
    # Checksum used to validate download and cache integrity
    sha1: ad4c03c3c4e4668c8beb34c399e71f539e6d633d
    actions:
      - action: explode                # Extract archive after downloading
        configuration: { strip: 1 }    # Remove one leading path element while extracting

roles:
  licenses:
    - data/license-salsa.html
    - data/license-tiger.html
  training:
    - data/CoNLL2009-ST-German-train.txt
  development:
    - data/CoNLL2009-ST-German-development.txt
  testing:
    - data/CoNLL2009-ST-German-trial.txt

Artifact Actions

Some artifacts need to be post-processed after downloading. E.g. if an artifact is an archive, it typically needs to be extracted to disk so that the data files can be conveniently accessed. Such post-processing can be declared via the actions section of an artifact description.

explode

This action extracts an archive. The archive format is automatically detected (ensure that the artifact name extension is retained if you assign an alternative name for an archive). Supported formats are: zip, jar, tar, tar.gz (tgz), tar.bz2, tar.xz, xz, rar, 7z, …​ (and possibly more formats supported by Apache Commons Compress).

The explode action can be specified multiple times on an artifact, e.g. to extract nested archives. In this case, for all but the first, the additional parameter file needs to be specified.

Files are always extracted into a subfolder of the dataset cache which has the name of the artifact without any extensions. E.g. if the artifact name is data.zip, then its contents are extracted into a folder called data. This is important when assigning roles to extracted files.

The action supports optional configuration parameters.

strip

Number of leading path elements to strip while extracting archive. For example, setting this parameter to 2 strips corpus/en/train.conll down to train.conll. This is useful to avoid long path names when assigning file roles.

includes

This can be a single pattern string or a list of pattern strings. Supported wildcards are single asterisks to match arbitrary parts of a file or folder name or double asterisks to match arbitrary intermediate folders. Mind that the patterns are relative to archive root and that stripping is performed before matching the files. So taking the above example for the strip parameter, the include pattern should be *.conll and not corpus/en/*.conll.

excludes

Same patterns as for includes. If includes and excludes are both specified, the excludes apply only to the included files. The order in which the includes/excludes are specified in the YAML file is irrelevant.

file

When extracting nested archives, this parameter points to the archive to be exploded. E.g. if result from extracting the archive data.zip is a set of new files data/part1.zip and data/part2.zip, then these nested archives can be exploded using additional explode actions that specify the file parameter.

Example explode configuration
artifacts:
  gum.zip:
    url: "https://github.com/amir-zeldes/gum/archive/V2.2.0.zip"
    sha1: b17e276998ced83153be605d8157afacf1f10fdc
    actions:
      - action: explode
        configuration: { includes: ["dep/*", "LICENSE.txt", "README.md"], strip: 1 }

Sharing artifacts

If an artifact contains multiple datasets, it can be shared to avoid downloading and caching it redundantly. A typical example is a dataset that comes in multiple languages, because we require a separate dataset description for each language variant.

To mark an artifact as shared, simply add shared: true to its description.

Example shared artifact
perseus.zip:
  url: "https://github.com/PerseusDL/treebank_data/archive/f56a35f65ef15ac454f6fbd2cfc6ea97bf2ca9b8.zip"
  sha1: 140eee6d2e3e83745f95d3d5274d9e965d898980
  shared: true # Artifact is shared
  actions:
    - action: explode
      configuration: { strip: 1, includes: [ "README.md", "v2.1/Greek/**/*" ] }
The name, url, sha1, and shared settings must be exactly the same in all dataset descriptions that make use of the shared artifact. The actions can differ, so e.g. each dataset description for a multi-lingual dataset can extract only the files of a specific language from the shared artifact.

Shared artifacts are stored under a special folder structure in the dataset cache that includes the checksum of the artifacts. However, files extracted from a shared archive artifact are placed under the folder of the dataset along with any additional unshared artifacts that the dataset might declare.

Example folder structure for shared artifacts
<CACHE-ROOT>
  shared
    140eee6d2e3e83745f95d3d5274d9e965d898980    <=- shared artifact folder
      perseus.zip                               <=- shared artifact
  perseus-la-2.1                                <=- dataset folder
    perseus                                     <=- extracted from shared artifact
      v2.1
        ...
    LICENSE.txt                                 <=- unshared artifact

Registering Datasets

In order to obtain a dataset through the DatasetFactory, the dataset must be made discoverable. This is a two-step process:

  1. Placing the dataset YAML file into the classpath

  2. Creating a datasets.txt file pointing to the dataset YAML file

This example assumes that you are using Maven or at least the standard Maven project layout.

First place your dataset YAML file into a subfolder of the src/main/resources folder in your project, e.g. src/main/resources/my/experiment/datasets/my-dataset-en-1.0.0.yaml. Replace my/experiment/datasets with a package name suitable for your project, following the conventions that you also use for your Java classes.

Second, create the datasets.txt file at src/main/resources/META-INF/org.dkpro.core/datasets.txt in your project with the following content:

datasets.txt example file
classpath*:my/experiment/datasets/my-dataset-en-1.0.0.yaml

Again, substitute my/experiment/datasets suitably, as you did before. If you have multiple datasets, you can add multiple lines to the file. It is possible to use wildcards, e.g. `.yaml, but it is not recommended to do so.

Now, the dataset should be availble through the DatasetFactory.

Example of obtaining and using a dataset in Java code
DatasetFactory loader = new DatasetFactory(cache);
Dataset ds = loader.load("my-dataset-en-1.0.0");
Currently, the id of the dataset is the filename of the dataset YAML file without the extension. Make sure to choose a unique name to avoid name collisions!

References

Here are some further references that might be helpful when deciding which tools to use:

  • Giesbrecht, Eugenie and Evert, Stefan (2009). Part-of-speech tagging - a solved task? An evaluation of POS taggers for the Web as corpus. In I. Alegria, I. Leturia, and S. Sharoff, editors, Proceedings of the 5th Web as Corpus Workshop (WAC5), San Sebastian, Spain. PDF

  • Reut Tsarfaty, Djamé Seddah, Sandra Kübler, and Joakim Nivre. 2013. Parsing morphologically rich languages: Introduction to the special issue. Comput. Linguist. 39, 1 (March 2013), 15-22. PDF

  • Wolfgang Seeker and Jonas Kuhn. 2013. Morphological and syntactic case in statistical dependency parsing. Comput. Linguist. 39, 1 (March 2013), 23-55. PDF