The document targets users employing the DKPro Core framework to build analysis pipelines.

Introduction

DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. Many powerful and state-of-the-art NLP tools are already freely available in the NLP research community. New and improved tools are being developed and released continuously. The tools cover the whole range of NLP-related processing tasks. DKPro Core provides UIMA components wrapping these tools so they can be used interchangeably in processing pipelines. DKPro Core builds heavily on uimaFIT which allows for rapid and easy development of UIMA-based NLP processing pipelines.

What can DKPro Core do for me?

Many NLP tasks require a dataset that has been preprocessed with several other NLP tools. For example, a corpus for coreference resolution must first be processed to break the text into words and sentences (segmentation), to add part of speech labels (POS tagging), and to identify noun phrases (chunking). Carefully developed state-of-the-art NLP tools exist for each of these preprocessing tasks, but used independently, input-output formats may not line up well, and if the annotations are not stored as offsets, information may be lost.

DKPro Core integrates many state-of-the-art NLP tools as uimaFIT components so that they can be seamlessly combined into an experiment pipeline.

Check out the dkpro-core examples repository to see some working Java examples.

Analytics

To begin selecting tools to use in your pipeline, you may wish to use the same tools you were running on your data before switching to uimaFIT, or use the same tool specifications as others working on the your NLP task. With DKPro Core, it is easy to import the exact tool version and model version you need to replicate the work of others. Or, you may wish to conduct a comparison of multiple tools with the same function, to see which performs best on your data. With DKPro Core, it is simple to switch between multiple tools with the same function, or multiple models for a tool.

Because DKPro Core does not alter any of the integrated tools, however, it is up to the user to make sure that sequential components or models use matching tokenizations or tagsets, and that the component or model is appropriate for the user’s data (i.e., a pos-tagger for newstext may perform badly if applied to Twitter data).

Compatibility of Components

When selecting components for your pipeline you should make sure that the components are compatible regarding the annotation types they expect or offer.

  • if a component expects an annotation type that is not provided by the preceding component, that may lead to an error or simply to no results

  • if a component (e.g. a reader which adds sentence annotations) provides an annotation that is added again by a subsequent component (e.g. a segmenter), that will result in undefined behaviour of other components when you iterate over the annotation that has been added more than once.

To check whether components are compatible, you can look at the @TypeCapability annotation which is available in most DKPro Core components. Mind that many components can be configured with regards to which types they consume or produce, so the @TypeCapability should be taken as a rough indicator, not as a definitive information. It is also important to note, that the @TypeCapability does say anything about the tagset being consumed or produced by a component. E.g. one if a POS-tagger uses a model that produces POS-tags from the tagset X and a dependency-parser uses a model that requires POS-tags from the tagset Y, then the two models are not semantically compatible - even though the POS-tagger and dependency-parser components are compatible on the level of the type system.

Dictionaries and other lexical resources

If you use components in your pipeline that access dictionaries or other lexical resources, it might be essential to include a Lemmatizer in your pipeline: Many dictionaries and well-known lexical resources such as WordNet require at minimum a lemma form as a search word in order to return information about that word. For large-scale lexical resources, e.g. for Wiktionary, additional information about POS is very helpful in order to reduce the ambiguity of a given lemma form.

Lemmatizing multiwords

If you use lemma information in your pipeline, you should bear in mind that multiword expressions, in particular discontinuous multiwords, might not be lemmatized as one word (or expression), but rather each multiword part might be lemmatized separately. In languages such as German, there are verbs with separable particle such as anfangen (an occurs separate from fangen in particular sentence constructions). Therefore - depending on your use case - you might consider postprocessing the output of the lemmatizer in order to get the true lemmas (which you might need, e.g. in order to look up information in a lexical resource).

Morphologically Rich Languages

  • Parsing: Morphologically rich languages (e.g. Czech, German, and Hungarian) pose a particular challenge to parser components (Tsarfaty et al. 2013).

  • Morphological analysis: for languages with case syncretism (displaying forms that are ambiguous regarding their case, e.g. Frauen in German can be nominative or genitive or dative or accusative), it might be better to leave case underspecified at the morphosyntactic level and leave disambiguation to the components at the syntactic level. Otherwise errors might be introduced that will then be propagated to the next pipeline component (Seeker and Kuhn 2013).

Domain-specific and other non-standard data

Most components (sentence splitters, POS taggers, parsers …​) are trained on (standard) newspaper text. As a consequence, you might encounter a significant performance drop if you apply the components to domain specific or other non-standard data (scientific abstracts, Twitter data, etc.) without adaptation.

  • Tokenizing: adapting the tokenizer to your specific domain is crucial, since tokenizer errors propagate to all subsequent components in the pipeline and worsen their performance. For example, you might adapt your tokenizer to become aware of emoticons or chemical formulae in order to process social media data or text from the biochemical domain.

Shallow processing and POS tagsets

While more advanced semantic processing (e.g. discourse analysis) typically depends on the output of a parser component, there might be settings where you prefer to perform shallow processing (i.e. POS tagging and chunking).

For shallow processing, it might be necessary to become familiar with the original POS tagsets of the POS taggers rather than relying on the uniform, but coarse-grained DKPro Core POS tags (because the original fine-grained POS tags carry more information).

Although many POS taggers in a given language are trained on the same POS tagset (e.g. the Penn TreeBank Tagset for English, the STTS Tagset for German), the individual POS Taggers might output variants of this tagset. You should be aware of the fact that in the DKPro Core version of the tagger, the original POS tagger output possibly has been mapped to a version that is compatible with the corresponding original tagset. (Example)

Headings

Headings typically do not end with a sentence end marker (i.e. a full stop .). As a consequence segmenters are confused and consider a heading and the first sentence of the following paragraph to be a single sentence unit. This can be resolved by adding Div annotations to the document (and its subtypes) and configuring segmenters to respect these. Such annotations can be obtained for example by using:

  • a reader component that creates Paragraph and Heading annotations (e.g. the PdfReader)

  • an analysis component that detects Paragraph boundaries (e.g. the ParagraphSplitter)

Segmenters are by default configured to respect Div-type annotations (the default value for PARAM_ZONE_TYPES is Div). This means, segmenters will ensure that sentences and tokens to not overlap with their boundaries. If desired, PARAM_STRICT_ZONING can be set to true (default false) to ensure that tokens and sentences are only created within the boundaries of the zone types.

Hyphenation

Hyphenated words cannot be properly processed by NLP tools. The HyphenationRemover component can be used to join hypenated words.

References

Here are some further references that might be helpful when deciding which tools to use:

  • Giesbrecht, Eugenie and Evert, Stefan (2009). Part-of-speech tagging - a solved task? An evaluation of POS taggers for the Web as corpus. In I. Alegria, I. Leturia, and S. Sharoff, editors, Proceedings of the 5th Web as Corpus Workshop (WAC5), San Sebastian, Spain. PDF

  • Reut Tsarfaty, Djamé Seddah, Sandra Kübler, and Joakim Nivre. 2013. Parsing morphologically rich languages: Introduction to the special issue. Comput. Linguist. 39, 1 (March 2013), 15-22. PDF

  • Wolfgang Seeker and Jonas Kuhn. 2013. Morphological and syntactic case in statistical dependency parsing. Comput. Linguist. 39, 1 (March 2013), 23-55. PDF

Adding components as dependencies (Maven)

In order to start using an integrated tool from DKPro Core, we can add it as a Maven dependency to our experiment.

As an example, we take the OpenNlpPosTagger component. To make it available in a pipeline, we add the following dependency to our POM file:

<properties>
  <dkpro.core.version>1.8.0</dkpro.core.version>
</properties>
<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
      <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
      <version>${dkpro.core.version}</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

<dependencies>
  <dependency>
    <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-asl</artifactId>
  </dependency>
</dependencies>

The dependency on DKPro Core declared in the dependency management section fixes the version of all DKPro Core dependencies that are added to the POM. Hence, it is not necessary to declare the version for each dependency. When upgrading to a new DKPro Core version, it is sufficient to change the value of the dkpro.core.version property in the properties section.

If you use a multi-module project, the properties and dependencyManagement sections should go into the parent-pom of your project, while the dependencies section should be added to the respective module requiring the dependency.

Adding resources as dependencies (Maven)

Most components (i.e., tools such as OpenNlpPosTagger) require resources such as models (such as opennlp-model-tagger-en-maxent) in order to operate. Since components and resources are versioned separately, it can be non-trivial to find the right version of a resource for a particular version of a component. For this reason, DKPro Core components each maintain a list of resources known to be compatible with them. This information can be accessed in a Maven POM, thus avoiding the need to manually specify the version of the models. Consequently, when you upgrade to a new version of DKPro Core, all models are automatically upgraded as well. This is usually the desired solution, although it can mean that your pipelines may produce slightly different results.

As an example, we take the OpenNlpPosTagger component. In the previous section, we have seen how to make it available in a pipeline. Now we also add the model for English.

<dependencies>
  <dependency>
    <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-en-maxent</artifactId>
  </dependency>
</dependencies>
<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
      <artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-asl</artifactId>
      <version>${dkpro.core.version}</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

The dependency on the DKPro Core OpenNLP module declared in the dependency management section fixes the version of all known OpenNLP models. Thus, it is not necessary to declare a version on each model dependency. When upgrading to a new DKPro Core version, it is sufficient to change the value of the dkpro.core.version property in the properties section.

I/O

This section gives an overview on the I/O components. The components are organized into one module per file type. These modules contain typically one reader and/or writer component.

All readers initialize the CAS with a DocumentMetaData annotation.

Most readers and writers do not support all features of the respective formats. Additionally, readers and writers may only support a specific variant of a format.

Reading data

DKPro Core aims to provide a consistent API for reading and writing annotated data. Most of our readers are resource readers (RR) or file writers (FW) and support a common set of parameters which are explained below.

Table 1. Resource reader parameters
Parameter Description Default

PARAM_SOURCE_LOCATION

Location to read from.

optional

PARAM_PATTERNS

Include/exclude patterns.

optional

PARAM_USE_DEFAULT_EXCLUDES

Enable default excludes for versioning systems like Subversion, git, etc.

true

PARAM_INCLUDE_HIDDEN

Include hidden files

false

PARAM_LANGUAGE

Two letter ISO code

optional

Either PARAM_SOURCE_LOCATION or PARAM_PATTERNS or both must be set.

Read all files in the folder files/texts
PARAM_SOURCE_LOCATION, "files/texts"
Recursively read all .txt files in the folder files/texts (embedded pattern)
PARAM_SOURCE_LOCATION, "files/texts/**/*.txt"
Recursively read all .txt files in the folder files/texts (detached pattern)
PARAM_SOURCE_LOCATION, "files/texts"
PARAM_PATTERNS, "*.txt"
Excluding some files (detached pattern)
PARAM_SOURCE_LOCATION, "files/texts"
PARAM_PATTERNS, new String[] {"*.txt", "[-]broken*.txt"}
Read from the classpath
PARAM_SOURCE_LOCATION, "classpath*:texts/*.txt"

Writing data

Table 2. File writer parameters
Parameter Description Default

PARAM_TARGET_LOCATION

Location to write to.

mandatory

PARAM_COMPRESSION

Compression algorithm to use when writing output. File suffix automatically added depending on algorithm. Supported are: NONE, GZIP, BZIP2, and XZ (see class CompressionMethod).

NONE

PARAM_STRIP_EXTENSION

Whether to remove the original file extension when writing. E.g. with the XmiWriter without extension stripping, an input file MyText.txt would be written as MyText.txt.xmi - with stripping it would be MyText.xmi.

false

PARAM_USE_DOCUMENT_ID

Use the document ID as the file name, even if an original file name is present in the document URI.

false

PARAM_ESCAPE_DOCUMENT_ID

Escape the document ID in case it contains characters that are not valid in a filename.

false

PARAM_SINGULAR_TARGET

Treat target location as a single file name.

false

PARAM_OVERWRITE

Allow overwriting existing files.

false

Working with ZIP archives

Most formats can be read from and written to ZIP archives.

Read from a ZIP archive
PARAM_SOURCE_LOCATION, "jar:file:archive.zip!texts/**/*.txt"

Most file writers write multiple files, so PARAM_TARGET_LOCATION is treated as a directory name. A few only write a single file (e.g. NegraExportWriter), in which case the parameter is treated as the file name. Instead of writing to a directory, it is possible to write to a ZIP archive:

Write to a ZIP archive
PARAM_TARGET_LOCATION, "jar:file:archive.zip"
Write to a folder inside a ZIP archive
PARAM_TARGET_LOCATION, "jar:file:archive.zip!folder/within/zip"
It is not possible to write into an existing ZIP file. A new file is created in every case. If a ZIP file by the name already exists, it is overwritten.

Models and Resources

Packaging models

Most models used by DKPro Core are available through our Maven repository. However, in some cases, we cannot redistribute the models. For these cases, we provide Ànt-based build.xml scripts that automatically download and package the models for use with DKPro Core.

For any given module supporting packaged resources, there is always the build.xml in SVN trunk and the ones in previous releases (tags folder) in SVN. Which one should you use?

You should always use only the build.xml files belonging to the verison of DKPro Core that you are using. From time to time, we change the metadata within these files and DKPro Core may be unable to properly resolve models belonging to a different version of DKPro Core. The files are contained in the src/scripts folder of the respective modules in SVN. We do not ship the build.xml files in any other way than via SVN.

That said, it might be necessary to make modifications to a build.xml file if it refers to files that are changed by upstream. E.g. the TreeTagger models tend to change without their name or version changing. Also, sometimes upstream files may become unavailable. In such cases, you have to update the MD5 hash for the model in the build.xml file or even to comment it out entirely.

In case you need to update MD5 sum, you should also update the upstreamVersion to correspond to the date of the new model. A good way to determine the date of the latest change is using the curl tool, e.g.:

curl -I http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin

In the output locate Last-Modified line

Last-Modified: Thu, 09 Sep 2010 06:57:11 GMT

So, here the upstreamVersion for en-pos-maxent.bin should be set to 20100909 (YYYYMMDD).