This document is targets developers of DKPro Core components.

Setup

This section explains the setup for developers. Please also refer to the Setup section in the User’s Guide as it may contain additional information necessary to run DKPro Core components (e.g. special libraries that need to be installed depending on your operating system).

1. GIT

All DKPro Core files are stored using UNIX line endings. If you develop on Windows, you have to set the core.autocrlf configuration setting to input to avoid accidentally submitting Windows line endings to the repository. Using input is a good strategy in most cases, thus you should consider setting this as a global (add --global) or even as a system (--system) setting.

Configure git line ending treatment

C:\> git config --global core.autocrlf input

After changing this setting, best do a fresh clone and check-out of DKPro Core.

2. Eclipse

2.1. Use a JDK

On Linux or OS X, the following setting is not necessary. Having a full JDK installed on your system is generally sufficient. You can skip on to the next section.

On Windows, you need to edit the eclipse.ini file and directly before the -vmargs line, you have to add the following two lines. Mind to replace C:/Program Files/Java/jdk1.8.0_144 with the actual location of the JDK on your system. Without this, Eclipse will complain that the jdk.tools:jdk.tools artifact would be missing.

Force Eclipse to run on a JDK

-vm
C:/Program Files/Java/jdk1.8.0_144/jre/bin/server/jvm.dll

2.2. Required Plugins

Maven Integration: m2e , already comes pre-installed with the Eclipse IDE for Java Developers. If you use another edition of Eclipse which does not have m2e pre-installed, go to Help→Install New Software, select --All available sites-- and choose Collaboration → m2e - Maven Integration for Eclipse
Apache UIMA tools: Update site: http://www.apache.org/dist/uima/eclipse-update-site/
Groovy: Find the applicable update site here: https://github.com/groovy/groovy-eclipse/wiki
- Make sure to install at least the Groovy Eclipse Feature, the Groovy Compiler (2.4), and the Groovy M2E Integration.
Checkstyle Eclipse plugin: Update site: http://eclipse-cs.sf.net/update
Checkstyle configuration plugin for M2Eclipse: Update site: http://m2e-code-quality.github.com/m2e-code-quality/site/latest/

2.3. Workspace Preferences

The following settings are recommended for the Eclipse Workspace Preferences:

Setting	Value
General → Workspace → Text file encoding	UTF-8
General → Workspace → New text file line delimiter	Unix
General → Editors → Text Editors → Displayed tab width	2
General → Editors → Text Editors → Insert spaces for tabs	true
General → Editors → Text Editors → Show print margin	true
General → Editors → Text Editors → Print margin column	100
XML → XML Files → Editor → Line width	100
XML → XML Files → Editor → Format comments	false
XML → XML Files → Editor → Indent using spaces	selected
XML → XML Files → Editor → Indentation size	2

Setting

Value

General → Workspace → Text file encoding

UTF-8

General → Workspace → New text file line delimiter

Unix

General → Editors → Text Editors → Displayed tab width

General → Editors → Text Editors → Insert spaces for tabs

true

General → Editors → Text Editors → Show print margin

true

General → Editors → Text Editors → Print margin column

100

XML → XML Files → Editor → Line width

100

XML → XML Files → Editor → Format comments

false

XML → XML Files → Editor → Indent using spaces

selected

XML → XML Files → Editor → Indentation size

2.4. Import

In Eclipse, go to File → Import, choose Existing Maven projects, and select the folder to which you have cloned DKPro Core. Eclipse should automatically detect all modules. Mind that DKPro Core is a large project and it takes significant time until all dependencies have been downloaded and until the first build is complete.

Adding Modules

DKPro Core consists of a number of Maven modules. The actual components (readers, writers) as well as the DKPro Core types and APIs reside within these modules.

3. Module Naming Scheme

The name is the first thing to consider when creating a new module.

Although the modules are technically all the same, in the naming scheme, we discern between the following types of modules:

API modules (dkpro-core-api-NAME-asl) - these modules contain common base classes, utility classes, type system definitions and JCas classes. Since API modules are used in many places, they must be licensed under the Apache License.
IO modules (dkpro-core-io-NAME-LIC) - these modules contain reader and writer components. They are usually named after the file format (e.g. lif) or family of file formats they support (e.g. conll).
FS modules (dkpro-core-fs-NAME-LIC) - these modules contain support for specific file systems. They are usually named after the file system type they support (e.g. hdfs).
Component modules (dkpro-core-NAME-LIC) - these modules contain processing components. They are usually named after the tool or library that is wrapped (e.g. treetagger or corenlp).

In addition to the four categories, there are a view unique modules which do not fall into any of these categories, e.g. de.tudarmstadt.ukp.dkpro.core.testing-asl or de.tudarmstadt.ukp.dkpro.core.doc-asl.

DKPro Core is in a transition phase from the traditional but deprecated naming scheme (de.tudarmstadt.ukp.dkpro.core…) to the new naming scheme (org.dkpro.core…). Many modules still make use of the old naming scheme.

The naming scheme applies in several occasions:

module folder - the sub-folder within the DKPro Core project which contains the module
artifactId - the Maven artifactId as recorded in the pom.xml file. The groupId should always be org.dkpro.core.
Java packages - the module name translates roughly into the Java package names, e.g. the root Java package in the module dkpro-core-io-lif-asl is org.dkpro.core.io.lif.

4. Module Hierarchy

Once you have decided on a name for the new module, you proceed by creating a new module folder. Module folders are created directly under the root of the DKPro Core source tree.

Although the folder structure of DKPro Core appears as if there would be a flat list of modules, there is actually a shallow hierarchy of modules (i.e. the folder hierarchy does not correspond to the Maven module hierarchy).

The DKPro Core Parent POM is at the root of the module and of the folder hierarchy. Its parent is the DKPro Parent POM which is maintained in a separate source tree and which follows its own release cycle. The DKPro Parent POM defines a set of default settings, profiles, and managed dependencies useful for all DKPro projects. The DKPro Core Parent POM defines settings specific to DKPro Core.

DKPro Parent POM
  DKPro Core Parent POM
    DKPro Core ASL Parent POM
      ... DKPro Core ASL modules ...
    DKPro Core GPL Parent POM
      ... DKPro Core GPL modules ...
    DKPro Core Documentation

New modules are added either in the <modules> section of the DKPro Core ASL Parent POM or of the DKPro Core GPL Parent POM depending on whether the new module can be licensed under the Apache License or whether it has to be licensed under the GPL due to a GPL dependency. The these two parent POMs configure different sets of license checkers. For ASL modules, the Apache RAT Maven Plugin is used; for the GPL modules, the License Maven Plugin is used.

Note that the <modules> section in these POMs points to the folders which contain the respective modules. Since the folder hierarchy is flat (unlike the module hierarchy), the module names here need to be prefixed with ../.

Excerpt from the DKPro Core ASL Parent POM modules section

<modules>
  <!-- API modules -->
  <module>../dkpro-core-api-anomaly-asl</module>
  <module>../dkpro-core-api-coref-asl</module>
  ...
  <!-- FS modules -->
  <module>../dkpro-core-fs-hdfs-asl</module>
  <!-- IO modules -->
  <module>../dkpro-core-io-aclanthology-asl</module>
  <module>../dkpro-core-io-ancora-asl</module>
  ...
  <!-- Processing modules -->
  <module>../dkpro-core-castransformation-asl</module>
  <module>../dkpro-core-cisstem-asl</module>
  ...
</modules>

In addition to adding a new module to the <modules> section of the respective parent POM, it also needs to be added to the <dependencyManagement> section with this POM:

Excerpt from the DKPro Core ASL Parent POM dependency management section

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>org.dkpro.core</groupId>
      <artifactId>dkpro-core-foo-asl</artifactId>
      <version>1.9.0</version>
    </dependency>
    ...
  </dependencies>
</dependencyManagement>

If you create a GPLed module, copy the .license-header.txt file from another GPLed module to your new module in order to properly configure the license checker. Mind that the GPL license text in XML files must be indented by 4 spaces and will not be recognized otherwise. You may have to adjust the text, depending on whether the module can be licensed under GPLv3 or has to be licensed under GPLv2.

If not all of the dependencies of your new module are available from Maven Central or JCenter, then add the module within the <modules> and <dependencyManagemen> sections located under the deps-not-on-maven-central profile within the respective parent POM. Also add the required third-party repositories there if necessary.

5. Basic POM

Next, you create a basic POM inside your new module folder. Below is an example of a minimal POM for a new Apache-licensed component module. If you create a GPL-licensed module instead, replace the -asl suffixes with -gpl and copy the license header from another GPLed module.

Minimal sample POM for a new Apache-licensed component module

<!--
  Licensed to the Technische Universität Darmstadt under one
  or more contributor license agreements. See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership. The Technische Universität Darmstadt
  licenses this file to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.

  http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
    <version>1.9.0</version>
    <relativePath>../dkpro-core-asl</relativePath>
  </parent>
  <groupId>org.dkpro.core</groupId>
  <artifactId>dkpro-core-foo-asl</artifactId>
  <packaging>jar</packaging>
  <name>DKPro Core ASL - Foo NLP Suite (v ${foo.version})</name>
  <properties>
    <foo.version>1.8.2</foo.version>
  </properties>
  <dependencies>
  </dependencies>
</project>

6. Library Dependencies

In order to avoid unpleasant surprises, DKPro Core uses the Maven Dependency Plugin to check if all dependencies used directly within the code of a module are also explicitly declared in the module POM. If this is not the case, the automated builds fail (they run with -DfailOnWarning). This means, you have to declare dependencies for all libraries that you are using directly from your code in the <dependencies> section. If a dependency is only required during testing, it must be marked with <scope>test</scope>. Below, you find a few typical libraries used in many modules. Note that there is no version defined for these dependencies. The versions for many libraries used by multiple modules in DKPro Core are defined in the DKPro Core Parent POM. Only libraries that are specific to a particular module, e.g. the specific NLP library wrapped, should have their versions defined within the module POM.

Typical dependencies section

<dependency>
  <groupId>org.apache.uima</groupId>
  <artifactId>uimaj-core</artifactId>
</dependency>
<dependency>
  <groupId>org.apache.uima</groupId>
  <artifactId>uimafit-core</artifactId>
</dependency>
<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
  <groupId>commons-io</groupId>
  <artifactId>commons-io</artifactId>
</dependency>
<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core.api.parameter-asl</artifactId>
</dependency>
<dependency>
  <groupId>junit</groupId>
  <artifactId>junit</artifactId>
  <scope>test</scope>
</dependency>
<dependency>
  <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
  <artifactId>de.tudarmstadt.ukp.dkpro.core.testing-asl</artifactId>
  <scope>test</scope>
</dependency>

You may notice the foo.version property in the minimal POM example above. This property should be used to set the version of wrapped NLP library. It should appear in the name of the module as well as in the specific dependency for the wrapped library.

Typical dependencies section

<dependency>
  <groupId>org.foo.nlp</groupId>
  <artifactId>foo-nlp-suite</artifactId>
  <version>${foo.version}</version>
</dependency>

7. Model Dependencies

When you package models for your new component, they need a special treatment in the POM. First, although it is a good idea to create unit tests based on the models, most often you do not want to download all models and run all unit tests during a normal developer build (some models are very large any may quickly fill up your hard disk). Second, the Maven Dependency Plugin is unable to detect that your code or tests make use of the models and it needs to be configured in a special way to allow the build to pass even though it considers the model dependencies as unnecessary.

So assuming you have a model for your component, then first add it to the <dependencyManagemen> section of the POM - here you specify the version but not the scope. All models you have get added to this section, irrespective of whether you want to use them for testing or not.

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
      <artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-en-maxent</artifactId>
      <version>20120616.1</version>
    </dependency>
  </dependencies>
</dependencyManagement>

If you also want to use the model for testing, then you add it also to the <dependencies> section of the POM. Here you specify the scope but not the version. Then you also have to configure the Maven Dependency Plugin to accept the presence of the dependency.

<dependencies>
  <dependency>
    <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
    <artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-en-maxent</artifactId>
    <scope>test</scope>
  </dependency>
<dependencies>
<build>
  <pluginManagement>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-dependency-plugin</artifactId>
        <configuration>
          <usedDependencies>
            <!-- Models not detected by byte-code analysis -->
            <usedDependency>de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-en-maxent</usedDependency>
          </usedDependencies>
        </configuration>
      </plugin>
    </plugins>
  </pluginManagement>
</build>

As said before, if you have many models for your component, it is a good idea to use only a small set for regular testing. If you want to create tests for additional models or even for all of your models, then it is best to add the dependencies for these under a profile called use-full-resources. This profile is enabled for automated builds or can be enabled on demand by developers who wish to run all tests. In the example below, we add an additional test dependency on a German model if the profile use-full-resources is enabled. Note that the Maven Dependency Plugin is also again configured within the profile and that the combine.children="append" parameter is used to merge the configuration with the one already present for the default build.

<profiles>
  <profile>
    <id>use-full-resources</id>
    <dependencies>
      <dependency>
        <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
        <artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-de-maxent</artifactId>
        <scope>test</scope>
      </dependency>
    </dependencies>
    <build>
      <pluginManagement>
        <plugins>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-dependency-plugin</artifactId>
            <configuration>
              <usedDependencies combine.children="append">
                <!-- Models not detected by byte-code analysis -->
                <usedDependency>de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-de-maxent</usedDependency>
              </usedDependencies>
            </configuration>
          </plugin>
        </plugins>
      </pluginManagement>
    </build>
  </profile>
</profiles>

To conditionally run a test only if the required model is enabled, you can use the AssumeResource class from the DKPro Core testing module.

AssumeResource.assumeResource(OpenNlpPosTagger.class, "tagger", language, variant);

8. LICENSE.txt

Every module must contain a file called LICENSE.txt at its root which contains the license text. Copy this file from another Apache-licensed or GPL-licensed module (again check if you need to use GPLv2 or v3). If this file is not present, the build will fail.

9. NOTICE.txt

If the module contains code or resources from a third party (e.g. a source or test file which you copied from some other code repository or obtained from some website), then you need to add a file called NOTICE.txt next to the LICENSE.txt file. For every third-party file (or set of files if mutiple files were obtained from the same source under the same conditions), the NOTICE.txt must contain a statement which allows to identify the files, identify from where these files were obtained, and contain a copyright and license statement. Check the license of the original files for whether you have to include the full license text and potentially some specific attribution (possibly from an upstream NOTICE file).

Implementing Components

10. General

10.1. Capabilities

All components should declare the types they consume by default as input and they produced by default as output. Some components may not know before runtime what they produce or consume, so nothing can be declared here.

Example of declaring input/output types on a component

@TypeCapability(
        inputs = {
            "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token",
            "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence" },
        outputs = {
            "de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS" })
public class OpenNlpPosTagger
        extends JCasAnnotator_ImplBase
{

11. Analysis components

11.1. Base classes

The base classes for analysis components are provided by uimaFIT: JCasAnnotator_ImplBase and CasAnnotator_ImplBase.

11.2. Models

The ModelProviderBase class offers convenient support for working with model resources. The following code is taken from the OpenNlpPosTagger component. It shows how the POS Tagger model is addressed using a parametrized classpath URL with parameters for language and variant.

Model provider setup in OpenNlpPosTagger.initialize() (shortened)

// Use ModelProviderBase convenience constructor to set up a model provider that
// auto-detects most of its settings and is configured to use default variants.
// Auto-detection inspects the configuration parameter fields (@ConfigurationParameter)
// of the analysis engine class and looks for default parameters such as PARAM_LANGUAGE,
// PARAM_VARIANT, and PARAM_MODEL_LOCATION.
      modelProvider = new ModelProviderBase<POSTaggerME>(this, "tagger")
      {
          @Override
          protected POSTaggerME produceResource(InputStream aStream)
              throws Exception
          {
              // Load the POS tagger model from the location the model provider offers
              POSModel model = new POSModel(aStream);
              // Create a new POS tagger instance from the loaded model
              return new POSTaggerME(model);
          }
      };

The produceResource() method is called with the URL of the model once it has been located by CasConfigurableProviderBase.

Model provider use in OpenNlpPosTagger.process() (shortened)

CAS cas = aJCas.getCas();

// Document-specific configuration of model and mapping provider in process()
modelProvider.configure(cas);

Collection<Token> tokens = index.get(sentence);
String[] tokenTexts = toText(tokens).toArray(new String[tokens.size()]);
fixEncoding(tokenTexts);

// Fetch the OpenNLP pos tagger instance configured with the right model and use it to
// tag the text
String[] tags = modelProvider.getResource().tag(tokenTexts);

11.3. Type mappings

The DKPro type system design provides two levels of abstraction on most annotations:

a generic annotation type, e.g. POS (part of speech) with a feature value containing the original tag produced by an analysis component, e.g. TreeTagger
a set of high-level types for very common categories, e.g. N (noun), V (verb), etc.

DKPro maintains mappings for commonly used tagsets, e.g. in the module dkpro-core-api-lexmorph-asl. They are named:

Naming scheme for tag mapping files

{language}-{tagset}-{layer}.map

The following values are commonly used for layer:

pos - part-of-speech tag mapping
morph - morphological features mapping
constituency - constituent tag mapping
dependency - dependency relation mapping

The mapping provder is create in the initialize() method of the UIMA component after the respective model provider. This is necessary because the mapping provider obtains the tagset information for the current model from the model provider.

Mapping provider setup in OpenNlpPosTagger.initialize() (shortened)

// General setup of the mapping provider in initialize()
mappingProvider = MappingProviderFactory.createPosMappingProvider(posMappingLocation,
        language, modelProvider);

In the process() method, the mapping provider is used to to create an UIMA annotation. First, it is configured for the current document and model. Then, it is invoked for each tag produced by the tagger obtain the UIMA type for the annotation to be created. If there is no mapping file, the mapping provide will fall back to a suitable default annotation, e.g. POS for part-of-speech tags or NamedEntity.

Mapping provider use in OpenNlpPosTagger.process() (shortened)

// Mind the mapping provider must be configured after the model provider as it uses the
// model metadata
mappingProvider.configure(cas);

// Convert the tag produced by the tagger to an UIMA type, create an annotation
// of this type, and add it to the document.
Type posTag = mappingProvider.getTagType(tag);
POS posAnno = (POS) cas.createAnnotation(posTag, t.getBegin(), t.getEnd());
// To save memory, we typically intern() tag strings
posAnno.setPosValue(internTags ? tag.intern() : tag);
POSUtils.assignCoarseValue(posAnno);
posAnno.addToIndexes();

11.4. Default variants

It is possible a different default variant needs to be used depending on the language. This can be configured by placing a properties file in the classpath and setting its location using setDefaultVariantsLocation(String). The key in the properties is the language and the value is used a default variant. These file should always reside in the lib sub-package of a component and use the naming convention:

{tool}-default-variants.map

The default variant file is a Java properties file which defines for each language which variant should be assumed as default. It is possible to declare a catch-all variant using *. This is used if none of the other default variants apply.

OpenNLP POS tagger default variants configuration

it=perceptron
*=maxent

Use the convenience constructor of ModelProviderBase to create model providers that are already correctly set up to use default variants:

public ModelProviderBase(Object aObject, String aShortName, String aType)
{
    setContextObject(aObject);

    setDefault(ARTIFACT_ID, "${groupId}." + aShortName + "-model-" + aType
            + "-${language}-${variant}");
    setDefault(LOCATION,
            "classpath:/${package}/lib/"+aType+"-${language}-${variant}.properties");
    setDefaultVariantsLocation("${package}/lib/"+aType+"-default-variants.map");
    setDefault(VARIANT, "default");

    addAutoOverride(ComponentParameters.PARAM_MODEL_LOCATION, LOCATION);
    addAutoOverride(ComponentParameters.PARAM_VARIANT, VARIANT);
    addAutoOverride(ComponentParameters.PARAM_LANGUAGE, LANGUAGE);

    applyAutoOverrides(aObject);
}

12. I/O components

12.1. Base classes

The base classes for I/O components are located in the dkpro-core-api-io-asl module.

Most reader components are derived from JCasResourceCollectionReaderBase or ResourceCollectionReaderBase. These class offers support for many common functionalities, e.g.:

common parameters like PARAM_SOURCE_LOCATION and PARAM_PATTERNS
reading from the file system, classpath, or ZIP archives
file-based compression (GZ, BZIP2, XZ)
handling of DocumentMetaData (`initCas methods)
Ant-like include/exclude patterns
handling of default excludes and hidden files
progress reporting support
extensibility with own Spring resource resolvers

Most writer components are derived from JCasFileWriter_ImplBase. This class offers support for common functionality such as:

common parameters like PARAM_TARGET_LOCATION
writing to the file system, classpath, or ZIP archives (getOutputStream methods)
file-based compression (GZ, BZIP2, XZ)
properly interpreting DocumentMetaData (`getOutputStream methods)
overwrite protection
replacing file extensions

If an I/O component interacts with a different data source, e.g. a database, the base classes above are not suitable. Such readers should derive from the uimaFIT JCasCollectionReader_ImplBase (or CasCollectionReader_ImplBase) and writers from JCasAnnotator_ImplBase (or CasAnnotator_ImplBase). However, the developer should ensure that the component’s parameters reflect the standard DKPro Core reader/writer parameters defined in ComponentParameters (dkpro-core-api-parameters-asl module).

Testing

The testing module offers a convenient way to create unit tests for UIMA components.

13. Basic test setup

There are a couple of things useful in every unit test:

Redirecting the UIMA logging through log4j - DKPro Core uses log4j for logging in unit tests.
Printing the name of the test to the console before every test
Enabling extended index checks in UIMA (uima.exception_when_fs_update_corrupts_index)

To avoid repeating a longish setup boilerplate code in every unit test, add the following lines to your unit test class:

@Rule
public DkproTestContext testContext = new DkproTestContext();

Additional benefits you get from this testContext are:

getting the class name of the current test (getClassName())
getting the method name of the current test (getMethodName())
getting the name of a folder you can use to store test results (getTestOutputFolder()).

14. Unit test example

A typical unit test class has consists of two parts

the test cases
a runTest method - which sets up the pipeline required by the test and then calls TestRunner.runTest().

In the following example, mind that the text must be provided with spaces separating the tokens (thus there must be a space before the full stop at the end of the sentence) and with newline characters (\n) separating the sentences:

Typical unit test for an analysis component from the OpenNlpNamedEntityRecognizer test (shortened)

@Test
public void testEnglish()
    throws Exception
{
    // Run the test pipeline. Note the full stop at the end of a sentence is preceded by a
    // whitespace. This is necessary for it to be detected as a separate token!
    JCas jcas = runTest("en", "person", "SAP where John Doe works is in Germany .");

    // Define the reference data that we expect to get back from the test
    String[] namedEntity = { "[ 10, 18]NamedEntity(person) (John Doe)" };

    // Compare the annotations created in the pipeline to the reference data
    AssertAnnotations.assertNamedEntity(namedEntity, select(jcas, NamedEntity.class));
}

// Auxiliary method that sets up the analysis engine or pipeline used in the test.
// Typically, we have multiple tests per unit test file that each invoke this method.
private JCas runTest(String language, String variant, String testDocument)
    throws Exception
{
    AssumeResource.assumeResource(OpenNlpNamedEntityRecognizer.class, "ner", language, variant);

    AnalysisEngine engine = createEngine(OpenNlpNamedEntityRecognizer.class,
            OpenNlpNamedEntityRecognizer.PARAM_VARIANT, variant,
            OpenNlpNamedEntityRecognizer.PARAM_PRINT_TAGSET, true);

    // Here we invoke the TestRunner which performs basic whitespace tokenization and
    // sentence splitting, creates a CAS, runs the pipeline, etc. TestRunner explicitly
    // disables automatic model loading. Thus, models used in unit tests must be explicitly
    // made dependencies in the pom.xml file.
    return TestRunner.runTest(engine, language, testDocument);
}

Test cases for segmenter components should not make use of the TestRunner class, which already performs tokenization and sentence splitting internally.

15. AssertAnnotations

The AssertAnnotations class offers various static methods to test if a component has properly created annotations of a certain kind. There are methods to test almost every kind of annotation supported by DKPro Core, e.g.:

assertToken
assertSentence
assertPOS
assertLemma
assertMorph
assertStem
assertNamedEntity
assertConstituents
assertChunks
assertDependencies
assertPennTree
assertSemanticPredicates
assertSemanticField
assertCoreference
assertTagset
assertTagsetMapping
assertValid - Tests implemented with TestRunner and IOTestRunner perform validation checks automatically. All other unit tests should invoke AssertAnnotations.assertValid(jcas).
etc.

16. Testing I/O componets

The IOTestRunner class offers convenient methods to test I/O components:

testRoundTrip can be used to test converting a format to CAS, converting it back and comparing it to the original
testOneWay instead is useful to read data and compare it to a reference file in a different format (e.g. CasDumpWriter format). It can also be used if there a full round-trip is not possible because some information is lost or cannot be exported exactly as ingested from the original file.

The input file and reference file path given to these methods is always considered relative to src/test/resources.

Example using testRoundTrip with extra parameters (Conll2006ReaderWriterTest)

testRoundTrip(
        Conll2006Reader.class, // the reader
        Conll2006Writer.class,  // the writer
        "conll/2006/fk003_2006_08_ZH1.conll"); // the input also used as output reference

Example using testOneWay with extra parameters (Conll2006ReaderWriterTest)

testOneWay(
        Conll2006Reader.class, // the reader
        Conll2006Writer.class,  // the writer
        "conll/2006/fi-ref.conll", // the reference file for the output
        "conll/2006/fi-orig.conll"); // the input file for the test

Example using testRoundTrip with extra parameters (BratReaderWriterTest)

testOneWay(
        createReaderDescription(Conll2009Reader.class), // the reader
        createEngineDescription(BratWriter.class, // the writer
                BratWriter.PARAM_WRITE_RELATION_ATTRIBUTES, true),
        "conll/2009/en-ref.ann", // the reference file for the output
        "conll/2009/en-orig.conll"); // the input file for the test

Type System

17. Types

To add a new type, first locate the relevant module. Typically types are added to an API module because types are supposed to be independent of individual analysis tools. In rare circumstances, a type may be added to an I/O or tool module, e.g. because the type is experimental and needs to be tested in the context of that module - or because the type is highly specific to that module.

Typically, there is only a single descriptor file called dkpro-types.xml. Within a module, we keep this descriptor in the folder src/main/resources under the top-level package plus type name of the module. E.g. for the module

dkpro-core-api-semantics-asl

the type descriptor would be

src/main/resources/de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types.xml

For the time being, descriptors in src/main/resources/desc/type are also supported. However, this support is going to be removed in the future.

18. Type descriptors

If there is no suitable type descriptor file yet, create a new one.

When a new type system descriptor has been added to a module, it needs to be registered with uimaFIT. This happens by creating the file

src/main/resources/META-INF/org.apache.uima.fit/types.txt

consisting of a list of type system descriptor locations prefixed with classpath*:, e.g.:

classpath*:de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types.xml

The type system location corresponds to the location within the classpath at runtime, thus src/main/resources is stripped from the beginning.

19. Documentation

19.1. Type descriptors

To play nicely with the automatic documentation generation system, the following points should be observed when creating a new type descriptor file:

Name

field of the type descriptor corresponds to the section under which the types declared in the descriptor appear. If a type descriptor name field is e.g. Syntax, all types declared in the file will appear under that heading in the documentation. Multiple descriptors can declare the same name and the types declared in them are listed in the documentation in alphabetical order.

Description

field should be emtpy. Create instead a sectionIntroXXX.adoc file under src/main/asciidoc/typesystem-reference in the dkpro-core-doc module (XXX is the name of the section - see Name above).

Version

should be set to ${version}. If it does not exist yet, create a file src/filter/filter.properties in the module that creates the new type descriptor with the following content:

version=${project.version}
timestamp=${maven.build.timestamp}

Also add the following section to the pom.xml file in the respective module:

<resources>
  <resource>
    <filtering>false</filtering>
    <directory>src/main/resources</directory>
    <excludes>
      <exclude>desc/type/**/*</exclude>
    </excludes>
  </resource>
  <resource>
    <filtering>true</filtering>
    <directory>src/main/resources</directory>
    <includes>
      <include>desc/type/**/*</include>
    </includes>
  </resource>
</resources>

Replace the pattern inside the include and exclude elements with the location of your type descriptor file, e.g. de/tudarmstadt/ukp/dkpro/core/api/semantics/type/*.xml.

19.2. Types

When creating a new type or feature, you can use HTML tags to format the description. Line breaks, indentation, etc. will not be preserved. Mind that the description will be placed into the JavaDoc for the generated JCas classes as well as into the auto-generated DKPro Core documentation.

20. JCas classes

Instead of pre-generating the JCas classes and storing them in the version control, we use the jcasgen-maven-plugin to automatically generate JCas classes at build time. The automatic generation of JCas classes need to be explictily enabled for modules containing types. This is done by placing a file called .activate-run-jcasgen in the module root with the content

Marker to activate run-jcasgen profile.

Actually the content is irrelevant, but it is a good idea to place a note here regarding the purpose of the file.

However, in some we customized the JCas classes, e.g. we added the method DocumentMetaData.get(JCas). Such classes are excluded from being generated automatically by placing them in a second descriptor file called dkpro-types-customized.xml, e.g.

src/main/resources/de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types-customized.xml

The dkpro-types-customized.xml descriptor must be also registered with uimaFIT in the types.txt file.

21. Compliance validation

Often a type comes with a certain policy. For example, root nodes in a dependency relation tree should have the type ROOT and the features governor and dependent should point to the same token. Another example would be that if a constituent is a child of another constituent, then its parent feature should be set accordingly.

To ensure that all components adhere to such policies, it is a good idea to implement checks for them. This can be done simply by placing a new check implementation into the package de.tudarmstadt.ukp.dkpro.core.testing.validation.checks in the testing module. Tests implemented with TestRunner and IOTestRunner use these unit tests automatically. All other checks should invoke AssertAnnotations.assertValid(jcas).

Example check ensuring that parent of constituents and tokens is properly set (shortened)

@Override
public boolean check(JCas aJCas, List<Message> aMessages)
{
    for (Constituent parent : select(aJCas, Constituent.class)) {
        Collection<Annotation> children = select(parent.getChildren(), Annotation.class);
        for (Annotation child : children) {
            Annotation declParent = FSUtil.getFeature(child, "parent", Annotation.class);

            if (declParent == null) {
                aMessages.add(new Message(this, ERROR, String.format(
                        "Child without parent set: %s", child)));

            }
            else if (declParent != parent) {
                aMessages.add(new Message(this, ERROR, String.format(
                        "Child points to wrong parent: %s", child)));

            }
        }
    }

    return aMessages.stream().anyMatch(m -> m.level == ERROR);
}

Models and Resources

This section explains how resources, such as models, are packaged, distributed, and used within DKPro Core.

22. Architecture

The architecture for resources (e.g. parser models, POS tagger models, etc.) in DKPro is still work in progress. However, there are a couple of corner points that have already been established.

REQ-1 - Addressable by URL: Resources must be addressable using an URL, typically a classpath URL (classpath:/de/tudarmstadt/…/model.bin) or a file URL (file:///home/model.bin). Remote URLs like HTTP should not be used and may not be supported.
REQ-2 - Maven compatible: Resources are packaged in JARs and can be downloaded from our Maven repositories (if the license permits).
REQ-3 - Document-sensitive: A component should dynamically determine at runtime which resource to use based on properties of a processed document, e.g. based on the document language. This may change from one document to the next.
REQ-4 - Overridable: The user should be able to override the model or provide additional information as to what specific variant of a resource should be used. E.g. if there are two resources for the language de, de-fast and de-accurate, the component could use de-fast per default unless the user specifies to use variante accurate or specifies a different model altogether.
REQ-5 - Loadable from classpath: Due to REQ-1, REQ-2, and REQ-3 models must be resolvable from the classpath.
- ResourceUtils.resolveLocation(String, Object, UimaContext)
- Resource Providers (see below)
- PathMatchingResourcePatternResolver

22.1. Versioning scheme

To version our packaged models, we use a date (yyyymmdd) and a counter (x). We use a date, because often no (reliable) upstream version is available. E.g. with the Stanford NLP tools, the same model is sometimes included in different pacakges with different versions (e.g. parser models are included with the CoreNLP package and the parser package). TreeTagger models are not versioned at all. With the OpenNLP version, we are not sure if they are versioned - it seems they are just versioned for compatibility with a particular OpenNLP version (e.g. 1.5.) but have no proper version of their own. If we know it, we use the date when the model was last changed, otherwise we use the date when we first package a new model and update it when we observe a model change.

We include additional metadata with the packaged model (e.g. which tagset is used) and we sometimes want to release packaged models with new metadata, although the upstream model itself has not changed. In such cases, we increment the counter. The counter starts at 0 if a new model is incorporated.

Thus, a model version has the format "yyyymmdd.x".

23. Packaging resources

Resources needed by DKPro components (e.g. parser models or POS tagger models) are not packaged with the corresponding analysis components, but as separate JARs, one per language and model variant.

Due to license restrictions, we may not redistribute all of these resources. But, we offer Ant scripts to automatically download the resources and package them as DKPro-compatible JARs. When the license permits, we upload these to our public Maven repository.

If you need a non-redistributable resource (e.g. TreeTagger models) or just want to package the models yourself, here is how you do it.

23.1. Installing Ant in Eclipse

Our build.xml scripts require Ant 1.8.x. If you use an older Eclipse version, you may have to manually download and register a recent Ant version:

Download the latest Ant binaries from the website and unpack them in a directory of your choice.
Start Eclipse and go to Window > Preferences > Ant > Runtime and press Ant Home….
Select the Ant directory you just unpacked, then confirm.

23.2. Implementing a build.xml script

Models are usually large and we therefore package them separately from the components that use them. Each model becomes a JAR that is uploaded to our Maven repositories and added as a dependency in the projects that use them.

Often, models are single files, e.g. serialize Java objects that represent a parser model, POS tagger model, etc. The simplest case is that these files are distributed from some website. We use an Ant script then to download the file and package it as a JAR. We defined custom Ant macros like install-model-file that make the process very convenient. The following code shows how we import the custom macros and define two targets, local-maven and separate-jars. The first just sets a property to cause install-model-file to copy the finished JAR into the local Maven repository (~.m2/repository).

The versioning scheme for models is "yyyymmdd.x" where "yyyymmdd" is the date of the last model change (if known) or the date of packaging and "x" is a counter unique per date starting a 0. Please refer to the versioning scheme documentation for more information.

The model building ANT script goes to src/scripts/build.xml with the project.

DKPro Core provides a set of ANT macros that help in packaging models. Typically, you will need one of the following two:

install-stub-and-upstream-file - if your model consists of a single file
install-stub-and-upstream-folder - if your model consists of multiple files.

When using install-stub-and-upstream-folder, the outputPackage property must end in lib, otherwise the generated artifacts will remain empty.

The ant-macros.xml file itself contains additional documentation on the macros and additional properties that can be set.

<project basedir="../.." default="separate-jars">
  <import>
    <url url="http://dkpro-core-asl.googlecode.com/svn/built-ant-macros/
      tags/0.7.0/ant-macros.xml"/>
  </import>

  <!--
      - Output package configuration
    -->
  <property name="outputPackage"
     value="de/tudarmstadt/ukp/dkpro/core/opennlp/lib"/>

  <target name="local-maven">
    <property name="install-artifact-mode" value="local"/>
    <antcall target="separate-jars"/>
   </target>

  <target name="remote-maven">
    <property name="install-artifact-mode" value="remote"/>
    <antcall target="separate-jars"/>
  </target>

  <target name="separate-jars">
    <mkdir dir="target/download"/>

    <!-- FILE: models-1.5/en-pos-maxent.bin - - - - - - - - - - - - - -
      - 2012-06-16 | now        | db2cd70395b9e2e4c6b9957015a10607
      -->
    <get
      src="http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin"
      dest="target/download/en-pos-maxent.bin"
      skipexisting="true"/>
    <install-stub-and-upstream-file
      file="target/download/en-pos-maxent.bin"
      md5="db2cd70395b9e2e4c6b9957015a10607"
      groupId="de.tudarmstadt.ukp.dkpro.core"
      artifactIdBase="de.tudarmstadt.ukp.dkpro.core.opennlp"
      upstreamVersion="20120616"
      metaDataVersion="1"
      tool="tagger"
      language="en"
      variant="maxent"
      extension="bin" >
        <metadata>
          <entry key="pos.tagset" value="ptb"/>
        </metadata>
    </install-model-file>
  </target>
</project>

The model file en-pos-maxent.bin is downloaded from the OpenNLP website and stored in a local cache directory target/download/tagger/da-pos-maxent.bin. From there, install-stub-and-upstream-file picks it up and packages it as two JARs, 1) a JAR containing the DKPro Core meta data and a POM referencing the second JAR, 2) a JAR containing the actual model file(s). The JAR file names derive from the artifactIdBase, tool, language, variant, upstreamVersion and metaDataVersion parameters. These parameters along with the extension parameter are also used to determine the package name and file name of the model in the JAR. They are determined as follows (mind that dots in the artifactBase turn to slashes, e.g. de.tud turns de/tud:

Pattern used to place a resource within a JAR

{artifactIdBase}/lib/{tool}-{language}-{variant}.{extension}

The following values are commonly used for tool:

token - tokenizer
sentence - sentence splitter
segmenter - tokenizer & sentence splitter combined
lemma - lemmatizer (also sometimes lemmatizer is used, but should not be used for new models)
tagger - part-of-speech tagger
morphtagger - morphological analyzer
ner - named-entity recognizer
parser - syntactic parser
depparser - dependency parser (also sometimes parser is used, but should not be used for new models)
coref - coreference resolver

The values for variant are very tool-dependent. Typically, the variant encodes parameters that were used during the creation of a model, e.g. which machine learning algorithm was used, which parameters it had, and on which data set is has been created.

An md5 sum for the remote file must be specified to make sure we notice if the remote file changes or if the download is corrupt.

The metadata added for the models currently used to store tagset information, which is used to drive the tag-to-DKPro-UIMA-type mapping. The following values are commonly used as keys:

pos.tagset - part-of-speech tagset (ptb, ctb, stts, …)
dependency.tagset - dependency relation labels, aka. syntactic functions (negra, ancora, …)
constituent.tagset - constituent labels, aka. syntactic categories (ptb, negra, …)

23.3. Running the build.xml script

For those modules where we support packaging resources as JARs, we provide an Ant script called build.xml which is located in the corresponding module in the SVN.

build.xml is a script that can be run with Apache Ant (version 1.8.x or higher) and requires an internet connection.

You can find this script in the src/scripts folder of the module.

Depending on the script, various build targets are supported. Three of them are particularly important: separate-jars, local-maven, and remote-maven:

separate-jars downloads all resource from the internet, validates them against MD5 checksums and packages them as DKPro-compatible JARs. The JARs are stored to the target folder. You can easily update them to an Artifactory Maven repository. Artifactory automatically recognizes their group ID, artifact ID and version. This may not work with other Maven repositories.
local-maven additionally installs the JARs into your the local Maven repository on your computer. It assumes the default location of the repository at ~/.m2/repository. If you keep your repository in a different folder, specify it via the alt.maven.repo.path system property.
remote-maven additionally installs the JARS into a remote Maven repository. The repository to deploy to can be controlled via the system property alt.maven.repo.url. If the remote repo also requires authentication, use the system property alt.maven.repo.id to configure the credentials from the settings.xml that should be used. An alternative settings file can be configured using alt.maven.settings.

This target requires that you have installed maven-ant-tasks-2.1.3.jar in ~/.ant/lib.

It is recommended to open the build.xml file in Eclipse, run the local-maven target, and then restart Eclipse. Upon restart, Eclipse should automatically scan your local Maven repository. Thus, the new resource JARs should be available in the search dialog when you add dependencies in the POM editor.

23.4. Example: how to package TreeTagger binaries and models

TreeTagger and its models cannot be re-distributed with DKPro Core, you need to download it yourself. For your convenience, we included an Apache Ant script called build.xml in the src/scripts folder of the TreeTagger module. This script downloads the TreeTagger binaries and models and packages them as artifacts, allowing you to simply add them as dependencies in Maven.

To run the script, you need to have Ant 1.8.x installed and configured in Eclipse. This is already the case with Eclipse 3.7.x. If you use an older Eclipse version, please see the section below on installing Ant in Eclipse.

Now to build the TreeTagger artifacts:

Locate the Ant build script (build.xml) in the scripts directory (src/scripts) of the dkpro-core-treetagger-asl module.
Right-click, choose Run As > External Tools Configurations. In the Target tab, select local-maven, run.
Read the license in the Ant console and - if you care - accept the license terms.
Wait for the build process to finish.
Restart Eclipse

To use the packaged TreeTagger resources, add them as Maven dependencies to your project (or add them to the classpath if you do not use Maven).

Note that in order to use TreeTagger you must have added at least the JAR with the TreeTagger binaries and one JAR with the model for the language you want to work with.

24. Updating a model

Whenever one existing model have a new release, it is good to update the build.xml changing the:

URL for retrieving the model (if it has changed)
The version from the model (the day when the model was created in the yyyymmdd format)

After that, run the ant script with the local-maven target, add the jars to your project classpath and check if the existing unit tests work for the up to date model. If they do, then run the script again, this time with the remote-maven target. Then, change the versions from the models in the dependency management section from the project’s pom file, commit those changes and move these new models from staging into model repository on zoidberg.

24.1. MD5 checksum check fails

Not all of the resources are properly versioned by their maintainers (in particular TreeTagger binaries and models). We observed that resources changed from one day to the next without any announcement or increase of the version number (if present at all). Thus, we validate all resources against an MD5 checksum stored in the build.xml file. This way, we can recognize if a remote resource has been changed. When this happens, we add a note to the build.xml file indicating, when we noticed the MD5 changed and update the version of the corresponding resource.

Since we do not test the build.xml files every day, you may get an MD5 checksum error when you try to package the resources yourself. If this happens, open the build.xml file with a text editor, locate the MD5 checksum that fails, update it and update the version of the corresponding resource. You can also tell us on the DKPro Core User Group and we will update the build.xml file.

25. Metadata

Typical metadata items for a model.

Almost all models should declare at least one tagset. We currently declare only the tagsets that a model produces, not those that it consumes.

Table 1. Tagsets
Entry	Description
`constituent.tagset`
`chunk.tagset`
`dependency.tagset`
`pos.tagset`
`morph.tagset`

Table 2. Model properties
Entry	Description
`encoding`	Deprecated, use `model.encoding` instead
`model.encoding`	The character encoding of the model. In particular relevant for native tools, e.g. TreeTagger, Sfst, as we communicate with as external processes them through pipes or files.

The Dublin Core (DC) metadata items are not (yet) widely used throughout the models. This might change in the future.

Table 3. Dublin Core metadata
Entry	Description
`DC.title`
`DC.creator`
`DC.identifier`
`DC.rights`

Table 4. Component-specific metadata
Entry	Description
`mstparser.param.order`	Used by the MstParser component to indicate the type of model
`flushSequence`	Used by the TreeTagger components to mark the boundary between two documents.
`pos.tagset.tagSplitPattern`
`pos.tag.map.XXX`

25.1. Low-level tagset mapping

Some models require minimalistic tag mapping, e.g. because the tags in the model are upper-case, while the tagset normally uses lower-case tags - or because there is a typo in a tag, etc. For this reason, DKPro Core offers a way of mapping tags directly in the model metadata provider such that DKPro Core components do not see the original model tags, but the fixed mapped tags.

A tag mapping is defined as <level>.tag.map.<originalTag>=<newTag>. For example:

Example tag mapping in properties file

pos.tag.map.nn=NN

Example tag mapping in build.xml

<entry key="pos.tag.map.nn" value="NN"/>

This functionality is successively added and may not be available for all types of models.

DKPro Core™ Developer Guide