This document is targets developers of DKPro Core components.

Implementing Components

1. General

1.1. Capabilities

All components should declare the types they consume by default as input and they produced by default as output. Some components may not know before runtime what they produce or consume, so nothing can be declared here.

Example of declaring input/output types on a component
@TypeCapability(
        inputs = {
            "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token",
            "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence" },
        outputs = {
            "de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS" })
public class OpenNlpPosTagger
        extends JCasAnnotator_ImplBase
{

2. Analysis components

2.1. Base classes

The base classes for analysis components are provided by uimaFIT: JCasAnnotator_ImplBase and CasAnnotator_ImplBase.

2.2. Models

The ModelProviderBase class offers convenient support for working with model resources. The following code is taken from the OpenNlpPosTagger component. It shows how the POS Tagger model is addressed using a parametrized classpath URL with parameters for language and variant.

Model provider setup in OpenNlpPosTagger.initialize() (shortened)
// Use ModelProviderBase convenience constructor to set up a model provider that
// auto-detects most of its settings and is configured to use default variants.
// Auto-detection inspects the configuration parameter fields (@ConfigurationParameter)
// of the analysis engine class and looks for default parameters such as PARAM_LANGUAGE,
// PARAM_VARIANT, and PARAM_MODEL_LOCATION.
      modelProvider = new ModelProviderBase<POSTagger>(this, "opennlp", "tagger")
      {
          @Override
          protected POSTagger produceResource(InputStream aStream)
              throws Exception
          {
              // Load the POS tagger model from the location the model provider offers
              POSModel model = new POSModel(aStream);
              // Create a new POS tagger instance from the loaded model
              return new POSTaggerME(model);
          }
      };

The produceResource() method is called with the URL of the model once it has been located by CasConfigurableProviderBase.

Model provider use in OpenNlpPosTagger.process() (shortened)
CAS cas = aJCas.getCas();

// Document-specific configuration of model and mapping provider in process()
modelProvider.configure(cas);
List<Token> tokens = selectCovered(aJCas, Token.class, sentence);
String[] tokenTexts = toText(tokens).toArray(new String[tokens.size()]);

// Fetch the OpenNLP pos tagger instance configured with the right model and use it to
// tag the text
String[] tags = modelProvider.getResource().tag(tokenTexts);

2.3. Type mappings

The DKPro type system design provides two levels of abstraction on most annotations:

  • a generic annotation type, e.g. POS (part of speech) with a feature value containing the original tag produced by an analysis component, e.g. TreeTagger

  • a set of high-level types for very common categories, e.g. N (noun), V (verb), etc.

DKPro maintains mappings for commonly used tagsets, e.g. in the module dkpro-core-api-lexmorph-asl. They are named:

Naming scheme for tag mapping files
{language}-{tagset}-{layer}.map

The following values are commonly used for layer:

  • pos - part-of-speech tag mapping

  • morph - morphological features mapping

  • constituency - constituent tag mapping

  • dependency - dependency relation mapping

The mapping provder is create in the initialize() method of the UIMA component after the respective model provider. This is necessary because the mapping provider obtains the tagset information for the current model from the model provider.

Mapping provider setup in OpenNlpPosTagger.initialize() (shortened)
// General setup of the mapping provider in initialize()
mappingProvider = MappingProviderFactory.createPosMappingProvider(posMappingLocation,
        language, modelProvider);

In the process() method, the mapping provider is used to to create an UIMA annotation. First, it is configured for the current document and model. Then, it is invoked for each tag produced by the tagger obtain the UIMA type for the annotation to be created. If there is no mapping file, the mapping provide will fall back to a suitable default annotation, e.g. POS for part-of-speech tags or NamedEntity.

Mapping provider use in OpenNlpPosTagger.process() (shortened)
// Mind the mapping provider must be configured after the model provider as it uses the
// model metadata
mappingProvider.configure(cas);
// Convert the tag produced by the tagger to an UIMA type, create an annotation
// of this type, and add it to the document.
Type posTag = mappingProvider.getTagType(tag);
POS posAnno = (POS) cas.createAnnotation(posTag, t.getBegin(), t.getEnd());
// To save memory, we typically intern() tag strings
posAnno.setPosValue(internTags ? tag.intern() : tag);
posAnno.addToIndexes();

2.4. Default variants

It is possible a different default variant needs to be used depending on the language. This can be configured by placing a properties file in the classpath and setting its location using setDefaultVariantsLocation(String). The key in the properties is the language and the value is used a default variant. These file should always reside in the lib sub-package of a component and use the naming convention:

{tool}-default-variants.map

The default variant file is a Java properties file which defines for each language which variant should be assumed as default. It is possible to declare a catch-all variant using *. This is used if none of the other default variants apply.

OpenNLP POS tagger default variants configuration
it=perceptron
*=maxent

Use the convenience constructor of ModelProviderBase to create model providers that are already correctly set up to use default variants:

public ModelProviderBase(Object aObject, String aShortName, String aType)
{
    setContextObject(aObject);

    setDefault(ARTIFACT_ID, "${groupId}." + aShortName + "-model-" + aType
            + "-${language}-${variant}");
    setDefault(LOCATION,
            "classpath:/${package}/lib/"+aType+"-${language}-${variant}.properties");
    setDefaultVariantsLocation("${package}/lib/"+aType+"-default-variants.map");

    addAutoOverride(ComponentParameters.PARAM_MODEL_LOCATION, LOCATION);
    addAutoOverride(ComponentParameters.PARAM_VARIANT, VARIANT);
    addAutoOverride(ComponentParameters.PARAM_LANGUAGE, LANGUAGE);

    applyAutoOverrides(aObject);
}

3. I/O components

3.1. Base classes

The base classes for I/O components are located in the dkpro-core-api-io-asl module.

Most reader components are derived from JCasResourceCollectionReaderBase or ResourceCollectionReaderBase. These class offers support for many common functionalities, e.g.:

  • common parameters like PARAM_SOURCE_LOCATION and PARAM_PATTERNS

  • reading from the file system, classpath, or ZIP archives

  • file-based compression (GZ, BZIP2, XZ)

  • handling of DocumentMetaData (`initCas methods)

  • Ant-like include/exclude patterns

  • handling of default excludes and hidden files

  • progress reporting support

  • extensibility with own Spring resource resolvers

Most writer components are derived from JCasFileWriter_ImplBase. This class offers support for common functionality such as:

  • common parameters like PARAM_TARGET_LOCATION

  • writing to the file system, classpath, or ZIP archives (getOutputStream methods)

  • file-based compression (GZ, BZIP2, XZ)

  • properly interpreting DocumentMetaData (`getOutputStream methods)

  • overwrite protection

  • replacing file extensions

If an I/O component interacts with a different data source, e.g. a database, the base classes above are not suitable. Such readers should derive from the uimaFIT JCasCollectionReader_ImplBase (or CasCollectionReader_ImplBase) and writers from JCasAnnotator_ImplBase (or CasAnnotator_ImplBase). However, the developer should ensure that the component’s parameters reflect the standard DKPro Core reader/writer parameters defined in ComponentParameters (dkpro-core-api-parameters-asl module).

Testing

The testing module offers a convenient way to create unit tests for UIMA components.

4. Basic test setup

There are a couple of things useful in every unit test:

  • Redirecting the UIMA logging through log4j - DKPro Core uses log4j for logging in unit tests.

  • Printing the name of the test to the console before every test

  • Enabling extended index checks in UIMA (uima.exception_when_fs_update_corrupts_index)

To avoid repeating a longish setup boilerplate code in every unit test, add the following lines to your unit test class:

@Rule
public DkproTestContext testContext = new DkproTestContext();

Additional benefits you get from this testContext are:

  • getting the class name of the current test (getClassName())

  • getting the method name of the current test (getMethodName())

  • getting the name of a folder you can use to store test results (getTestOutputFolder()).

5. Unit test example

A typical unit test class has consists of two parts

  1. the test cases

  2. a runTest method - which sets up the pipeline required by the test and then calls TestRunner.runTest().

In the following example, mind that the text must be provided with spaces separating the tokens (thus there must be a space before the full stop at the end of the sentence) and with newline characters (\n) separating the sentences:

Typical unit test for an analysis component from the OpenNlpNamedEntityRecognizer test (shortened)
@Test
public void testEnglish()
    throws Exception
{
    // Run the test pipeline. Note the full stop at the end of a sentence is preceded by a
    // whitespace. This is necessary for it to be detected as a separate token!
    JCas jcas = runTest("en", "person", "SAP where John Doe works is in Germany .");

    // Define the reference data that we expect to get back from the test
    String[] namedEntity = { "[ 10, 18]NamedEntity(person) (John Doe)" };

    // Compare the annotations created in the pipeline to the reference data
    AssertAnnotations.assertNamedEntity(namedEntity, select(jcas, NamedEntity.class));
}

// Auxiliary method that sets up the analysis engine or pipeline used in the test.
// Typically, we have multiple tests per unit test file that each invoke this method.
private JCas runTest(String language, String variant, String testDocument)
    throws Exception
{
    AnalysisEngine engine = createEngine(OpenNlpNamedEntityRecognizer.class,
            OpenNlpNamedEntityRecognizer.PARAM_VARIANT, variant,
            OpenNlpNamedEntityRecognizer.PARAM_PRINT_TAGSET, true);

    // Here we invoke the TestRunner which performs basic whitespace tokenization and
    // sentence splitting, creates a CAS, runs the pipeline, etc. TestRunner explicitly
    // disables automatic model loading. Thus, models used in unit tests must be explicitly
    // made dependencies in the pom.xml file.
    return TestRunner.runTest(engine, language, testDocument);
}

Test cases for segmenter components should not make use of the TestRunner class, which already performs tokenization and sentence splitting internally.

6. AssertAnnotations

The AssertAnnotations class offers various static methods to test if a component has properly created annotations of a certain kind. There are methods to test almost every kind of annotation supported by DKPro Core, e.g.:

  • assertToken

  • assertSentence

  • assertPOS

  • assertLemma

  • assertMorph

  • assertStem

  • assertNamedEntity

  • assertConstituents

  • assertChunks

  • assertDependencies

  • assertPennTree

  • assertSemanticPredicates

  • assertSemanticField

  • assertCoreference

  • assertTagset

  • assertTagsetMapping

  • assertValid - Tests implemented with TestRunner and IOTestRunner perform validation checks automatically. All other unit tests should invoke AssertAnnotations.assertValid(jcas).

  • etc.

7. Testing I/O componets

The IOTestRunner class offers convenient methods to test I/O components:

  • testRoundTrip can be used to test converting a format to CAS, converting it back and comparing it to the original

  • testOneWay instead is useful to read data and compare it to a reference file in a different format (e.g. CasDumpWriter format). It can also be used if there a full round-trip is not possible because some information is lost or cannot be exported exactly as ingested from the original file.

The input file and reference file path given to these methods is always considered relative to src/test/resources.

Example using testRoundTrip with extra parameters (Conll2006ReaderWriterTest)
testRoundTrip(
        Conll2006Reader.class, // the reader
        Conll2006Writer.class,  // the writer
        "conll/2006/fk003_2006_08_ZH1.conll"); // the input also used as output reference
Example using testOneWay with extra parameters (Conll2006ReaderWriterTest)
testOneWay(
        Conll2006Reader.class, // the reader
        Conll2006Writer.class,  // the writer
        "conll/2006/fi-ref.conll", // the reference file for the output
        "conll/2006/fi-orig.conll"); // the input file for the test
Example using testRoundTrip with extra parameters (BratReaderWriterTest)
testOneWay(
        createReaderDescription(Conll2009Reader.class), // the reader
        createEngineDescription(BratWriter.class, // the writer
                BratWriter.PARAM_WRITE_RELATION_ATTRIBUTES, true),
        "conll/2009/en-ref.ann", // the reference file for the output
        "conll/2009/en-orig.conll"); // the input file for the test

Type System

8. Types

To add a new type, first locate the relevant module. Typically types are added to an API module because types are supposed to be independent of individual analysis tools. In rare circumstances, a type may be added to an I/O or tool module, e.g. because the type is experimental and needs to be tested in the context of that module - or because the type is highly specific to that module.

Typically, there is only a single descriptor file called dkpro-types.xml. Within a module, we keep this descriptor in the folder src/main/resources under the top-level package plus type name of the module. E.g. for the module

dkpro-core-api-semantics-asl

the type descriptor would be

src/main/resources/de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types.xml
For the time being, descriptors in src/main/resources/desc/type are also supported. However, this support is going to be removed in the future.

9. Type descriptors

If there is no suitable type descriptor file yet, create a new one.

When a new type system descriptor has been added to a module, it needs to be registered with uimaFIT. This happens by creating the file

src/main/resources/META-INF/org.apache.uima.fit/types.txt

consisting of a list of type system descriptor locations prefixed with classpath*:, e.g.:

classpath*:de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types.xml
The type system location corresponds to the location within the classpath at runtime, thus src/main/resources is stripped from the beginning.

10. Documentation

10.1. Type descriptors

To play nicely with the automatic documentation generation system, the following points should be observed when creating a new type descriptor file:

Name

field of the type descriptor corresponds to the section under which the types declared in the descriptor appear. If a type descriptor name field is e.g. Syntax, all types declared in the file will appear under that heading in the documentation. Multiple descriptors can declare the same name and the types declared in them are listed in the documentation in alphabetical order.

Description

field should be emtpy. Create instead a sectionIntroXXX.adoc file under src/main/asciidoc/typesystem-reference in the dkpro-core-doc module (XXX is the name of the section - see Name above).

Version

should be set to ${version}. If it does not exist yet, create a file src/filter/filter.properties in the module that creates the new type descriptor with the following content:

version=${project.version}
timestamp=${maven.build.timestamp}

Also add the following section to the pom.xml file in the respective module:

<build>
        <resources>
                <resource>
                        <filtering>false</filtering>
                        <directory>src/main/resources</directory>
                        <excludes>
                                <exclude>desc/type/**/*</exclude>
                        </excludes>
                </resource>
                <resource>
                        <filtering>true</filtering>
                        <directory>src/main/resources</directory>
                        <includes>
                                <include>desc/type/**/*</include>
                        </includes>
                </resource>
        </resources>
        <pluginManagement>
                <plugins>
                        <plugin>
                                <groupId>org.apache.maven.plugins</groupId>
                                <artifactId>maven-dependency-plugin</artifactId>
                                <configuration>
                                        <usedDependencies>
                                                <usedDependency>de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.api.parameter-asl</usedDependency>
                                        </usedDependencies>
                                </configuration>
                        </plugin>
                </plugins>
        </pluginManagement>
</build>
Replace the pattern inside the include and exclude elements with the location of your type descriptor file, e.g. de/tudarmstadt/ukp/dkpro/core/api/semantics/type/*.xml.

10.2. Types

When creating a new type or feature, you can use HTML tags to format the description. Line breaks, indentation, etc. will not be preserved. Mind that the description will be placed into the JavaDoc for the generated JCas classes as well as into the auto-generated DKPro Core documentation.

11. JCas classes

Instead of pre-generating the JCas classes and storing them in the version control, we use the jcasgen-maven-plugin to automatically generate JCas classes at build time. The automatic generation of JCas classes need to be explictily enabled for modules containing types. This is done by placing a file called .activate-run-jcasgen in the module root with the content

Marker to activate run-jcasgen profile.
Actually the content is irrelevant, but it is a good idea to place a note here regarding the purpose of the file.

However, in some we customized the JCas classes, e.g. we added the method DocumentMetaData.get(JCas). Such classes are excluded from being generated automatically by placing them in a second descriptor file called dkpro-types-customized.xml, e.g.

src/main/resources/de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types-customized.xml
The dkpro-types-customized.xml descriptor must be also registered with uimaFIT in the types.txt file.

12. Compliance validation

Often a type comes with a certain policy. For example, root nodes in a dependency relation tree should have the type ROOT and the features governor and dependent should point to the same token. Another example would be that if a constituent is a child of another constituent, then its parent feature should be set accordingly.

To ensure that all components adhere to such policies, it is a good idea to implement checks for them. This can be done simply by placing a new check implementation into the package de.tudarmstadt.ukp.dkpro.core.testing.validation.checks in the testing module. Tests implemented with TestRunner and IOTestRunner use these unit tests automatically. All other checks should invoke AssertAnnotations.assertValid(jcas).

Example check ensuring that parent of constituents and tokens is properly set (shortened)
@Override
public boolean check(JCas aJCas, List<Message> aMessages)
{
    for (Constituent parent : select(aJCas, Constituent.class)) {
        Collection<Annotation> children = select(parent.getChildren(), Annotation.class);
        for (Annotation child : children) {
            Annotation declParent = FSUtil.getFeature(child, "parent", Annotation.class);

            if (declParent == null) {
                aMessages.add(new Message(this, ERROR, String.format(
                        "Child without parent set: %s", child)));

            }
            else if (declParent != parent) {
                aMessages.add(new Message(this, ERROR, String.format(
                        "Child points to wrong parent: %s", child)));

            }
        }
    }

    return aMessages.stream().anyMatch(m -> m.level == ERROR);
}

Models and Resources

This section explains how resources, such as models, are packaged, distributed, and used within DKPro Core.

13. Architecture

The architecture for resources (e.g. parser models, POS tagger models, etc.) in DKPro is still work in progress. However, there are a couple of corner points that have already been established.

  • REQ-1 - Addressable by URL: Resources must be addressable using an URL, typically a classpath URL (classpath:/de/tudarmstadt/…​/model.bin) or a file URL (file:///home/model.bin). Remote URLs like HTTP should not be used and may not be supported.

  • REQ-2 - Maven compatible: Resources are packaged in JARs and can be downloaded from our Maven repositories (if the license permits).

  • REQ-3 - Document-sensitive: A component should dynamically determine at runtime which resource to use based on properties of a processed document, e.g. based on the document language. This may change from one document to the next.

  • REQ-4 - Overridable: The user should be able to override the model or provide additional information as to what specific variant of a resource should be used. E.g. if there are two resources for the language de, de-fast and de-accurate, the component could use de-fast per default unless the user specifies to use variante accurate or specifies a different model altogether.

  • REQ-5 - Loadable from classpath: Due to REQ-1, REQ-2, and REQ-3 models must be resolvable from the classpath.

    • ResourceUtils.resolveLocation(String, Object, UimaContext)

    • Resource Providers (see below)

    • PathMatchingResourcePatternResolver

13.1. Versioning scheme

To version our packaged models, we use a date (yyyymmdd) and a counter (x). We use a date, because often no (reliable) upstream version is available. E.g. with the Stanford NLP tools, the same model is sometimes included in different pacakges with different versions (e.g. parser models are included with the CoreNLP package and the parser package). TreeTagger models are not versioned at all. With the OpenNLP version, we are not sure if they are versioned - it seems they are just versioned for compatibility with a particular OpenNLP version (e.g. 1.5.) but have no proper version of their own. If we know it, we use the date when the model was last changed, otherwise we use the date when we first package a new model and update it when we observe a model change.

We include additional metadata with the packaged model (e.g. which tagset is used) and we sometimes want to release packaged models with new metadata, although the upstream model itself has not changed. In such cases, we increment the counter. The counter starts at 0 if a new model is incorporated.

Thus, a model version has the format "yyyymmdd.x".

14. Packaging resources

Resources needed by DKPro components (e.g. parser models or POS tagger models) are not packaged with the corresponding analysis components, but as separate JARs, one per language and model variant.

Due to license restrictions, we may not redistribute all of these resources. But, we offer Ant scripts to automatically download the resources and package them as DKPro-compatible JARs. When the license permits, we upload these to our public Maven repository.

If you need a non-redistributable resource (e.g. TreeTagger models) or just want to package the models yourself, here is how you do it.

14.1. Installing Ant in Eclipse

Our build.xml scripts require Ant 1.8.x. If you use an older Eclipse version, you may have to manually download and register a recent Ant version:

  • Download the latest Ant binaries from the website and unpack them in a directory of your choice.

  • Start Eclipse and go to Window > Preferences > Ant > Runtime and press Ant Home…​.

  • Select the Ant directory you just unpacked, then confirm.

14.2. Implementing a build.xml script

Models are usually large and we therefore package them separately from the components that use them. Each model becomes a JAR that is uploaded to our Maven repositories and added as a dependency in the projects that use them.

Often, models are single files, e.g. serialize Java objects that represent a parser model, POS tagger model, etc. The simplest case is that these files are distributed from some website. We use an Ant script then to download the file and package it as a JAR. We defined custom Ant macros like install-model-file that make the process very convenient. The following code shows how we import the custom macros and define two targets, local-maven and separate-jars. The first just sets a property to cause install-model-file to copy the finished JAR into the local Maven repository (~.m2/repository).

The versioning scheme for models is "yyyymmdd.x" where "yyyymmdd" is the date of the last model change (if known) or the date of packaging and "x" is a counter unique per date starting a 0. Please refer to the versioning scheme documentation for more information.

The model building ANT script goes to src/scripts/build.xml with the project.

DKPro Core provides a set of ANT macros that help in packaging models. Typically, you will need one of the following two:

  • install-stub-and-upstream-file - if your model consists of a single file

  • install-stub-and-upstream-folder - if your model consists of multiple files.

When using install-stub-and-upstream-folder, the outputPackage property must end in lib, otherwise the generated artifacts will remain empty.

The ant-macros.xml file itself contains additional documentation on the macros and additional properties that can be set.

<project basedir="../.." default="separate-jars">
  <import>
    <url url="http://dkpro-core-asl.googlecode.com/svn/built-ant-macros/
      tags/0.7.0/ant-macros.xml"/>
  </import>

  <!--
      - Output package configuration
    -->
  <property name="outputPackage"
     value="de/tudarmstadt/ukp/dkpro/core/opennlp/lib"/>

  <target name="local-maven">
    <property name="install-artifact-mode" value="local"/>
    <antcall target="separate-jars"/>
   </target>

  <target name="remote-maven">
    <property name="install-artifact-mode" value="remote"/>
    <antcall target="separate-jars"/>
  </target>

  <target name="separate-jars">
    <mkdir dir="target/download"/>

    <!-- FILE: models-1.5/en-pos-maxent.bin - - - - - - - - - - - - - -
      - 2012-06-16 | now        | db2cd70395b9e2e4c6b9957015a10607
      -->
    <get
      src="http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin"
      dest="target/download/en-pos-maxent.bin"
      skipexisting="true"/>
    <install-stub-and-upstream-file
      file="target/download/en-pos-maxent.bin"
      md5="db2cd70395b9e2e4c6b9957015a10607"
      groupId="de.tudarmstadt.ukp.dkpro.core"
      artifactIdBase="de.tudarmstadt.ukp.dkpro.core.opennlp"
      upstreamVersion="20120616"
      metaDataVersion="1"
      tool="tagger"
      language="en"
      variant="maxent"
      extension="bin" >
        <metadata>
          <entry key="pos.tagset" value="ptb"/>
        </metadata>
    </install-model-file>
  </target>
</project>

The model file en-pos-maxent.bin is downloaded from the OpenNLP website and stored in a local cache directory target/download/tagger/da-pos-maxent.bin. From there, install-stub-and-upstream-file picks it up and packages it as two JARs, 1) a JAR containing the DKPro Core meta data and a POM referencing the second JAR, 2) a JAR containing the actual model file(s). The JAR file names derive from the artifactIdBase, tool, language, variant, upstreamVersion and metaDataVersion parameters. These parameters along with the extension parameter are also used to determine the package name and file name of the model in the JAR. They are determined as follows (mind that dots in the artifactBase turn to slashes, e.g. de.tud turns de/tud:

Pattern used to place a resource within a JAR
{artifactIdBase}/lib/{tool}-{language}-{variant}.{extension}

The following values are commonly used for tool:

  • token - tokenizer

  • sentence - sentence splitter

  • lemmatizer - lemmatizer

  • tagger - part-of-speech tagger

  • morphtagger - morphological analyzer

  • ner - named-entity recognizer

  • parser - syntactic or dependency parser

  • coref - coreference resolver

The values for variant are very tool-dependent. Typically, the variant encodes parameters that were used during the creation of a model, e.g. which machine learning algorithm was used, which parameters it had, and on which data set is has been created.

An md5 sum for the remote file must be specified to make sure we notice if the remote file changes or if the download is corrupt.

The metadata added for the models currently used to store tagset information, which is used to drive the tag-to-DKPro-UIMA-type mapping. The following values are commonly used as keys:

  • pos.tagset - part-of-speech tagset (ptb, ctb, stts, …​)

  • dependency.tagset - dependency relation labels, aka. syntactic functions (negra, ancora, …​)

  • constituent.tagset - constituent labels, aka. syntactic categories (ptb, negra, …​)

14.3. Running the build.xml script

For those modules where we support packaging resources as JARs, we provide an Ant script called build.xml which is located in the corresponding module in the SVN.

build.xml is a script that can be run with Apache Ant (version 1.8.x or higher) and requires an internet connection.

You can find this script in the src/scripts folder of the module.

Depending on the script, various build targets are supported. Three of them are particularly important: separate-jars, local-maven, and remote-maven:

  • separate-jars downloads all resource from the internet, validates them against MD5 checksums and packages them as DKPro-compatible JARs. The JARs are stored to the target folder. You can easily update them to an Artifactory Maven repository. Artifactory automatically recognizes their group ID, artifact ID and version. This may not work with other Maven repositories.

  • local-maven additionally installs the JARs into your the local Maven repository on your computer. It assumes the default location of the repository at ~/.m2/repository. If you keep your repository in a different folder, specify it via the alt.maven.repo.path system property.

  • remote-maven additionally installs the JARS into a remote Maven repository. The repository to deploy to can be controlled via the system property alt.maven.repo.url. If the remote repo also requires authentication, use the system property alt.maven.repo.id to configure the credentials from the settings.xml that should be used. An alternative settings file can be configured using alt.maven.settings.

This target requires that you have installed maven-ant-tasks-2.1.3.jar in ~/.ant/lib.

It is recommended to open the build.xml file in Eclipse, run the local-maven target, and then restart Eclipse. Upon restart, Eclipse should automatically scan your local Maven repository. Thus, the new resource JARs should be available in the search dialog when you add dependencies in the POM editor.

14.4. Example: how to package TreeTagger binaries and models

TreeTagger and its models cannot be re-distributed with DKPro Core, you need to download it yourself. For your convenience, we included an Apache Ant script called build.xml in the src/scripts folder of the TreeTagger module. This script downloads the TreeTagger binaries and models and packages them as artifacts, allowing you to simply add them as dependencies in Maven.

To run the script, you need to have Ant 1.8.x installed and configured in Eclipse. This is already the case with Eclipse 3.7.x. If you use an older Eclipse version, please see the section below on installing Ant in Eclipse.

Now to build the TreeTagger artifacts:

  • Locate the Ant build script (build.xml) in the scripts directory (src/scripts) of the dkpro-core-treetagger-asl module.

  • Right-click, choose Run As > External Tools Configurations. In the Target tab, select local-maven, run.

  • Read the license in the Ant console and - if you care - accept the license terms.

  • Wait for the build process to finish.

  • Restart Eclipse

To use the packaged TreeTagger resources, add them as Maven dependencies to your project (or add them to the classpath if you do not use Maven).

Note that in order to use TreeTagger you must have added at least the JAR with the TreeTagger binaries and one JAR with the model for the language you want to work with.

15. Updating a model

Whenever one existing model have a new release, it is good to update the build.xml changing the:

  • URL for retrieving the model (if it has changed)

  • The version from the model (the day when the model was created in the yyyymmdd format)

After that, run the ant script with the local-maven target, add the jars to your project classpath and check if the existing unit tests work for the up to date model. If they do, then run the script again, this time with the remote-maven target. Then, change the versions from the models in the dependency management section from the project’s pom file, commit those changes and move these new models from staging into model repository on zoidberg.

15.1. MD5 checksum check fails

Not all of the resources are properly versioned by their maintainers (in particular TreeTagger binaries and models). We observed that resources changed from one day to the next without any announcement or increase of the version number (if present at all). Thus, we validate all resources against an MD5 checksum stored in the build.xml file. This way, we can recognize if a remote resource has been changed. When this happens, we add a note to the build.xml file indicating, when we noticed the MD5 changed and update the version of the corresponding resource.

Since we do not test the build.xml files every day, you may get an MD5 checksum error when you try to package the resources yourself. If this happens, open the build.xml file with a text editor, locate the MD5 checksum that fails, update it and update the version of the corresponding resource. You can also tell us on the DKPro Core User Group and we will update the build.xml file.

16. Metadata

Typical metadata items for a model.

Almost all models should declare at least one tagset. We currently declare only the tagsets that a model produces, not those that it consumes.

Table 1. Tagsets
Entry Description

constituent.tagset

chunk.tagset

dependency.tagset

pos.tagset

 morph.tagset

Table 2. Model properties
Entry Description

encoding

Deprecated, use model.encoding instead

model.encoding

The character encoding of the model. In particular relevant for native tools, e.g. TreeTagger, Sfst, as we communicate with as external processes them through pipes or files.

The Dublin Core (DC) metadata items are not (yet) widely used throughout the models. This might change in the future.

Table 3. Dublin Core metadata
Entry Description

 DC.title

 DC.creator

 DC.identifier

 DC.rights

Table 4. Component-specific metadata
Entry Description

 mstparser.param.order

Used by the MstParser component to indicate the type of model

 flushSequence

Used by the TreeTagger components to mark the boundary between two documents.

pos.tagset.tagSplitPattern

 pos.tag.map.XXX