This document is targets developers of DKPro Core components.
Setup
1. GIT
All DKPro Core files are stored using UNIX line endings. If you develop on Windows, you have to
set the core.autocrlf
configuration setting to input
to avoid accidentally submitting Windows
line endings to the repository. Using input
is a good strategy in most cases, thus you should
consider setting this as a global (add --global
) or even as a system (--system
) setting.
C:\> git config --global core.autocrlf input
After changing this setting, best do a fresh clone and check-out of DKPro Core.
2. Eclipse
2.1. Use a JDK
On Linux or OS X, the following setting is not necessary. Having a full JDK installed on your system is generally sufficient. You can skip on to the next section.
On Windows, you need to edit the eclipse.ini
file and directly before the -vmargs
line, you
have to add the following two lines. Mind to replace C:/Program Files/Java/jdk1.8.0_144
with the actual
location of the JDK on your system. Without this, Eclipse will complain that the
jdk.tools:jdk.tools
artifact would be missing.
-vm
C:/Program Files/Java/jdk1.8.0_144/jre/bin/server/jvm.dll
2.2. Required Plugins
-
Maven Integration: m2e , already comes pre-installed with the Eclipse IDE for Java Developers. If you use another edition of Eclipse which does not have m2e pre-installed, go to Help→Install New Software, select --All available sites-- and choose Collaboration → m2e - Maven Integration for Eclipse
-
Apache UIMA tools: Update site:
http://www.apache.org/dist/uima/eclipse-update-site/
-
Groovy: Find the applicable update site here:
https://github.com/groovy/groovy-eclipse/wiki
-
Make sure to install at least the Groovy Eclipse Feature, the Groovy Compiler (2.4), and the Groovy M2E Integration.
-
-
Checkstyle Eclipse plugin: Update site:
http://eclipse-cs.sf.net/update
-
Checkstyle configuration plugin for M2Eclipse: Update site:
http://m2e-code-quality.github.com/m2e-code-quality/site/latest/
2.3. Workspace Preferences
The following settings are recommended for the Eclipse Workspace Preferences:
Setting | Value |
---|---|
General → Workspace → Text file encoding |
UTF-8 |
General → Workspace → New text file line delimiter |
Unix |
General → Editors → Text Editors → Displayed tab width |
2 |
General → Editors → Text Editors → Insert spaces for tabs |
true |
General → Editors → Text Editors → Show print margin |
true |
General → Editors → Text Editors → Print margin column |
100 |
XML → XML Files → Editor → Line width |
100 |
XML → XML Files → Editor → Format comments |
false |
XML → XML Files → Editor → Indent using spaces |
selected |
XML → XML Files → Editor → Indentation size |
2 |
2.4. Import
In Eclipse, go to File → Import, choose Existing Maven projects, and select the folder to which you have cloned DKPro Core. Eclipse should automatically detect all modules. Mind that DKPro Core is a large project and it takes significant time until all dependencies have been downloaded and until the first build is complete.
Adding Modules
3. Module Naming Scheme
The name is the first thing to consider when creating a new module.
Although the modules are technically all the same, in the naming scheme, we discern between the following types of modules:
-
API modules (
dkpro-core-api-NAME-asl
) - these modules contain common base classes, utility classes, type system definitions and JCas classes. Since API modules are used in many places, they must be licensed under the Apache License. -
IO modules (
dkpro-core-io-NAME-LIC
) - these modules contain reader and writer components. They are usually named after the file format (e.g.lif
) or family of file formats they support (e.g.conll
). -
FS modules (
dkpro-core-fs-NAME-LIC
) - these modules contain support for specific file systems. They are usually named after the file system type they support (e.g.hdfs
). -
Component modules (
dkpro-core-NAME-LIC
) - these modules contain processing components. They are usually named after the tool or library that is wrapped (e.g.treetagger
orcorenlp
).
In addition to the four categories, there are a view unique modules which do not fall into any
of these categories, e.g. de.tudarmstadt.ukp.dkpro.core.testing-asl
or
de.tudarmstadt.ukp.dkpro.core.doc-asl
.
DKPro Core is in a transition phase from the traditional but deprecated naming scheme
(de.tudarmstadt.ukp.dkpro.core… ) to the new naming scheme (org.dkpro.core… ). Many modules
still make use of the old naming scheme.
|
The naming scheme applies in several occasions:
-
module folder - the sub-folder within the DKPro Core project which contains the module
-
artifactId - the Maven artifactId as recorded in the
pom.xml
file. The groupId should always beorg.dkpro.core
. -
Java packages - the module name translates roughly into the Java package names, e.g. the root Java package in the module
dkpro-core-io-lif-asl
isorg.dkpro.core.io.lif
.
4. Module Hierarchy
Once you have decided on a name for the new module, you proceed by creating a new module folder. Module folders are created directly under the root of the DKPro Core source tree.
Although the folder structure of DKPro Core appears as if there would be a flat list of modules, there is actually a shallow hierarchy of modules (i.e. the folder hierarchy does not correspond to the Maven module hierarchy).
The DKPro Core Parent POM is at the root of the module and of the folder hierarchy. Its parent is the DKPro Parent POM which is maintained in a separate source tree and which follows its own release cycle. The DKPro Parent POM defines a set of default settings, profiles, and managed dependencies useful for all DKPro projects. The DKPro Core Parent POM defines settings specific to DKPro Core.
DKPro Parent POM
DKPro Core Parent POM
DKPro Core ASL Parent POM
... DKPro Core ASL modules ...
DKPro Core GPL Parent POM
... DKPro Core GPL modules ...
DKPro Core Documentation
New modules are added either in the <modules>
section of the DKPro Core ASL Parent POM or of the
DKPro Core GPL Parent POM depending on whether the new module can be licensed under the Apache License
or whether it has to be licensed under the GPL due to a GPL dependency. The these two parent POMs
configure different sets of license checkers. For ASL modules, the Apache RAT Maven Plugin is used; for the GPL modules, the License Maven Plugin is used.
Note that the <modules>
section in these POMs points to the folders which contain the respective
modules. Since the folder hierarchy is flat (unlike the module hierarchy), the module names here
need to be prefixed with ../
.
<modules>
<!-- API modules -->
<module>../dkpro-core-api-anomaly-asl</module>
<module>../dkpro-core-api-coref-asl</module>
...
<!-- FS modules -->
<module>../dkpro-core-fs-hdfs-asl</module>
<!-- IO modules -->
<module>../dkpro-core-io-aclanthology-asl</module>
<module>../dkpro-core-io-ancora-asl</module>
...
<!-- Processing modules -->
<module>../dkpro-core-castransformation-asl</module>
<module>../dkpro-core-cisstem-asl</module>
...
</modules>
In addition to adding a new module to the <modules>
section of the respective parent POM, it
also needs to be added to the <dependencyManagement>
section with this POM:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.dkpro.core</groupId>
<artifactId>dkpro-core-foo-asl</artifactId>
<version>2.2.0</version>
</dependency>
...
</dependencies>
</dependencyManagement>
If you create a GPLed module, copy the .license-header.txt file from another GPLed module
to your new module in order to properly configure the license checker. Mind that the GPL license
text in XML files must be indented by 4 spaces and will not be recognized otherwise. You may
have to adjust the text, depending on whether the module can be licensed under GPLv3 or has to
be licensed under GPLv2.
|
If not all of the dependencies of your new module are available from Maven Central or
JCenter, then add the module within the <modules> and <dependencyManagemen> sections located
under the deps-not-on-maven-central profile within the respective parent POM. Also add the
required third-party repositories there if necessary.
|
5. Basic POM
Next, you create a basic POM inside your new module folder. Below is an example of a minimal
POM for a new Apache-licensed component module. If you create a GPL-licensed module instead,
replace the -asl
suffixes with -gpl
and copy the license header from another GPLed module.
<!--
Licensed to the Technische Universität Darmstadt under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The Technische Universität Darmstadt
licenses this file to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
<artifactId>de.tudarmstadt.ukp.dkpro.core-asl</artifactId>
<version>2.2.0</version>
<relativePath>../dkpro-core-asl</relativePath>
</parent>
<groupId>org.dkpro.core</groupId>
<artifactId>dkpro-core-foo-asl</artifactId>
<packaging>jar</packaging>
<name>DKPro Core ASL - Foo NLP Suite (v ${foo.version})</name>
<properties>
<foo.version>1.8.2</foo.version>
</properties>
<dependencies>
</dependencies>
</project>
6. Library Dependencies
In order to avoid unpleasant surprises, DKPro Core uses the Maven Dependency Plugin to check if
all dependencies used directly within the code of a module are also explicitly declared in the
module POM. If this is not the case, the automated builds fail (they run with -DfailOnWarning
).
This means, you have to declare dependencies for all libraries that you are using directly
from your code in the <dependencies>
section. If a dependency is only required during testing,
it must be marked with <scope>test</scope>
. Below, you find a few typical libraries used in many
modules. Note that there is no version defined for these dependencies. The versions for many
libraries used by multiple modules in DKPro Core are defined in the DKPro Core Parent POM.
Only libraries that are specific to a particular module, e.g. the specific NLP library wrapped,
should have their versions defined within the module POM.
<dependency>
<groupId>org.apache.uima</groupId>
<artifactId>uimaj-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.uima</groupId>
<artifactId>uimafit-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
</dependency>
<dependency>
<groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
<artifactId>de.tudarmstadt.ukp.dkpro.core.api.parameter-asl</artifactId>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
<artifactId>de.tudarmstadt.ukp.dkpro.core.testing-asl</artifactId>
<scope>test</scope>
</dependency>
You may notice the foo.version
property in the minimal POM example above. This property should
be used to set the version of wrapped NLP library. It should appear in the name of the module
as well as in the specific dependency for the wrapped library.
<dependency>
<groupId>org.foo.nlp</groupId>
<artifactId>foo-nlp-suite</artifactId>
<version>${foo.version}</version>
</dependency>
7. Model Dependencies
When you package models for your new component, they need a special treatment in the POM. First, although it is a good idea to create unit tests based on the models, most often you do not want to download all models and run all unit tests during a normal developer build (some models are very large any may quickly fill up your hard disk). Second, the Maven Dependency Plugin is unable to detect that your code or tests make use of the models and it needs to be configured in a special way to allow the build to pass even though it considers the model dependencies as unnecessary.
So assuming you have a model for your component, then first add it to the <dependencyManagemen>
section of the POM - here you specify the version but not the scope. All models you have get
added to this section, irrespective of whether you want to use them for testing or not.
<dependencyManagement>
<dependencies>
<dependency>
<groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
<artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-en-maxent</artifactId>
<version>20120616.1</version>
</dependency>
</dependencies>
</dependencyManagement>
If you also want to use the model for testing, then you add it also to the <dependencies>
section
of the POM. Here you specify the scope but not the version. Then you also have to configure the
Maven Dependency Plugin to accept the presence of the dependency.
<dependencies>
<dependency>
<groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
<artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-en-maxent</artifactId>
<scope>test</scope>
</dependency>
<dependencies>
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<configuration>
<usedDependencies>
<!-- Models not detected by byte-code analysis -->
<usedDependency>de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-en-maxent</usedDependency>
</usedDependencies>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
As said before, if you have many models for your component, it is a good idea to use only a small
set for regular testing. If you want to create tests for additional models or even for all of
your models, then it is best to add the dependencies for these under a profile called use-full-resources
.
This profile is enabled for automated builds or can be enabled on demand by developers who wish
to run all tests. In the example below, we add an additional test dependency on a German model
if the profile use-full-resources
is enabled. Note that the Maven Dependency Plugin is also again
configured within the profile and that the combine.children="append"
parameter is used to merge
the configuration with the one already present for the default build.
<profiles>
<profile>
<id>use-full-resources</id>
<dependencies>
<dependency>
<groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
<artifactId>de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-de-maxent</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<configuration>
<usedDependencies combine.children="append">
<!-- Models not detected by byte-code analysis -->
<usedDependency>de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.opennlp-model-tagger-de-maxent</usedDependency>
</usedDependencies>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
</profile>
</profiles>
To conditionally run a test only if the required model is enabled, you can use the AssumeResource
class from the DKPro Core testing module.
AssumeResource.assumeResource(OpenNlpPosTagger.class, "tagger", language, variant);
8. LICENSE.txt
Every module must contain a file called LICENSE.txt
at its root which contains the license text.
Copy this file from another Apache-licensed or GPL-licensed module (again check if you need to
use GPLv2 or v3). If this file is not present, the build will fail.
9. NOTICE.txt
If the module contains code or resources from a third party (e.g. a source or test file which you
copied from some other code repository or obtained from some website), then you need to add a
file called NOTICE.txt
next to the LICENSE.txt
file. For every third-party file (or set of files
if mutiple files were obtained from the same source under the same conditions), the NOTICE.txt
must contain a statement which allows to identify the files, identify from where these files were
obtained, and contain a copyright and license statement. Check the license of the original files
for whether you have to include the full license text and potentially some specific attribution
(possibly from an upstream NOTICE
file).
Implementing Components
10. General
10.1. Capabilities
All components should declare the types they consume by default as input and they produced by default as output. Some components may not know before runtime what they produce or consume, so nothing can be declared here.
Unresolved directive in developer-guide/components.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-opennlp-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpPosTagger.java[tags=capabilities]
11. Analysis components
11.1. Base classes
The base classes for analysis components are provided by uimaFIT: JCasAnnotator_ImplBase
and
CasAnnotator_ImplBase
.
11.2. Models
The ModelProviderBase
class offers convenient support for working with model resources.
The following code is taken from the OpenNlpPosTagger
component. It shows how the POS Tagger
model is addressed using a parametrized classpath URL with parameters for language and variant.
Unresolved directive in developer-guide/components.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-opennlp-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpPosTagger.java[tags=model-provider-decl]
The produceResource()
method is called with the URL of the model once it has been located by
CasConfigurableProviderBase
.
Unresolved directive in developer-guide/components.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-opennlp-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpPosTagger.java[tags=model-provider-use-1]
Unresolved directive in developer-guide/components.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-opennlp-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpPosTagger.java[tags=model-provider-use-2]
11.3. Type mappings
The DKPro type system design provides two levels of abstraction on most annotations:
-
a generic annotation type, e.g. POS (part of speech) with a feature value containing the original tag produced by an analysis component, e.g. TreeTagger
-
a set of high-level types for very common categories, e.g. N (noun), V (verb), etc.
DKPro maintains mappings for commonly used tagsets, e.g. in the module
dkpro-core-api-lexmorph-asl
. They are named:
{language}-{tagset}-{layer}.map
The following values are commonly used for layer
:
-
pos
- part-of-speech tag mapping -
morph
- morphological features mapping -
constituency
- constituent tag mapping -
dependency
- dependency relation mapping
The mapping provder is create in the initialize()
method of the UIMA component after the
respective model provider. This is necessary because the mapping provider obtains the tagset
information for the current model from the model provider.
Unresolved directive in developer-guide/components.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-opennlp-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpPosTagger.java[tags=mapping-provider-decl]
In the process()
method, the mapping provider is used to to create an UIMA annotation. First,
it is configured for the current document and model. Then, it is invoked for each tag produced
by the tagger obtain the UIMA type for the annotation to be created. If there is no mapping file,
the mapping provide will fall back to a suitable default annotation, e.g. POS for part-of-speech
tags or NamedEntity.
Unresolved directive in developer-guide/components.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-opennlp-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpPosTagger.java[tags=mapping-provider-use-1]
Unresolved directive in developer-guide/components.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-opennlp-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpPosTagger.java[tags=mapping-provider-use-2]
11.4. Default variants
It is possible a different default variant needs to be used depending on the language. This
can be configured by placing a properties file in the classpath and setting its
location using setDefaultVariantsLocation(String)
. The key in the
properties is the language and the value is used a default variant. These file
should always reside in the lib
sub-package of a component and use the naming
convention:
{tool}-default-variants.map
The default variant file is a Java properties file which defines for each language which variant
should be assumed as default. It is possible to declare a catch-all variant using *
. This is
used if none of the other default variants apply.
Unresolved directive in developer-guide/components.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-opennlp-asl/src/main/resources/de/tudarmstadt/ukp/dkpro/core/opennlp/lib/tagger-default-variants.map[]
Use the convenience constructor of ModelProviderBase
to create model providers that are already
correctly set up to use default variants:
Unresolved directive in developer-guide/components.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-api-resources-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/api/resources/ModelProviderBase.java[tags=model-provider-convenience]
12. I/O components
12.1. Base classes
The base classes for I/O components are located in the dkpro-core-api-io-asl
module.
Most reader components are derived from JCasResourceCollectionReaderBase
or
ResourceCollectionReaderBase
. These class offers support for many common functionalities, e.g.:
-
common parameters like
PARAM_SOURCE_LOCATION
andPARAM_PATTERNS
-
reading from the file system, classpath, or ZIP archives
-
file-based compression (GZ, BZIP2, XZ)
-
handling of
DocumentMetaData (`initCas
methods) -
Ant-like include/exclude patterns
-
handling of default excludes and hidden files
-
progress reporting support
-
extensibility with own Spring resource resolvers
Most writer components are derived from JCasFileWriter_ImplBase
. This class offers support for
common functionality such as:
-
common parameters like
PARAM_TARGET_LOCATION
-
writing to the file system, classpath, or ZIP archives (
getOutputStream
methods) -
file-based compression (GZ, BZIP2, XZ)
-
properly interpreting
DocumentMetaData (`getOutputStream
methods) -
overwrite protection
-
replacing file extensions
If an I/O component interacts with a different data source, e.g. a database, the base classes above
are not suitable. Such readers should derive from the uimaFIT JCasCollectionReader_ImplBase
(or CasCollectionReader_ImplBase
) and writers from JCasAnnotator_ImplBase
(or
CasAnnotator_ImplBase
). However, the developer should ensure that the component’s parameters
reflect the standard DKPro Core reader/writer parameters defined in ComponentParameters
(dkpro-core-api-parameters-asl
module).
Testing
13. Basic test setup
There are a couple of things useful in every unit test:
-
Redirecting the UIMA logging through log4j - DKPro Core uses log4j for logging in unit tests.
-
Printing the name of the test to the console before every test
-
Enabling extended index checks in UIMA (
uima.exception_when_fs_update_corrupts_index
)
To avoid repeating a longish setup boilerplate code in every unit test, add the following lines to your unit test class:
@Rule
public DkproTestContext testContext = new DkproTestContext();
Additional benefits you get from this testContext
are:
-
getting the class name of the current test (
getClassName()
) -
getting the method name of the current test (
getMethodName()
) -
getting the name of a folder you can use to store test results (
getTestOutputFolder()
).
14. Unit test example
A typical unit test class has consists of two parts
-
the test cases
-
a
runTest
method - which sets up the pipeline required by the test and then callsTestRunner.runTest()
.
In the following example, mind that the text must be provided with spaces
separating the tokens (thus there must be a space before the full stop at the end of the
sentence) and with newline characters (\n
) separating the sentences:
Unresolved directive in developer-guide/testing.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-opennlp-asl/src/test/java/de/tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpNamedEntityRecognizerTest.java[tags=test]
Test cases for segmenter components should not make use of the TestRunner
class, which already performs tokenization and sentence splitting internally.
15. AssertAnnotations
The AssertAnnotations class offers various static methods to test if a component has properly created annotations of a certain kind. There are methods to test almost every kind of annotation supported by DKPro Core, e.g.:
-
assertToken
-
assertSentence
-
assertPOS
-
assertLemma
-
assertMorph
-
assertStem
-
assertNamedEntity
-
assertConstituents
-
assertChunks
-
assertDependencies
-
assertPennTree
-
assertSemanticPredicates
-
assertSemanticField
-
assertCoreference
-
assertTagset
-
assertTagsetMapping
-
assertValid
- Tests implemented withTestRunner
andIOTestRunner
perform validation checks automatically. All other unit tests should invokeAssertAnnotations.assertValid(jcas)
. -
etc.
16. Testing I/O componets
The ReaderAssert
and WriterAssert
classes can be used to text I/O components. They allow building
AssertJ-style unit tests with DKPro Core reader and writer components.
One of the simplest tests is a round-trip test where an input file is read using a reader for a particular format, then written out again using a writer for the same format.
Unresolved directive in developer-guide/testing.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-io-conll-asl/src/test/java/de/tudarmstadt/ukp/dkpro/core/io/conll/Conll2006ReaderWriterTest.java[tags=testRoundTrip]
The reader is set up to reader the test input file. Instead of setting PARAM_SOURCE_LOCATION
, it is
also possible to set the input location using readingFrom()
. The writer automatically makes use of
a test output folder provided by a DkproTestContext
- therefore a target location does not need to
be configured explicitly.
Assuming the writer produces only a single output file, this file can be accessed for
assertions using outputAsString()
. If multiple output files are created, an argument can be passed
to that method, e.g. outputAsString("output.txt")
. This will look for a at the target location whose
name ends in output.txt
. If there is none or more than one matching file, the test will fail.
If the original input file is in a different format or cannot be fully reproduced by the writer, then it is easy to set up a one way test, simply by changing the final comparison. The following example also shows how to specify additional parameters on the reader or writer.
Unresolved directive in developer-guide/testing.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-io-conll-asl/src/test/java/de/tudarmstadt/ukp/dkpro/core/io/conll/Conll2006ReaderWriterTest.java[tags=testOneWay]
In order to test the ability of readers to read multiple files, the asJCasList()
method can be used.
While pipelines typically re-use a single CAS which is repeatedly reset and refilled, this method
generates a list of separate CAS instances which can be individually validated after the test. To
access elements of the list use element(n)
.
Type System
17. Types
To add a new type, first locate the relevant module. Typically types are added to an API module because types are supposed to be independent of individual analysis tools. In rare circumstances, a type may be added to an I/O or tool module, e.g. because the type is experimental and needs to be tested in the context of that module - or because the type is highly specific to that module.
Typically, there is only a single descriptor file called dkpro-types.xml
. Within a module, we keep
this descriptor in the folder src/main/resources
under the top-level package plus type name of
the module. E.g. for the module
dkpro-core-api-semantics-asl
the type descriptor would be
src/main/resources/de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types.xml
For the time being, descriptors in src/main/resources/desc/type are also supported.
However, this support is going to be removed in the future.
|
18. Type descriptors
If there is no suitable type descriptor file yet, create a new one.
When a new type system descriptor has been added to a module, it needs to be registered with uimaFIT. This happens by creating the file
src/main/resources/META-INF/org.apache.uima.fit/types.txt
consisting of a list of type system descriptor locations prefixed with classpath*:
, e.g.:
classpath*:de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types.xml
The type system location corresponds to the location within the classpath at runtime, thus
src/main/resources is stripped from the beginning.
|
19. Documentation
19.1. Type descriptors
To play nicely with the automatic documentation generation system, the following points should be observed when creating a new type descriptor file:
- Name
-
field of the type descriptor corresponds to the section under which the types declared in the descriptor appear. If a type descriptor name field is e.g. Syntax, all types declared in the file will appear under that heading in the documentation. Multiple descriptors can declare the same name and the types declared in them are listed in the documentation in alphabetical order.
- Description
-
field should be emtpy. Create instead a
sectionIntroXXX.adoc
file undersrc/main/asciidoc/typesystem-reference
in thedkpro-core-doc
module (XXX
is the name of the section - see Name above). - Version
-
should be set to
${version}
. If it does not exist yet, create a filesrc/filter/filter.properties
in the module that creates the new type descriptor with the following content:version=${project.version} timestamp=${maven.build.timestamp}
Also add the following section to the
pom.xml
file in the respective module:<resources> <resource> <filtering>false</filtering> <directory>src/main/resources</directory> <excludes> <exclude>desc/type/**/*</exclude> </excludes> </resource> <resource> <filtering>true</filtering> <directory>src/main/resources</directory> <includes> <include>desc/type/**/*</include> </includes> </resource> </resources>
Replace the pattern inside the include
andexclude
elements with the location of your type descriptor file, e.g.de/tudarmstadt/ukp/dkpro/core/api/semantics/type/*.xml
.
19.2. Types
When creating a new type or feature, you can use HTML tags to format the description. Line breaks, indentation, etc. will not be preserved. Mind that the description will be placed into the JavaDoc for the generated JCas classes as well as into the auto-generated DKPro Core documentation.
20. JCas classes
Instead of pre-generating the JCas classes and storing them in the version control, we use the
jcasgen-maven-plugin to automatically generate JCas classes at build time. The automatic
generation of JCas classes need to be explictily enabled for modules containing types. This
is done by placing a file called .activate-run-jcasgen
in the module root with the content
Marker to activate run-jcasgen profile.
Actually the content is irrelevant, but it is a good idea to place a note here regarding the purpose of the file. |
However, in some we customized the JCas classes, e.g. we added the method
DocumentMetaData.get(JCas)
. Such classes are excluded from being generated automatically by
placing them in a second descriptor file called dkpro-types-customized.xml
, e.g.
src/main/resources/de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types-customized.xml
The dkpro-types-customized.xml descriptor must be also registered with uimaFIT in the
types.txt file.
|
21. Compliance validation
Often a type comes with a certain policy. For example, root nodes in a dependency relation tree
should have the type ROOT
and the features governor
and dependent
should point to the same
token. Another example would be that if a constituent is a child of another constituent, then its
parent
feature should be set accordingly.
To ensure that all components adhere to such policies, it is a good idea to implement checks for
them. This can be done simply by placing a new check implementation into the package
de.tudarmstadt.ukp.dkpro.core.testing.validation.checks
in the testing module. Tests implemented
with TestRunner
and IOTestRunner
use these unit tests automatically. All other checks should invoke
AssertAnnotations.assertValid(jcas)
.
Unresolved directive in developer-guide/typesystem.adoc - include::/Users/bluefire/git/dkpro-core/dkpro-core-doc/../dkpro-core-testing-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/testing/validation/checks/ParentSetCheck.java[tags=check-example]
Models and Resources
22. Architecture
The architecture for resources (e.g. parser models, POS tagger models, etc.) in DKPro is still work in progress. However, there are a couple of corner points that have already been established.
-
REQ-1 - Addressable by URL: Resources must be addressable using an URL, typically a classpath URL (classpath:/de/tudarmstadt/…/model.bin) or a file URL (file:///home/model.bin). Remote URLs like HTTP should not be used and may not be supported.
-
REQ-2 - Maven compatible: Resources are packaged in JARs and can be downloaded from our Maven repositories (if the license permits).
-
REQ-3 - Document-sensitive: A component should dynamically determine at runtime which resource to use based on properties of a processed document, e.g. based on the document language. This may change from one document to the next.
-
REQ-4 - Overridable: The user should be able to override the model or provide additional information as to what specific variant of a resource should be used. E.g. if there are two resources for the language de, de-fast and de-accurate, the component could use de-fast per default unless the user specifies to use variante accurate or specifies a different model altogether.
-
REQ-5 - Loadable from classpath: Due to REQ-1, REQ-2, and REQ-3 models must be resolvable from the classpath.
-
ResourceUtils.resolveLocation(String, Object, UimaContext)
-
Resource Providers (see below)
-
PathMatchingResourcePatternResolver
-
22.1. Versioning scheme
To version our packaged models, we use a date (yyyymmdd) and a counter (x). We use a date, because often no (reliable) upstream version is available. E.g. with the Stanford NLP tools, the same model is sometimes included in different pacakges with different versions (e.g. parser models are included with the CoreNLP package and the parser package). TreeTagger models are not versioned at all. With the OpenNLP version, we are not sure if they are versioned - it seems they are just versioned for compatibility with a particular OpenNLP version (e.g. 1.5.) but have no proper version of their own. If we know it, we use the date when the model was last changed, otherwise we use the date when we first package a new model and update it when we observe a model change.
We include additional metadata with the packaged model (e.g. which tagset is used) and we sometimes want to release packaged models with new metadata, although the upstream model itself has not changed. In such cases, we increment the counter. The counter starts at 0 if a new model is incorporated.
Thus, a model version has the format "yyyymmdd.x".
23. Packaging resources
Resources needed by DKPro components (e.g. parser models or POS tagger models) are not packaged with the corresponding analysis components, but as separate JARs, one per language and model variant.
Due to license restrictions, we may not redistribute all of these resources. But, we offer Ant scripts to automatically download the resources and package them as DKPro-compatible JARs. When the license permits, we upload these to our public Maven repository.
If you need a non-redistributable resource (e.g. TreeTagger models) or just want to package the models yourself, here is how you do it.
23.1. Installing Ant in Eclipse
Our build.xml scripts require Ant 1.8.x. If you use an older Eclipse version, you may have to manually download and register a recent Ant version:
-
Download the latest Ant binaries from the website and unpack them in a directory of your choice.
-
Start Eclipse and go to Window > Preferences > Ant > Runtime and press Ant Home….
-
Select the Ant directory you just unpacked, then confirm.
23.2. Implementing a build.xml script
Models are usually large and we therefore package them separately from the components that use them. Each model becomes a JAR that is uploaded to our Maven repositories and added as a dependency in the projects that use them.
Often, models are single files, e.g. serialize Java objects that represent a
parser model, POS tagger model, etc. The simplest case is that these files are
distributed from some website. We use an Ant script then to download the file and
package it as a JAR. We defined custom Ant macros like install-model-file that make
the process very convenient. The following code shows how we import the custom
macros and define two targets, local-maven and separate-jars. The first just sets a
property to cause install-model-file to copy the finished JAR into the local Maven
repository (~.m2/repository
).
The versioning scheme for models is "yyyymmdd.x" where "yyyymmdd" is the date of the last model change (if known) or the date of packaging and "x" is a counter unique per date starting a 0. Please refer to the versioning scheme documentation for more information.
The model building ANT script goes to src/scripts/build.xml
with the project.
DKPro Core provides a set of ANT macros that help in packaging models. Typically, you will need one of the following two:
-
install-stub-and-upstream-file
- if your model consists of a single file -
install-stub-and-upstream-folder
- if your model consists of multiple files.
When using install-stub-and-upstream-folder , the outputPackage property must end in lib ,
otherwise the generated artifacts will remain empty.
|
The ant-macros.xml
file itself contains additional documentation on the macros and additional
properties that can be set.
<project basedir="../.." default="separate-jars">
<import>
<url url="http://dkpro-core-asl.googlecode.com/svn/built-ant-macros/
tags/0.7.0/ant-macros.xml"/>
</import>
<!--
- Output package configuration
-->
<property name="outputPackage"
value="de/tudarmstadt/ukp/dkpro/core/opennlp/lib"/>
<target name="local-maven">
<property name="install-artifact-mode" value="local"/>
<antcall target="separate-jars"/>
</target>
<target name="remote-maven">
<property name="install-artifact-mode" value="remote"/>
<antcall target="separate-jars"/>
</target>
<target name="separate-jars">
<mkdir dir="target/download"/>
<!-- FILE: models-1.5/en-pos-maxent.bin - - - - - - - - - - - - - -
- 2012-06-16 | now | db2cd70395b9e2e4c6b9957015a10607
-->
<get
src="http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin"
dest="target/download/en-pos-maxent.bin"
skipexisting="true"/>
<install-stub-and-upstream-file
file="target/download/en-pos-maxent.bin"
md5="db2cd70395b9e2e4c6b9957015a10607"
groupId="de.tudarmstadt.ukp.dkpro.core"
artifactIdBase="de.tudarmstadt.ukp.dkpro.core.opennlp"
upstreamVersion="20120616"
metaDataVersion="1"
tool="tagger"
language="en"
variant="maxent"
extension="bin" >
<metadata>
<entry key="pos.tagset" value="ptb"/>
</metadata>
</install-model-file>
</target>
</project>
The model file en-pos-maxent.bin
is downloaded from the OpenNLP website and stored in a local
cache directory target/download/tagger/da-pos-maxent.bin
. From there,
install-stub-and-upstream-file
picks it up and packages it as two JARs, 1) a JAR
containing the DKPro Core meta data and a POM referencing the second JAR, 2) a JAR
containing the actual model file(s). The JAR file names derive from the
artifactIdBase, tool, language, variant, upstreamVersion and metaDataVersion
parameters. These parameters along with the extension parameter are also used to
determine the package name and file name of the model in the JAR. They are
determined as follows (mind that dots in the artifactBase turn to slashes, e.g.
de.tud
turns de/tud
:
{artifactIdBase}/lib/{tool}-{language}-{variant}.{extension}
The following values are commonly used for tool:
-
token
- tokenizer -
sentence
- sentence splitter -
segmenter
- tokenizer & sentence splitter combined -
lemma
- lemmatizer (also sometimeslemmatizer
is used, but should not be used for new models) -
tagger
- part-of-speech tagger -
morphtagger
- morphological analyzer -
ner
- named-entity recognizer -
parser
- syntactic parser -
depparser
- dependency parser (also sometimesparser
is used, but should not be used for new models) -
coref
- coreference resolver
The values for variant are very tool-dependent. Typically, the variant encodes parameters that were used during the creation of a model, e.g. which machine learning algorithm was used, which parameters it had, and on which data set is has been created.
An md5 sum for the remote file must be specified to make sure we notice if the remote file changes or if the download is corrupt.
The metadata added for the models currently used to store tagset information, which is used to drive the tag-to-DKPro-UIMA-type mapping. The following values are commonly used as keys:
-
pos.tagset
- part-of-speech tagset (ptb, ctb, stts, …) -
dependency.tagset
- dependency relation labels, aka. syntactic functions (negra, ancora, …) -
constituent.tagset
- constituent labels, aka. syntactic categories (ptb, negra, …)
23.3. Running the build.xml script
For those modules where we support packaging resources as JARs, we provide an Ant script
called build.xml
which is located in the corresponding module
in the SVN.
build.xml
is a script that can be run with Apache Ant (version 1.8.x or higher) and requires an
internet connection.
You can find this script in the src/scripts
folder of the module.
Depending on the script, various build targets are supported. Three of them are particularly important: separate-jars, local-maven, and remote-maven:
-
separate-jars downloads all resource from the internet, validates them against MD5 checksums and packages them as DKPro-compatible JARs. The JARs are stored to the target folder. You can easily update them to an Artifactory Maven repository. Artifactory automatically recognizes their group ID, artifact ID and version. This may not work with other Maven repositories.
-
local-maven additionally installs the JARs into your the local Maven repository on your computer. It assumes the default location of the repository at
~/.m2/repository
. If you keep your repository in a different folder, specify it via the alt.maven.repo.path system property. -
remote-maven additionally installs the JARS into a remote Maven repository. The repository to deploy to can be controlled via the system property alt.maven.repo.url. If the remote repo also requires authentication, use the system property alt.maven.repo.id to configure the credentials from the settings.xml that should be used. An alternative settings file can be configured using alt.maven.settings.
This target requires that you have installed
maven-ant-tasks-2.1.3.jar
in |
It is recommended to open the build.xml
file in
Eclipse, run the local-maven target, and then restart Eclipse.
Upon restart, Eclipse should automatically scan your local Maven repository. Thus,
the new resource JARs should be available in the search dialog when you add
dependencies in the POM editor.
23.4. Example: how to package TreeTagger binaries and models
TreeTagger and its models cannot be re-distributed with DKPro Core, you need to download it
yourself. For your convenience, we included an Apache Ant script called
build.xml
in the src/scripts
folder of
the TreeTagger module. This script downloads the TreeTagger binaries and models and
packages them as artifacts, allowing you to simply add them as dependencies in Maven.
To run the script, you need to have Ant 1.8.x installed and configured in Eclipse. This is already the case with Eclipse 3.7.x. If you use an older Eclipse version, please see the section below on installing Ant in Eclipse.
Now to build the TreeTagger artifacts:
-
Locate the Ant build script (
build.xml
) in the scripts directory (src/scripts
) of thedkpro-core-treetagger-asl
module. -
Right-click, choose Run As > External Tools Configurations. In the Target tab, select local-maven, run.
-
Read the license in the Ant console and - if you care - accept the license terms.
-
Wait for the build process to finish.
-
Restart Eclipse
To use the packaged TreeTagger resources, add them as Maven dependencies to your project (or add them to the classpath if you do not use Maven).
Note that in order to use TreeTagger you must have added at least the JAR with the TreeTagger binaries and one JAR with the model for the language you want to work with.
24. Updating a model
Whenever one existing model have a new release, it is good to update the build.xml changing the:
-
URL for retrieving the model (if it has changed)
-
The version from the model (the day when the model was created in the
yyyymmdd
format)
After that, run the ant script with the local-maven target, add the jars to your project classpath and check if the existing unit tests work for the up to date model. If they do, then run the script again, this time with the remote-maven target. Then, change the versions from the models in the dependency management section from the project’s pom file, commit those changes and move these new models from staging into model repository on zoidberg.
24.1. MD5 checksum check fails
Not all of the resources are properly versioned by their maintainers (in particular
TreeTagger binaries and models). We observed that resources changed from one day to
the next without any announcement or increase of the version number (if present at
all). Thus, we validate all resources against an MD5 checksum stored in the
build.xml
file. This way, we can recognize if a remote
resource has been changed. When this happens, we add a note to the
build.xml
file indicating, when we noticed the MD5 changed
and update the version of the corresponding resource.
Since we do not test the build.xml files every day, you may get an MD5 checksum
error when you try to package the resources yourself. If this happens, open the
build.xml
file with a text editor, locate the MD5 checksum that fails, update it and
update the version of the corresponding resource. You can also tell us on the DKPro
Core User Group and we will update the build.xml
file.
25. Metadata
Typical metadata items for a model.
Almost all models should declare at least one tagset. We currently declare only the tagsets that a model produces, not those that it consumes.
Entry | Description |
---|---|
|
|
|
|
|
|
|
|
|
Entry | Description |
---|---|
|
Deprecated, use |
|
The character encoding of the model. In particular relevant for native tools, e.g. TreeTagger, Sfst, as we communicate with as external processes them through pipes or files. |
The Dublin Core (DC) metadata items are not (yet) widely used throughout the models. This might change in the future.
Entry | Description |
---|---|
|
|
|
|
|
|
|
Entry | Description |
---|---|
|
Used by the MstParser component to indicate the type of model |
|
Used by the TreeTagger components to mark the boundary between two documents. |
|
|
|
25.1. Low-level tagset mapping
Some models require minimalistic tag mapping, e.g. because the tags in the model are upper-case, while the tagset normally uses lower-case tags - or because there is a typo in a tag, etc. For this reason, DKPro Core offers a way of mapping tags directly in the model metadata provider such that DKPro Core components do not see the original model tags, but the fixed mapped tags.
A tag mapping is defined as <level>.tag.map.<originalTag>=<newTag>
. For example:
pos.tag.map.nn=NN
<entry key="pos.tag.map.nn" value="NN"/>
This functionality is successively added and may not be available for all types of models. |