This document is targets developers of DKPro Core components.
Implementing Components
1. General
1.1. Capabilities
All components should declare the types they consume by default as input and they produced by default as output. Some components may not know before runtime what they produce or consume, so nothing can be declared here.
@TypeCapability(
inputs = {
"de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token",
"de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence" },
outputs = {
"de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS" })
public class OpenNlpPosTagger
extends JCasAnnotator_ImplBase
{
2. Analysis components
2.1. Base classes
The base classes for analysis components are provided by uimaFIT: JCasAnnotator_ImplBase
and
CasAnnotator_ImplBase
.
2.2. Models
The ModelProviderBase
class offers convenient support for working with model resources.
The following code is taken from the OpenNlpPosTagger
component. It shows how the POS Tagger
model is addressed using a parametrized classpath URL with parameters for language and variant.
// Use ModelProviderBase convenience constructor to set up a model provider that
// auto-detects most of its settings and is configured to use default variants.
// Auto-detection inspects the configuration parameter fields (@ConfigurationParameter)
// of the analysis engine class and looks for default parameters such as PARAM_LANGUAGE,
// PARAM_VARIANT, and PARAM_MODEL_LOCATION.
modelProvider = new ModelProviderBase<POSTagger>(this, "opennlp", "tagger")
{
@Override
protected POSTagger produceResource(InputStream aStream)
throws Exception
{
// Load the POS tagger model from the location the model provider offers
POSModel model = new POSModel(aStream);
// Create a new POS tagger instance from the loaded model
return new POSTaggerME(model);
}
};
The produceResource()
method is called with the URL of the model once it has been located by
CasConfigurableProviderBase
.
CAS cas = aJCas.getCas();
// Document-specific configuration of model and mapping provider in process()
modelProvider.configure(cas);
List<Token> tokens = selectCovered(aJCas, Token.class, sentence);
String[] tokenTexts = toText(tokens).toArray(new String[tokens.size()]);
// Fetch the OpenNLP pos tagger instance configured with the right model and use it to
// tag the text
String[] tags = modelProvider.getResource().tag(tokenTexts);
2.3. Type mappings
The DKPro type system design provides two levels of abstraction on most annotations:
-
a generic annotation type, e.g. POS (part of speech) with a feature value containing the original tag produced by an analysis component, e.g. TreeTagger
-
a set of high-level types for very common categories, e.g. N (noun), V (verb), etc.
DKPro maintains mappings for commonly used tagsets, e.g. in the module
dkpro-core-api-lexmorph-asl
. They are named:
{language}-{tagset}-{layer}.map
The following values are commonly used for layer
:
-
pos
- part-of-speech tag mapping -
morph
- morphological features mapping -
constituency
- constituent tag mapping -
dependency
- dependency relation mapping
The mapping provder is create in the initialize()
method of the UIMA component after the
respective model provider. This is necessary because the mapping provider obtains the tagset
information for the current model from the model provider.
// General setup of the mapping provider in initialize()
mappingProvider = MappingProviderFactory.createPosMappingProvider(posMappingLocation,
language, modelProvider);
In the process()
method, the mapping provider is used to to create an UIMA annotation. First,
it is configured for the current document and model. Then, it is invoked for each tag produced
by the tagger obtain the UIMA type for the annotation to be created. If there is no mapping file,
the mapping provide will fall back to a suitable default annotation, e.g. POS for part-of-speech
tags or NamedEntity.
// Mind the mapping provider must be configured after the model provider as it uses the
// model metadata
mappingProvider.configure(cas);
// Convert the tag produced by the tagger to an UIMA type, create an annotation
// of this type, and add it to the document.
Type posTag = mappingProvider.getTagType(tag);
POS posAnno = (POS) cas.createAnnotation(posTag, t.getBegin(), t.getEnd());
// To save memory, we typically intern() tag strings
posAnno.setPosValue(internTags ? tag.intern() : tag);
posAnno.addToIndexes();
2.4. Default variants
It is possible a different default variant needs to be used depending on the language. This
can be configured by placing a properties file in the classpath and setting its
location using setDefaultVariantsLocation(String)
. The key in the
properties is the language and the value is used a default variant. These file
should always reside in the lib
sub-package of a component and use the naming
convention:
{tool}-default-variants.map
The default variant file is a Java properties file which defines for each language which variant
should be assumed as default. It is possible to declare a catch-all variant using *
. This is
used if none of the other default variants apply.
it=perceptron
*=maxent
Use the convenience constructor of ModelProviderBase
to create model providers that are already
correctly set up to use default variants:
public ModelProviderBase(Object aObject, String aShortName, String aType)
{
setContextObject(aObject);
setDefault(ARTIFACT_ID, "${groupId}." + aShortName + "-model-" + aType
+ "-${language}-${variant}");
setDefault(LOCATION,
"classpath:/${package}/lib/"+aType+"-${language}-${variant}.properties");
setDefaultVariantsLocation("${package}/lib/"+aType+"-default-variants.map");
addAutoOverride(ComponentParameters.PARAM_MODEL_LOCATION, LOCATION);
addAutoOverride(ComponentParameters.PARAM_VARIANT, VARIANT);
addAutoOverride(ComponentParameters.PARAM_LANGUAGE, LANGUAGE);
applyAutoOverrides(aObject);
}
3. I/O components
3.1. Base classes
The base classes for I/O components are located in the dkpro-core-api-io-asl
module.
Most reader components are derived from JCasResourceCollectionReaderBase
or
ResourceCollectionReaderBase
. These class offers support for many common functionalities, e.g.:
-
common parameters like
PARAM_SOURCE_LOCATION
andPARAM_PATTERNS
-
reading from the file system, classpath, or ZIP archives
-
file-based compression (GZ, BZIP2, XZ)
-
handling of
DocumentMetaData (`initCas
methods) -
Ant-like include/exclude patterns
-
handling of default excludes and hidden files
-
progress reporting support
-
extensibility with own Spring resource resolvers
Most writer components are derived from JCasFileWriter_ImplBase
. This class offers support for
common functionality such as:
-
common parameters like
PARAM_TARGET_LOCATION
-
writing to the file system, classpath, or ZIP archives (
getOutputStream
methods) -
file-based compression (GZ, BZIP2, XZ)
-
properly interpreting
DocumentMetaData (`getOutputStream
methods) -
overwrite protection
-
replacing file extensions
If an I/O component interacts with a different data source, e.g. a database, the base classes above
are not suitable. Such readers should derive from the uimaFIT JCasCollectionReader_ImplBase
(or CasCollectionReader_ImplBase
) and writers from JCasAnnotator_ImplBase
(or
CasAnnotator_ImplBase
). However, the developer should ensure that the component’s parameters
reflect the standard DKPro Core reader/writer parameters defined in ComponentParameters
(dkpro-core-api-parameters-asl
module).
Testing
4. Basic test setup
There are a couple of things useful in every unit test:
-
Redirecting the UIMA logging through log4j - DKPro Core uses log4j for logging in unit tests.
-
Printing the name of the test to the console before every test
-
Enabling extended index checks in UIMA (
uima.exception_when_fs_update_corrupts_index
)
To avoid repeating a longish setup boilerplate code in every unit test, add the following lines to your unit test class:
@Rule
public DkproTestContext testContext = new DkproTestContext();
Additional benefits you get from this testContext
are:
-
getting the class name of the current test (
getClassName()
) -
getting the method name of the current test (
getMethodName()
) -
getting the name of a folder you can use to store test results (
getTestOutputFolder()
).
5. Unit test example
A typical unit test class has consists of two parts
-
the test cases
-
a
runTest
method - which sets up the pipeline required by the test and then callsTestRunner.runTest()
.
In the following example, mind that the text must be provided with spaces
separating the tokens (thus there must be a space before the full stop at the end of the
sentence) and with newline characters (\n
) separating the sentences:
@Test
public void testEnglish()
throws Exception
{
// Run the test pipeline. Note the full stop at the end of a sentence is preceded by a
// whitespace. This is necessary for it to be detected as a separate token!
JCas jcas = runTest("en", "person", "SAP where John Doe works is in Germany .");
// Define the reference data that we expect to get back from the test
String[] namedEntity = { "[ 10, 18]NamedEntity(person) (John Doe)" };
// Compare the annotations created in the pipeline to the reference data
AssertAnnotations.assertNamedEntity(namedEntity, select(jcas, NamedEntity.class));
}
// Auxiliary method that sets up the analysis engine or pipeline used in the test.
// Typically, we have multiple tests per unit test file that each invoke this method.
private JCas runTest(String language, String variant, String testDocument)
throws Exception
{
AnalysisEngine engine = createEngine(OpenNlpNamedEntityRecognizer.class,
OpenNlpNamedEntityRecognizer.PARAM_VARIANT, variant,
OpenNlpNamedEntityRecognizer.PARAM_PRINT_TAGSET, true);
// Here we invoke the TestRunner which performs basic whitespace tokenization and
// sentence splitting, creates a CAS, runs the pipeline, etc. TestRunner explicitly
// disables automatic model loading. Thus, models used in unit tests must be explicitly
// made dependencies in the pom.xml file.
return TestRunner.runTest(engine, language, testDocument);
}
Test cases for segmenter components should not make use of the TestRunner
class, which already performs tokenization and sentence splitting internally.
6. AssertAnnotations
The AssertAnnotations class offers various static methods to test if a component has properly created annotations of a certain kind. There are methods to test almost every kind of annotation supported by DKPro Core, e.g.:
-
assertToken
-
assertSentence
-
assertPOS
-
assertLemma
-
assertMorph
-
assertStem
-
assertNamedEntity
-
assertConstituents
-
assertChunks
-
assertDependencies
-
assertPennTree
-
assertSemanticPredicates
-
assertSemanticField
-
assertCoreference
-
assertTagset
-
assertTagsetMapping
-
assertValid
- Tests implemented withTestRunner
andIOTestRunner
perform validation checks automatically. All other unit tests should invokeAssertAnnotations.assertValid(jcas)
. -
etc.
7. Testing I/O componets
The IOTestRunner class offers convenient methods to test I/O components:
-
testRoundTrip
can be used to test converting a format to CAS, converting it back and comparing it to the original -
testOneWay
instead is useful to read data and compare it to a reference file in a different format (e.g. CasDumpWriter format). It can also be used if there a full round-trip is not possible because some information is lost or cannot be exported exactly as ingested from the original file.
The input file and reference file path given to these methods is always considered relative to
src/test/resources
.
testRoundTrip
with extra parameters (Conll2006ReaderWriterTest)testRoundTrip(
Conll2006Reader.class, // the reader
Conll2006Writer.class, // the writer
"conll/2006/fk003_2006_08_ZH1.conll"); // the input also used as output reference
testOneWay
with extra parameters (Conll2006ReaderWriterTest)testOneWay(
Conll2006Reader.class, // the reader
Conll2006Writer.class, // the writer
"conll/2006/fi-ref.conll", // the reference file for the output
"conll/2006/fi-orig.conll"); // the input file for the test
testRoundTrip
with extra parameters (BratReaderWriterTest)testOneWay(
createReaderDescription(Conll2009Reader.class), // the reader
createEngineDescription(BratWriter.class, // the writer
BratWriter.PARAM_WRITE_RELATION_ATTRIBUTES, true),
"conll/2009/en-ref.ann", // the reference file for the output
"conll/2009/en-orig.conll"); // the input file for the test
Type System
8. Types
To add a new type, first locate the relevant module. Typically types are added to an API module because types are supposed to be independent of individual analysis tools. In rare circumstances, a type may be added to an I/O or tool module, e.g. because the type is experimental and needs to be tested in the context of that module - or because the type is highly specific to that module.
Typically, there is only a single descriptor file called dkpro-types.xml
. Within a module, we keep
this descriptor in the folder src/main/resources
under the top-level package plus type name of
the module. E.g. for the module
dkpro-core-api-semantics-asl
the type descriptor would be
src/main/resources/de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types.xml
For the time being, descriptors in src/main/resources/desc/type are also supported.
However, this support is going to be removed in the future.
|
9. Type descriptors
If there is no suitable type descriptor file yet, create a new one.
When a new type system descriptor has been added to a module, it needs to be registered with uimaFIT. This happens by creating the file
src/main/resources/META-INF/org.apache.uima.fit/types.txt
consisting of a list of type system descriptor locations prefixed with classpath*:
, e.g.:
classpath*:de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types.xml
The type system location corresponds to the location within the classpath at runtime, thus
src/main/resources is stripped from the beginning.
|
10. Documentation
10.1. Type descriptors
To play nicely with the automatic documentation generation system, the following points should be observed when creating a new type descriptor file:
- Name
-
field of the type descriptor corresponds to the section under which the types declared in the descriptor appear. If a type descriptor name field is e.g. Syntax, all types declared in the file will appear under that heading in the documentation. Multiple descriptors can declare the same name and the types declared in them are listed in the documentation in alphabetical order.
- Description
-
field should be emtpy. Create instead a
sectionIntroXXX.adoc
file undersrc/main/asciidoc/typesystem-reference
in thedkpro-core-doc
module (XXX
is the name of the section - see Name above). - Version
-
should be set to
${version}
. If it does not exist yet, create a filesrc/filter/filter.properties
in the module that creates the new type descriptor with the following content:version=${project.version} timestamp=${maven.build.timestamp}
Also add the following section to the
pom.xml
file in the respective module:<build> <resources> <resource> <filtering>false</filtering> <directory>src/main/resources</directory> <excludes> <exclude>desc/type/**/*</exclude> </excludes> </resource> <resource> <filtering>true</filtering> <directory>src/main/resources</directory> <includes> <include>desc/type/**/*</include> </includes> </resource> </resources> <pluginManagement> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-dependency-plugin</artifactId> <configuration> <usedDependencies> <usedDependency>de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.api.parameter-asl</usedDependency> </usedDependencies> </configuration> </plugin> </plugins> </pluginManagement> </build>
Replace the pattern inside the include
andexclude
elements with the location of your type descriptor file, e.g.de/tudarmstadt/ukp/dkpro/core/api/semantics/type/*.xml
.
10.2. Types
When creating a new type or feature, you can use HTML tags to format the description. Line breaks, indentation, etc. will not be preserved. Mind that the description will be placed into the JavaDoc for the generated JCas classes as well as into the auto-generated DKPro Core documentation.
11. JCas classes
Instead of pre-generating the JCas classes and storing them in the version control, we use the
jcasgen-maven-plugin to automatically generate JCas classes at build time. The automatic
generation of JCas classes need to be explictily enabled for modules containing types. This
is done by placing a file called .activate-run-jcasgen
in the module root with the content
Marker to activate run-jcasgen profile.
Actually the content is irrelevant, but it is a good idea to place a note here regarding the purpose of the file. |
However, in some we customized the JCas classes, e.g. we added the method
DocumentMetaData.get(JCas)
. Such classes are excluded from being generated automatically by
placing them in a second descriptor file called dkpro-types-customized.xml
, e.g.
src/main/resources/de/tudarmstadt/ukp/dkpro/core/api/semantics/type/dkpro-types-customized.xml
The dkpro-types-customized.xml descriptor must be also registered with uimaFIT in the
types.txt file.
|
12. Compliance validation
Often a type comes with a certain policy. For example, root nodes in a dependency relation tree
should have the type ROOT
and the features governor
and dependent
should point to the same
token. Another example would be that if a constituent is a child of another constituent, then its
parent
feature should be set accordingly.
To ensure that all components adhere to such policies, it is a good idea to implement checks for
them. This can be done simply by placing a new check implementation into the package
de.tudarmstadt.ukp.dkpro.core.testing.validation.checks
in the testing module. Tests implemented
with TestRunner
and IOTestRunner
use these unit tests automatically. All other checks should invoke
AssertAnnotations.assertValid(jcas)
.
@Override
public boolean check(JCas aJCas, List<Message> aMessages)
{
for (Constituent parent : select(aJCas, Constituent.class)) {
Collection<Annotation> children = select(parent.getChildren(), Annotation.class);
for (Annotation child : children) {
Annotation declParent = FSUtil.getFeature(child, "parent", Annotation.class);
if (declParent == null) {
aMessages.add(new Message(this, ERROR, String.format(
"Child without parent set: %s", child)));
}
else if (declParent != parent) {
aMessages.add(new Message(this, ERROR, String.format(
"Child points to wrong parent: %s", child)));
}
}
}
return aMessages.stream().anyMatch(m -> m.level == ERROR);
}
Models and Resources
13. Architecture
The architecture for resources (e.g. parser models, POS tagger models, etc.) in DKPro is still work in progress. However, there are a couple of corner points that have already been established.
-
REQ-1 - Addressable by URL: Resources must be addressable using an URL, typically a classpath URL (classpath:/de/tudarmstadt/…/model.bin) or a file URL (file:///home/model.bin). Remote URLs like HTTP should not be used and may not be supported.
-
REQ-2 - Maven compatible: Resources are packaged in JARs and can be downloaded from our Maven repositories (if the license permits).
-
REQ-3 - Document-sensitive: A component should dynamically determine at runtime which resource to use based on properties of a processed document, e.g. based on the document language. This may change from one document to the next.
-
REQ-4 - Overridable: The user should be able to override the model or provide additional information as to what specific variant of a resource should be used. E.g. if there are two resources for the language de, de-fast and de-accurate, the component could use de-fast per default unless the user specifies to use variante accurate or specifies a different model altogether.
-
REQ-5 - Loadable from classpath: Due to REQ-1, REQ-2, and REQ-3 models must be resolvable from the classpath.
-
ResourceUtils.resolveLocation(String, Object, UimaContext)
-
Resource Providers (see below)
-
PathMatchingResourcePatternResolver
-
13.1. Versioning scheme
To version our packaged models, we use a date (yyyymmdd) and a counter (x). We use a date, because often no (reliable) upstream version is available. E.g. with the Stanford NLP tools, the same model is sometimes included in different pacakges with different versions (e.g. parser models are included with the CoreNLP package and the parser package). TreeTagger models are not versioned at all. With the OpenNLP version, we are not sure if they are versioned - it seems they are just versioned for compatibility with a particular OpenNLP version (e.g. 1.5.) but have no proper version of their own. If we know it, we use the date when the model was last changed, otherwise we use the date when we first package a new model and update it when we observe a model change.
We include additional metadata with the packaged model (e.g. which tagset is used) and we sometimes want to release packaged models with new metadata, although the upstream model itself has not changed. In such cases, we increment the counter. The counter starts at 0 if a new model is incorporated.
Thus, a model version has the format "yyyymmdd.x".
14. Packaging resources
Resources needed by DKPro components (e.g. parser models or POS tagger models) are not packaged with the corresponding analysis components, but as separate JARs, one per language and model variant.
Due to license restrictions, we may not redistribute all of these resources. But, we offer Ant scripts to automatically download the resources and package them as DKPro-compatible JARs. When the license permits, we upload these to our public Maven repository.
If you need a non-redistributable resource (e.g. TreeTagger models) or just want to package the models yourself, here is how you do it.
14.1. Installing Ant in Eclipse
Our build.xml scripts require Ant 1.8.x. If you use an older Eclipse version, you may have to manually download and register a recent Ant version:
-
Download the latest Ant binaries from the website and unpack them in a directory of your choice.
-
Start Eclipse and go to Window > Preferences > Ant > Runtime and press Ant Home….
-
Select the Ant directory you just unpacked, then confirm.
14.2. Implementing a build.xml script
Models are usually large and we therefore package them separately from the components that use them. Each model becomes a JAR that is uploaded to our Maven repositories and added as a dependency in the projects that use them.
Often, models are single files, e.g. serialize Java objects that represent a
parser model, POS tagger model, etc. The simplest case is that these files are
distributed from some website. We use an Ant script then to download the file and
package it as a JAR. We defined custom Ant macros like install-model-file that make
the process very convenient. The following code shows how we import the custom
macros and define two targets, local-maven and separate-jars. The first just sets a
property to cause install-model-file to copy the finished JAR into the local Maven
repository (~.m2/repository
).
The versioning scheme for models is "yyyymmdd.x" where "yyyymmdd" is the date of the last model change (if known) or the date of packaging and "x" is a counter unique per date starting a 0. Please refer to the versioning scheme documentation for more information.
The model building ANT script goes to src/scripts/build.xml
with the project.
DKPro Core provides a set of ANT macros that help in packaging models. Typically, you will need one of the following two:
-
install-stub-and-upstream-file
- if your model consists of a single file -
install-stub-and-upstream-folder
- if your model consists of multiple files.
When using install-stub-and-upstream-folder , the outputPackage property must end in lib ,
otherwise the generated artifacts will remain empty.
|
The ant-macros.xml
file itself contains additional documentation on the macros and additional
properties that can be set.
<project basedir="../.." default="separate-jars">
<import>
<url url="http://dkpro-core-asl.googlecode.com/svn/built-ant-macros/
tags/0.7.0/ant-macros.xml"/>
</import>
<!--
- Output package configuration
-->
<property name="outputPackage"
value="de/tudarmstadt/ukp/dkpro/core/opennlp/lib"/>
<target name="local-maven">
<property name="install-artifact-mode" value="local"/>
<antcall target="separate-jars"/>
</target>
<target name="remote-maven">
<property name="install-artifact-mode" value="remote"/>
<antcall target="separate-jars"/>
</target>
<target name="separate-jars">
<mkdir dir="target/download"/>
<!-- FILE: models-1.5/en-pos-maxent.bin - - - - - - - - - - - - - -
- 2012-06-16 | now | db2cd70395b9e2e4c6b9957015a10607
-->
<get
src="http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin"
dest="target/download/en-pos-maxent.bin"
skipexisting="true"/>
<install-stub-and-upstream-file
file="target/download/en-pos-maxent.bin"
md5="db2cd70395b9e2e4c6b9957015a10607"
groupId="de.tudarmstadt.ukp.dkpro.core"
artifactIdBase="de.tudarmstadt.ukp.dkpro.core.opennlp"
upstreamVersion="20120616"
metaDataVersion="1"
tool="tagger"
language="en"
variant="maxent"
extension="bin" >
<metadata>
<entry key="pos.tagset" value="ptb"/>
</metadata>
</install-model-file>
</target>
</project>
The model file en-pos-maxent.bin
is downloaded from the OpenNLP website and stored in a local
cache directory target/download/tagger/da-pos-maxent.bin
. From there,
install-stub-and-upstream-file
picks it up and packages it as two JARs, 1) a JAR
containing the DKPro Core meta data and a POM referencing the second JAR, 2) a JAR
containing the actual model file(s). The JAR file names derive from the
artifactIdBase, tool, language, variant, upstreamVersion and metaDataVersion
parameters. These parameters along with the extension parameter are also used to
determine the package name and file name of the model in the JAR. They are
determined as follows (mind that dots in the artifactBase turn to slashes, e.g.
de.tud
turns de/tud
:
{artifactIdBase}/lib/{tool}-{language}-{variant}.{extension}
The following values are commonly used for tool:
-
token
- tokenizer -
sentence
- sentence splitter -
lemmatizer
- lemmatizer -
tagger
- part-of-speech tagger -
morphtagger
- morphological analyzer -
ner
- named-entity recognizer -
parser
- syntactic or dependency parser -
coref
- coreference resolver
The values for variant are very tool-dependent. Typically, the variant encodes parameters that were used during the creation of a model, e.g. which machine learning algorithm was used, which parameters it had, and on which data set is has been created.
An md5 sum for the remote file must be specified to make sure we notice if the remote file changes or if the download is corrupt.
The metadata added for the models currently used to store tagset information, which is used to drive the tag-to-DKPro-UIMA-type mapping. The following values are commonly used as keys:
-
pos.tagset
- part-of-speech tagset (ptb, ctb, stts, …) -
dependency.tagset
- dependency relation labels, aka. syntactic functions (negra, ancora, …) -
constituent.tagset
- constituent labels, aka. syntactic categories (ptb, negra, …)
14.3. Running the build.xml script
For those modules where we support packaging resources as JARs, we provide an Ant script
called build.xml
which is located in the corresponding module
in the SVN.
build.xml
is a script that can be run with Apache Ant (version 1.8.x or higher) and requires an
internet connection.
You can find this script in the src/scripts
folder of the module.
Depending on the script, various build targets are supported. Three of them are particularly important: separate-jars, local-maven, and remote-maven:
-
separate-jars downloads all resource from the internet, validates them against MD5 checksums and packages them as DKPro-compatible JARs. The JARs are stored to the target folder. You can easily update them to an Artifactory Maven repository. Artifactory automatically recognizes their group ID, artifact ID and version. This may not work with other Maven repositories.
-
local-maven additionally installs the JARs into your the local Maven repository on your computer. It assumes the default location of the repository at
~/.m2/repository
. If you keep your repository in a different folder, specify it via the alt.maven.repo.path system property. -
remote-maven additionally installs the JARS into a remote Maven repository. The repository to deploy to can be controlled via the system property alt.maven.repo.url. If the remote repo also requires authentication, use the system property alt.maven.repo.id to configure the credentials from the settings.xml that should be used. An alternative settings file can be configured using alt.maven.settings.
This target requires that you have installed
maven-ant-tasks-2.1.3.jar
in |
It is recommended to open the build.xml
file in
Eclipse, run the local-maven target, and then restart Eclipse.
Upon restart, Eclipse should automatically scan your local Maven repository. Thus,
the new resource JARs should be available in the search dialog when you add
dependencies in the POM editor.
14.4. Example: how to package TreeTagger binaries and models
TreeTagger and its models cannot be re-distributed with DKPro Core, you need to download it
yourself. For your convenience, we included an Apache Ant script called
build.xml
in the src/scripts
folder of
the TreeTagger module. This script downloads the TreeTagger binaries and models and
packages them as artifacts, allowing you to simply add them as dependencies in Maven.
To run the script, you need to have Ant 1.8.x installed and configured in Eclipse. This is already the case with Eclipse 3.7.x. If you use an older Eclipse version, please see the section below on installing Ant in Eclipse.
Now to build the TreeTagger artifacts:
-
Locate the Ant build script (
build.xml
) in the scripts directory (src/scripts
) of thedkpro-core-treetagger-asl
module. -
Right-click, choose Run As > External Tools Configurations. In the Target tab, select local-maven, run.
-
Read the license in the Ant console and - if you care - accept the license terms.
-
Wait for the build process to finish.
-
Restart Eclipse
To use the packaged TreeTagger resources, add them as Maven dependencies to your project (or add them to the classpath if you do not use Maven).
Note that in order to use TreeTagger you must have added at least the JAR with the TreeTagger binaries and one JAR with the model for the language you want to work with.
15. Updating a model
Whenever one existing model have a new release, it is good to update the build.xml changing the:
-
URL for retrieving the model (if it has changed)
-
The version from the model (the day when the model was created in the
yyyymmdd
format)
After that, run the ant script with the local-maven target, add the jars to your project classpath and check if the existing unit tests work for the up to date model. If they do, then run the script again, this time with the remote-maven target. Then, change the versions from the models in the dependency management section from the project’s pom file, commit those changes and move these new models from staging into model repository on zoidberg.
15.1. MD5 checksum check fails
Not all of the resources are properly versioned by their maintainers (in particular
TreeTagger binaries and models). We observed that resources changed from one day to
the next without any announcement or increase of the version number (if present at
all). Thus, we validate all resources against an MD5 checksum stored in the
build.xml
file. This way, we can recognize if a remote
resource has been changed. When this happens, we add a note to the
build.xml
file indicating, when we noticed the MD5 changed
and update the version of the corresponding resource.
Since we do not test the build.xml files every day, you may get an MD5 checksum
error when you try to package the resources yourself. If this happens, open the
build.xml
file with a text editor, locate the MD5 checksum that fails, update it and
update the version of the corresponding resource. You can also tell us on the DKPro
Core User Group and we will update the build.xml
file.
16. Metadata
Typical metadata items for a model.
Almost all models should declare at least one tagset. We currently declare only the tagsets that a model produces, not those that it consumes.
Entry | Description |
---|---|
|
|
|
|
|
|
|
|
|
Entry | Description |
---|---|
|
Deprecated, use |
|
The character encoding of the model. In particular relevant for native tools, e.g. TreeTagger, Sfst, as we communicate with as external processes them through pipes or files. |
The Dublin Core (DC) metadata items are not (yet) widely used throughout the models. This might change in the future.
Entry | Description |
---|---|
|
|
|
|
|
|
|
Entry | Description |
---|---|
|
Used by the MstParser component to indicate the type of model |
|
Used by the TreeTagger components to mark the boundary between two documents. |
|
|
|