This page provides an overview over software that we will most likely not integrate and distribute with DKPro Core for one or more of the following reasons:
- Service only - the software is only available as a web-service. DKPro Core aims to be a collection of portable NLP components that can easily be deployed on any Linux, Windows or OS X machine. Software that is only available as a service is problematic for various reasons. E.g. the data must be transferred to the service for being processed. For large amounts of data or for sensitive data, this may not be possible. DKPro Core components are versioned and each of the versions is available indefinitely (or until Maven Central goes away). Hosting the artifacts requires comparatively few resources. Hosting a live service consumes comparatively many resources. Therefore it is not possible to host all versions of a service. Thus, a service may easily change due to upgrades or go away due to lack of funding.
- Not redistributable - the software may not be distributed from other parties than the original provider. All DKPro Core components are meant to be self-contained and should not require the user to install additional software on a machine where it is executed. If a DKPro Core component requires a library or tool, it should be able to acquire it automatically from Maven Central or from the UKP Lab OSS Maven repository. Some licenses prohibit redistribution, thus we cannot upload them either of these Maven repositories. A typical phrase in the license that prohibits redistribution is this: The licensee has no right to give or sell the system to third parties without written permission from the licenser.
- Not statically compilable - the software is written in a non-Java language and cannot be compiled in such a way that it runs on a reasonably large set of machines. E.g. it may depend on too many dynamically linked libraries.
- Not portable - the software is not available for all major platforms (Linux, OS X, Windows).
- Requires runtime - the software is written in a non-Java language that requires a runtime environment which cannot easily be deployed automatically a machine by DKPro Core. This includes software written in Python, Perl, or similar languages.
- Loss of information - while processing, the software looses information, in particular the information on how to relate its output back to its input. A typical example is making changes to the text (normalization) while at the same time not anchoring the resulting annotations via offsets to the original text. Fixing such problems typically requires changing the source code of the tool. If such a change is not adopted by the upstream developers, it would mean that we would have to maintain a patch or fork against the upstream code which would cause additional overhead for us.
|Enju parser||Not redistributable|
|TreeTagger||Not redistributable||DKPro Core supports TreeTagger, but users have to download binaries and models themselves. DKPro Core runs TreeTagger in a separate process, thus there are no compile-time dependencies on any part of TreeTagger.|
|LX-Parser||Not redistributable||Additionally, the LX-Parser requires an ancient version of the Stanford Parser. It is not kept up-to-date with recent Stanford releases.|
|LX-Tokenizer||Not redistributable||Additionally, the LX-Tokenizer is only available as a binary for Linux. No Windows or OS X support.|
|UAIC NLP Tools||Service only|
|AlchemyAPI||Service only||Commercial service|
|OpenCalais||Service only||Commercial service|
|SVMTool||Requires runtime (Perl)|
|PDTB Parser||Loss of information||Tokenization, text normalization (but in principle can be integrated using JRuby quite easily)|
|BART (de)||Unclear preprocessing||Preprocessing for German requires several third party tools in various languages (Python scripts & packages, C tools). Preprocessing sparsely documented, required C-libraries throw a memory exception. High effort required to add the required preprocessing to DKPro (as of 12/2014)|