DKPro Core - TreeTagger part-of-speech tagging and lemmatizing

Analytics

Reads files from the specified directory and prints the result to the console.

TreeTagger Installation for Linux

  • Go to the TreeTagger website
  • From the download section, download the correct tagger package, i.e. PC-Linux
    • Extract the .gz archive
    • Copy the tree-tagger-linux-3.2/bin/tree-tagger file and place it in the same folder as the script treetagger.py
  • From the parameter file section, download the correct model. For the example below download English parameter file (english-par-linux-3.2-utf8.bin.gz)
    • Unzip the file (e.g. gunzip english-par-linux-3.2-utf8.bin.gz)
    • Copy the file english-par-linux-3.2-utf8.bin into the same folder as the treetagger.py script. Ensure that the name for the model is english-par-linux-3.2-utf8.bin

TreeTagger Installation for Windows 7

  • Ensure that you have a program to unzip .gz files. For example you can use [http://www.7-zip.org 7zip]
  • Go to the TreeTagger website
  • In the Windows section, you find the download link for the tree-tagger-windows-3.2.zip file.
    • Extract the zip-archive
    • Copy the tree-tagger-windows-3.2/bin/tree-tagger.exe to your folder with with the treetagger.py script
  • From the parameter file section, download the correct model. For the example below download English parameter file (english-par-linux-3.2-utf8.bin.gz)
    • Unzip the file (e.g. by using 7zip)
    • Copy the file english-par-linux-3.2-utf8.bin into the same folder as the treetagger.py script. Ensure that the name for the model is english-par-linux-3.2-utf8.bin
  • In the script below, you find a line TreeTaggerPosLemmaTT4J.PARAM_EXECUTABLE_PATH, "tree-tagger", change the value tree-tagger to tree-tagger.exe

If you already have TreeTagger installed on your system and or if you want to use another model file, you can also set in the script the parameters PARAM_EXECUTABLE_PATH and PARAM_MODEL_PATH to their respective locations.

Call with C:\jython-2.7b1\jython treetagger.py <foldername> <language>, e.g. C:\jython-2.7b1\jython treetagger.py C:\example_folder\ en.

#!/usr/bin/env jython
# Fix classpath scanning - otherise uimaFIT will not find the UIMA types
from java.lang import Thread
from org.python.core.imp import *
Thread.currentThread().contextClassLoader = getSyspathJavaLoader()

# Dependencies and imports for DKPro modules
from jip.embed import require
require('de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.opennlp-asl:1.6.1')
require('de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.treetagger-asl:1.6.1')
require('de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.io.text-asl:1.6.1')
from de.tudarmstadt.ukp.dkpro.core.opennlp import *
from de.tudarmstadt.ukp.dkpro.core.treetagger import *
from de.tudarmstadt.ukp.dkpro.core.io.text import *
from de.tudarmstadt.ukp.dkpro.core.api.segmentation.type import *

# uimaFIT imports
from org.apache.uima.fit.util.JCasUtil import *
from org.apache.uima.fit.pipeline.SimplePipeline import *
from org.apache.uima.fit.factory.CollectionReaderFactory import *
from org.apache.uima.fit.factory.AnalysisEngineFactory import *

# Access to commandline arguments
import sys

# Assemble and run pipeline
pipeline = iteratePipeline(
  createReaderDescription(TextReader,
    TextReader.PARAM_PATH, sys.argv[1],
    TextReader.PARAM_LANGUAGE, sys.argv[2],
    TextReader.PARAM_ENCODING, "ISO-8859-1",
    TextReader.PARAM_PATTERNS, "*.txt"),
  createEngineDescription(OpenNlpSegmenter),
  createEngineDescription(TreeTaggerPosLemmaTT4J,
    TreeTaggerPosLemmaTT4J.PARAM_EXECUTABLE_PATH, "tree-tagger", #!! Change to "tree-tagger.exe" if the script is executed under windows !!
    TreeTaggerPosLemmaTT4J.PARAM_MODEL_PATH, "english-par-linux-3.2-utf8.bin",
    TreeTaggerPosLemmaTT4J.PARAM_MODEL_ENCODING, "UTF-8"));

for jcas in pipeline:
  for token in select(jcas, Token):
    print token.coveredText + " " + token.pos.posValue + " " + token.lemma.value

Example output:

The DT the
quick JJ quick
brown JJ brown
fox NN fox
jumps NNS jump
over IN over
the DT the
lazy JJ lazy
dog NN dog
. SENT .

Support DKPro Core by allowing the use of cookies

Please support DKPro Core project by allowing this site to use cookies to track your activity. Doing so allows us to get an idea of how interesting our project is to the community. The EU General Data Protection Regulation (GDPR) requires us to ask you for your consent about the use of cookies. To learn more about how our site makes use of cookies and uses your activity data, please refer to our privacy policy. You can also always revise the choice you make here by visiting out privacy policy page.

Do you allow tracking your activity on this site using cookies?