The document targets users employing the DKPro Script to build analysis pipelines.
Introduction
DKPro Script is a domain specific language (DSL) based on Groovy that greatly facilitates building pipelines using DKPro Core components.
Script structure
The script starts with a mandatory preamble that sets up the DKPro Script environment:
#!/usr/bin/env groovy
@Grab('org.dkpro.script:dkpro-script-groovy:0.1.0')
@groovy.transform.BaseScript org.dkpro.script.groovy.DKProCoreScript baseScript
After the preamble follows the actual script. There are three main commands read
, apply
, and
write
. The read
command must appear first and there can only be one read
command per script.
Any number of apply
and write
commands may follow.
read 'String' language 'en' params([
documentText: 'This is a test.'])
apply 'OpenNlpSegmenter'
apply 'OpenPosTagger'
write 'Conll2006'
Commands
read
The read
command defines the source of the data being processed by the pipeline.
read '<FORMAT>' language '<LANG>' from '<LOCATION>' params ([
<param1>: <value1>,
<param2>: <value2>,
...
])
<FORMAT>
-
the format of the data to be read. To get a list of the supported formats, use the
inventory
command. language <LANG>
(optional)-
the language of the data to be read. This must be a two-letter ISO code. This is a shortcut for specifying
language
inparams
. from <LOCATION>
(optional)-
the location from where the data is to be read. This is required for most formats, but not for all. For example the
String
format expects the document text to be specified viaparams
. This is a shortcut for specifyingsourceLocation
inparams
. params([<param1>: <value1>,…])
(optional)-
additional parameters to be passed to the underlying reader. To see the available parameters for a specified format use the
explain
command.
read 'Xmi' language 'en' from '*.xmi'
read 'Text' language 'en' from 'textfile.txt'
read 'Conll2006' language 'en' from '**/*.conll'
read 'String' language 'en' params([
documentText: 'This is a test.'])
apply
apply '<ENGINE>' params ([
<param1>: <value1>,
<param2>: <value2>,
...
])
apply 'OpenNlpSegmenter'
apply 'StanfordParser' params([
writePennTree: true])
write
The write
command defines where to output the processed data. The command can appear any time
after the read
command but typically appears at the end of the pipeline. It can appear multiple
times, either to write out different formats or to write output at different stages of processing,
i.e. when write
appears in between apply
commands.
write '<FORMAT>' to '<LOCATION>' params ([
<param1>: <value1>,
<param2>: <value2>,
...
])
<FORMAT>
-
the format of the data to be write. To get a list of the supported formats, use the
inventory
command. to <LOCATION>
(optional)-
the location to write the output to. If this is omitted, most writers will write their output output to standard out (if using DKPro Core 1.8.0 or higher). Most format will treat the location as a target folder and for each input file, they create an output file in this folder. For some formats, the location specifies a file name. This is a shortcut for specifying
targetLocation
inparams
. params([<param1>: <value1>,…])
(optional)-
additional parameters to be passed to the underlying writer. To see the available parameters for a specified format use the
explain
command.
// Write result to screen in Conll2006 format
write 'Conll2006'
// Write output file per input file to the folder 'output' in Conll2006 format
write 'Conll2006' to 'output'
// Write a single aggregate file 'output.conll' in Conll2006 format
write 'Conll2006' to 'output.conll' params([
singularTarget: true
])
explain
explain '<COMPONENT>'
<COMPONENT>
-
the component to be explained. This can either be the name of an engine (cf.
apply
) or the name of a format (cf.read
,write
). When explaining a format, the explanation may contain a section for a 'Reader' and for a 'Writer'. In case only an explanation of the reading or writing aspect of a format is desired, appendReader
orWriter
to the format name.
// Explain the OpenNlpSegmenter engine
explain 'OpenNlpSegmenter'
// Explain the Conll2006 format (reading and writing)
explain 'Conll2006'
// Explain the Conll2006 format (reading only)
explain 'Conll2006Reader'