Reads each Penn Treebank bracketed format file from the corpus in the specified folder and writes them to the target folder with Tiger XML format.
Copy the script to a file called
penn2tiger.groovy and call it e.g. using
groovy penn2tiger.groovy pennTreebankFile.txt .. This creates a file called
pennTrebankFile.xml in Tiger XML format in the current directory.
Note: If the script fails, check that any line that does not start a sentence is indented. If necessary, add a space at the beginning of a line.
Note: This script uses DKPro Core 1.7.1-SNAPSHOT because of the following two issues present in version 1.7.0.
- If the Penn Treebank file is malformed, an !EmptyStackException will be thrown (Issue 613)
- A file is malformed e.g.:
- if the brackets do not balance
- if the tree is not properly indented, in particular, if a line that is not indented is not the start of a tree (alleviated but not fixed by Issue 612)