Conversion Guide

Content

Corpus format and structure
Example conversion
Tested corpora
Conversion scripts

CSniper requires the corpora/data to be used with it to be in a special binary format for fast search. See below for examples on how to convert your data.

Corpus format and structure

The main query engine for CSniper is CQP, the Corpus Query Processor from the IMS Corpus Workbench http://cwb.sourceforge.net. Thus, corpora which shall be used by CSniper have to be converted to the CQP format. For relatively fast context delivery, serialized CASes are used.

The corpora directory structure being used by CSniper is as follows:

corpora/
	registry/
		corpus1
		corpus2
		...
	CORPUS1/
		corpus.properties
			format1/
				corpus-files for engine 1
				...
			format2/
				corpus-files for engine 2
				...
	CORPUS2/
		...

At the moment, a format can be, e.g., cqp (mandatory), bin (used for context display) or tgrep (to search on parse trees rather than token/lemmas/POS tags). Each corpus directory has to contain a corpus.properties file, an example of which can be seen further below.

Example conversion

Let’s say we have a corpus in the NEGRA export format (a sample can be found here: http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/corpus-sample.export).

To convert it, we’ll write a short conversion program:

#!/usr/bin/env groovy
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.bincas-asl', 
	version='1.5.0')
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.negra-asl', 
	version='1.5.0')
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.imscwb-asl', 
	version='1.5.0')

import org.uimafit.pipeline.SimplePipeline;
import de.tudarmstadt.ukp.dkpro.core.io.bincas.SerializedCasWriter;
import de.tudarmstadt.ukp.dkpro.core.io.imscwb.ImsCwbWriter;
import de.tudarmstadt.ukp.dkpro.core.io.negra.NegraExportReader;
import de.tudarmstadt.ukp.dkpro.core.api.resources.CompressionMethod;
import static org.uimafit.factory.AnalysisEngineFactory.createPrimitiveDescription;
import static org.uimafit.factory.CollectionReaderFactory.createDescription;

// Collection ID
def id = args[0]

// Source file (e.g. tuebadz-5.0.anaphora.export.bz2)
def source = args[1]

// Target folder
def target = args[2];

SimplePipeline.runPipeline(
	createDescription(NegraExportReader.class,
	    NegraExportReader.PARAM_SOURCE_LOCATION, source,
	    NegraExportReader.PARAM_COLLECTION_ID, id,
	    NegraExportReader.PARAM_LANGUAGE, "de",
	    NegraExportReader.PARAM_ENCODING, "ISO-8859-15",
	    NegraExportReader.PARAM_READ_PENN_TREE, true),

	createPrimitiveDescription(SerializedCasWriter.class,
	    SerializedCasWriter.PARAM_PATH, target + "/bin",
	    SerializedCasWriter.PARAM_USE_DOCUMENT_ID, true,
	    SerializedCasWriter.PARAM_COMPRESSION, CompressionMethod.XZ),

	createPrimitiveDescription(ImsCwbWriter.class,
	    ImsCwbWriter.PARAM_TARGET_ENCODING, "UTF-8",
	    ImsCwbWriter.PARAM_TARGET_LOCATION, target + "/cqp",
	    ImsCwbWriter.PARAM_WRITE_TEXT_TAG, true,
	    ImsCwbWriter.PARAM_WRITE_DOCUMENT_TAG, true,
	    ImsCwbWriter.PARAM_WRITE_OFFSETS, true,
	    ImsCwbWriter.PARAM_WRITE_LEMMA, true,
	    ImsCwbWriter.PARAM_WRITE_DOC_ID, false));

Save it as convert-sample.groovy and run it, using: groovy convert-sample sample corpus-sample.export SAMPLE.

We’re presented with a directory SAMPLE, with two subdirectories bin and cqp. Copy SAMPLE to your corpora directory. Now we have to do some post-processing: First create a corpus.properties below SAMPLE, containing these lines:

name=NEGRA sample
description=Just an example corpus.
language=de

Then move the registry directory from /srv/csniper/corpora/SAMPLE/cqp/registry to /srv/csniper/corpora/registry. You have to edit the /srv/csniper/corpora/registry/sample file, changing the following lines:

# path to binary data files
HOME target/SAMPLE/cqp/data

# path to binary data files
HOME /srv/csniper/corpora/SAMPLE/cqp

At last, move all files from /srv/csniper/corpora/SAMPLE/cqp/data to /srv/csniper/corpora/SAMPLE/cqp, and remove the now empty data directory.

Note: If a corpus is not found in CSniper, make sure the corpus name is in all uppercase letters in your directory structure and in the HOME ... line in the corresponding registry file. Also make sure your application server has file access to the corpora directory and all its subdirectories.

Tested corpora

We have tested CSniper with the following corpora

Corpus	Reader	Features	Comment
British National Corpus	BncReader	POS, CPOS
deWaC	ImsCwbReader	POS, lemma	tested with the first deWaC data file (~ 1/20th)
TüBa D/Z	NegraExportReader	POS, parse trees	tested with version 5, contains no lemmas
TIGER Corpus	NegraExportReader	POS, parse trees, (lemma?)	tested with version 2 and 2.1
TextGrid Digitale Bibliothek	TEIReader

Many of the parse trees from TüBa D/Z and Tiger can currently not be imported, because the corpora use POS tags such as “$(“ which are not compatible with the bracketed parse tree representation used by TGrep.

Conversion scripts

we supply here a couple of conversion scripts written in groovy. There is nothing required besides a groovy installation - the script will fetch all needed dependencies on its own. After installing groovy simply paste the script into a file, make the file executable, and run it. The scripts all take three arguments:

corpus ID: an ID you choose, no spaces, slashes or backslashes (e.g. “tueba5”)
source: the source file
target: a folder into which the output is generated

If you name your file e.g. convert-tueba5.groovy, run it as

groovy convert-tueba5 tueba5 tuebadz-5.0.anaphora.export.bz2 tueba5

Conversion script - TüBa D/Z 5

#!/usr/bin/env groovy
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.bincas-asl', 
	version='1.5.0')
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.negra-asl', 
	version='1.5.0')
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.tgrep-gpl',
	version='1.5.0')
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.imscwb-asl', 
	version='1.5.0')

import org.uimafit.pipeline.SimplePipeline;
import de.tudarmstadt.ukp.dkpro.core.io.bincas.SerializedCasWriter;
import de.tudarmstadt.ukp.dkpro.core.io.imscwb.ImsCwbWriter;
import de.tudarmstadt.ukp.dkpro.core.io.negra.NegraExportReader;
import de.tudarmstadt.ukp.dkpro.core.io.tgrep.TGrepWriter;
import de.tudarmstadt.ukp.dkpro.core.api.resources.CompressionMethod;
import static org.uimafit.factory.AnalysisEngineFactory.createPrimitiveDescription;
import static org.uimafit.factory.CollectionReaderFactory.createDescription;

// Collection ID
def id = args[0]

// Source file (e.g. tuebadz-5.0.anaphora.export.bz2)
def source = args[1]

// Target folder
def target = args[2];

SimplePipeline.runPipeline(
	createDescription(NegraExportReader.class,
	    NegraExportReader.PARAM_SOURCE_LOCATION, source,
	    NegraExportReader.PARAM_COLLECTION_ID, id,
	    NegraExportReader.PARAM_LANGUAGE, "de",
	    NegraExportReader.PARAM_ENCODING, "ISO-8859-15",
	    NegraExportReader.PARAM_READ_PENN_TREE, true),

	createPrimitiveDescription(SerializedCasWriter.class,
	    SerializedCasWriter.PARAM_PATH, target + "/bin",
	    SerializedCasWriter.PARAM_USE_DOCUMENT_ID, true,
	    SerializedCasWriter.PARAM_COMPRESSION, CompressionMethod.XZ),
	
	createPrimitiveDescription(TGrepWriter.class,
	    TGrepWriter.PARAM_TARGET_LOCATION, target + "/tgrep",
	    TGrepWriter.PARAM_COMPRESSION, CompressionMethod.GZIP,
	    TGrepWriter.PARAM_DROP_MALFORMED_TREES, true,
	    TGrepWriter.PARAM_WRITE_COMMENTS, true,
	    TGrepWriter.PARAM_WRITE_T2C, true),
	
	createPrimitiveDescription(ImsCwbWriter.class,
	    ImsCwbWriter.PARAM_TARGET_ENCODING, "UTF-8",
	    ImsCwbWriter.PARAM_TARGET_LOCATION, target + "/cqp",
	    ImsCwbWriter.PARAM_WRITE_TEXT_TAG, true,
	    ImsCwbWriter.PARAM_WRITE_DOCUMENT_TAG, true,
	    ImsCwbWriter.PARAM_WRITE_OFFSETS, true,
	    ImsCwbWriter.PARAM_WRITE_LEMMA, true,
	    ImsCwbWriter.PARAM_WRITE_DOC_ID, false));

Conversion script - TIGER

#!/usr/bin/env groovy
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.bincas-asl', 
	version='1.5.0')
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.negra-asl', 
	version='1.5.0')
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.tgrep-gpl',
	version='1.5.0')
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.imscwb-asl', 
	version='1.5.0')

import org.apache.uima.collection.CollectionReaderDescription;
import org.uimafit.pipeline.SimplePipeline;
import de.tudarmstadt.ukp.dkpro.core.io.bincas.SerializedCasWriter;
import de.tudarmstadt.ukp.dkpro.core.io.imscwb.ImsCwbWriter;
import de.tudarmstadt.ukp.dkpro.core.io.negra.NegraExportReader;
import de.tudarmstadt.ukp.dkpro.core.io.tgrep.TGrepWriter;
import de.tudarmstadt.ukp.dkpro.core.api.resources.CompressionMethod;
import static org.uimafit.factory.AnalysisEngineFactory.createPrimitiveDescription;
import static org.uimafit.factory.CollectionReaderFactory.createDescription;

// Collection ID
def id = args[0]

// Source file (e.g. tigercorpus2.1/corpus/tiger_release_aug07.export)
def source = args[1]

// Target folder
def target = args[2];

SimplePipeline.runPipeline(
	createDescription(NegraExportReader.class,
	    NegraExportReader.PARAM_SOURCE_LOCATION, source,
	    NegraExportReader.PARAM_COLLECTION_ID, id,
	    NegraExportReader.PARAM_LANGUAGE, "de",
	    NegraExportReader.PARAM_ENCODING, "ISO-8859-15",
	    NegraExportReader.PARAM_GENERATE_NEW_IDS, true,
	    NegraExportReader.PARAM_READ_PENN_TREE, true,
	    NegraExportReader.PARAM_DOCUMENT_UNIT, NegraExportReader.DocumentUnit.ORIGIN_NAME),

	createPrimitiveDescription(SerializedCasWriter.class,
	    SerializedCasWriter.PARAM_PATH, target + "/bin",
	    SerializedCasWriter.PARAM_USE_DOCUMENT_ID, true,
	    SerializedCasWriter.PARAM_COMPRESSION, CompressionMethod.XZ),
	
	createPrimitiveDescription(TGrepWriter.class,
	    TGrepWriter.PARAM_TARGET_LOCATION, target + "/tgrep",
	    TGrepWriter.PARAM_COMPRESSION, CompressionMethod.GZIP,
	    TGrepWriter.PARAM_DROP_MALFORMED_TREES, true,
	    TGrepWriter.PARAM_WRITE_COMMENTS, true,
	    TGrepWriter.PARAM_WRITE_T2C, true),
	
	createPrimitiveDescription(ImsCwbWriter.class,
	    ImsCwbWriter.PARAM_TARGET_ENCODING, "UTF-8",
	    ImsCwbWriter.PARAM_TARGET_LOCATION, target + "/cqp",
	    ImsCwbWriter.PARAM_WRITE_TEXT_TAG, true,
	    ImsCwbWriter.PARAM_WRITE_DOCUMENT_TAG, true,
	    ImsCwbWriter.PARAM_WRITE_OFFSETS, true,
	    ImsCwbWriter.PARAM_WRITE_LEMMA, true,
	    ImsCwbWriter.PARAM_WRITE_DOC_ID, false));

Conversion script - Digitale Bibliothek

#!/usr/bin/env groovy
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.tei-asl', 
	version='1.5.0')
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.bincas-asl', 
	version='1.5.0')
@Grab(
	group='de.tudarmstadt.ukp.dkpro.core', 
	module='de.tudarmstadt.ukp.dkpro.core.io.imscwb-asl', 
	version='1.5.0')

import org.uimafit.pipeline.SimplePipeline;
import de.tudarmstadt.ukp.dkpro.core.io.bincas.SerializedCasWriter;
import de.tudarmstadt.ukp.dkpro.core.io.imscwb.ImsCwbWriter;
import de.tudarmstadt.ukp.dkpro.core.io.tei.TEIReader;
import de.tudarmstadt.ukp.dkpro.core.api.resources.CompressionMethod;
import static org.uimafit.factory.AnalysisEngineFactory.createPrimitiveDescription;
import static org.uimafit.factory.CollectionReaderFactory.createDescription;

// Collection ID
def id = args[0]

// Source file (e.g. digitale-bibliothek/literatur-nur-texte-1.zip)
def source = args[1]

// Target folder
def target = args[2];

SimplePipeline.runPipeline(
	createDescription(TEIReader.class, 
	    TEIReader.PARAM_PATH, "jar:file:/" + source + "!",
	    TEIReader.PARAM_PATTERNS, [ { "[+]**/*.xml" } ],
	    TEIReader.PARAM_USE_FILENAME_ID, true,
	    TEIReader.PARAM_LANGUAGE, "de"),

	createPrimitiveDescription(SerializedCasWriter.class,
	    SerializedCasWriter.PARAM_PATH, target + "/bin",
	    SerializedCasWriter.PARAM_USE_DOCUMENT_ID, true,
	    SerializedCasWriter.PARAM_COMPRESSION, CompressionMethod.XZ),
	
	createPrimitiveDescription(ImsCwbWriter.class,
	    ImsCwbWriter.PARAM_TARGET_ENCODING, "UTF-8",
	    ImsCwbWriter.PARAM_TARGET_LOCATION, target + "/cqp",
	    ImsCwbWriter.PARAM_WRITE_TEXT_TAG, true,
	    ImsCwbWriter.PARAM_WRITE_DOCUMENT_TAG, true,
	    ImsCwbWriter.PARAM_WRITE_OFFSETS, true,
	    ImsCwbWriter.PARAM_WRITE_LEMMA, true,
	    ImsCwbWriter.PARAM_WRITE_DOC_ID, false));