Anatomy of a DKPro TC experiment
Using the ExperimentBuilder
, as described in the basic section, wires an experiment and creates all necessary data structures to execute an experiment. We briefly show here the necessary steps without using the ExperimentBuilder
to create an experiment: All information is provided in a DKPro Lab data type called a Dimension
. Each piece of information is provided in an own dimension which are identified by their name and put into a ParameterSpace
. This parameter space is then provided to an experimental setup, which is executed by DKPro Lab
. In some cases it might become necessary to work with dimensions but for the regular setups, using the ExperimentBuilder
is sufficient.
// Defining the readers that read the data that we use in an experiment
CollectionReaderDescription readerTrain = CollectionReaderFactory.create..()
CollectionReaderDescription readerTest = CollectionReaderFactory.create..();
Map<String, Object> dimReaders = new HashMap<String, Object>();
dimReaders.put(DIM_READER_TRAIN, readerTrain);
dimReaders.put(DIM_READER_TEST, readerTest);
Dimension<Map<String, Object>> readersDimension = Dimension.createBundle("readers", dimReaders);
/* Defining the features that we extract for training a classifier model,
we use two features, the number of tokens and the 50 most frequent word ngrams */
/* Classification of documents (alternative would be sinlge words,
either independelty or as sequence modelling task) */
Dimension<String> dimFeatureMode = Dimension.create(DIM_FEATURE_MODE, FM_DOCUMENT);
/* Classification is done by predicting a single label for each document
(alternative would be regression, i.e. no label, or multi-label) */
Dimension<String> dimLearningMode = Dimension.create(DIM_LEARNING_MODE, LM_SINGLE_LABEL);
/* The feature set we use, we use two dummy features here:
Number of tokens per document and the 50 most frequent
word ngrams over all documents */
Dimension<TcFeatureSet> dimFeatureSet = Dimension.create(DIM_FEATURE_SET, new TcFeatureSet(
TcFeatureFactory.create(AvgTokenRatioPerDocument.class),
TcFeatureFactory.create(WordNGram.class,
WordNGram.PARAM_NGRAM_USE_TOP_K, 50)));
/* The configuration specifies which classifier we want to use, one can specify several
classifiers or confirgurations of the same classifier; TC will automatically execute them all */
Map<String, Object> libsvmConfig = new HashMap<String, Object>();
libsvmConfig.put(DIM_CLASSIFICATION_ARGS,
new Object[] { new LibsvmAdapter(), "-s", "0", "-c", "100" });
libsvmConfig.put(DIM_DATA_WRITER, new LibsvmAdapter().getDataWriterClass());
libsvmConfig.put(DIM_FEATURE_USE_SPARSE, new LibsvmAdapter().useSparseFeatures());
Map<String, Object> liblinearConfig = new HashMap<String, Object>();
liblinearConfig.put(DIM_CLASSIFICATION_ARGS,
new Object[] { new LiblinearAdapter(), "-s", "1"});
liblinearConfig.put(DIM_DATA_WRITER, new LiblinearAdapter().getDataWriterClass());
liblinearConfig.put(DIM_FEATURE_USE_SPARSE, new LiblinearAdapter().useSparseFeatures());
Dimension<Map<String, Object>> configs = Dimension.createBundle("config", libsvmConfig, liblinearConfig);
// Wire everything in a parameter space
ParameterSpace pSpace = new ParameterSpace(
dimLearningModem,
dimFeatureMode,
readersDimension,
dimFeatureSet,
configs
);
/* Sets the output-folder to which all data is written that is created by DKPro TC,
this includes the results of the experiments.
This environmental variable has to be set before the experiment runs, temporarily or permantely */
System.setProperty("DKPRO_HOME", System.getProperty("user.home")+"/Desktop/");
// Pass this configuration to an experiment
ExperimentTrainTest exp = new ExperimentTrainTest("ExperimentName");
exp.setPreprocessing(createEngineDescription(OutcomeAnnotator.class);
exp.addReport(new BatchTrainTestReport());
exp.setParameterSpace(pSpace);
// Run experiment
Lab.getInstance().run(exp);
Dimensions and the parameter space
An experiment consists of (i) several dimensions that are combined in a (ii) parameter space and provided to an experiment. Regarding (i), dimensions are the basic building blocks of an experimental setup. Almost every parameter that is altered in an experiment is changed or set via a dimension. The dimensions in the code declare three building blocks: First, the readers that provide the data for the experiment, second, the feature set that is used in this experiment, and third, the classification arguments that specify the classifier which is to be used (Liblinear in this case). Regarding (ii), the parameter space is main data structure which is used by DKPro TC in the background; it is important that all created dimension are added to the parameter space, otherwise they are not used.