Usage
- Attention: Only use the TimeMachine, if you really need to reconstruct one (or multiple) version(s) of Wikipedia corresponding to past states. The data files used for this purpose are very large. If you just want a single version of the recent Wikipedia, use the DataMachine instead. If you need access to the whole revision history of all Wikipedia articles (e.g. add revision history access to an existing JWPL database), you probably want to use the RevisionMachine instead. It provides access to the whole revision history for a given JWPL database.
- Learn about the different ways to get JWPL and choose the one that is right for you!
- Download the Wikipedia data from the Wikimedia Download Site. You need 3 files:
[LANGCODE]wiki-[DATE]-pages-meta-history.xml.bz2
[LANGCODE]wiki-[DATE]-pagelinks.sql.gz
[LANGCODE]wiki-[DATE]-categorylinks.sql.gz
- Create a configuration file.
- You may edit one of the sample configuration files, which come with the source code of the TimeMachine.
- The configuration file must be UTF8 encoded.
- Running
org.dkpro.jwpl.wikipedia.timemachine.domain.SettingsXML
will generate a sample file that can be edited. - The elements of the configuration file are further explained below or in the readme in the TimeMachine Sources.
- Start the TimeMachine
org.dkpro.jwpl.wikipedia.timemachine.domain.JWPLTimeMachine CONFIG_FILE
- Allocate enough heap size to speed up the execution (use the
-Xmx
JVM parameter to increase heap space; e.g.-Xmx512m
gives you 512MB heap space).
- If everything went well (it will take a while), the extracted data files are now available in the output directory (each in a directory with the corresponding timestamp as name).
- For each directory. create a database with the necessary tables using jwpl_tables.sql
- Import the data files into the databases:
mysqlimport -uUSER -p --local --default-character-set=utf8 {database_name}
pwd
/*.txt
- Now you are ready to use the databases with the JWPL Core API (also see JWPLCore:GettingStarted). When first connecting to a newly imported database, indexes are created. This takes some time (up to 30 minutes), depending on the server and the size of your Wikipedia. Subsequent connects won’t have this delay.
Example configuration file
Parameters
Parameter | Description | Comment / Example |
---|---|---|
language | The used language. | The language string must correspond to one of the values enumerated in WikiConstants.Language in the JWPL. Examples: english, german, frensh, arabic. |
mainCategory | The title of the main category of the Wikipedia language version used. | For example, “Contents” for the English Wikipedia or “!Hauptkategorie” for the German Wikipedia. |
disambiguationCategory | The title of the disambiguation category of the Wikipedia language version used. | For example, “Disambiguation_pages” for the English Wikipedia or “Begriffsklärung” for the German Wikipedia. |
fromTimestamp | yyyymmddhhmmss | The timestamp of the first version to be extracted. |
toTimestamp | yyyymmddhhmmss | The timestamp of the last version to be extracted. |
each | The number of days to be used as regular interval for extracting versions. | |
metaHistoryFile | The absolute path to the pages-meta-history file. | Only .xml and .xml.bz2 extensions are supported. |
pageLinksFile | The absolute path to the pagelinks file. | Only .sql and .sql.gz extensions are supported. |
categoryLinksFile | The absolute path to the categorylinks file. | Only .sql and .sql.gz extensions are supported. |
outputDirectory | The absolute path to the directory to which the transformed files will be written. | The outputDirectory will be created if it does not exist. However its parent directory must exist. |
removeInputFilesAfterProcessing | A boolean that specifies whether the meta-history file, the pagelinks file and the categorylinks file should be removed after the processing. |