This page gives and overview on the most important classes used by JWKTL. Make sure you followed the [GettingStarted getting started] guide before proceeding.
The main entry point is the JWKTL
class, which facilitates parsing an XML dump file and accessing the parsed data. The latter should equip you with an instance of IWiktionaryEdition
, JWKTL’s representation of a single parsed Wiktionary language edition (e.g., the English Wiktionary). Using the IWiktionaryEdition
instance, you can access the different information types encoded in Wiktionary. The following class diagram shows the most important classes in this context:
Article pages
The primary building blocks of Wiktionary are article pages, which contain lexicographic information on a specific word form (e.g., “boat”). This can include multiple parts of speech (e.g., the noun and the verb “plant”) and multiple languages (e.g., “fog” in English and Hungarian). JWKTL makes use of the IWiktionaryPage
interface to represent these article pages. They may be queried either by their unique ID or the described word form. For each page, the title (i.e., the word form), the date, revision ID, and author of the last revision, the language of the embracing Wiktionary language edition, the set of language codes occuring as interwiki-links, and the target of a possible redirection. Code example:
Lexical entries
Each combination of language and part of speech described on an article page yields a separate instance of IWiktionaryEntry
(i.e., multiple lexical entries sharing the same page). Besides the language and the part of speech, an entry encodes the etymology, gender (if applicable), pronunciations, inflected word forms, as well as links to a related entry containing more information (e.g., in case of alternative spellings). The entries can be accessed using the IWiktionaryPage
instance they are contained in or using the API directly. Note that search queries are case sensitive by default and thus return different results for looking up “rom”, “Rom”, or “ROM”. An optional parameter can be used to initiate case insensitive search. Code example:
Word senses
Finally, each lexical entry can distinguish multiple word senses, which are represented in JWKTL as IWiktionarySense
instances. This class allows accessing the sense definition (gloss), example sentences, quotations, semantic relations, translations, and references. The senses are associated with a running index in the order of their definition within a Wiktionary entry. Since there are often information items that are not associated to any word sense, there is also an unassiged sense containing all of this information. This sense has the running index 0. Code example:
Having understood this general architecture, you can now take a look at specific JWKTL use cases to extract a certain kind of information type. In addition to that, the Javadoc documentation gives further information on using the API and its different methods.