public class WiktionaryDumpParser extends XMLDumpParser implements IWiktionaryMultistreamDumpParser
XMLDumpParser that reads the different XML tags
of the Wiktionary XML dump file format and provides hotspots for each
type of information. A number of IWiktionaryPageParsers can
be registered for this dump parser. The page parsers are called whenever
a certain information has been read. Different page parsers can, for
example, handle different page types or namespaces.XMLDumpParser.XMLDumpHandler| Modifier and Type | Field and Description |
|---|---|
protected DumpInfo |
dumpInfo |
protected boolean |
inPage |
protected List<IWiktionaryPageParser> |
parserRegistry |
protected DateFormat |
timestampFormat |
BZ2_FILE_EXTENSION| Constructor and Description |
|---|
WiktionaryDumpParser(IWiktionaryPageParser... pageParsers)
Initializes the dump parser and registers the given page parsers.
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
addNamespace(String namespace) |
IDumpInfo |
getDumpInfo()
Returns information on the current dump file and its parsing
progress.
|
Iterable<IWiktionaryPageParser> |
getPageParsers()
Returns the list of all registered
IWiktionaryPageParsers. |
protected void |
onClose() |
protected void |
onElementEnd(String name,
XMLDumpParser.XMLDumpHandler handler)
Hotspot that is invoked for each closing XML element.
|
protected void |
onElementStart(String name,
XMLDumpParser.XMLDumpHandler handler)
Hotspot that is invoked for each opening XML element.
|
protected void |
onPageEnd() |
protected void |
onPageStart() |
protected void |
onParserEnd()
Hotspot that is invoked on finishing the parsing.
|
protected void |
onParserStart()
Hotspot that is invoked on starting the parser.
|
protected void |
onSiteInfoComplete() |
void |
parse(File dumpFile)
Parses the given XML dump file.
|
void |
parseMultistream(File multistreamDumpFile,
File indexFile,
MultistreamFilter filter)
Parses a multistream XML dump file
|
protected Date |
parseTimestamp(String dateString) |
void |
register(IWiktionaryPageParser pageParser)
Register the given
IWiktionaryPageParser. |
protected static ILanguage |
resolveLanguage(String baseURL) |
protected void |
setAuthor(String author) |
protected void |
setBaseURL(String baseURL) |
protected void |
setPageId(long pageId) |
protected void |
setRevision(long revisionId) |
protected void |
setText(String text) |
protected void |
setTimestamp(Date timestamp) |
protected void |
setTitle(String title) |
parseStreamprotected List<IWiktionaryPageParser> parserRegistry
protected boolean inPage
protected DumpInfo dumpInfo
protected DateFormat timestampFormat
public WiktionaryDumpParser(IWiktionaryPageParser... pageParsers)
public void register(IWiktionaryPageParser pageParser)
IWiktionaryDumpParserIWiktionaryPageParser. The registered
parser will then be notified once a Wiktionary-related XML tag
has been processed.register in interface IWiktionaryDumpParserpublic Iterable<IWiktionaryPageParser> getPageParsers()
IWiktionaryDumpParserIWiktionaryPageParsers.getPageParsers in interface IWiktionaryDumpParserpublic void parse(File dumpFile) throws WiktionaryException
XMLDumpParserparse in interface IWiktionaryDumpParserparse in class XMLDumpParserdumpFile - the dumpFileWiktionaryException - in case of any parser errors.public void parseMultistream(File multistreamDumpFile, File indexFile, MultistreamFilter filter) throws WiktionaryException
IWiktionaryMultistreamDumpParserparseMultistream in interface IWiktionaryMultistreamDumpParsermultistreamDumpFile - the dumpfile (*-pages-articles-multistream-index.txt.bz2)indexFile - the matching index file (*-pages-articles-multistream.xml.bz2)filter - the filter to use to constrain the parsed pagesWiktionaryExceptionprotected void onParserStart()
XMLDumpParseronParserStart in class XMLDumpParserprotected void onSiteInfoComplete()
protected void onParserEnd()
XMLDumpParseronParserEnd in class XMLDumpParserprotected void onClose()
protected void onElementStart(String name, XMLDumpParser.XMLDumpHandler handler)
XMLDumpParseronElementStart in class XMLDumpParserprotected void onElementEnd(String name, XMLDumpParser.XMLDumpHandler handler)
XMLDumpParseronElementEnd in class XMLDumpParserprotected void onPageStart()
protected void onPageEnd()
protected void setBaseURL(String baseURL)
protected void addNamespace(String namespace)
protected void setAuthor(String author)
protected void setRevision(long revisionId)
protected void setTimestamp(Date timestamp)
protected void setPageId(long pageId)
protected void setTitle(String title)
protected void setText(String text)
protected Date parseTimestamp(String dateString) throws ParseException
ParseExceptionpublic IDumpInfo getDumpInfo()
null if the parser has not
yet been started (i.e., the parse(File) method has not
been called).Copyright © 2011-2016 Ubiquitous Knowledge Processing (UKP) Lab. All Rights Reserved.