public class WiktionaryDumpParser extends XMLDumpParser implements IWiktionaryMultistreamDumpParser
XMLDumpParser
that reads the different XML tags
of the Wiktionary XML dump file format and provides hotspots for each
type of information. A number of IWiktionaryPageParser
s can
be registered for this dump parser. The page parsers are called whenever
a certain information has been read. Different page parsers can, for
example, handle different page types or namespaces.XMLDumpParser.XMLDumpHandler
Modifier and Type | Field and Description |
---|---|
protected DumpInfo |
dumpInfo |
protected boolean |
inPage |
protected List<IWiktionaryPageParser> |
parserRegistry |
protected DateFormat |
timestampFormat |
BZ2_FILE_EXTENSION
Constructor and Description |
---|
WiktionaryDumpParser(IWiktionaryPageParser... pageParsers)
Initializes the dump parser and registers the given page parsers.
|
Modifier and Type | Method and Description |
---|---|
protected void |
addNamespace(String namespace) |
IDumpInfo |
getDumpInfo()
Returns information on the current dump file and its parsing
progress.
|
Iterable<IWiktionaryPageParser> |
getPageParsers()
Returns the list of all registered
IWiktionaryPageParser s. |
protected void |
onClose() |
protected void |
onElementEnd(String name,
XMLDumpParser.XMLDumpHandler handler)
Hotspot that is invoked for each closing XML element.
|
protected void |
onElementStart(String name,
XMLDumpParser.XMLDumpHandler handler)
Hotspot that is invoked for each opening XML element.
|
protected void |
onPageEnd() |
protected void |
onPageStart() |
protected void |
onParserEnd()
Hotspot that is invoked on finishing the parsing.
|
protected void |
onParserStart()
Hotspot that is invoked on starting the parser.
|
protected void |
onSiteInfoComplete() |
void |
parse(File dumpFile)
Parses the given XML dump file.
|
void |
parseMultistream(File multistreamDumpFile,
File indexFile,
MultistreamFilter filter)
Parses a multistream XML dump file
|
protected Date |
parseTimestamp(String dateString) |
void |
register(IWiktionaryPageParser pageParser)
Register the given
IWiktionaryPageParser . |
protected static ILanguage |
resolveLanguage(String baseURL) |
protected void |
setAuthor(String author) |
protected void |
setBaseURL(String baseURL) |
protected void |
setPageId(long pageId) |
protected void |
setRevision(long revisionId) |
protected void |
setText(String text) |
protected void |
setTimestamp(Date timestamp) |
protected void |
setTitle(String title) |
parseStream
protected List<IWiktionaryPageParser> parserRegistry
protected boolean inPage
protected DumpInfo dumpInfo
protected DateFormat timestampFormat
public WiktionaryDumpParser(IWiktionaryPageParser... pageParsers)
public void register(IWiktionaryPageParser pageParser)
IWiktionaryDumpParser
IWiktionaryPageParser
. The registered
parser will then be notified once a Wiktionary-related XML tag
has been processed.register
in interface IWiktionaryDumpParser
public Iterable<IWiktionaryPageParser> getPageParsers()
IWiktionaryDumpParser
IWiktionaryPageParser
s.getPageParsers
in interface IWiktionaryDumpParser
public void parse(File dumpFile) throws WiktionaryException
XMLDumpParser
parse
in interface IWiktionaryDumpParser
parse
in class XMLDumpParser
dumpFile
- the dumpFileWiktionaryException
- in case of any parser errors.public void parseMultistream(File multistreamDumpFile, File indexFile, MultistreamFilter filter) throws WiktionaryException
IWiktionaryMultistreamDumpParser
parseMultistream
in interface IWiktionaryMultistreamDumpParser
multistreamDumpFile
- the dumpfile (*-pages-articles-multistream-index.txt.bz2
)indexFile
- the matching index file (*-pages-articles-multistream.xml.bz2
)filter
- the filter to use to constrain the parsed pagesWiktionaryException
protected void onParserStart()
XMLDumpParser
onParserStart
in class XMLDumpParser
protected void onSiteInfoComplete()
protected void onParserEnd()
XMLDumpParser
onParserEnd
in class XMLDumpParser
protected void onClose()
protected void onElementStart(String name, XMLDumpParser.XMLDumpHandler handler)
XMLDumpParser
onElementStart
in class XMLDumpParser
protected void onElementEnd(String name, XMLDumpParser.XMLDumpHandler handler)
XMLDumpParser
onElementEnd
in class XMLDumpParser
protected void onPageStart()
protected void onPageEnd()
protected void setBaseURL(String baseURL)
protected void addNamespace(String namespace)
protected void setAuthor(String author)
protected void setRevision(long revisionId)
protected void setTimestamp(Date timestamp)
protected void setPageId(long pageId)
protected void setTitle(String title)
protected void setText(String text)
protected Date parseTimestamp(String dateString) throws ParseException
ParseException
public IDumpInfo getDumpInfo()
null
if the parser has not
yet been started (i.e., the parse(File)
method has not
been called).Copyright © 2011-2016 Ubiquitous Knowledge Processing (UKP) Lab. All Rights Reserved.