public class WikipediaTemplateFilteredArticleReader extends WikipediaReaderBase
It is possible to just define a whitelist OR a blacklist. If both whitelist and blacklist are provided, the articles are chosen that DO contain the templates from the whitelist and at the same time DO NOT contain the templates from the blacklist (= the intersection of the "whitelist page set" and the "blacklist page set")
This reader only works if template tables have been generated for the JWPL database using the
WikipediaTemplateInfoGenerator
.
NOTE: This reader directly extends the WikipediaReaderBase
and not the
WikipediaStandardReaderBase
Modifier and Type | Field and Description |
---|---|
static String |
PARAM_DOUBLE_CHECK_ASSOCIATED_PAGES
If this option is set, discussion pages are rejected that are associated with a blacklisted
article.
|
static String |
PARAM_EXACT_TEMPLATE_MATCHING
Defines whether to match the templates exactly or whether to match all
templates that start with the String given in the respective parameter
list.
|
static String |
PARAM_INCLUDE_DISCUSSION_PAGES
Whether the reader should read also include talk pages.
|
static String |
PARAM_LIMIT_NUMBER_OF_ARTICLES_TO_READ
Optional parameter that allows to define the max number of articles that should be delivered
by the reader.
|
static String |
PARAM_ONLY_FIRST_PARAGRAPH
If set to true, only the first paragraph instead of the whole article is used.
|
static String |
PARAM_OUTPUT_PLAIN_TEXT
Whether the reader outputs plain text or wiki markup.
|
static String |
PARAM_PAGE_BUFFER
The page buffer size (#pages) of the page iterator.
|
static String |
PARAM_TEMPLATE_BLACKLIST
Defines templates that the articles MUST NOT contain.
|
static String |
PARAM_TEMPLATE_WHITELIST
Defines templates that the articles MUST contain.
|
dbconfig, PARAM_CREATE_DATABASE_CONFIG_ANNOTATION, PARAM_DB, PARAM_HOST, PARAM_LANGUAGE, PARAM_PASSWORD, PARAM_USER, wiki
Constructor and Description |
---|
WikipediaTemplateFilteredArticleReader() |
Modifier and Type | Method and Description |
---|---|
void |
getNext(org.apache.uima.jcas.JCas jcas) |
org.apache.uima.util.Progress[] |
getProgress() |
boolean |
hasNext() |
void |
initialize(org.apache.uima.UimaContext context) |
close, getLogger, getNext, initialize
destroy, getCasInitializer, getProcessingResourceMetaData, initialize, isConsuming, reconfigure, setCasInitializer, typeSystemInit
getConfigParameterValue, getConfigParameterValue, setConfigParameterValue, setConfigParameterValue
getCasManager, getMetaData, getRelativePathResolver, getResourceManager, getUimaContext, getUimaContextAdmin, setLogger, setMetaData
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
public static final String PARAM_ONLY_FIRST_PARAGRAPH
public static final String PARAM_OUTPUT_PLAIN_TEXT
public static final String PARAM_INCLUDE_DISCUSSION_PAGES
public static final String PARAM_DOUBLE_CHECK_ASSOCIATED_PAGES
This check is rather expensive and could take a long time. This is option is not active if only a whitelist is used.
Default Value: false
public static final String PARAM_LIMIT_NUMBER_OF_ARTICLES_TO_READ
This avoids unnecessary filtering if only a small number of articles is needed.
public static final String PARAM_TEMPLATE_WHITELIST
If you also define a blacklist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist)
public static final String PARAM_TEMPLATE_BLACKLIST
If you also define a whitelist, the intersection of both sets is used. (= pages that DO contain templates from the whitelist, but DO NOT contain templates from the blacklist)
public static final String PARAM_EXACT_TEMPLATE_MATCHING
Default Value: true
public static final String PARAM_PAGE_BUFFER
public WikipediaTemplateFilteredArticleReader()
public void initialize(org.apache.uima.UimaContext context) throws org.apache.uima.resource.ResourceInitializationException
initialize
in class WikipediaReaderBase
org.apache.uima.resource.ResourceInitializationException
public boolean hasNext() throws IOException, org.apache.uima.collection.CollectionException
IOException
org.apache.uima.collection.CollectionException
public void getNext(org.apache.uima.jcas.JCas jcas) throws IOException, org.apache.uima.collection.CollectionException
getNext
in class WikipediaReaderBase
IOException
org.apache.uima.collection.CollectionException
public org.apache.uima.util.Progress[] getProgress()
getProgress
in interface org.apache.uima.collection.base_cpm.BaseCollectionReader
getProgress
in class WikipediaReaderBase
Copyright © 2007–2018 Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt. All rights reserved.