public abstract class PdfLayoutEventStripper
extends PDFStreamEngine
This class is based on the pdfbox 1.7.0 PDFTextStripper class and was substantially modified and enhanced for basic paragraph and heading detection. Unfortunately it was not possible to add these enhancements through sub-classing, thus the code was copied and adapted.
Modifier and Type | Class and Description |
---|---|
static class |
PdfLayoutEventStripper.Style |
static class |
PdfLayoutEventStripper.Values |
Modifier and Type | Field and Description |
---|---|
protected Vector<List<TextPosition>> |
charactersByArticle
The charactersByArticle is used to extract text by article divisions.
|
Constructor and Description |
---|
PdfLayoutEventStripper()
Instantiate a new PDFTextStripper object.
|
PdfLayoutEventStripper(Properties props)
Instantiate a new PDFTextStripper object.
|
Modifier and Type | Method and Description |
---|---|
protected abstract void |
endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)
This method is available for subclasses of this class.
|
protected abstract void |
endPage(int firstPage,
int lastPage,
int currentPage,
org.apache.pdfbox.pdmodel.PDPage page)
End a page.
|
protected abstract void |
endRegion(PdfLayoutEventStripper.Style style)
End a region.
|
protected List<List<TextPosition>> |
getCharactersByArticle()
Character strings are grouped by articles.
|
protected int |
getCurrentPageNo()
Get the current page number that is being processed.
|
int |
getEndPage()
This will get the last page that will be extracted.
|
int |
getStartPage()
This is the page that the text extraction will start on.
|
protected PdfLayoutEventStripper.Style |
getStyle(TextPosition pos) |
protected void |
processArticle(List<TextPosition> textList)
This method tries do detect headings and paragraphs and line boundaries.
|
protected abstract void |
processLineSeparator() |
protected void |
processPage(org.apache.pdfbox.pdmodel.PDPage page,
org.apache.pdfbox.cos.COSStream content)
This will process the contents of a page.
|
protected void |
processPages(List<org.apache.pdfbox.pdmodel.PDPage> pages)
This will process all of the pages and the text that is in them.
|
protected void |
processTextPosition(TextPosition text)
This will show add a character to the list of characters to be printed to the text file.
|
protected abstract void |
processWordSeparator() |
void |
setEndPage(int endPageValue)
This will set the last page to be extracted by this class.
|
void |
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
Set if the text stripper should group the text output by a list of beads.
|
void |
setStartPage(int startPageValue)
This will set the first page to be extracted by this class.
|
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
By default the text stripper will attempt to remove text that overlapps each other.
|
boolean |
shouldSeparateByBeads()
This will tell if the text stripper should separate by beads.
|
boolean |
shouldSuppressDuplicateOverlappingText() |
protected abstract void |
startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)
This method is available for subclasses of this class.
|
protected abstract void |
startPage(int firstPage,
int lastPage,
int currentPage,
org.apache.pdfbox.pdmodel.PDPage page)
Start a new page.
|
protected abstract void |
startRegion(PdfLayoutEventStripper.Style style)
Start a new region.
|
protected abstract void |
writeCharacters(TextPosition text)
Write the string to the output stream.
|
void |
writeText(org.apache.pdfbox.pdmodel.PDDocument doc)
This will take a PDDocument and write the text of that document to the print writer.
|
protected Vector<List<TextPosition>> charactersByArticle
public PdfLayoutEventStripper() throws IOException
IOException
- If there is an error loading the properties.public PdfLayoutEventStripper(Properties props) throws IOException
props
- The properties containing the mapping of operators to PDFOperator classes.IOException
- If there is an error reading the properties.public void writeText(org.apache.pdfbox.pdmodel.PDDocument doc) throws IOException
doc
- The document to get the data from.IOException
- If the doc is in an invalid state.protected void processPages(List<org.apache.pdfbox.pdmodel.PDPage> pages) throws IOException
pages
- The pages object in the document.IOException
- If there is an error parsing the text.protected void processPage(org.apache.pdfbox.pdmodel.PDPage page, org.apache.pdfbox.cos.COSStream content) throws IOException
page
- The page to process.content
- The contents of the page.IOException
- If there is an error processing the page.protected void processArticle(List<TextPosition> textList) throws IOException
textList
- the text.IOException
- if there is an error writing to the stream.protected void processTextPosition(TextPosition text)
text
- The description of the character to display.protected PdfLayoutEventStripper.Style getStyle(TextPosition pos)
protected abstract void startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException
pdf
- The PDF document that is being processed.IOException
- If an IO error occurs.protected abstract void endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException
pdf
- The PDF document that is being processed.IOException
- If an IO error occurs.protected abstract void startRegion(PdfLayoutEventStripper.Style style) throws IOException
style
- the style.IOException
- If there is any error writing to the stream.protected abstract void endRegion(PdfLayoutEventStripper.Style style) throws IOException
style
- the style.IOException
- If there is any error writing to the stream.protected abstract void startPage(int firstPage, int lastPage, int currentPage, org.apache.pdfbox.pdmodel.PDPage page) throws IOException
firstPage
- first page.lastPage
- last page.currentPage
- current page.page
- The page we are about to process.IOException
- If there is any error writing to the stream.protected abstract void endPage(int firstPage, int lastPage, int currentPage, org.apache.pdfbox.pdmodel.PDPage page) throws IOException
firstPage
- first page.lastPage
- last page.currentPage
- current page.page
- The page we are about to process.IOException
- If there is any error writing to the stream.protected abstract void processLineSeparator() throws IOException
IOException
protected abstract void processWordSeparator() throws IOException
IOException
protected abstract void writeCharacters(TextPosition text) throws IOException
text
- The text to write to the stream.IOException
- If there is an error when writing the text.public int getStartPage()
public void setStartPage(int startPageValue)
startPageValue
- New value of property startPage.public int getEndPage()
public void setEndPage(int endPageValue)
endPageValue
- New value of property endPage.public boolean shouldSuppressDuplicateOverlappingText()
protected int getCurrentPageNo()
protected List<List<TextPosition>> getCharactersByArticle()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
suppressDuplicateOverlappingTextValue
- The suppressDuplicateOverlappingText to set.public boolean shouldSeparateByBeads()
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
aShouldSeparateByBeads
- The new grouping of beads.Copyright © 2007–2018 Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt. All rights reserved.