PdfLayoutEventStripper (DKPro Core 1.9.0 API)

java.lang.Object
- PDFStreamEngine
- - de.tudarmstadt.ukp.dkpro.core.io.pdf.PdfLayoutEventStripper

Direct Known Subclasses:

Pdf2CasConverter
```
public abstract class PdfLayoutEventStripper
extends PDFStreamEngine
```
This class will take a PDF document and strip out all of the text and ignore the formatting and such. Please note; it is up to clients of this class to verify that a specific user has the correct permissions to extract text from the PDF document.
This class is based on the pdfbox 1.7.0 PDFTextStripper class and was substantially modified and enhanced for basic paragraph and heading detection. Unfortunately it was not possible to add these enhancements through sub-classing, thus the code was copied and adapted.

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class PdfLayoutEventStripper.Style

static class PdfLayoutEventStripper.Values

Nested Classes
Modifier and Type	Class and Description
`static class`	`PdfLayoutEventStripper.Style`
`static class`	`PdfLayoutEventStripper.Values`

Field Summary

Fields
Modifier and Type Field and Description

protected Vector<List<TextPosition>> charactersByArticle
The charactersByArticle is used to extract text by article divisions.

Fields
Modifier and Type	Field and Description
`protected Vector<List<TextPosition>>`	`charactersByArticle` The charactersByArticle is used to extract text by article divisions.

Constructor Summary

Constructors
Constructor and Description
`PdfLayoutEventStripper()` Instantiate a new PDFTextStripper object.
`PdfLayoutEventStripper(Properties props)` Instantiate a new PDFTextStripper object.

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`protected abstract void`	`endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)` This method is available for subclasses of this class.
`protected abstract void`	`endPage(int firstPage, int lastPage, int currentPage, org.apache.pdfbox.pdmodel.PDPage page)` End a page.
`protected abstract void`	`endRegion(PdfLayoutEventStripper.Style style)` End a region.
`protected List<List<TextPosition>>`	`getCharactersByArticle()` Character strings are grouped by articles.
`protected int`	`getCurrentPageNo()` Get the current page number that is being processed.
`int`	`getEndPage()` This will get the last page that will be extracted.
`int`	`getStartPage()` This is the page that the text extraction will start on.
`protected PdfLayoutEventStripper.Style`	`getStyle(TextPosition pos)`
`protected void`	`processArticle(List<TextPosition> textList)` This method tries do detect headings and paragraphs and line boundaries.
`protected abstract void`	`processLineSeparator()`
`protected void`	`processPage(org.apache.pdfbox.pdmodel.PDPage page, org.apache.pdfbox.cos.COSStream content)` This will process the contents of a page.
`protected void`	`processPages(List<org.apache.pdfbox.pdmodel.PDPage> pages)` This will process all of the pages and the text that is in them.
`protected void`	`processTextPosition(TextPosition text)` This will show add a character to the list of characters to be printed to the text file.
`protected abstract void`	`processWordSeparator()`
`void`	`setEndPage(int endPageValue)` This will set the last page to be extracted by this class.
`void`	`setShouldSeparateByBeads(boolean aShouldSeparateByBeads)` Set if the text stripper should group the text output by a list of beads.
`void`	`setStartPage(int startPageValue)` This will set the first page to be extracted by this class.
`void`	`setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)` By default the text stripper will attempt to remove text that overlapps each other.
`boolean`	`shouldSeparateByBeads()` This will tell if the text stripper should separate by beads.
`boolean`	`shouldSuppressDuplicateOverlappingText()`
`protected abstract void`	`startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)` This method is available for subclasses of this class.
`protected abstract void`	`startPage(int firstPage, int lastPage, int currentPage, org.apache.pdfbox.pdmodel.PDPage page)` Start a new page.
`protected abstract void`	`startRegion(PdfLayoutEventStripper.Style style)` Start a new region.
`protected abstract void`	`writeCharacters(TextPosition text)` Write the string to the output stream.
`void`	`writeText(org.apache.pdfbox.pdmodel.PDDocument doc)` This will take a PDDocument and write the text of that document to the print writer.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - charactersByArticle
```
protected Vector<List<TextPosition>> charactersByArticle
```
    The charactersByArticle is used to extract text by article divisions. For example a PDF that has two columns like a newspaper, we want to extract the first column and then the second column. In this example the PDF would have 2 beads(or articles), one for each column. The size of the charactersByArticle would be 5, because not all text on the screen will fall into one of the articles. The five divisions are shown below Text before first article first article text text between first article and second article second article text text after second article Most PDFs won't have any beads, so charactersByArticle will contain a single entry.
- Constructor Detail
  - PdfLayoutEventStripper
```
public PdfLayoutEventStripper()
                       throws IOException
```
    Instantiate a new PDFTextStripper object. This object will load properties from Resources/PDFTextStripper.properties.
    
    Throws:
    
    IOException - If there is an error loading the properties.
  - PdfLayoutEventStripper
```
public PdfLayoutEventStripper(Properties props)
                       throws IOException
```
    Instantiate a new PDFTextStripper object. Loading all of the operator mappings from the properties object that is passed in.
    
    Parameters:
    
    props - The properties containing the mapping of operators to PDFOperator classes.
    
    Throws:
    
    IOException - If there is an error reading the properties.
- Method Detail
  - writeText
```
public void writeText(org.apache.pdfbox.pdmodel.PDDocument doc)
               throws IOException
```
    This will take a PDDocument and write the text of that document to the print writer.
    
    Parameters:
    
    doc - The document to get the data from.
    
    Throws:
    
    IOException - If the doc is in an invalid state.
  - processPages
```
protected void processPages(List<org.apache.pdfbox.pdmodel.PDPage> pages)
                     throws IOException
```
    This will process all of the pages and the text that is in them.
    
    Parameters:
    
    pages - The pages object in the document.
    
    Throws:
    
    IOException - If there is an error parsing the text.
  - processPage
```
protected void processPage(org.apache.pdfbox.pdmodel.PDPage page,
                           org.apache.pdfbox.cos.COSStream content)
                    throws IOException
```
    This will process the contents of a page.
    
    Parameters:
    
    page - The page to process.
    
    content - The contents of the page.
    
    Throws:
    
    IOException - If there is an error processing the page.
  - processArticle
```
protected void processArticle(List<TextPosition> textList)
                       throws IOException
```
    This method tries do detect headings and paragraphs and line boundaries.
    
    Parameters:
    
    textList - the text.
    
    Throws:
    
    IOException - if there is an error writing to the stream.
  - processTextPosition
```
protected void processTextPosition(TextPosition text)
```
    This will show add a character to the list of characters to be printed to the text file.
    
    Parameters:
    
    text - The description of the character to display.
  - getStyle
```
protected PdfLayoutEventStripper.Style getStyle(TextPosition pos)
```
  - startDocument
```
protected abstract void startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)
                               throws IOException
```
    This method is available for subclasses of this class. It will be called before processing of the document start.
    
    Parameters:
    
    pdf - The PDF document that is being processed.
    
    Throws:
    
    IOException - If an IO error occurs.
  - endDocument
```
protected abstract void endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)
                             throws IOException
```
    This method is available for subclasses of this class. It will be called after processing of the document finishes.
    
    Parameters:
    
    pdf - The PDF document that is being processed.
    
    Throws:
    
    IOException - If an IO error occurs.
  - startRegion
```
protected abstract void startRegion(PdfLayoutEventStripper.Style style)
                             throws IOException
```
    Start a new region.
    
    Parameters:
    
    style - the style.
    
    Throws:
    
    IOException - If there is any error writing to the stream.
  - endRegion
```
protected abstract void endRegion(PdfLayoutEventStripper.Style style)
                           throws IOException
```
    End a region.
    
    Parameters:
    
    style - the style.
    
    Throws:
    
    IOException - If there is any error writing to the stream.
  - startPage
```
protected abstract void startPage(int firstPage,
                                  int lastPage,
                                  int currentPage,
                                  org.apache.pdfbox.pdmodel.PDPage page)
                           throws IOException
```
    Start a new page.
    
    Parameters:
    
    firstPage - first page.
    
    lastPage - last page.
    
    currentPage - current page.
    
    page - The page we are about to process.
    
    Throws:
    
    IOException - If there is any error writing to the stream.
  - endPage
```
protected abstract void endPage(int firstPage,
                                int lastPage,
                                int currentPage,
                                org.apache.pdfbox.pdmodel.PDPage page)
                         throws IOException
```
    End a page.
    
    Parameters:
    
    firstPage - first page.
    
    lastPage - last page.
    
    currentPage - current page.
    
    page - The page we are about to process.
    
    Throws:
    
    IOException - If there is any error writing to the stream.
  - processLineSeparator
```
protected abstract void processLineSeparator()
                                      throws IOException
```
    Throws:
    
    IOException
  - processWordSeparator
```
protected abstract void processWordSeparator()
                                      throws IOException
```
    Throws:
    
    IOException
  - writeCharacters
```
protected abstract void writeCharacters(TextPosition text)
                                 throws IOException
```
    Write the string to the output stream.
    
    Parameters:
    
    text - The text to write to the stream.
    
    Throws:
    
    IOException - If there is an error when writing the text.
  - getStartPage
```
public int getStartPage()
```
    This is the page that the text extraction will start on. The pages start at page 1. For example in a 5 page PDF document, if the start page is 1 then all pages will be extracted. If the start page is 4 then pages 4 and 5 will be extracted. The default value is 1.
    
    Returns:
    
    Value of property startPage.
  - setStartPage
```
public void setStartPage(int startPageValue)
```
    This will set the first page to be extracted by this class.
    
    Parameters:
    
    startPageValue - New value of property startPage.
  - getEndPage
```
public int getEndPage()
```
    This will get the last page that will be extracted. This is inclusive, for example if a 5 page PDF an endPage value of 5 would extract the entire document, an end page of 2 would extract pages 1 and 2. This defaults to Integer.MAX_VALUE such that all pages of the pdf will be extracted.
    
    Returns:
    
    Value of property endPage.
  - setEndPage
```
public void setEndPage(int endPageValue)
```
    This will set the last page to be extracted by this class.
    
    Parameters:
    
    endPageValue - New value of property endPage.
  - shouldSuppressDuplicateOverlappingText
```
public boolean shouldSuppressDuplicateOverlappingText()
```
    Returns:
    
    Returns the suppressDuplicateOverlappingText.
  - getCurrentPageNo
```
protected int getCurrentPageNo()
```
    Get the current page number that is being processed.
    
    Returns:
    
    A 1 based number representing the current page.
  - getCharactersByArticle
```
protected List<List<TextPosition>> getCharactersByArticle()
```
    Character strings are grouped by articles. It is quite common that there will only be a single article. This returns a List that contains List objects, the inner lists will contain TextPosition objects.
    
    Returns:
    
    A double List of TextPositions for all text strings on the page.
  - setSuppressDuplicateOverlappingText
```
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
```
    By default the text stripper will attempt to remove text that overlapps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.
    
    Parameters:
    
    suppressDuplicateOverlappingTextValue - The suppressDuplicateOverlappingText to set.
  - shouldSeparateByBeads
```
public boolean shouldSeparateByBeads()
```
    This will tell if the text stripper should separate by beads.
    
    Returns:
    
    If the text will be grouped by beads.
  - setShouldSeparateByBeads
```
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
```
    Set if the text stripper should group the text output by a list of beads. The default value is true!
    
    Parameters:
    
    aShouldSeparateByBeads - The new grouping of beads.

Class PdfLayoutEventStripper

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

charactersByArticle

Constructor Detail

PdfLayoutEventStripper

PdfLayoutEventStripper

Method Detail

writeText

processPages

processPage

processArticle

processTextPosition

getStyle

startDocument

endDocument

startRegion

endRegion

startPage

endPage

processLineSeparator

processWordSeparator

writeCharacters

getStartPage

setStartPage

getEndPage

setEndPage

shouldSuppressDuplicateOverlappingText

getCurrentPageNo

getCharactersByArticle

setSuppressDuplicateOverlappingText

shouldSeparateByBeads

setShouldSeparateByBeads