Skip navigation links
A B C D E F G H I L M N O P R S T U V W Y 

A

AbstractLineSegment - Interface in de.citec.scie.pdf.structure
The AbstractLineSegment interface represents a simple line in one dimensional space, given by a start position and an end position.
addAll(Histogramm<H>) - Method in class de.citec.scie.pdf.Histogramm
Adds all values from another histogramm.
addBlock(TextBlock, PreTextBlock) - Method in class de.citec.scie.pdf.TextBlockRankEstimator
 
addDataPoint(H) - Method in class de.citec.scie.pdf.Histogramm
Add a new datapoint/bin (or override an old one).
addElement(TextPosition) - Method in class de.citec.scie.pdf.PreTextLine
 
addLine(PreTextLine) - Method in class de.citec.scie.pdf.PreTextBlock
 
addTextPosition(TextPosition) - Method in class de.citec.scie.pdf.PreTextBlock
 

B

begin - Variable in class de.citec.scie.pdf.structure.LineSegment
Start position of the line.
blockCleanup(Document) - Method in class de.citec.scie.pdf.DocumentBlockCleaner
The cleanup is done using a greedy heuristic as follows: Start with short text blocks on the first page and than iterate over all other pages and try to build a sequence of most similar TextBlocks to it.
boundariesEqual(AbstractLineSegment, AbstractLineSegment) - Static method in class de.citec.scie.pdf.structure.LineSegment
Returns true if the boundaries of the two line segments are equal.
boundariesEqual(AbstractLineSegment) - Method in class de.citec.scie.pdf.structure.LineSegment
Returns true if the boundaries of the two line segments are equal.

C

calculate(String, String) - Method in class de.citec.scie.pdf.StringSimilarity
This implements an algorithm to determine the similarity between Strings by utilizing an alignment/edit distance approach.
calculateAlignment(TextPosition) - Method in class de.citec.scie.pdf.VerticalAlignmentEstimator
 
content - Variable in class de.citec.scie.pdf.PreTextLine
 
content - Variable in class de.citec.scie.pdf.structure.Document
 
content - Variable in class de.citec.scie.pdf.structure.Page
 
content - Variable in class de.citec.scie.pdf.structure.Paragraph
This is the Text content of this Paragraph.
content - Variable in class de.citec.scie.pdf.structure.TextBlock
This is the actual content of the TextBlock.

D

de.citec.scie.pdf - package de.citec.scie.pdf
 
de.citec.scie.pdf.structure - package de.citec.scie.pdf.structure
 
Document - Class in de.citec.scie.pdf.structure
This represents a parsed document which is defined as a sequence of pages.
Document() - Constructor for class de.citec.scie.pdf.structure.Document
 
DocumentBlockCleaner - Class in de.citec.scie.pdf
 
DocumentBlockCleaner() - Constructor for class de.citec.scie.pdf.DocumentBlockCleaner
 

E

end - Variable in class de.citec.scie.pdf.structure.LineSegment
End position of the line.
equals(Object) - Method in class de.citec.scie.pdf.structure.Document
equals(Object) - Method in class de.citec.scie.pdf.structure.Page
equals(Object) - Method in class de.citec.scie.pdf.structure.Paragraph
equals(Object) - Method in class de.citec.scie.pdf.structure.Text
equals(Object) - Method in class de.citec.scie.pdf.structure.TextBlock

F

fontHisto - Variable in class de.citec.scie.pdf.PreTextLine
 
fontSizeHisto - Variable in class de.citec.scie.pdf.PreTextLine
 

G

getAverage() - Method in class de.citec.scie.pdf.Histogramm
This only works if the given class type is a number.
getBackingMap() - Method in class de.citec.scie.pdf.Histogramm
Returns the backing HashMap.
getBegin() - Method in interface de.citec.scie.pdf.structure.AbstractLineSegment
Returns the start index of the word in the text.
getBegin() - Method in class de.citec.scie.pdf.structure.LineSegment
Returns the begin position of the line.
getEnd() - Method in interface de.citec.scie.pdf.structure.AbstractLineSegment
Returns the end index of the word in the text.
getEnd() - Method in class de.citec.scie.pdf.structure.LineSegment
Returns the end position of the line.
getFontName() - Method in class de.citec.scie.pdf.structure.Text
Get the value of fontName
getFontSize() - Method in class de.citec.scie.pdf.structure.Text
Get the value of fontSize
getMaxElement() - Method in class de.citec.scie.pdf.Histogramm
Returns the element that was counted the most.
getNumber(H) - Method in class de.citec.scie.pdf.Histogramm
Returns the current count for a given datapoint/bin.
getPageNumber() - Method in class de.citec.scie.pdf.structure.Page
Get the value of pageNumber
getRelativeFontSize() - Method in class de.citec.scie.pdf.structure.TextBlock
The font size of this TextBlocks content relative to the page-wide average.
getRelativeFontSize(TextBlock) - Method in class de.citec.scie.pdf.TextBlockRankEstimator
Returns the relativ font size of this block in relation to the whole page.
getSize() - Method in class de.citec.scie.pdf.PreTextBlock
 
getText() - Method in class de.citec.scie.pdf.structure.Text
Get the value of text
getVerticalAlignment() - Method in class de.citec.scie.pdf.structure.Text
Get the value of verticalAlignment
getX_end() - Method in class de.citec.scie.pdf.PreTextLine
 
getX_start() - Method in class de.citec.scie.pdf.PreTextLine
 

H

hashCode() - Method in class de.citec.scie.pdf.structure.Document
hashCode() - Method in class de.citec.scie.pdf.structure.Page
hashCode() - Method in class de.citec.scie.pdf.structure.Paragraph
hashCode() - Method in class de.citec.scie.pdf.structure.Text
hashCode() - Method in class de.citec.scie.pdf.structure.TextBlock
hasWhiteSpace(TextPosition) - Method in class de.citec.scie.pdf.WhiteSpaceEstimator
 
Histogramm<H> - Class in de.citec.scie.pdf
A convenience implementation for histogramms.
Histogramm() - Constructor for class de.citec.scie.pdf.Histogramm
 

I

importAsDocument(InputStream) - Static method in class de.citec.scie.pdf.PDFStructuredTextExtractor
Assumes the given InputStream to contain PDF data and parses it.
importAsInputStream(InputStream) - Static method in class de.citec.scie.pdf.PDFStructuredTextExtractor
Assumes the given InputStream to contain PDF data and parses it.
importAsString(InputStream) - Static method in class de.citec.scie.pdf.PDFStructuredTextExtractor
Assumes the given InputStream to contain PDF data and parses it.
indexedToString(int) - Method in class de.citec.scie.pdf.structure.Document
Does the same as toString but also inserts the beginning and end index of each objects respective text representation to this objects attributes (which is retrievable by getBegin and getEnd).
indexedToString(int) - Method in class de.citec.scie.pdf.structure.Page
Does the same as toString but also inserts the beginning and end index of each objects respective text representation to this objects attributes (which is retrievable by getBegin and getEnd).
indexedToString(int) - Method in class de.citec.scie.pdf.structure.Paragraph
Does the same as toString but also inserts the beginning and end index of each objects respective text representation to this objects attributes (which is retrievable by getBegin and getEnd).
indexedToString(int) - Method in class de.citec.scie.pdf.structure.Text
Does the same as toString but also inserts the beginning and end index of each objects respective text representation to this objects attributes (which is retrievable by getBegin and getEnd).
indexedToString(int) - Method in class de.citec.scie.pdf.structure.TextBlock
Does the same as toString but also inserts the beginning and end index of each objects respective text representation to this objects attributes (which is retrievable by getBegin and getEnd).
intersection(AbstractLineSegment, AbstractLineSegment) - Static method in class de.citec.scie.pdf.structure.LineSegment
Returns a new line which is the union of the two lines.
isNewParagraph(PreTextLine) - Method in class de.citec.scie.pdf.ParagraphEstimator
 
isPartOfLine(TextPosition) - Method in class de.citec.scie.pdf.PreTextLine
 
isValid(AbstractLineSegment) - Static method in class de.citec.scie.pdf.structure.LineSegment
Returns true if the given line is valid (its begin is smaller or equal to its end).
isValid() - Method in class de.citec.scie.pdf.structure.LineSegment
Returns true if this line is valid (its begin is smaller or equal to its end).

L

length() - Method in class de.citec.scie.pdf.PreTextLine
 
length(AbstractLineSegment) - Static method in class de.citec.scie.pdf.structure.LineSegment
Returns the length of the line segment.
length() - Method in class de.citec.scie.pdf.structure.LineSegment
Returns the length of the line segment.
lengthHisto - Variable in class de.citec.scie.pdf.PreTextBlock
 
lines - Variable in class de.citec.scie.pdf.PreTextBlock
 
LineSegment - Class in de.citec.scie.pdf.structure
The LineSegment class implements the AbstractLineSegmentSegment interface and adds (static) utility functions that help to compare to lines.
LineSegment() - Constructor for class de.citec.scie.pdf.structure.LineSegment
Initializes the line segment as invalid, with begin being set to INF and end being set to -INF.
LineSegment(int, int) - Constructor for class de.citec.scie.pdf.structure.LineSegment
Initializes the line segment with the given begin and end.

M

MINIMUMBLOCKSIZE - Static variable in class de.citec.scie.pdf.PreTextBlock
 
MINIMUMPARSIZE - Static variable in class de.citec.scie.pdf.PDFStructuredTextExtractor
 

N

normalize(AbstractLineSegment) - Static method in class de.citec.scie.pdf.structure.LineSegment
Normalizes the line segment by swapping begin and end if they are in the wrong order.
normalize() - Method in class de.citec.scie.pdf.structure.LineSegment
Swaps begin and end if the line is not valid.
normalizedBounds(AbstractLineSegment) - Static method in class de.citec.scie.pdf.structure.LineSegment
Returns the boundaries of the normalized line as a two-element array -- normalization means that begin and end are swapped if end is larger than begin.

O

overlaps(AbstractLineSegment, AbstractLineSegment) - Static method in class de.citec.scie.pdf.structure.LineSegment
Returns true if the boundaries of the two line segments overlap.
overlaps(AbstractLineSegment) - Method in class de.citec.scie.pdf.structure.LineSegment
Returns true if the boundaries of the two line segments overlap.

P

Page - Class in de.citec.scie.pdf.structure
This represents one Page of a document, consisting of a (syntactically meaningful) sequence of TextBlock instances (e.g. columns in a two-column formatted Text).
Page() - Constructor for class de.citec.scie.pdf.structure.Page
 
Paragraph - Class in de.citec.scie.pdf.structure
This represents a paragraph of text that is defined as a sequence of Text objects that syntactically were grouped in a paragraph.
Paragraph() - Constructor for class de.citec.scie.pdf.structure.Paragraph
 
ParagraphEstimator - Class in de.citec.scie.pdf
This class is able to estimate if a line break also indicates a new paragraph.
ParagraphEstimator(PreTextBlock) - Constructor for class de.citec.scie.pdf.ParagraphEstimator
 
PDFStructuredTextExtractor - Class in de.citec.scie.pdf
This class takes a PDF File as input and extracts the text of it in an HTML-like hierarchical object structure (see the package "structure" for the classes itself).
PDFStructuredTextExtractor() - Constructor for class de.citec.scie.pdf.PDFStructuredTextExtractor
 
PreTextBlock - Class in de.citec.scie.pdf
A PreTextBlock represents a ThreadBead with some additional information.
PreTextBlock() - Constructor for class de.citec.scie.pdf.PreTextBlock
 
PreTextLine - Class in de.citec.scie.pdf
This just aggregates all TextPosition objects that are part of one line.
PreTextLine() - Constructor for class de.citec.scie.pdf.PreTextLine
 

R

REMOVETHRESHOLD - Static variable in class de.citec.scie.pdf.DocumentBlockCleaner
 

S

setBegin(int) - Method in class de.citec.scie.pdf.structure.LineSegment
Sets the start position of the line.
setEnd(int) - Method in class de.citec.scie.pdf.structure.LineSegment
Sets the end position of the line.
setFontName(String) - Method in class de.citec.scie.pdf.structure.Text
Set the value of fontName
setFontSize(float) - Method in class de.citec.scie.pdf.structure.Text
Set the value of fontSize
setPageNumber(int) - Method in class de.citec.scie.pdf.structure.Page
Set the value of pageNumber
setRelativeFontSize(double) - Method in class de.citec.scie.pdf.structure.TextBlock
The font size of this TextBlocks content relative to the page-wide average.
setText(String) - Method in class de.citec.scie.pdf.structure.Text
Set the value of text
setVerticalAlignment(Text.VerticalAlignment) - Method in class de.citec.scie.pdf.structure.Text
Set the value of verticalAlignment
setX_End() - Method in class de.citec.scie.pdf.PreTextLine
 
SMALLBLOCKSIZE - Static variable in class de.citec.scie.pdf.DocumentBlockCleaner
 
split() - Method in class de.citec.scie.pdf.PreTextBlock
This is supposed to split a TextBlock representing a whole page into different blocks that might represent columns in a two-column text Headings Foot notes Tables and figures The document abstract etc.
StringSimilarity - Class in de.citec.scie.pdf
This implements an algorithm to determine the similarity between Strings by utilizing an alignment/edit distance approach.
StringSimilarity() - Constructor for class de.citec.scie.pdf.StringSimilarity
 

T

Text - Class in de.citec.scie.pdf.structure
This is a wrapper class for text itself with additional information about the style of the text.
Text() - Constructor for class de.citec.scie.pdf.structure.Text
 
Text.VerticalAlignment - Enum in de.citec.scie.pdf.structure
 
TextBlock - Class in de.citec.scie.pdf.structure
This represents a syntatic block of Text, which can be a column on a page, a header or something similar.
TextBlock() - Constructor for class de.citec.scie.pdf.structure.TextBlock
 
TextBlockRankEstimator - Class in de.citec.scie.pdf
This estimator has the purpose to determine if a TextBlock has a larger usual Font Size as the usual Font Size for the whole page, an equal or a smaller one.
TextBlockRankEstimator() - Constructor for class de.citec.scie.pdf.TextBlockRankEstimator
 
toString() - Method in class de.citec.scie.pdf.structure.Document
Converts this object to a string by going recursively through the underlying page structure and calling their respective toString methods.
toString() - Method in class de.citec.scie.pdf.structure.Page
Converts this object to a string by going recursively through the underlying block structure and calling their respective toString methods.
toString() - Method in class de.citec.scie.pdf.structure.Paragraph
Converts this object to a string by going recursively through the underlying text objects and calling their respective toString methods.
toString() - Method in class de.citec.scie.pdf.structure.Text
Returns the text content of this Text object.
toString() - Method in class de.citec.scie.pdf.structure.TextBlock
Converts this object to a string by going recursively through the underlying paragraph structure and calling their respective toString methods.
toXML() - Method in class de.citec.scie.pdf.structure.Document
Returns a XML representation of this document by going recursively through the underlying page structure and calling their respective toXML methods.
toXML() - Method in class de.citec.scie.pdf.structure.Page
Returns a XML representation of this page by going recursively through the underlying block structure and calling their respective toXML methods.
toXML() - Method in class de.citec.scie.pdf.structure.Paragraph
Returns a XML representation of this paragraph by going recursively through the underlying text objects and calling their respective toXML methods.
toXML() - Method in class de.citec.scie.pdf.structure.Text
Returns a XML representation of this text object including its font size, font name and vertical alignment as XML attributes.
toXML() - Method in class de.citec.scie.pdf.structure.TextBlock
Returns a XML representation of this block by going recursively through the underlying paragraph structure and calling their respective toXML methods.

U

union(AbstractLineSegment, AbstractLineSegment) - Static method in class de.citec.scie.pdf.structure.LineSegment
Returns a new line which is the union of the two lines.

V

valueOf(String) - Static method in enum de.citec.scie.pdf.structure.Text.VerticalAlignment
Returns the enum constant of this type with the specified name.
values() - Static method in enum de.citec.scie.pdf.structure.Text.VerticalAlignment
Returns an array containing the constants of this enum type, in the order they are declared.
VerticalAlignmentEstimator - Class in de.citec.scie.pdf
This just determines the vertical alignment of a given glyph in relation to the line it is part of.
VerticalAlignmentEstimator(PreTextLine) - Constructor for class de.citec.scie.pdf.VerticalAlignmentEstimator
 

W

WhiteSpaceEstimator - Class in de.citec.scie.pdf
This is based on the work of Ben Litchfield in the PDFTextStripper of Apache PDFBox.
WhiteSpaceEstimator() - Constructor for class de.citec.scie.pdf.WhiteSpaceEstimator
 

Y

yDistHisto - Variable in class de.citec.scie.pdf.PreTextBlock
 
yHisto - Variable in class de.citec.scie.pdf.PreTextLine
 
A B C D E F G H I L M N O P R S T U V W Y 
Skip navigation links

Copyright © 2014. All rights reserved.