public class Cleaner extends Object
| Constructor and Description |
|---|
Cleaner()
Creates a Cleaner object with default options.
|
Cleaner(Parameters params)
Creates a Cleaner object with a given set of options.
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
checkCharacters(ITextUnit tu,
Segment seg,
LocaleId targetLocale)
Wrapper for removing character corruption and detecting unexpected
characters.
|
protected void |
markSegmentForRemoval(ITextUnit tu,
Segment seg,
LocaleId targetLocale)
Effectively marks the segment for removal by emptying the content for the
given target.
|
protected void |
matchRegexExpressions(ITextUnit tu,
Segment seg,
LocaleId targetLocale)
Marks segments for removal that contain text which match given regular
expressions.
|
protected void |
normalizeMarks(ITextUnit tu,
Segment seg,
LocaleId targetLocale) |
protected void |
normalizePunctuation(TextFragment srcFrag,
TextFragment trgFrag)
Attempts to make punctuation and spacing around punctuation consistent
according to standard English punctuation rules.
|
protected void |
normalizeQuotation(ITextUnit tu,
Segment seg,
LocaleId targetLocale)
Converts all quotation marks (curly or language specific) to straight
quotes.
|
protected void |
normalizeWhitespace(ITextUnit tu,
Segment seg,
LocaleId targetLocale)
Converts whitespace ({tab}, {space}, {CR}, {LF}) to single space.
|
protected boolean |
pruneTextUnit(ITextUnit tu,
LocaleId targetLocale)
Removes segments from the text unit marked as not containing useful
information.
|
boolean |
run(ITextUnit tu,
LocaleId targetLocale)
Performs the cleaning of the text unit according to user selected
options.
|
public Cleaner()
public Cleaner(Parameters params)
params - the options to assign to this object (use null for the
defaults).public boolean run(ITextUnit tu, LocaleId targetLocale)
tu - the unit containing the text to cleantargetLocale - protected void normalizeWhitespace(ITextUnit tu, Segment seg, LocaleId targetLocale)
tu: - the TextUnit containing the segments to updateseg: - the Segment to updatetargetLocale: - the language for which the text should be updatedprotected void normalizeQuotation(ITextUnit tu, Segment seg, LocaleId targetLocale)
srcFrag: - original text to be normalizedtrgFrag: - target text to be normalizedtargetLocale - the language for which the text should be updatedprotected void normalizePunctuation(TextFragment srcFrag, TextFragment trgFrag)
srcFrag - : original text to be normalizedtrgFrag - : target text to be normalizedprotected void markSegmentForRemoval(ITextUnit tu, Segment seg, LocaleId targetLocale)
pruneTextUnit(ITextUnit, LocaleId)).tu - the text unit containing the contentseg - the segment to be marked for removaltargetLocale - the locale for which the segment should be removedprotected void matchRegexExpressions(ITextUnit tu, Segment seg, LocaleId targetLocale)
tu - the text unit containing the segments to be matchedseg - the segment to analyzetargetLocale - the localeprotected boolean pruneTextUnit(ITextUnit tu, LocaleId targetLocale)
tu - text unit to be pruned of unwanted segmentstargetLocale - locale of target through which to searchprotected void checkCharacters(ITextUnit tu, Segment seg, LocaleId targetLocale)
tu: - the TextUnit containing the segments to updateseg: - the Segment to updatetargetLocale: - the language for which the text should be updatedCopyright © 2022. All rights reserved.