Package net.sf.okapi.steps.cleanup
Class Cleaner
- java.lang.Object
-
- net.sf.okapi.steps.cleanup.Cleaner
-
public class Cleaner extends Object
-
-
Constructor Summary
Constructors Constructor Description Cleaner()Creates a Cleaner object with default options.Cleaner(Parameters params)Creates a Cleaner object with a given set of options.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voidcheckCharacters(ITextUnit tu, Segment seg, LocaleId targetLocale)Wrapper for removing character corruption and detecting unexpected characters.protected voidmarkSegmentForRemoval(ITextUnit tu, Segment seg, LocaleId targetLocale)Effectively marks the segment for removal by emptying the content for the given target.protected voidmatchRegexExpressions(ITextUnit tu, Segment seg, LocaleId targetLocale)Marks segments for removal that contain text which match given regular expressions.protected voidnormalizeMarks(ITextUnit tu, Segment seg, LocaleId targetLocale)protected voidnormalizePunctuation(TextFragment srcFrag, TextFragment trgFrag)Attempts to make punctuation and spacing around punctuation consistent according to standard English punctuation rules.protected voidnormalizeQuotation(ITextUnit tu, Segment seg, LocaleId targetLocale)Converts all quotation marks (curly or language specific) to straight quotes.protected voidnormalizeWhitespace(ITextUnit tu, Segment seg, LocaleId targetLocale)Converts whitespace ({tab}, {space}, {CR}, {LF}) to single space.protected booleanpruneTextUnit(ITextUnit tu, LocaleId targetLocale)Removes segments from the text unit marked as not containing useful information.booleanrun(ITextUnit tu, LocaleId targetLocale)Performs the cleaning of the text unit according to user selected options.
-
-
-
Constructor Detail
-
Cleaner
public Cleaner()
Creates a Cleaner object with default options.
-
Cleaner
public Cleaner(Parameters params)
Creates a Cleaner object with a given set of options.- Parameters:
params- the options to assign to this object (use null for the defaults).
-
-
Method Detail
-
run
public boolean run(ITextUnit tu, LocaleId targetLocale)
Performs the cleaning of the text unit according to user selected options.- Parameters:
tu- the unit containing the text to cleantargetLocale-- Returns:
- true if tu should be discarded and false otherwise
-
normalizeWhitespace
protected void normalizeWhitespace(ITextUnit tu, Segment seg, LocaleId targetLocale)
Converts whitespace ({tab}, {space}, {CR}, {LF}) to single space.- Parameters:
tu- : the TextUnit containing the segments to updateseg- : the Segment to updatetargetLocale- : the language for which the text should be updated
-
normalizeQuotation
protected void normalizeQuotation(ITextUnit tu, Segment seg, LocaleId targetLocale)
Converts all quotation marks (curly or language specific) to straight quotes. All apostrophes will also be converted to their straight equivalents.- Parameters:
srcFrag- : original text to be normalizedtrgFrag- : target text to be normalizedtargetLocale- the language for which the text should be updated
-
normalizePunctuation
protected void normalizePunctuation(TextFragment srcFrag, TextFragment trgFrag)
Attempts to make punctuation and spacing around punctuation consistent according to standard English punctuation rules. Assumptions: 1) all strings passed have consistent spacing (only single spaces) 2) quotes have been normalized 3) strings will need post-processing in order to correct spacing for languages such as French. This ignores locale and Asian punctuation.- Parameters:
srcFrag- : original text to be normalizedtrgFrag- : target text to be normalized
-
markSegmentForRemoval
protected void markSegmentForRemoval(ITextUnit tu, Segment seg, LocaleId targetLocale)
Effectively marks the segment for removal by emptying the content for the given target. the text unit will be pruned by a different method (pruneTextUnit(ITextUnit, LocaleId)).- Parameters:
tu- the text unit containing the contentseg- the segment to be marked for removaltargetLocale- the locale for which the segment should be removed
-
matchRegexExpressions
protected void matchRegexExpressions(ITextUnit tu, Segment seg, LocaleId targetLocale)
Marks segments for removal that contain text which match given regular expressions. Allows for marking segments which match user specified regular expressions.- Parameters:
tu- the text unit containing the segments to be matchedseg- the segment to analyzetargetLocale- the locale
-
pruneTextUnit
protected boolean pruneTextUnit(ITextUnit tu, LocaleId targetLocale)
Removes segments from the text unit marked as not containing useful information.- Parameters:
tu- text unit to be pruned of unwanted segmentstargetLocale- locale of target through which to search- Returns:
- true if entire text unit is to be discarded false if text unit contains good translated text
-
checkCharacters
protected void checkCharacters(ITextUnit tu, Segment seg, LocaleId targetLocale)
Wrapper for removing character corruption and detecting unexpected characters.- Parameters:
tu- : the TextUnit containing the segments to updateseg- : the Segment to updatetargetLocale- : the language for which the text should be updated
-
-