Class Cleaner


  • public class Cleaner
    extends Object
    • Constructor Detail

      • Cleaner

        public Cleaner()
        Creates a Cleaner object with default options.
      • Cleaner

        public Cleaner​(Parameters params)
        Creates a Cleaner object with a given set of options.
        Parameters:
        params - the options to assign to this object (use null for the defaults).
    • Method Detail

      • run

        public boolean run​(ITextUnit tu,
                           LocaleId targetLocale)
        Performs the cleaning of the text unit according to user selected options.
        Parameters:
        tu - the unit containing the text to clean
        targetLocale -
        Returns:
        true if tu should be discarded and false otherwise
      • normalizeWhitespace

        protected void normalizeWhitespace​(ITextUnit tu,
                                           Segment seg,
                                           LocaleId targetLocale)
        Converts whitespace ({tab}, {space}, {CR}, {LF}) to single space.
        Parameters:
        tu - : the TextUnit containing the segments to update
        seg - : the Segment to update
        targetLocale - : the language for which the text should be updated
      • normalizeQuotation

        protected void normalizeQuotation​(ITextUnit tu,
                                          Segment seg,
                                          LocaleId targetLocale)
        Converts all quotation marks (curly or language specific) to straight quotes. All apostrophes will also be converted to their straight equivalents.
        Parameters:
        srcFrag - : original text to be normalized
        trgFrag - : target text to be normalized
        targetLocale - the language for which the text should be updated
      • normalizePunctuation

        protected void normalizePunctuation​(TextFragment srcFrag,
                                            TextFragment trgFrag)
        Attempts to make punctuation and spacing around punctuation consistent according to standard English punctuation rules. Assumptions: 1) all strings passed have consistent spacing (only single spaces) 2) quotes have been normalized 3) strings will need post-processing in order to correct spacing for languages such as French. This ignores locale and Asian punctuation.
        Parameters:
        srcFrag - : original text to be normalized
        trgFrag - : target text to be normalized
      • markSegmentForRemoval

        protected void markSegmentForRemoval​(ITextUnit tu,
                                             Segment seg,
                                             LocaleId targetLocale)
        Effectively marks the segment for removal by emptying the content for the given target. the text unit will be pruned by a different method (pruneTextUnit(ITextUnit, LocaleId)).
        Parameters:
        tu - the text unit containing the content
        seg - the segment to be marked for removal
        targetLocale - the locale for which the segment should be removed
      • matchRegexExpressions

        protected void matchRegexExpressions​(ITextUnit tu,
                                             Segment seg,
                                             LocaleId targetLocale)
        Marks segments for removal that contain text which match given regular expressions. Allows for marking segments which match user specified regular expressions.
        Parameters:
        tu - the text unit containing the segments to be matched
        seg - the segment to analyze
        targetLocale - the locale
      • pruneTextUnit

        protected boolean pruneTextUnit​(ITextUnit tu,
                                        LocaleId targetLocale)
        Removes segments from the text unit marked as not containing useful information.
        Parameters:
        tu - text unit to be pruned of unwanted segments
        targetLocale - locale of target through which to search
        Returns:
        true if entire text unit is to be discarded false if text unit contains good translated text
      • checkCharacters

        protected void checkCharacters​(ITextUnit tu,
                                       Segment seg,
                                       LocaleId targetLocale)
        Wrapper for removing character corruption and detecting unexpected characters.
        Parameters:
        tu - : the TextUnit containing the segments to update
        seg - : the Segment to update
        targetLocale - : the language for which the text should be updated