Package net.sf.okapi.lib.segmentation
Class SRXSegmenter
- java.lang.Object
-
- net.sf.okapi.lib.segmentation.SRXSegmenter
-
- All Implemented Interfaces:
ISegmenter
public class SRXSegmenter extends Object implements ISegmenter
Implements theISegmenterinterface for SRX rules.
-
-
Constructor Summary
Constructors Constructor Description SRXSegmenter()Creates a new SRXSegmenter object.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voidaddRule(net.sf.okapi.lib.segmentation.CompiledRule compiledRule)Adds a compiled rule to this segmenter.booleancascade()Indicates if cascading must be applied when selecting the rules for a given language pattern.intcomputeSegments(String text)intcomputeSegments(TextContainer container)LocaleIdgetLanguage()RangegetNextSegmentRange(TextContainer container)List<Range>getRanges()List<Integer>getSplitPositions()booleanincludeEndCodes()booleanincludeIsolatedCodes()booleanincludeStartCodes()booleanoneSegmentIncludesAll()voidreset()booleansegmentSubFlows()protected voidsetCascade(boolean value)Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.voidsetIncludeEndCodes(boolean includeEndCodes)voidsetIncludeIsolatedCodes(boolean includeIsolatedCodes)voidsetIncludeStartCodes(boolean includeStartCodes)voidsetLanguage(LocaleId languageCode)protected voidsetMaskRule(String pattern)Sets the pattern for the mask rule.voidsetOneSegmentIncludesAll(boolean oneSegmentIncludesAll)voidsetOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)voidsetOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS, boolean useJavaRegex, boolean useIcu4JBreakRules, boolean treatIsolatedCodesAsWhitespace)Sets the options for this segmenter.voidsetSegmentSubFlows(boolean segmentSubFlows)voidsetTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)voidsetTrimCodes(boolean trimCodes)voidsetTrimLeadingWS(boolean trimLeadingWS)voidsetTrimTrailingWS(boolean trimTrailingWS)voidsetUseJavaRegex(boolean useJavaRegex)Sets the indicator that tells if this document has rules that are defined for the Java regular expression engine (vs ICU).booleantreatIsolatedCodesAsWhitespace()booleantrimLeadingWhitespaces()booleantrimTrailingWhitespaces()booleanuseJavaRegex()Indicates if this document has rules that are defined for the Java regular expression engine (vs ICU).
-
-
-
Method Detail
-
reset
public void reset()
- Specified by:
resetin interfaceISegmenter
-
setOptions
public void setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS, boolean useJavaRegex, boolean useIcu4JBreakRules, boolean treatIsolatedCodesAsWhitespace)Sets the options for this segmenter.- Parameters:
segmentSubFlows- true to segment sub-flows, false to no segment them.includeStartCodes- true to include start codes just before a break in the 'left' segment, false to put them in the next segment.includeEndCodes- true to include end codes just before a break in the 'left' segment, false to put them in the next segment.includeIsolatedCodes- true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.oneSegmentIncludesAll- true to include everything in segments that are alone.trimLeadingWS- true to trim leading white-spaces from the segments, false to keep them.trimTrailingWS- true to trim trailing white-spaces from the segments, false to keep them.useJavaRegex- true if the rules are for the Java regular expression engine, false if they are for ICU.treatIsolatedCodesAsWhitespace- if true then the isolated code markers in codedText get converted to spaces, so that they don't get in the way of the rules. If false, the codes are simply removed.
-
setOptions
public void setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)- Specified by:
setOptionsin interfaceISegmenter
-
oneSegmentIncludesAll
public boolean oneSegmentIncludesAll()
- Specified by:
oneSegmentIncludesAllin interfaceISegmenter
-
segmentSubFlows
public boolean segmentSubFlows()
- Specified by:
segmentSubFlowsin interfaceISegmenter
-
cascade
public boolean cascade()
Indicates if cascading must be applied when selecting the rules for a given language pattern.- Returns:
- true if cascading must be applied, false otherwise.
-
trimLeadingWhitespaces
public boolean trimLeadingWhitespaces()
- Specified by:
trimLeadingWhitespacesin interfaceISegmenter
-
trimTrailingWhitespaces
public boolean trimTrailingWhitespaces()
- Specified by:
trimTrailingWhitespacesin interfaceISegmenter
-
useJavaRegex
public boolean useJavaRegex()
Indicates if this document has rules that are defined for the Java regular expression engine (vs ICU).- Returns:
- true if the rules are for the Java regular expression engine, false if they are for ICU.
-
treatIsolatedCodesAsWhitespace
public boolean treatIsolatedCodesAsWhitespace()
- Specified by:
treatIsolatedCodesAsWhitespacein interfaceISegmenter
-
setUseJavaRegex
public void setUseJavaRegex(boolean useJavaRegex)
Sets the indicator that tells if this document has rules that are defined for the Java regular expression engine (vs ICU).- Parameters:
useJavaRegex- true if the rules should be treated as Java regular expression, false for ICU.
-
includeStartCodes
public boolean includeStartCodes()
- Specified by:
includeStartCodesin interfaceISegmenter
-
includeEndCodes
public boolean includeEndCodes()
- Specified by:
includeEndCodesin interfaceISegmenter
-
includeIsolatedCodes
public boolean includeIsolatedCodes()
- Specified by:
includeIsolatedCodesin interfaceISegmenter
-
computeSegments
public int computeSegments(String text)
- Specified by:
computeSegmentsin interfaceISegmenter
-
computeSegments
public int computeSegments(TextContainer container)
- Specified by:
computeSegmentsin interfaceISegmenter
-
getNextSegmentRange
public Range getNextSegmentRange(TextContainer container)
- Specified by:
getNextSegmentRangein interfaceISegmenter
-
getSplitPositions
public List<Integer> getSplitPositions()
- Specified by:
getSplitPositionsin interfaceISegmenter
-
getRanges
public List<Range> getRanges()
- Specified by:
getRangesin interfaceISegmenter
-
getLanguage
public LocaleId getLanguage()
- Specified by:
getLanguagein interfaceISegmenter
-
setLanguage
public void setLanguage(LocaleId languageCode)
- Specified by:
setLanguagein interfaceISegmenter
-
setCascade
protected void setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.- Parameters:
value- true if cascading must be applied, false otherwise.
-
addRule
protected void addRule(net.sf.okapi.lib.segmentation.CompiledRule compiledRule)
Adds a compiled rule to this segmenter.- Parameters:
compiledRule- the compiled rule to add.
-
setMaskRule
protected void setMaskRule(String pattern)
Sets the pattern for the mask rule.- Parameters:
pattern- the new pattern to use for the mask rule.
-
setSegmentSubFlows
public void setSegmentSubFlows(boolean segmentSubFlows)
- Specified by:
setSegmentSubFlowsin interfaceISegmenter
-
setIncludeStartCodes
public void setIncludeStartCodes(boolean includeStartCodes)
- Specified by:
setIncludeStartCodesin interfaceISegmenter
-
setIncludeEndCodes
public void setIncludeEndCodes(boolean includeEndCodes)
- Specified by:
setIncludeEndCodesin interfaceISegmenter
-
setIncludeIsolatedCodes
public void setIncludeIsolatedCodes(boolean includeIsolatedCodes)
- Specified by:
setIncludeIsolatedCodesin interfaceISegmenter
-
setOneSegmentIncludesAll
public void setOneSegmentIncludesAll(boolean oneSegmentIncludesAll)
- Specified by:
setOneSegmentIncludesAllin interfaceISegmenter
-
setTrimLeadingWS
public void setTrimLeadingWS(boolean trimLeadingWS)
- Specified by:
setTrimLeadingWSin interfaceISegmenter
-
setTrimTrailingWS
public void setTrimTrailingWS(boolean trimTrailingWS)
- Specified by:
setTrimTrailingWSin interfaceISegmenter
-
setTrimCodes
public void setTrimCodes(boolean trimCodes)
- Specified by:
setTrimCodesin interfaceISegmenter
-
setTreatIsolatedCodesAsWhitespace
public void setTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)
- Specified by:
setTreatIsolatedCodesAsWhitespacein interfaceISegmenter
-
-