net.sf.mmm.util.text.base
Class HyphenationPattern

java.lang.Object
  extended by net.sf.mmm.util.text.base.HyphenationPattern

public class HyphenationPattern
extends Object

A HyphenationPattern is a pattern that acts as rule for a hyphenation algorithm.
The concept is based on the thesis Word Hy-phen-a-tion by Com-put-er by Franklin Mark Liang. Out of an entire dictionary of hyphenated words for a specific language, a set of patterns is extracted. To allow correct results with a reasonable small set of patterns, these patterns form a chain of positive rules and exceptions. Therefore a pattern can rank a potential hyphenation-position with a number from 1 to 9. If two patterns apply for a hyphenation-position the higher number wins. Odd numbers indicate a hyphenation while even values indicate an exception that should NOT be hyphenated. The character '.' is used at the beginning and/or end of a pattern to indicate that it should only match at the beginning/end of the word to hyphenate.
Logically for each start-index of the (normalized) word to hyphenate (enclosed with '.') all patterns are checked if they match (please note that the order of the patterns is important!). Matching means that the pattern stripped from digits is a substring of the word at this start-index. If the pattern matches the hyphenation-positions are applied.

Here is an example to illustrate the algorithm:
The string "Computer" will be transformed to ".computer." that matches the following patterns:

This results to co4m5pu2t3er so the hyphenated input String is finally "Com-put-er". The challenge is to implement this algorithm in an efficient way.

Since:
2.0.0
Author:
Joerg Hohwiller (hohwille at users.sourceforge.net)

Field Summary
private  HyphenationPatternPosition[] hyphenationPositions
           
static char TERMINATOR
          The word-terminator representing start end end of a word.
private  String wordPart
          The pattern without digits.
private  int wordPartHash
           
 
Constructor Summary
HyphenationPattern(String pattern, StringHasher hasher)
          The constructor.
 
Method Summary
protected  HyphenationPatternPosition[] getHyphenationPositions()
          This method gets the hyphenation-positions of the pattern.
 String getPattern()
          This method gets the original pattern (word-part with hyphenation-points).
 String getWordPart()
          This method gets the word-part, that is the pattern without digits.
 int getWordPartHash()
          This method gets the pre-calculated hash of word-part.
 String toString()
          This method gets the original pattern.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

wordPart

private final String wordPart
The pattern without digits.


wordPartHash

private final int wordPartHash
See Also:
getWordPartHash()

hyphenationPositions

private final HyphenationPatternPosition[] hyphenationPositions
See Also:
getHyphenationPositions()

TERMINATOR

public static final char TERMINATOR
The word-terminator representing start end end of a word.

See Also:
Constant Field Values
Constructor Detail

HyphenationPattern

public HyphenationPattern(String pattern,
                          StringHasher hasher)
The constructor.

Parameters:
pattern - is the raw pattern.
hasher - is the hash-algorithm to use for the word-part-hash.
Method Detail

getHyphenationPositions

protected HyphenationPatternPosition[] getHyphenationPositions()
This method gets the hyphenation-positions of the pattern.

Returns:
the HyphenationPatternPositions.

getWordPart

public String getWordPart()
This method gets the word-part, that is the pattern without digits. If the word-part is a substring of the word to hyphenate (enclosed with '.'), the hyphenation-points are applied to the HyphenationState.

Returns:
the word-part.
See Also:
HyphenationState.apply(HyphenationPattern)

getWordPartHash

public int getWordPartHash()
This method gets the pre-calculated hash of word-part.
ATTENTION:
The result may be different to the hash-code of word-part. A specific hash algorithm is used that allows efficient calculation of shifting substrings.

Returns:
the hash.

getPattern

public String getPattern()
This method gets the original pattern (word-part with hyphenation-points).
ATTENTION:
This method is intended for debugging purposes. It rebuilds the pattern wasting some performance.

Returns:
the pattern (e.g. ".af1t").

toString

public String toString()
This method gets the original pattern.

Overrides:
toString in class Object


Copyright © 2001-2010 mmm-Team. All Rights Reserved.