| Interface | Description |
|---|---|
| ParagraphSplitter |
A paragraph splitter segments text into paragraphs.
|
| SentenceSplitter |
A sentence splitter segments text into sentences (a string of words
satisfying the grammatical rules of a language).
|
| Tokenizer |
A token is a string of characters, categorized according to the rules as a
symbol.
|
| Class | Description |
|---|---|
| BreakIteratorSentenceSplitter |
A sentence splitter based on the java.text.BreakIterator, which supports
multiple natural languages (selected by locale setting).
|
| BreakIteratorTokenizer |
A word tokenizer based on the java.text.BreakIterator, which supports
multiple natural languages (selected by locale setting).
|
| PennTreebankTokenizer |
A word tokenizer that tokenizes English sentences using the conventions
used by the Penn Treebank.
|
| SimpleParagraphSplitter |
This is a simple paragraph splitter.
|
| SimpleSentenceSplitter |
This is a simple sentence splitter for English.
|
| SimpleTokenizer |
A word tokenizer that tokenizes English sentences with some differences from
TreebankWordTokenizer, notably on handling not-contractions.
|