Package net.sf.eBus.text
Class TokenLexer
- java.lang.Object
-
- net.sf.eBus.text.TokenLexer
-
public final class TokenLexer extends Object
Provides a generalized token lexer capability. This lexer ability is beyondjava.util.StringTokenizerin that it identifies the token type along with the token and converts the token string into the type's corresponding Java instance. There are nine (9) pre-defined token types and two special types:ERRORandEOF.ERRORis returned when an recoverable error occurred.EOFis returned when the input end is reached and no more tokens will be returned.The pre-defined token types are:
-
CHARACTER: a single character between single quotes ('). -
COMMENT: Either a//or slash star comment. Supports nested comments. -
FLOAT: A decimal number. -
INTEGER: An integer number. -
NAME: An alpha-numeric identifier. -
OPERATOR: Punctuation only identifier. -
SOURCE: Raw, unanalyzed input. -
STRING: Zero or more characters between double quotes ("").
NAMEtoken is found, the user keywords map is checked if it contains the token as a keyword. If so, then the associated token type is returned instead ofNAME. When aOPERATORtoken is found, both the user operators and delimiters maps are checked.The user-defined token maps should meet the following criteria:
-
The token type values must be >= to
NEXT_TOKEN. - The token type values do not need be unique either within or across maps.
- The token type values do not need to be consecutive.
TokenLexeris:import java.io.Reader; import net.sf.eBus.text.TokenLexer; import net.sf.eBus.text.Token; ... TokenLexer lexer = new TokenLexer(Keywords, Operators, Delimiters); Token token; Reader input = ...; // Set the input to be tokenized. lexer.input(input); // Continue retrieving until no more tokens. while ((token = lexer.nextToken()).type() != TokenLexer.EOF) { // Process the next token based on token type. } // Finish up the tokenization.Raw Lexical Mode
Users may not want the lexer to analyze input between two well-defined delimiters. This data is collected and returned as a
SOURCEtoken when the terminating delimiter is reached. Raw mode requires both an an opening and closing delimiter specified. This allows the lexer to track the appearance of nested delimiters within the input and return only when the top-level terminating delimiter is found.Raw lexical mode is used when input contains sub-text to be handled by a different lexer.
p- Author:
- Charles Rapp
-
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classTokenLexer.LexModeThe lexer will either analyze the tokens identifying the type or collect raw input until a terminating delimiter is found.
-
Field Summary
Fields Modifier and Type Field Description static intCHARACTERA single-quoted character token (2).static intCOMMENTEither a//or a slash star comment (3).static intEOFThe end of the input is reached (1).static intERRORAn error occurred when seeking the next token (0).static intFLOATA floating point number (4).static intINTEGERAn integer number (5).static intNAMEAn alphanumberic identifier (6).static intNEXT_TOKENUser-defined tokens must be >= 11.static charNO_OPEN_CHARWhen the raw mode open character is set to U+0000, this means there is no open character, only a close character.static intOPERATORToken consists solely of punctuation characters (7).static intSOURCERaw, unanalyzed input (8).static intSTRINGA double-quoted string (9).static intTOKEN_COUNTThere are eleven (11) predefined token types.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidcookedMode()Switch back to cooked tokenization.voidinput(Reader reader)Extract tokens from this input stream.intlineNumber()Returns the current line number being tokenized.TokenLexer.LexModemode()Returns the current lexer mode.TokennextToken()Returns the next token found in the input stream.intoffset()Returns the current offset into the input.voidrawMode(char openChar, char closeChar)Switch to raw tokenization.
-
-
-
Field Detail
-
NO_OPEN_CHAR
public static final char NO_OPEN_CHAR
When the raw mode open character is set to U+0000, this means there is no open character, only a close character.- See Also:
- Constant Field Values
-
ERROR
public static final int ERROR
An error occurred when seeking the next token (0).- See Also:
- Constant Field Values
-
EOF
public static final int EOF
The end of the input is reached (1).- See Also:
- Constant Field Values
-
CHARACTER
public static final int CHARACTER
A single-quoted character token (2). Token value is ajava.lang.Characterinstance.- See Also:
- Constant Field Values
-
COMMENT
public static final int COMMENT
Either a//or a slash star comment (3). Nested comments are supported.- See Also:
- Constant Field Values
-
FLOAT
public static final int FLOAT
A floating point number (4). Token value is ajava.lang.Doubleinstance.- See Also:
- Constant Field Values
-
INTEGER
public static final int INTEGER
An integer number (5). Token value is ajava.lang.Longinstance.- See Also:
- Constant Field Values
-
NAME
public static final int NAME
An alphanumberic identifier (6). If the token appears in the user-defined keywords map, then the user's token type is returned instead.- See Also:
- Constant Field Values
-
OPERATOR
public static final int OPERATOR
Token consists solely of punctuation characters (7). If the token is in the user-defined operator or delimiter map, then the user's token type is returned instead.Punctuation characters are:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { } | ~- See Also:
- Constant Field Values
-
SOURCE
public static final int SOURCE
Raw, unanalyzed input (8).- See Also:
TokenLexer.LexMode.RAW, Constant Field Values
-
STRING
public static final int STRING
A double-quoted string (9).- See Also:
- Constant Field Values
-
TOKEN_COUNT
public static final int TOKEN_COUNT
There are eleven (11) predefined token types.- See Also:
- Constant Field Values
-
NEXT_TOKEN
public static final int NEXT_TOKEN
User-defined tokens must be >= 11.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
TokenLexer
public TokenLexer(Map<String,Integer> keywords, Map<String,Integer> operators, Map<Character,Integer> delimiters)
Creates a message layout lexer using the specified keywords, operator and delimiters. These maps may be empty but notnull.- Parameters:
keywords- Keyword to integer identifier mapping.operators- Operator to integer identifier mapping.delimiters- Delimiter to integer identifier mapping.- Throws:
IllegalArgumentException- if any of the user maps contains a value <NEXT_TOKEN.
-
-
Method Detail
-
lineNumber
public int lineNumber()
Returns the current line number being tokenized.- Returns:
- the current line number being tokenized.
-
offset
public int offset()
Returns the current offset into the input.- Returns:
- the current offset into the input.
-
mode
public TokenLexer.LexMode mode()
Returns the current lexer mode.- Returns:
- the current lexer mode.
-
input
public void input(Reader reader)
Extract tokens from this input stream.- Parameters:
reader- Tokenize this input.
-
rawMode
public void rawMode(char openChar, char closeChar)Switch to raw tokenization.- Parameters:
openChar- The open clause delimiter.closeChar- The close clause delimiter.- See Also:
cookedMode()
-
cookedMode
public void cookedMode()
Switch back to cooked tokenization.- See Also:
rawMode(char, char)
-
nextToken
public Token nextToken()
Returns the next token found in the input stream. If there are no more tokens in the input stream, thenEOFis returned.- Returns:
- the next token found in the input stream.
- Throws:
IllegalStateException- if input reader is not set.
-
-