Class TokenLexer


  • public final class TokenLexer
    extends Object
    Provides a generalized token lexer capability. This lexer ability is beyond java.util.StringTokenizer in that it identifies the token type along with the token and converts the token string into the type's corresponding Java instance. There are nine (9) pre-defined token types and two special types: ERROR and EOF. ERROR is returned when an recoverable error occurred. EOF is returned when the input end is reached and no more tokens will be returned.

    The pre-defined token types are:

    1. CHARACTER: a single character between single quotes (').
    2. COMMENT: Either a // or slash star comment. Supports nested comments.
    3. FLOAT: A decimal number.
    4. INTEGER: An integer number.
    5. NAME: An alpha-numeric identifier.
    6. OPERATOR: Punctuation only identifier.
    7. SOURCE: Raw, unanalyzed input.
    8. STRING: Zero or more characters between double quotes ("").
    There is support for user-defined keyword, operator and delimiter tokens. When a NAME token is found, the user keywords map is checked if it contains the token as a keyword. If so, then the associated token type is returned instead of NAME. When a OPERATOR token is found, both the user operators and delimiters maps are checked.

    The user-defined token maps should meet the following criteria:

    • The token type values must be >= to NEXT_TOKEN.
    • The token type values do not need be unique either within or across maps.
    • The token type values do not need to be consecutive.
    The basic algorithm using TokenLexer is:
       
     import java.io.Reader;
     import net.sf.eBus.text.TokenLexer;
     import net.sf.eBus.text.Token;
     ...
     TokenLexer lexer = new TokenLexer(Keywords, Operators, Delimiters);
     Token token;
     Reader input = ...;
    
     // Set the input to be tokenized.
     lexer.input(input);
    
     // Continue retrieving until no more tokens.
     while ((token = lexer.nextToken()).type() != TokenLexer.EOF)
     {
         // Process the next token based on token type.
     }
    
     // Finish up the tokenization.
       
     

    Raw Lexical Mode

    Users may not want the lexer to analyze input between two well-defined delimiters. This data is collected and returned as a SOURCE token when the terminating delimiter is reached. Raw mode requires both an an opening and closing delimiter specified. This allows the lexer to track the appearance of nested delimiters within the input and return only when the top-level terminating delimiter is found.

    Raw lexical mode is used when input contains sub-text to be handled by a different lexer.

    p
    Author:
    Charles Rapp
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  TokenLexer.LexMode
      The lexer will either analyze the tokens identifying the type or collect raw input until a terminating delimiter is found.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int CHARACTER
      A single-quoted character token (2).
      static int COMMENT
      Either a // or a slash star comment (3).
      static int EOF
      The end of the input is reached (1).
      static int ERROR
      An error occurred when seeking the next token (0).
      static int FLOAT
      A floating point number (4).
      static int INTEGER
      An integer number (5).
      static int NAME
      An alphanumberic identifier (6).
      static int NEXT_TOKEN
      User-defined tokens must be >= 11.
      static char NO_OPEN_CHAR
      When the raw mode open character is set to U+0000, this means there is no open character, only a close character.
      static int OPERATOR
      Token consists solely of punctuation characters (7).
      static int SOURCE
      Raw, unanalyzed input (8).
      static int STRING
      A double-quoted string (9).
      static int TOKEN_COUNT
      There are eleven (11) predefined token types.
    • Field Detail

      • NO_OPEN_CHAR

        public static final char NO_OPEN_CHAR
        When the raw mode open character is set to U+0000, this means there is no open character, only a close character.
        See Also:
        Constant Field Values
      • ERROR

        public static final int ERROR
        An error occurred when seeking the next token (0).
        See Also:
        Constant Field Values
      • CHARACTER

        public static final int CHARACTER
        A single-quoted character token (2). Token value is a java.lang.Character instance.
        See Also:
        Constant Field Values
      • COMMENT

        public static final int COMMENT
        Either a // or a slash star comment (3). Nested comments are supported.
        See Also:
        Constant Field Values
      • FLOAT

        public static final int FLOAT
        A floating point number (4). Token value is a java.lang.Double instance.
        See Also:
        Constant Field Values
      • INTEGER

        public static final int INTEGER
        An integer number (5). Token value is a java.lang.Long instance.
        See Also:
        Constant Field Values
      • NAME

        public static final int NAME
        An alphanumberic identifier (6). If the token appears in the user-defined keywords map, then the user's token type is returned instead.
        See Also:
        Constant Field Values
      • OPERATOR

        public static final int OPERATOR
        Token consists solely of punctuation characters (7). If the token is in the user-defined operator or delimiter map, then the user's token type is returned instead.

        Punctuation characters are:

           
         !  "  #  $  %  &  '  ( )  *
         +  ,  -  .  /  :  ;  <  =  >
         ?  @  [  \  ]  ^  _  `  {  }
         |  ~
           
         
        See Also:
        Constant Field Values
      • TOKEN_COUNT

        public static final int TOKEN_COUNT
        There are eleven (11) predefined token types.
        See Also:
        Constant Field Values
      • NEXT_TOKEN

        public static final int NEXT_TOKEN
        User-defined tokens must be >= 11.
        See Also:
        Constant Field Values
    • Constructor Detail

      • TokenLexer

        public TokenLexer​(Map<String,​Integer> keywords,
                          Map<String,​Integer> operators,
                          Map<Character,​Integer> delimiters)
        Creates a message layout lexer using the specified keywords, operator and delimiters. These maps may be empty but not null.
        Parameters:
        keywords - Keyword to integer identifier mapping.
        operators - Operator to integer identifier mapping.
        delimiters - Delimiter to integer identifier mapping.
        Throws:
        IllegalArgumentException - if any of the user maps contains a value < NEXT_TOKEN.
    • Method Detail

      • lineNumber

        public int lineNumber()
        Returns the current line number being tokenized.
        Returns:
        the current line number being tokenized.
      • offset

        public int offset()
        Returns the current offset into the input.
        Returns:
        the current offset into the input.
      • mode

        public TokenLexer.LexMode mode()
        Returns the current lexer mode.
        Returns:
        the current lexer mode.
      • input

        public void input​(Reader reader)
        Extract tokens from this input stream.
        Parameters:
        reader - Tokenize this input.
      • rawMode

        public void rawMode​(char openChar,
                            char closeChar)
        Switch to raw tokenization.
        Parameters:
        openChar - The open clause delimiter.
        closeChar - The close clause delimiter.
        See Also:
        cookedMode()
      • cookedMode

        public void cookedMode()
        Switch back to cooked tokenization.
        See Also:
        rawMode(char, char)
      • nextToken

        public Token nextToken()
        Returns the next token found in the input stream. If there are no more tokens in the input stream, then EOF is returned.
        Returns:
        the next token found in the input stream.
        Throws:
        IllegalStateException - if input reader is not set.