net.sf.mmm.util.io.base
Class EncodingUtilImpl.UtfDetectionProcessor

java.lang.Object
  extended by net.sf.mmm.util.io.base.EncodingUtilImpl.UtfDetectionProcessor
All Implemented Interfaces:
ByteProcessor
Enclosing class:
EncodingUtilImpl

protected static class EncodingUtilImpl.UtfDetectionProcessor
extends Object
implements ByteProcessor

This inner class is used to perform the actual UTF detection. It processes the bytes from the underlying InputStream from a lookahead buffer. It respects a ByteOrderMark, UTF-8 multi-byte-sequences, UTF-16 surrogates, zero-bytes for UTF-16 and UTF-32 ASCII overhead, etc.


Field Summary
private  ByteOrderMark bom
          The ByteOrderMark or null if NOT present (or detection NOT started).
private  long bytePosition
          The byte-position in the stream relative to the head.
private  RankMap<String> encodingRankMap
          The RankMap for encoding detection.
private  long firstNonAsciiPosition
          The bytePosition where the first non-ascii byte was detected.
private  boolean maybeAscii
          false if the data can NOT be ASCII, true otherwise.
private  boolean maybeUtf16
          false if the data can NOT be UTF-16, true otherwise.
private  boolean maybeUtf8
          false if the data can NOT be UTF-8, true otherwise.
private  String nonUtfEncoding
          The encoding to use if encoding is neither UTF nor ASCII.
private  EncodingUtilImpl.Surrogate[] surrogates
          The last EncodingUtilImpl.Surrogates for each of the positions modulo 4.
private  int utf8ContinuationByteCount
          The expected number of UTF-8 continuation bytes to come or 0 if no UTF-8 multi-byte-sequence is currently processed.
private  int[] zeroByteCounts
          The number of bytes that have been 0 for each of the positions modulo 4.
 
Constructor Summary
EncodingUtilImpl.UtfDetectionProcessor(String nonUtfEncoding)
          The constructor.
 
Method Summary
 String getEncoding()
          This method gets the detected encoding from the currently processed data.
 String getLowByteEncoding()
          This method gets the encoding without taking high-bytes (non-ASCII) into account.
 int process(byte[] buffer, int offset, int length)
          This method is called to process the number of length bytes from the given buffer starting from the given offset.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

encodingRankMap

private RankMap<String> encodingRankMap
The RankMap for encoding detection.


bom

private ByteOrderMark bom
The ByteOrderMark or null if NOT present (or detection NOT started).


nonUtfEncoding

private final String nonUtfEncoding
The encoding to use if encoding is neither UTF nor ASCII.


maybeAscii

private boolean maybeAscii
false if the data can NOT be ASCII, true otherwise.


maybeUtf8

private boolean maybeUtf8
false if the data can NOT be UTF-8, true otherwise.


maybeUtf16

private boolean maybeUtf16
false if the data can NOT be UTF-16, true otherwise.


bytePosition

private long bytePosition
The byte-position in the stream relative to the head.


firstNonAsciiPosition

private long firstNonAsciiPosition
The bytePosition where the first non-ascii byte was detected.


zeroByteCounts

private int[] zeroByteCounts
The number of bytes that have been 0 for each of the positions modulo 4.


surrogates

private EncodingUtilImpl.Surrogate[] surrogates
The last EncodingUtilImpl.Surrogates for each of the positions modulo 4.


utf8ContinuationByteCount

private int utf8ContinuationByteCount
The expected number of UTF-8 continuation bytes to come or 0 if no UTF-8 multi-byte-sequence is currently processed.

Constructor Detail

EncodingUtilImpl.UtfDetectionProcessor

public EncodingUtilImpl.UtfDetectionProcessor(String nonUtfEncoding)
The constructor.

Parameters:
nonUtfEncoding - is the encoding to use if encoding is neither UTF nor ASCII.
Method Detail

process

public int process(byte[] buffer,
                   int offset,
                   int length)
This method is called to process the number of length bytes from the given buffer starting from the given offset.
ATTENTION:
An implementation of this interface should only read bytes from the given buffer. It is NOT permitted to modify the given buffer unless this is explicitly specified by the calling object (typically an implementation of ByteProcessable).

Specified by:
process in interface ByteProcessor
Parameters:
buffer - contains the bytes to process.
offset - is the index where to start in the buffer.
length - is the number of bytes to proceed.
Returns:
the number of bytes that should be consumed. Typically you will simply return length. However you can also return a value less than length and greater or equal to zero, in order to stop processing at a specific position.

getLowByteEncoding

public String getLowByteEncoding()
This method gets the encoding without taking high-bytes (non-ASCII) into account.

Returns:
the low-byte encoding or null if it looks like ASCII so far.

getEncoding

public String getEncoding()
This method gets the detected encoding from the currently processed data.

Returns:
the detected encoding or null if the encoding has NOT yet been detected and it looks like ASCII so far.


Copyright © 2001-2010 mmm-Team. All Rights Reserved.