public final class StandardHtmlEncodingDetector extends Object implements org.apache.tika.detect.EncodingDetector
https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream
If a resource was fetched over HTTP, then HTTP headers should be added to tika metadata
when using detect(java.io.InputStream, org.apache.tika.metadata.Metadata), especially HttpHeaders.CONTENT_TYPE, as it may contain charset information.
This encoding detector may return null if no encoding is detected.
It is meant to be used inside a CompositeDetector.
For instance:
EncodingDetector detector = new CompositeDetector(
new StandardHtmlEncodingDetector(),
new Icu4jEncodingDetector()
);
| Constructor and Description |
|---|
StandardHtmlEncodingDetector() |
| Modifier and Type | Method and Description |
|---|---|
Charset |
detect(InputStream input,
org.apache.tika.metadata.Metadata metadata) |
int |
getMarkLimit() |
void |
setMarkLimit(int markLimit)
How far into the stream to read for charset detection.
|
public Charset detect(InputStream input, org.apache.tika.metadata.Metadata metadata) throws IOException
detect in interface org.apache.tika.detect.EncodingDetectorIOExceptionpublic int getMarkLimit()
@Field public void setMarkLimit(int markLimit)
Copyright © 2007–2022 The Apache Software Foundation. All rights reserved.