Class TikaLuceneContentExtractor


  • public class TikaLuceneContentExtractor
    extends Object
    • Constructor Detail

      • TikaLuceneContentExtractor

        public TikaLuceneContentExtractor​(org.apache.tika.parser.Parser parser)
        Create new Tika-based content extractor using the provided parser instance.
        Parameters:
        parser - parser instance
      • TikaLuceneContentExtractor

        public TikaLuceneContentExtractor​(org.apache.tika.parser.Parser parser,
                                          boolean validateMediaType)
        Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media typesthis.contentFieldName supported by the parser.
        Parameters:
        parser - parser instance
        validateMediaType - enabled or disable media type validation
      • TikaLuceneContentExtractor

        public TikaLuceneContentExtractor​(org.apache.tika.parser.Parser parser,
                                          LuceneDocumentMetadata documentMetadata)
        Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media types supported by the parser.
        Parameters:
        parser - parser instancethis.contentFieldName
        documentMetadata - documentMetadata
      • TikaLuceneContentExtractor

        public TikaLuceneContentExtractor​(org.apache.tika.parser.Parser parser,
                                          boolean validateMediaType,
                                          LuceneDocumentMetadata documentMetadata)
        Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media types supported by the parser.
        Parameters:
        parser - parser instancethis.contentFieldName
        validateMediaType - enabled or disable media type validation
        documentMetadata - documentMetadata
      • TikaLuceneContentExtractor

        public TikaLuceneContentExtractor​(List<org.apache.tika.parser.Parser> parsers,
                                          LuceneDocumentMetadata documentMetadata)
        Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media types supported by the parser.
        Parameters:
        parsers - parsers instancethis.contentFieldName
        documentMetadata - documentMetadata
    • Method Detail

      • extract

        public org.apache.lucene.document.Document extract​(InputStream in)
        Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
        Parameters:
        in - input stream to extract the content and metadata from
        Returns:
        the extracted document or null if extraction is not possible or was unsuccessful
      • extract

        public org.apache.lucene.document.Document extract​(InputStream in,
                                                           LuceneDocumentMetadata documentMetadata)
        Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
        Parameters:
        in - input stream to extract the content and metadata from
        documentMetadata - documentMetadata
        Returns:
        the extracted document or null if extraction is not possible or was unsuccessful
      • extractContent

        public org.apache.lucene.document.Document extractContent​(InputStream in)
        Extract the content only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
        Parameters:
        in - input stream to extract the content from
        Returns:
        the extracted document or null if extraction is not possible or was unsuccessful
      • extractMetadata

        public org.apache.lucene.document.Document extractMetadata​(InputStream in)
        Extract the metadata only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
        Parameters:
        in - input stream to extract the metadata from
        Returns:
        the extracted document or null if extraction is not possible or was unsuccessful
      • extractMetadata

        public org.apache.lucene.document.Document extractMetadata​(InputStream in,
                                                                   LuceneDocumentMetadata documentMetadata)
        Extract the metadata only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
        Parameters:
        in - input stream to extract the metadata from
        documentMetadata - documentMetadata
        Returns:
        the extracted document or null if extraction is not possible or was unsuccessful