Class TikaContentExtractor


  • public class TikaContentExtractor
    extends Object
    • Constructor Detail

      • TikaContentExtractor

        public TikaContentExtractor()
        Create new Tika-based content extractor using AutoDetectParser.
      • TikaContentExtractor

        public TikaContentExtractor​(org.apache.tika.parser.Parser parser)
        Create new Tika-based content extractor using the provided parser instance.
        Parameters:
        parser - parser instance
      • TikaContentExtractor

        public TikaContentExtractor​(List<org.apache.tika.parser.Parser> parsers)
        Create new Tika-based content extractor using the provided parser instances.
        Parameters:
        parsers - parser instances
      • TikaContentExtractor

        public TikaContentExtractor​(List<org.apache.tika.parser.Parser> parsers,
                                    org.apache.tika.detect.Detector detector)
        Create new Tika-based content extractor using the provided parser instances.
        Parameters:
        parsers - parser instances
      • TikaContentExtractor

        public TikaContentExtractor​(org.apache.tika.parser.Parser parser,
                                    boolean validateMediaType)
        Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation parser will try to detect the media type of the input and validate it against media types supported by the parser.
        Parameters:
        parser - parser instance
        validateMediaType - enabled or disable media type validationparser
    • Method Detail

      • extract

        public TikaContentExtractor.TikaContent extract​(InputStream in)
        Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
        Parameters:
        in - input stream to extract the content and metadata from
        Returns:
        the extracted content and metadata or null if extraction is not possible or was unsuccessful
      • extract

        public TikaContentExtractor.TikaContent extract​(InputStream in,
                                                        javax.ws.rs.core.MediaType mt)
        Extract the content and metadata from the input stream with a media type hint.
        Parameters:
        in - input stream to extract the content and metadata from
        mt - JAX-RS MediaType of the stream content
        Returns:
        the extracted content and metadata or null if extraction is not possible or was unsuccessful
      • extract

        public TikaContentExtractor.TikaContent extract​(InputStream in,
                                                        ContentHandler handler)
        Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
        Parameters:
        in - input stream to extract the content and metadata from
        handler - custom ContentHandler
        Returns:
        the extracted content and metadata or null if extraction is not possible or was unsuccessful
      • extract

        public TikaContentExtractor.TikaContent extract​(InputStream in,
                                                        ContentHandler handler,
                                                        javax.ws.rs.core.MediaType mt)
        Extract the content and metadata from the input stream with a media type hint.
        Parameters:
        in - input stream to extract the content and metadata from
        handler - custom ContentHandler
        mt - JAX-RS MediaType of the stream content
        Returns:
        the extracted content and metadata or null if extraction is not possible or was unsuccessful
      • extract

        public TikaContentExtractor.TikaContent extract​(InputStream in,
                                                        ContentHandler handler,
                                                        org.apache.tika.parser.ParseContext context)
        Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
        Parameters:
        in - input stream to extract the content and metadata from
        handler - custom ContentHandler
        context - custom context
        Returns:
        the extracted content and metadata or null if extraction is not possible or was unsuccessful
      • extract

        public TikaContentExtractor.TikaContent extract​(InputStream in,
                                                        ContentHandler handler,
                                                        javax.ws.rs.core.MediaType mtHint,
                                                        org.apache.tika.parser.ParseContext context)
        Extract the content and metadata from the input stream with a media type hint type of content.
        Parameters:
        in - input stream to extract the metadata from
        handler - custom ContentHandler
        mtHint - JAX-RS MediaType of the stream content
        context - custom context
        Returns:
        the extracted content and metadata or null if extraction is not possible or was unsuccessful
      • extractMetadata

        public TikaContentExtractor.TikaContent extractMetadata​(InputStream in)
        Extract the metadata only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
        Parameters:
        in - input stream to extract the metadata from
        Returns:
        the extracted content or null if extraction is not possible or was unsuccessful
      • extractMetadataToSearchBean

        public SearchBean extractMetadataToSearchBean​(InputStream in)
        Extract the metadata only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
        Parameters:
        in - input stream to extract the metadata from
        Returns:
        the extracted metadata converted to SearchBean or null if extraction is not possible or was unsuccessful