Class OutlinkExtractor

  • All Implemented Interfaces:
    Extractor<FetchResult>

    public final class OutlinkExtractor
    extends java.lang.Object
    implements Extractor<FetchResult>
    Outlink extractor, parses a page just for its outlinks.
    Author:
    thomas.jungblut
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static java.lang.String consumeStream​(java.io.InputStream stream)
      Consumes a given InputStream and returns a string consisting of the html code of the site.
      FetchResult extract​(java.lang.String realUrl)
      Extracts from a given URL all the content needed and return it.
      static java.lang.String extractBaseUrl​(java.lang.String url)
      Extracts a base url from the given url (to make relative outlinks to absolute ones).
      static java.util.HashSet<java.lang.String> extractOutlinks​(java.lang.String html, java.lang.String url)
      Extracts outlinks of the given HTML doc in string.
      static java.util.HashSet<java.lang.String> filter​(java.util.HashSet<java.lang.String> set, java.util.regex.Pattern matcher)
      Filters outlinks from a parsed page that NOT matches the given matcher.
      static java.io.InputStream getConnection​(java.lang.String realUrl)  
      static boolean isValid​(java.lang.String s)
      Checks if the site does not end with unparsable suffixes likes PDF and if its a valid url by extracting a base url at at index 0.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • OutlinkExtractor

        public OutlinkExtractor()
    • Method Detail

      • extract

        public FetchResult extract​(java.lang.String realUrl)
        Description copied from interface: Extractor
        Extracts from a given URL all the content needed and return it. Null if nothing should be returned or could be parsed.
        Specified by:
        extract in interface Extractor<FetchResult>
      • getConnection

        public static java.io.InputStream getConnection​(java.lang.String realUrl)
                                                 throws java.io.IOException
        Returns:
        an opened stream.
        Throws:
        java.io.IOException
      • filter

        public static java.util.HashSet<java.lang.String> filter​(java.util.HashSet<java.lang.String> set,
                                                                 java.util.regex.Pattern matcher)
        Filters outlinks from a parsed page that NOT matches the given matcher.
      • extractOutlinks

        public static java.util.HashSet<java.lang.String> extractOutlinks​(java.lang.String html,
                                                                          java.lang.String url)
                                                                   throws org.htmlparser.util.ParserException
        Extracts outlinks of the given HTML doc in string.
        Parameters:
        html - the html to extract the outlinkts from.
        url - the url where we found the current document.
        Returns:
        a set of outlinks.
        Throws:
        org.htmlparser.util.ParserException
      • consumeStream

        public static java.lang.String consumeStream​(java.io.InputStream stream)
                                              throws java.io.IOException
        Consumes a given InputStream and returns a string consisting of the html code of the site.
        Throws:
        java.io.IOException
      • extractBaseUrl

        public static java.lang.String extractBaseUrl​(java.lang.String url)
        Extracts a base url from the given url (to make relative outlinks to absolute ones).
        Returns:
        a base url or null if none was found.
      • isValid

        public static boolean isValid​(java.lang.String s)
        Checks if the site does not end with unparsable suffixes likes PDF and if its a valid url by extracting a base url at at index 0.