Package de.jungblut.crawl.extraction
Class OutlinkExtractor
- java.lang.Object
-
- de.jungblut.crawl.extraction.OutlinkExtractor
-
- All Implemented Interfaces:
Extractor<FetchResult>
public final class OutlinkExtractor extends java.lang.Object implements Extractor<FetchResult>
Outlink extractor, parses a page just for its outlinks.- Author:
- thomas.jungblut
-
-
Constructor Summary
Constructors Constructor Description OutlinkExtractor()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static java.lang.StringconsumeStream(java.io.InputStream stream)Consumes a givenInputStreamand returns a string consisting of the html code of the site.FetchResultextract(java.lang.String realUrl)Extracts from a given URL all the content needed and return it.static java.lang.StringextractBaseUrl(java.lang.String url)Extracts a base url from the given url (to make relative outlinks to absolute ones).static java.util.HashSet<java.lang.String>extractOutlinks(java.lang.String html, java.lang.String url)Extracts outlinks of the given HTML doc in string.static java.util.HashSet<java.lang.String>filter(java.util.HashSet<java.lang.String> set, java.util.regex.Pattern matcher)Filters outlinks from a parsed page that NOT matches the given matcher.static java.io.InputStreamgetConnection(java.lang.String realUrl)static booleanisValid(java.lang.String s)Checks if the site does not end with unparsable suffixes likes PDF and if its a valid url by extracting a base url at at index 0.
-
-
-
Method Detail
-
extract
public FetchResult extract(java.lang.String realUrl)
Description copied from interface:ExtractorExtracts from a given URL all the content needed and return it. Null if nothing should be returned or could be parsed.- Specified by:
extractin interfaceExtractor<FetchResult>
-
getConnection
public static java.io.InputStream getConnection(java.lang.String realUrl) throws java.io.IOException- Returns:
- an opened stream.
- Throws:
java.io.IOException
-
filter
public static java.util.HashSet<java.lang.String> filter(java.util.HashSet<java.lang.String> set, java.util.regex.Pattern matcher)Filters outlinks from a parsed page that NOT matches the given matcher.
-
extractOutlinks
public static java.util.HashSet<java.lang.String> extractOutlinks(java.lang.String html, java.lang.String url) throws org.htmlparser.util.ParserExceptionExtracts outlinks of the given HTML doc in string.- Parameters:
html- the html to extract the outlinkts from.url- the url where we found the current document.- Returns:
- a set of outlinks.
- Throws:
org.htmlparser.util.ParserException
-
consumeStream
public static java.lang.String consumeStream(java.io.InputStream stream) throws java.io.IOExceptionConsumes a givenInputStreamand returns a string consisting of the html code of the site.- Throws:
java.io.IOException
-
extractBaseUrl
public static java.lang.String extractBaseUrl(java.lang.String url)
Extracts a base url from the given url (to make relative outlinks to absolute ones).- Returns:
- a base url or null if none was found.
-
isValid
public static boolean isValid(java.lang.String s)
Checks if the site does not end with unparsable suffixes likes PDF and if its a valid url by extracting a base url at at index 0.
-
-