public final class OutlinkExtractor extends Object implements Extractor<FetchResult>
| Constructor and Description |
|---|
OutlinkExtractor() |
| Modifier and Type | Method and Description |
|---|---|
static String |
consumeStream(InputStream stream)
Consumes a given
InputStream and returns a string consisting of the
html code of the site. |
FetchResult |
extract(String realUrl)
Extracts from a given URL all the content needed and return it.
|
static String |
extractBaseUrl(String url)
Extracts a base url from the given url (to make relative outlinks to
absolute ones).
|
static HashSet<String> |
extractOutlinks(String html,
String url)
Extracts outlinks of the given HTML doc in string.
|
static HashSet<String> |
filter(HashSet<String> set,
Pattern matcher)
Filters outlinks from a parsed page that NOT matches the given matcher.
|
static InputStream |
getConnection(String realUrl) |
static boolean |
isValid(String s)
Checks if the site does not end with unparsable suffixes likes PDF and if
its a valid url by extracting a base url at at index 0.
|
public FetchResult extract(String realUrl)
Extractorextract in interface Extractor<FetchResult>public static InputStream getConnection(String realUrl) throws IOException
IOExceptionpublic static HashSet<String> filter(HashSet<String> set, Pattern matcher)
public static HashSet<String> extractOutlinks(String html, String url) throws org.htmlparser.util.ParserException
html - the html to extract the outlinkts from.url - the url where we found the current document.org.htmlparser.util.ParserExceptionpublic static String consumeStream(InputStream stream) throws IOException
InputStream and returns a string consisting of the
html code of the site.IOExceptionpublic static String extractBaseUrl(String url)
public static boolean isValid(String s)
Copyright © 2016. All rights reserved.