Package de.jungblut.crawl
Class MultithreadedCrawler<T extends FetchResult>
- java.lang.Object
-
- de.jungblut.crawl.MultithreadedCrawler<T>
-
- All Implemented Interfaces:
Crawler<T>
public final class MultithreadedCrawler<T extends FetchResult> extends java.lang.Object implements Crawler<T>
Fast multithreaded crawler, will start a fixed threadpool of 32 threads each will be fed by 10 urls at once. Majorly designed for speed and to use all the available bandwidth. Based on other internet bandwidths, you may retune the parameters of threadpool sizes and how many items should be batched. For my 6k ADSL it works fine by 32 threads batched on 10 urls. You may scale this linearly up, since this class has almost no contention and small sequential code. It is also backed by a bloom filter to check if a URL was visited, so the memory footprint stays low.- Author:
- thomas.jungblut
-
-
Constructor Summary
Constructors Constructor Description MultithreadedCrawler(int threadPoolSize, int batchSize, int fetches, Extractor<T> extractor, ResultWriter<T> writer)Constructs a new Multithreaded Crawler.MultithreadedCrawler(int fetches, Extractor<T> extractor, ResultWriter<T> writer)Constructs a new Multithreaded Crawler with 32 threads working on 10 url batches at each time.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static voidmain(java.lang.String[] args)voidprocess(java.lang.String... seedUrls)Starts the crawler, starting by the seedURL.voidsetup(int fetches, Extractor<T> extractor, ResultWriter<T> writer)Setups this crawler.
-
-
-
Constructor Detail
-
MultithreadedCrawler
public MultithreadedCrawler(int threadPoolSize, int batchSize, int fetches, Extractor<T> extractor, ResultWriter<T> writer) throws java.io.IOExceptionConstructs a new Multithreaded Crawler.- Parameters:
threadPoolSize- the number of threads to use.batchSize- the number of URLs a batch for a thread should contain.fetches- the number of urls to fetch.extractor- the extraction logic.writer- the writer.- Throws:
java.io.IOException
-
MultithreadedCrawler
public MultithreadedCrawler(int fetches, Extractor<T> extractor, ResultWriter<T> writer) throws java.io.IOExceptionConstructs a new Multithreaded Crawler with 32 threads working on 10 url batches at each time.- Parameters:
fetches- the number of urls to fetch.extractor- the extraction logic.writer- the writer.- Throws:
java.io.IOException
-
-
Method Detail
-
setup
public final void setup(int fetches, Extractor<T> extractor, ResultWriter<T> writer) throws java.io.IOExceptionDescription copied from interface:CrawlerSetups this crawler.- Specified by:
setupin interfaceCrawler<T extends FetchResult>- Parameters:
fetches- how many maximum fetches it should do.extractor- the givenExtractorto extract aFetchResult.writer- theResultWriterto write the result to a sink.- Throws:
java.io.IOException
-
process
public final void process(java.lang.String... seedUrls) throws java.lang.InterruptedException, java.util.concurrent.ExecutionExceptionDescription copied from interface:CrawlerStarts the crawler, starting by the seedURL. The real logic is implemented by the crawler itself.- Specified by:
processin interfaceCrawler<T extends FetchResult>- Throws:
java.lang.InterruptedExceptionjava.util.concurrent.ExecutionException
-
main
public static void main(java.lang.String[] args) throws java.lang.InterruptedException, java.util.concurrent.ExecutionException, java.io.IOException- Throws:
java.lang.InterruptedExceptionjava.util.concurrent.ExecutionExceptionjava.io.IOException
-
-