Package de.jungblut.crawl
-
Interface Summary Interface Description Crawler<T extends FetchResult> Basic Crawler Interface, all implements should implicit give a constructor with the same arguments like setup and redirect the call to it.ResultWriter<T extends FetchResult> Result writing interface. -
Class Summary Class Description ConsoleResultWriter<T extends FetchResult> Simple class that outputs to console.FetchResult Fetch Result class, contains the origin url and its outlinks for further crawling.FetchResultPersister<T extends FetchResult> Asynchronous persister thread, taking a resultwriter and handles the logic behind asynchronous writing to disk or an arbitrary sink implemented by theResultWriter.FetchThread<T extends FetchResult> Callablefetcher that extracts, for a given list of URLs and with a givenExtractor, the content from the list of urls.MultithreadedCrawler<T extends FetchResult> Fast multithreaded crawler, will start a fixed threadpool of 32 threads each will be fed by 10 urls at once.ResultWriterAdapter<T extends FetchResult> Empty Adapter class for aResultWriter.SequenceFileResultWriter<T extends FetchResult> Writes the result into a sequencefile "files/crawl/result.seq".SequentialCrawler<T extends FetchResult> Sequential crawler, mainly for debugging or development.