Class MultithreadedCrawler<T extends FetchResult>

  • All Implemented Interfaces:
    Crawler<T>

    public final class MultithreadedCrawler<T extends FetchResult>
    extends java.lang.Object
    implements Crawler<T>
    Fast multithreaded crawler, will start a fixed threadpool of 32 threads each will be fed by 10 urls at once. Majorly designed for speed and to use all the available bandwidth. Based on other internet bandwidths, you may retune the parameters of threadpool sizes and how many items should be batched. For my 6k ADSL it works fine by 32 threads batched on 10 urls. You may scale this linearly up, since this class has almost no contention and small sequential code. It is also backed by a bloom filter to check if a URL was visited, so the memory footprint stays low.
    Author:
    thomas.jungblut
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static void main​(java.lang.String[] args)  
      void process​(java.lang.String... seedUrls)
      Starts the crawler, starting by the seedURL.
      void setup​(int fetches, Extractor<T> extractor, ResultWriter<T> writer)
      Setups this crawler.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • MultithreadedCrawler

        public MultithreadedCrawler​(int threadPoolSize,
                                    int batchSize,
                                    int fetches,
                                    Extractor<T> extractor,
                                    ResultWriter<T> writer)
                             throws java.io.IOException
        Constructs a new Multithreaded Crawler.
        Parameters:
        threadPoolSize - the number of threads to use.
        batchSize - the number of URLs a batch for a thread should contain.
        fetches - the number of urls to fetch.
        extractor - the extraction logic.
        writer - the writer.
        Throws:
        java.io.IOException
      • MultithreadedCrawler

        public MultithreadedCrawler​(int fetches,
                                    Extractor<T> extractor,
                                    ResultWriter<T> writer)
                             throws java.io.IOException
        Constructs a new Multithreaded Crawler with 32 threads working on 10 url batches at each time.
        Parameters:
        fetches - the number of urls to fetch.
        extractor - the extraction logic.
        writer - the writer.
        Throws:
        java.io.IOException
    • Method Detail

      • setup

        public final void setup​(int fetches,
                                Extractor<T> extractor,
                                ResultWriter<T> writer)
                         throws java.io.IOException
        Description copied from interface: Crawler
        Setups this crawler.
        Specified by:
        setup in interface Crawler<T extends FetchResult>
        Parameters:
        fetches - how many maximum fetches it should do.
        extractor - the given Extractor to extract a FetchResult.
        writer - the ResultWriter to write the result to a sink.
        Throws:
        java.io.IOException
      • process

        public final void process​(java.lang.String... seedUrls)
                           throws java.lang.InterruptedException,
                                  java.util.concurrent.ExecutionException
        Description copied from interface: Crawler
        Starts the crawler, starting by the seedURL. The real logic is implemented by the crawler itself.
        Specified by:
        process in interface Crawler<T extends FetchResult>
        Throws:
        java.lang.InterruptedException
        java.util.concurrent.ExecutionException
      • main

        public static void main​(java.lang.String[] args)
                         throws java.lang.InterruptedException,
                                java.util.concurrent.ExecutionException,
                                java.io.IOException
        Throws:
        java.lang.InterruptedException
        java.util.concurrent.ExecutionException
        java.io.IOException