Distributed Web crawling using Apache Spark - Is it Possible?

How about this way:

Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:

  1. split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well: for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
  2. assign each base url(www.example.com/news/20150401) to a single thread, it is in the threads where the really data fetch happens
  3. save the result of each thread into FileSystem.

When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:

  1. Split sites: def getPartitions: Array[Partition] is a good place to do the split task.
  2. Threads to crawl each split: def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.
  3. save the rdd into HDFS.

The final program looks like:

class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}

class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {

  override protected def getPartitions: Array[CrawlPartition] = {
    val partitions = new ArrayBuffer[CrawlPartition]
    //split baseURL to subsets and populate the partitions
    partitions.toArray
  }

  override def compute(part: Partition, context: TaskContext): Iterator[X] = {
    val p = part.asInstanceOf[CrawlPartition]
    val baseUrl = p.baseURL

    new Iterator[X] {
       var nextURL = _
       override def hasNext: Boolean = {
         //logic to find next url if has one, fill in nextURL and return true
         // else false
       }          

       override def next(): X = {
         //logic to crawl the web page nextURL and return the content in X
       }
    } 
  }
}

object Crawl {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Crawler")
    val sc = new SparkContext(sparkConf)
    val crdd = new CrawlRDD("baseURL", sc)
    crdd.saveAsTextFile("hdfs://path_here")
    sc.stop()
  }
}

YES.

Check out the open source project: Sparkler (spark - crawler) https://github.com/USCDataScience/sparkler

Checkout Sparkler Internals for a flow/pipeline diagram. (Apologies, it is an SVG image I couldn't post it here)

This project wasn't available when the question was posted, however as of December 2016 it is one of the very active projects!.

Is it possible to crawl the Websites using Apache Spark?

The following pieces may help you understand why someone would ask such a question and also help you to answer it.

  • The creators of Spark framework wrote in the seminal paper [1] that RDDs would be less suitable for applications that make asynchronous finegrained updates to shared state, such as a storage system for a web application or an incremental web crawler
  • RDDs are key components in Spark. However, you can create traditional map reduce applications (with little or no abuse of RDDs)
  • There is a widely popular distributed web crawler called Nutch [2]. Nutch is built with Hadoop Map-Reduce (in fact, Hadoop Map Reduce was extracted out from the Nutch codebase)
  • If you can do some task in Hadoop Map Reduce, you can also do it with Apache Spark.

[1] http://dl.acm.org/citation.cfm?id=2228301
[2] http://nutch.apache.org/


PS: I am a co-creator of Sparkler and a Committer, PMC for Apache Nutch.


When I designed Sparkler, I created an RDD which is a proxy to Solr/Lucene based indexed storage. It enabled our crawler-databse RDD to make asynchronous finegrained updates to shared state, which otherwise is not possible natively.


Spark adds essentially no value to this task.

Sure, you can do distributed crawling, but good crawling tools already support this out of the box. The datastructures provided by Spark such as RRDs are pretty much useless here, and just to launch crawl jobs, you could just use YARN, Mesos etc. directly at less overhead.

Sure, you could do this on Spark. Just like you could do a word processor on Spark, since it is turing complete... but it doesn't get any easier.


There is a project, called SpookyStuff, which is an

Scalable query engine for web scraping/data mashup/acceptance QA, powered by Apache Spark

Hope it helps!