cloud-crawler : an open source ruby dsl and distributed processing framework for crawling the web using aws

Posted on April 22, 2013

3


cloud-crawler-0.1

For the past few weeks, I have taken some time off from pure math to work on an open source platform for crawling the web.  I am happy to announce the cloud-crawler version 0.1  open source project

The cloud-crawler is a distributed ruby dsl for crawling the web using amazon ec2 micro-instances. The goal is to create an end-to-end framework for crawling the web, eventually including the ability to crawl even dynamic javascript, and from a pool of spot instances.

This initial version is built using Qless, a redis-based queue, a redis-based bloomfilter, and a re-implementation and extension of the anemone dsl.   It also includes chef recipes for spooling up nodes on the amazon cloud, and a Sinatra app, cloud-monitor, to monitor the queue.  the basic layout is shown below from the Slideshare presentation

cloud-crawler architecture

Here, we show an example crawl, which finds links to all Level 1 Certs on the Crossfit main page:

urls = ["http://www.crossfit.com"]
CloudCrawler::crawl(urls, opts)  do |cc|
  cc.focus_crawl do |page|
    page.links.keep_if do |lnk| 
       text_for(lnk) =~ /Level 1/i
    end
  end
   cc.on_every_page do |page|
     puts page.url.to_s
   end
end

This is a very pre-release, and we are actively looking for contributors interested in getting involved.  (also, the web documentation is still in progress)

Rather than go into details, here we show how to install the crawler and get a test crawl up and running.  

To install on a local machine

(i.e Mac or Linux, Ruby does not play well with Windows)

I. Dependencies

Ruby 1.9.3 with Bundler   http://gembundler.com

Redis 2.6.x  (stable)     http://redis.io/download

it is suggested to use RVM to install ruby  https://rvm.io

and to use git to obtain the source  http://git-scm.com

II.  Installation Steps

II.0  install ruby 1.9.3, and redis 2.6.x

II.1  install bundler

  gem install bundler

II.2 clone the git source

git clone git://github.com/CalculatedContent/cloud-crawler.git

II.3  install the required gems and sources

change directories to where the Gemfile.lock file is located

  cd cloud_crawler/cloud-crawler

install the gems and required source and build the gem

   bundle install

to create a complete, sandbox, you can say

  bundle install --path vendor/bundle

this will install the cloud_crawler in a local bundle gem repository

we use bundler locally because we also use this on amazon aws / ec2 machines

III. Testing the Install

III.1  start the redis server

  redis-server &

III.2  run rake

  bundle exec rake

III.3  run a test crawl

  bundle exec ./test/test_crawl.rb

IV  try a real crawl using the DSL

flush the redis database

  redis-cli flushdb

load the first job into redis

  bundle exec ./examples/crossfit_crawl.rb

run the worker job

  bundle exec ./bin/run_worker.rb -n crossfit-crawl

V.  To view the queue monitor in a browser

   bundle exec qless-web

this should launch a tab in the web browser if this fails, the monitor may still work , and  may be visible in your browser at

   localhost:5678

and that’s it — you have a dsl for crawling running locally.  

VI.   To run the crawler on AWS and EC2, you will need to set up an amazon account,  install chef-solo, and create some security groups and s3 buckets.

Stay tuned for extended documentation and  examples, including  seeing the crawler in action on EC2.  Feel free to email to ask questions or to express interest in getting involved.

About these ads
Posted in: Uncategorized