A Ruby DSL Design Pattern for Distributed Computing

Posted on August 10, 2013

1


Cloud Crawler

Frequently in my work in big data and machine learning, I need to run large calculations in parallel.  There are several great tools for this, including Hadoop, StarCluster,  gnu-parallel, etc.   The ruby world lacks comparable tools although ruby has had distributed computing for a long time:

http://blog.new-bamboo.co.uk/2012/04/11/the-druby-book-distributed-and-parallel-computing-with-ruby-is-finally-out

Having learned ruby while at Aardvark, but not being able to read Japanese , I decided to write my own simple tool, a ruby DSL for distributed computing.   Here I discuss the ruby DSL design pattern and it’s first implementation , for a distributed web crawler and scraper.

As usual, this post is motivated by a question on Quora, “How can we define our own query language?” .  

DSL Example

Here we show an example of the DSL pattern in action. Imagine we have a master node and several worker nodes, communicating by a queue:

cloud-crawler: architecture

We want to load a block of work onto the master, have it distribute the block to the workers, execute the block in parallel, and cache the results back to the master.

In other languages, like Java or Python, we would need a compiler-compiler like ANTLR.  In Ruby, we can use the magical meta-programming, and the sourcify gem, to quickly build a DSL and execute it remotely.  Lets take a look at a simple example of what our DSL should look like:

#!/usr/bin/env ruby
$LOAD_PATH << "."
require 'dsl'

include Dsl

Dsl::load do |worker| 
  worker.doit do
     p "I am doing some work in parallel on several workers" 
  end
end

The doit &block is the basic method of our DSL.  We want to invoke doit on the master node, but have it executed remotely on the worker node(s).   We do this using the  sourcify gem , which lets us convert the block to source (i.e. Proc#to_source), and ship the source to the worker (i.e via a queue in redis) to be executed by instance_eval(source), or as a singleton method. Note: sourcify is a workaround until ruby-core officially supports Proc#to_source. It is a little buggy…so be careful.

Here is a complete example.  The DSL is broken into 3 parts: Driver, DSL::FrontEnd and DSL::CoreDSL::FrontEnd plays the role of a traditional parser, and DSL::Core is the remote interpreter. Driver and DSL::FrontEnd are executed on the master, and DSL:Core on the workers.

require 'sourcify'

module Dsl

  module FrontEnd
    def self.included(base)
      base.send :include, InstanceMethods
    end

    module InstanceMethods
      def init(opts={}, &block)
        @doit_block = nil
        yield self if block_given?
      end

      def doit(&block)
        @doit_block = block
        self
      end

      def block_sources
        blocks = {}
        blocks[:doit_block] = @doit_block.to_source
        return blocks
      end

    end
  end

end

DSL::FrontEnd uses the ruby module mixin meta-programming pattern; when Driver includes it, the doit method is added to the Driver instances.

DSL::load creates an instance of a worker  in the do block , which in an instance Driver and therefore now implements the doit method.

Using this pattern, we can create a super simple, extensible DSL that can support a wide range of methods to run on the worker nodes.

#!/usr/bin/env ruby
#
# Copyright (c) 2013 Charles H Martin, PhD
#  
#  Calculated Content 
#  http://calculatedcontent.com
#  charles@calculatedcontent.com
#
require 'dsl/front_end'
require 'json'

module Dsl

  def Dsl.load(opts={},&block)
    Driver.load(opts,&block)
  end

   class Driver
     include FrontEnd

      def initialize(opts = {}, &block)
         init(opts) #_dsl_front_end
         yield self if block_given?
      end

      def load
        # store block_sources.to_json in redis
      end

      def self.load(opts={}, &block)       
        self.new(opts) do |core|
          yield core if block_given?
          core.load
        end
      end

   end

end

When doit is invoked, the &block is converted to source, and then stored in cache (redis). When the worker is run, it grabs the block source (out of redis) and calls perform(source_text), which then evaluates the code block in context of the worker instance using instance_eval.  (Note that in a production system, we may want to compress the source to save memory)

require 'json'
module Dsl

  module Core
    def self.included(base)
      base.send :extend, ClassMethods
    # base.send :extend, InstanceMethods
    end

    module ClassMethods
      def perform(source_text)
        @doit_block = JSON.parse(source_text)['doit_block']
        doit
      end

      def doit
         instance_eval(@doit_block).call
      end

    end
  end

end

Also, to be a bit more efficient, we could also run the worker blocks in a batch mode and evaluate the DSL as a singleton method:

module Dsl
  module FastCore
    def self.included(base)
      base.send :extend, ClassMethods
    end

      def init(source_text)
         block = JSON.parse(source_text)['doit_block'] 
         define_singleton_method(:doit, block) 
      end

      def perform(source_text)     
        doit
      end

    end
  end
end

Also, notice that because the DSL block is being executed on the worker, it does not carry it’s local context from the master. Variables defined in the do block are not visible inside the doit worker method:

Dsl::load do |worker| 
  mvar = "I am a variable on the master"
  worker.doit do
     p "the worker can not see mvar" 
  end
end

And there we have it; a simple and lightweight way to create a simple ruby DSL for distributed computing.

If you would like to see the design pattern in action, you can can clone and run the open source  Cloud-Crawler .

Related:

http://charlesmartin14.wordpress.com/2013/04/22/cloud-crawler-an-open-source-ruby-dsl-for-crawling-the-web/

 

About these ads
Posted in: Uncategorized