Frequently in my work in big data and machine learning, I need to run large calculations in parallel. There are several great tools for this, including Hadoop, StarCluster, gnu-parallel, etc. The ruby world lacks comparable tools although ruby has had distributed computing for a long time:
Having learned ruby while at Aardvark, but not being able to read Japanese , I decided to write my own simple tool, a ruby DSL for distributed computing. Here I discuss the ruby DSL design pattern and it’s first implementation , for a distributed web crawler and scraper.
As usual, this post is motivated by a question on Quora, “How can we define our own query language?” .
Here we show an example of the DSL pattern in action. Imagine we have a master node and several worker nodes, communicating by a queue:
We want to load a block of work onto the master, have it distribute the block to the workers, execute the block in parallel, and cache the results back to the master.
In other languages, like Java or Python, we would need a compiler-compiler like ANTLR. In Ruby, we can use the magical meta-programming, and the sourcify gem, to quickly build a DSL and execute it remotely. Lets take a look at a simple example of what our DSL should look like:
doit &block is the basic method of our DSL. We want to invoke
doit on the master node, but have it executed remotely on the worker node(s). We do this using the sourcify gem , which lets us convert the block to source (i.e.
Proc#to_source), and ship the source to the worker (i.e via a queue in redis) to be executed by
instance_eval(source), or as a singleton method. Note: sourcify is a workaround until ruby-core officially supports
Proc#to_source. It is a little buggy…so be careful.
Here is a complete example. The DSL is broken into 3 parts:
DSL::FrontEnd plays the role of a traditional parser, and
DSL::Core is the remote interpreter.
DSL::FrontEnd are executed on the master, and
DSL:Core on the workers.
DSL::FrontEnd uses the ruby module mixin meta-programming pattern; when
Driver includes it, the
doit method is added to the
DSL::load creates an instance of a
worker in the
do block , which in an instance
Driver and therefore now implements the
Using this pattern, we can create a super simple, extensible DSL that can support a wide range of methods to run on the worker nodes.
doit is invoked, the
&block is converted to source, and then stored in cache (redis). When the worker is run, it grabs the block source (out of redis) and calls
perform(source_text), which then evaluates the code block in context of the worker instance using
instance_eval. (Note that in a production system, we may want to compress the source to save memory)
Also, to be a bit more efficient, we could also run the worker blocks in a batch mode and evaluate the DSL as a singleton method:
Also, notice that because the DSL block is being executed on the worker, it does not carry it’s local context from the master. Variables defined in the do block are not visible inside the
doit worker method:
And there we have it; a simple and lightweight way to create a simple ruby DSL for distributed computing.
If you would like to see the design pattern in action, you can can clone and run the open source Cloud-Crawler .