Introducing Drake, a kind of ‘make for data’ | Better HN

108 comments

56 comments · 18 top-level

jonathanjaeger13y ago· 7 in thread

Am I the only one who immediately thought of Drake the rapper? He's pretty famous, not sure if this was considered during the naming process. Even if it's not a legal problem, it's an SEO/social media problem.

wickeand00013y ago

Although I don't agree that the name "Drake" is an issue, I do find it interesting that an even more apt name for an application of this type might be "Usher"!

jonathanjaeger13y ago

Ha touché

sehugg13y ago

A drake is a male duck. They were pretty famous back in the day.

prospero13y ago

Also a privateer and a mythical beast. I think there's sufficient prior art on this one.

jonathanjaeger13y ago

True, but I wouldn't call my product 'Queen', 'Cream', 'Journey', or another noun that could be confused with someone or something famous. This distracts from the conversation of the product, so perhaps I shouldn't have brought it up.

logn13y ago

It's "Drake" but if you watch the video the "rake" is silent making it just "D". :P

aboytsov13y ago

Haha, this is so funny. :)

Sorry, guys, D was our working codename, and it slipped off my tongue, I guess, more than several times. :)

ori_b13y ago· 6 in thread

It looks like all of the drakefiles could be replaced pretty trivially with Makefiles. Replacing '<-' with ':', ';' with '#', and '$INPUT', '$OUTPUT' with '$<' and '$@', and inserting shell invocations of the Python interpreter looks like it would do the job.

The major differences I see are:

    - Inline support for Python et al.
    - Confirming the steps that will be taken.
    - HDFS support.

Are there any other big differences?

aboytsov13y ago

The example in the blogpost is understandably trivial, and it can be implemented in almost any Make-like system.

The concept of Make is not unique. Everything that has dependencies and executes steps is similar to Make in concept. Drake is no exception, and it can be replaced with Make, but no more so than Rake, Ant or Maven can be replaced by Make. That is, if it's trivial - yes. Just a bit more complicated - no.

Some things are merely painful to implement with Make, some are just impossible:

  - multiple outputs
  - no-input and no-output steps
  - HDFS support
  - Hadoop's partial files support (part-?????)
  - forced execution of any subbranch, up or down the tree or any individual targets (crucial for debugging and development)
  - target exclusions
  - protocol abstraction - inline Python is just one example
  - tags
  - branching
  - methods

These are just what's implemented already. Other things are planned such as:

  - automated data versioning (backup and revert)
  - parallelization
  - real-time status console
  - retries, email notifications
  - etc.

Requirements for building executables and working with large, complicated and expensive data workflows are quite visible different, and the most important thing about Drake is that it provides the platform for convenient features (such as versioning or email notifications) to be implemented. And once they are, every data workflow can take advantage of them.

I guess, if Make was really, really extendable, we could have considered it as a platform for all this. But it's not, and hacking all of that into Make's source code in C would be, I'm sure, a much greater pain than writing Drake.

Artem.

blablabla12313y ago

retries and email notifications is a good one. Currently I do something similar with cronjobs, rsync, shell scripts and some custom tools -- on multiple boxes. (Email notification with mailx) Works in theory pretty well, in practice race conditions become a problem, making it sometimes annoying because I need to run things manually when I need up to date processed data. If I had retries, this would be an improvement.

JoshTriplett13y ago

Make can support Python, or any other language you'd like. Just set ONESHELL to avoid splitting commands by line, and then set SHELL to your preferred language interpreter. Make will then hand that interpreter the entire body of commands to rebuild a target.

aboytsov13y ago

Drake supports "protocol" abstraction, which is much more than just specifying an interpreter. Python is a trivial protocol, not much more complicated than shell. There are slightly more complicated protocols, for example, "eval", which runs the first line as a shell command before putting everything else in $CMDS environment variable. There could be protocols for running an HBase query, a Pig query, Cascalog query, or an SQL query. Some of these things could involve building a JAR file and giving it to Hadoop binary. Currently only a handful of protocols is implemented, but more are described in the spec.

dirtyvagabondOP13y ago

Make was a major inspiration for us, and so Drake definitely has similarities to Make. The differences you list were non-trivial to us in usefulness, but of course YMMV. Also, there are a lot of (possibly) interesting future features described in the spec.

https://docs.google.com/document/d/1bF-OKNLIG10v_lMes_m4yyaJ...

andrewflnr13y ago

Does it have to have big differences? It's a slightly nicer system with a fairly shallow initial learning curve. If you're on a new project, what's the problem? I'm wondering how well it would work as an actual make replacement.

madhadron13y ago· 5 in thread

I wrote a workflow processing system (http://github.com/madhadron/bein) that's still running around the bioinformatics community in southern Switzerland, and came to the conclusion that something like make isn't actually what you want. Unfortunately, what you want varies with the task at hand. The relevant parameters are:

- The complexity of your analysis. - How fixed your pipeline is over time. - The size of a data set. - How many data sets you are running the analysis on. - How long the analysis takes to run.

If you are only doing one or two tasks, then you barely need a management tool, though if your data is huge, you probably want memoization of those steps. If your pipeline changes continuously, as it does for a scientist mucking around with new data, then you need executions of code to be objects in their own right, just like code.

Make-like systems are ideal when:

- Your analysis consists of tens of steps. - You have only a couple of data sets that you're running a given analysis on. - The analysis takes minutes to hours, so you need memoization.

Another Swiss project, openBIS, is ideal for big analyses that are very fixed, but will be run on large numbers of data sets. It's very regimented and provides lots of tools for curating data inputs and outputs. The system I wrote was meant for day to day analysis where the analysis would change with every run, was only being run on a few data sets, and the analysis tool minutes to hours to run. Having written it and had a few years to think about it, there are things I would do very differently today (notably, make executions much more first class than they are, starting with an omniscient debugger integrated with memoization, which is effectively an execution browser).

So bravo for this project for making a tool that fits their needs beautifully. More people need to do this. Tools to handle the logistics of data analysis are not one size fits all, and the habits we have inherited are often not what we really want.

zmmmmm13y ago

Heh, all the bioinformaticians come out of the woodwork :-)

Here's yet-another-project for bioinformatics workflows that I've been involved in. This one based on Groovy:

http://bpipe.org

I agree with your sentiments about the nature of pipelines vs build system a la make. Many many people start down the path of putting the classic DAG dependency analysis as the foundation of their needs when in fact, this isn't so much of a problem in real situations, and is even somewhat counterproductive because it forces you to declare a lot of things in a static way that actually aren't static at all. I've found tools like this completely break down when your data starts determining your workflow (eg: if the file is bigger than X I will break it in n parts and run them in parallel, otherwise I will continue on and do it using a different command entirely in memory).

In my experience the problems in big data analysis are more about the complexity of managing the process, achieving as much parallelization with as little effort and craziness as possible (don't see any mention of that in Drake), documenting what actually happened when something ran so you can figure it out later, and most of all, flexibility in modifying it since it changes every day of the week.

One mistake that Drake appears to make (again, from my quick skim), is interweaving the declaration of the "stages" of the pipeline (what they do) and the dependencies between them (the order they run in). This makes your pipeline stages less reusable and the pipeline harder to maintain. Bpipe completely separates these things out, which is something I like about it.

aboytsov13y ago

Thanks for your feedback. We do mention parallelization in the designdoc, it's just not implemented yet. It's quite easy to add though. We have a lot of features spec'ed out, but not implemented.

I would appreciate if you elaborated on separating step definitions from dependency definitions. In my mind, they are the same thing. If you mean that steps might not be connected by input-output relationship, but still have dependencies, Drake fully supports that via tags. If you mean that steps might be connected through input-output files, but not depend upon each other, I don't frankly see how it's possible. And if you mean some other syntax which more clearly separates the two, Drake supports methods which achieves exactly that. If you mean something else, I would love to see an example.

Thanks!

aboytsov13y ago

Thank you very much. We're really looking forward to other people using this tool.

You raise some interesting points (for example, a frequently changing code), which we ran into as well. Our current approach to it is not as fundamental, and basically includes ability to force re-build any target and everything down the tree and methods, and you can also add your binaries as a step's dependency.

I'm sure as we and other people use the tool, we'll have better ideas. For example, Drake could automatically sense that the step's definition has changed and offer to rebuild or dismiss.

Other points you raised are also definitely worth thinking about.

kisielk13y ago

I'm also a developer of a workflow processing system, though not open-source, and fairly specific to our company. A few more things that are desirable if you have a lot of data or need to do processing that takes a lot of time is the ability to run stages in parallel, and also to distribute the computation over a cluster of machines.

aboytsov13y ago

Great points.

Drake supports the ability to run stages in parallel (at least in theory) - it's been speced out (https://docs.google.com/document/d/1bF-OKNLIG10v_lMes_m4yyaJ...), just not implemented yet. But of course, once you have the entire dependency graph, it's easy to know what can be run in parallel and what cannot.

As for distributing computations, our approach is that it lies outside of Drake's scope. Drake doesn't know what's going on in steps. But you can always implement a step that would use distributed computation, for example, by submitting a Hadoop job, or in any other way. The only requirement Drake has is for the step to be synchronous, i.e. do not return before all the computation is complete. But even that can be changed for some cases.

abraininavat13y ago· 4 in thread

Why Clojure?

hvs13y ago

Probably because that's one of the languages that they use internally.

http://www.factual.com/jobs/oTR1Vfwq/Software-Engineer---Pal...

Another answer would be, "Why not?"

aboytsov13y ago

We love Clojure. Lisp is an extremely powerful language, and Clojure brings all this to the practical JVM world. And Lisp is quite good in operating on lists and graphs, which is a big part of Drake.

pencilcode13y ago

out of curiosity, why did you go the clojure route instead of the scala route? From what i understand, scala has more libraries available, including ai and nlp libraries but maybe my impression is not correct?

SilasX13y ago

Some see a Lisp and ask "why"? I see a non-Lisp that could be replaced with a Lisp and ask "why not?"

moonboots13y ago· 3 in thread

Djb redo[1], a make alternative, feels like a good fit for these type of data manipulation and dependency representations. Below is a port of the first example. The build script is just shell, so you can do stuff like embed python with a heredoc. One bit of syntactic sugar is that redo assumes stdout is the desired contents of the generated file, so you don't need to explicitly pipe to an OUTPUT variable.

  #!/bin/sh
  case $1 in
  contracts.csv)
    curl http://www.ferc.gov/docs-filing/eqr/soft-tools/sample-csv/contract.txt
    ;;
  evergreens.csv)
    redo-ifchange contracts.csv
    grep Evergreen contracts.csv
    ;;
  report.txt)
    input=evergreens.csv
    redo-ifchange $input
    python2 <<-EOF
  linecount = len(file("$input").readlines())
  print("File $input has {0} lines.\n".format(linecount))
  EOF
    ;;
  esac

[1] https://github.com/apenwarr/redo

groby_b13y ago

Taking off my programmer hat, and putting on my "I just want to get data moved along" hat - drake is much more readable than this.

The ideas behind redo are brilliant, but the way to express them in this particular implementation is not so fun.

aboytsov13y ago

Please see my response to Make comparison:

http://news.ycombinator.com/item?id=5111527

I suspect most of the points I made would be applicable to redo as well, if not more so. Trivial things don't require Drake. Heck, they often times don't require Make as well - just put it in a linear shell script if the steps are not too expensive. It's when things are getting complicated you need something like Drake.

moonboots13y ago

Redo lacks features baked into Drake, especially the Hadoop integration, but I believe it would be easier to incorporate custom functionality into redo versus hacking Make or writing a custom build system. I haven't used Drake, so I would be interested in a small but complicated Drake script which tackles an intractable problem in Make. I don't claim redo can provide a cleaner solution than a purpose-built system, but I think it will be unexpectedly simple.

jboggan13y ago· 2 in thread

I really wish that I had a tool like this back in grad school. I was doing bioinformatics work and merging, chopping, and processing various datasets over many months. When a new version of the underlying data came out it was not an easy task to go back and re-process it through dozens of steps in Perl and R. Having a tool like this would have made it a single command to do so and also ensured repeatability and transparency in my data, something which is often sorely lacking in an academic setting.

I am one of the data engineers at Factual and though I didn't have a role in creating it I definitely enjoy using it on a day to day basis. You begin to see the utility of it when you have a dozen people working up and down a data pipeline and need to coordinate as product specs evolve or schemas change.

I also really like the tagging features - you can add specific tags to different steps in the build and run different "flavors" of your workflow depending upon what is needed. For example, you might build a workflow that collects, cleans, filters, and performs calculations on data from all over the world - but you might also want alternative versions of the build that only work on specific regions or smaller debug datasets. Tags make that really simple to do, even when many steps are shared by the different versions or the dependencies are complicated.

xaa13y ago

As a fellow bioinformatician I can agree that this looks quite useful.

Although (since you mention R), I wonder why there's no love for R in Drake, given that R is perhaps the quintessential data processing language.

dirtyvagabondOP13y ago

There is love for R in Drake! As of about an hour ago: https://github.com/Factual/drake/commit/f63dd2630ca3e5e4a6a6...

danpalmer13y ago· 2 in thread

With an empty workflow, this is the result of `drake --version`.

  $ time drake --version
    Drake Version 0.1.0
    Target not found: ...
    drake --version  5.42s user 0.18s system 188% cpu 2.969 total

For short scripts that you should be running in the shell, this is really bad. I expect basic make commands on small projects to be effectively instant. Compilation might take a bit longer, but 5.4s to print the version points to a 5s overhead on all executions.

I'm guessing this is due to the JVM overhead, so that pretty much says this project isn't suited to the JVM. The JVM is great for long running processes, and applications where the overhead is a very small percentage of the total running time, but if it takes 5s longer than `make` to print it's version, that's really not a good sign.

This is a fantastic idea, and I will definitely be using it. But this overhead needs fixing.

aboytsov13y ago

Hey, thanks for trying out our tool!

First of all, --version shouldn't try to run any targets. This seems like a bug. Thanks.

Yes, you guessed correctly - this is the JVM startup time. I just hate JVM for that. We experimented with Nailgun and Drip to eliminate it - Nailgun is problematic because it uses a shared JVM for all runs, and it can get quite hairy sometimes. In the long run, Nailgun is almost certainly not an answer, since it assumes things we have no control over (i.e. Clojure runtime) don't do destructive tear down. Drip is a bit more promising, but we didn't succeed running Drake under it (simpler things worked fine though).

So, we're still looking into it, and we're looking for other ideas, too.

In the meantime, you could run Drake under REPL:

(-main "...")

The only problem is that Drake calls System/exit but we can add a flag ("--repl") that would prevent it from doing so, and you'll stay in REPL.

Thoughts?

P.S. JVM is unfortunate but Clojure is a fantastic language for something like Drake.

danpalmer13y ago

Thanks for the detailed and well explained reply.

I have limited experience with Clojure, but it does seem to be a good match to this sort of task due to it's structure. However the JVM seems to be a real drawback to me. Perhaps with something like Scheme or Lisp you might get a similar program structure, and be able to compile to faster binaries?

The REPL is a solution, but as many developers are using tools like make with many other tools in the shell, running a REPL like that would prevent them from using other things efficiently. Ultimately I think the overhead time needs to be removed.

If it takes far longer than something like make, that's not necessarily an issue. The key point is making it fast from the user's perspective. As long as it runs in a fraction of a second, I can't see much of a difference between 0.1s and 0.0001s, so I don't think that sort of difference really matters, it's when it gets over 1s that it becomes an issue.

Running something like Nailgun in the background may be a good solution, I don't have any experience with it. But if it requires starting a daemon in the background, that could get in the way of using the tool in a normal way.

I don't really know what the best solution to this problem is. I'm not sure Clojure is the best tool for the job.

jcromartie13y ago· 2 in thread

I like the idea that the tasks can be implemented in any language, but I feel like this has limitations compared to something like Rake, where the step definition is code, too. What this means is that in Rake I am not just limited to defining new task bodies, but new ways of defining tasks themselves.

I see that Drake is implemented in Clojure, so I'd imagine you understand the value of homoiconicity and extensible languages. So I wonder why you didn't just use Clojure all the way through?

aboytsov13y ago

This is a great question. Our approach to this is described here:

http://www.youtube.com/watch?feature=player_detailpage&v...

In short, we don't feel like it's an either or question. We want to have Drake as a command-line frontend to the core functionality, but we would love to see/have other frontends developed as well. Currently, there's no Clojure DSL for Drake, but I think it'd be totally awesome.

The reason we started from command-line is because our workflows are heterogenous, and we also didn't want to limit Drake to developers and associate it with coding. Clojure can be quite a big learning curve if you only need it to specify steps and link them together through file dependencies.

We had an important design goal in mind: Drake should be as simple as writing a shell script. If it's not, our experience shows that most workflow start as trivial shell-scripts with one or two steps, and by the time it grows into something unmanageable, it's kinda too late. :)

On a related note, Drake supports Clojure code inlining for manipulation of the parse tree. It's not an equivalent, just a somewhat related feature. It allows you to modify the steps, dependencies, and anything else in the parse tree directly from Clojure.

jboggan13y ago

I'm glad the step definitions are not in Clojure or a unified programming language. It makes it much easier to pull in data specialists, product managers, and other non-engineers to help build and maintain a data workflow while leaving them the autonomy to run and troubleshoot the steps of the build specific to their skillsets.

fnbr13y ago· 2 in thread

Perhaps I am the only one having issues here, but I cannot seem to get drake to run. Is there anything that is supposed to be done after building the uberjar?

Further, I don't understand how I'm supposed to alter my path to be able to run drake by simply entering 'drake'- would it be possible to get some help?

(I'm sorry if this is really obvious)

aboytsov13y ago

The project's README file (https://github.com/Factual/drake - scroll down) contains building and running instructions, as well as how to create a simple script to run Drake which you can put on your PATH.

fnbr13y ago

Ah, sorry, I should have been more clear. I've actually gone through the readme a few times, to no avail. I'll triple-check it though.

madMilo13y ago· 1 in thread

Reminds me of Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids, Workshop on Scalable Workflow Enactment Engines and Technologies (SWEET) at ACM SIGMOD, May, 2012.

https://www3.nd.edu/~ccl/software/makeflow/

aboytsov13y ago

Nice. Surprisingly, we weren't aware of Makeflow and kinda missed it completely. On the first look, it seems like Drake is quite a bit more feature-rich than Makeflow. Please see the designdoc and/or the tutorial video for details.

jeffdavis13y ago· 1 in thread

Cool project. I expected to be underwhelmed, but when I saw the dependency stuff, I was impressed. Maybe it should include a hook so that it can detect dataset changes automatically by running a separate command (or did I miss it?).

With a bit of creativity, I think there may be a lot of applications here.

aboytsov13y ago

This is an awesome idea. Currently Drake only supports timestamped and forced evaluations, but it would be great to have an evaluation abstraction where you could provide your own implementation of whether a target's changed and/or whether a target is to be considered fresher/younger than another target. Timestamped would compare modification times, forced would return true, and it could be extended indefinitely.

If you're serious about it, please submit a feature request (https://github.com/Factual/drake/issues), and describe more specifically what you would like to be able to do in your case.

Thank you for a great thought.

Artem.

daemon1313y ago· 1 in thread

Artem, the approach you guys are using is really EXCELLENT!

I think that a bit of a disconnect here may be because some OPs might be used to 'compiling' code versus 'compiling' data angle that you are using.

This is especially evident by make dependencies discussion with lars512.

To give a simple specific example: I have a dataset of say 5000-50000 SKUs that are aggregated across 9-12 dimensions. My final report/analysis uses 3 scenarios. Now one sub-set of one scenario has changed [that's the raw input] - of course running 'data compilation' by using data that changed and ONLY what depends on it is the most effective&efficient approach.

Just my 2 financial cents...

aboytsov13y ago

Thank you very much for your kind words and support, and we certainly are looking forward to your feedback, feature requests and bug reports, as well as your code contributions, should you so desire.

We built this based on our own pain points with a larger audience in mind. We hope we got some things right, because the success of any tool is defined by its users. So, if you like it, let's build a thriving community together!

Artem.

Xion13y ago· 1 in thread

There seems to be few differences between Drake and just rolling out Makefiles for data processing, but I definitely see this project has potential. Distributed processing over AWS/Compute Engine/etc. clusters would be one nice thing to have, as a kind of simpler alternative to Hadoop.

I really like the inline, multi-language scripting though.

aboytsov13y ago

Thanks! We feel that in practice, there's quite a lot of differences between Drake and most Make-like systems. See this response for details: http://news.ycombinator.com/item?id=5111527

roolio_13y ago· 1 in thread

Kudos for your work! Do you plan to integrate Amazon S3 the same way you did for hdfs?

aboytsov13y ago

Thank you. Why not? We would love to see it, but we're also not actively using Amazon S3 at the moment. But we would be more than happy to review code contributions.

First of all, you can file a feature request: https://github.com/Factual/drake/issues

Adding a new filesystem to Drake's source is very easy. You just create a filesystem object that implements a bunch of methods for: listing directory, removing file, renaming file and getting file's timestamps, and then put it along with the corresponding prefix in the filesystem map. That's pretty much it. Assuming there's client JAR for Amazon S3, written either in Clojure or in Java, it should be quite simple to do.

Artem.

aaronjg13y ago

I've spent a lot of time working with pipelining software, first for my last job doing bioinformatics research, and now for handling analytics workflows at Custora. We ultimately decided to write our own (which we are considering open sourcing, email me if you are interested in learning more).

The initial system that I used was pretty similar to Paul Butler's technique, with a whole bunch of hacks to inform Make as to the status of various MySQL tables, and to allow jobs to be parallelized across the cluster.

At Custora, we needed a system specifically designed for running our various machine learning algorithms. We are always making improvements to our models, and we need to be able to do versioning to see how the improvements change our final predictions about customer behavior, and how these stack up to reality. So in addition to versioning code, and rerunning analysis when the code is out of date we also need to keep track of different major versions of the code, and figure out exactly what needs to be recomputed.

We did a survey of a number of different workflow management systems such as JUG, Taverna, and Kepler. We ended up finding a reasonable model in an old configuration management program called VESTA. We took the concepts from VESTA and wrote a system in Ruby and R to handle all of our workflow needs. The general concepts are pretty similar to to Drake, but it is specialized for our ruby and R modeling.

Some more useful links for those interested:

JUG https://github.com/luispedro/jug

Taverna http://www.taverna.org.uk/

Kepler https://kepler-project.org/

VESTA http://vesta.sourceforge.net/

gojomo13y ago

I could imagine a bash shell that helps create drake files, by remembering in a richer history structure all files read/modified by subprocesses.

(A degenerate drake file, one line per 'step', would almost be a 1:1 representation of this richer history... though you then might want to coalesce and reorder atomic steps to represent the real shape of your workflow and dependencies.)

swalsh13y ago

Whoa, this is the first time i'm hearing of "Factual" but playing around i'm impressed! There was a side project I had a while ago, which i eventually gave up because I couldn't source some data. These guys found it!

circa13y ago

When you run it. It tells you, "you're the fuckin' best, you da fuckin' best."

j / k navigate · click thread line to collapse