The major differences I see are:
- Inline support for Python et al.
- Confirming the steps that will be taken.
- HDFS support.
Are there any other big differences?The concept of Make is not unique. Everything that has dependencies and executes steps is similar to Make in concept. Drake is no exception, and it can be replaced with Make, but no more so than Rake, Ant or Maven can be replaced by Make. That is, if it's trivial - yes. Just a bit more complicated - no.
Some things are merely painful to implement with Make, some are just impossible:
- multiple outputs
- no-input and no-output steps
- HDFS support
- Hadoop's partial files support (part-?????)
- forced execution of any subbranch, up or down the tree or any individual targets (crucial for debugging and development)
- target exclusions
- protocol abstraction - inline Python is just one example
- tags
- branching
- methods
These are just what's implemented already. Other things are planned such as: - automated data versioning (backup and revert)
- parallelization
- real-time status console
- retries, email notifications
- etc.
Requirements for building executables and working with large, complicated and expensive data workflows are quite visible different, and the most important thing about Drake is that it provides the platform for convenient features (such as versioning or email notifications) to be implemented. And once they are, every data workflow can take advantage of them.I guess, if Make was really, really extendable, we could have considered it as a platform for all this. But it's not, and hacking all of that into Make's source code in C would be, I'm sure, a much greater pain than writing Drake.
Artem.
https://docs.google.com/document/d/1bF-OKNLIG10v_lMes_m4yyaJ...
- The complexity of your analysis. - How fixed your pipeline is over time. - The size of a data set. - How many data sets you are running the analysis on. - How long the analysis takes to run.
If you are only doing one or two tasks, then you barely need a management tool, though if your data is huge, you probably want memoization of those steps. If your pipeline changes continuously, as it does for a scientist mucking around with new data, then you need executions of code to be objects in their own right, just like code.
Make-like systems are ideal when:
- Your analysis consists of tens of steps. - You have only a couple of data sets that you're running a given analysis on. - The analysis takes minutes to hours, so you need memoization.
Another Swiss project, openBIS, is ideal for big analyses that are very fixed, but will be run on large numbers of data sets. It's very regimented and provides lots of tools for curating data inputs and outputs. The system I wrote was meant for day to day analysis where the analysis would change with every run, was only being run on a few data sets, and the analysis tool minutes to hours to run. Having written it and had a few years to think about it, there are things I would do very differently today (notably, make executions much more first class than they are, starting with an omniscient debugger integrated with memoization, which is effectively an execution browser).
So bravo for this project for making a tool that fits their needs beautifully. More people need to do this. Tools to handle the logistics of data analysis are not one size fits all, and the habits we have inherited are often not what we really want.
Here's yet-another-project for bioinformatics workflows that I've been involved in. This one based on Groovy:
I agree with your sentiments about the nature of pipelines vs build system a la make. Many many people start down the path of putting the classic DAG dependency analysis as the foundation of their needs when in fact, this isn't so much of a problem in real situations, and is even somewhat counterproductive because it forces you to declare a lot of things in a static way that actually aren't static at all. I've found tools like this completely break down when your data starts determining your workflow (eg: if the file is bigger than X I will break it in n parts and run them in parallel, otherwise I will continue on and do it using a different command entirely in memory).
In my experience the problems in big data analysis are more about the complexity of managing the process, achieving as much parallelization with as little effort and craziness as possible (don't see any mention of that in Drake), documenting what actually happened when something ran so you can figure it out later, and most of all, flexibility in modifying it since it changes every day of the week.
One mistake that Drake appears to make (again, from my quick skim), is interweaving the declaration of the "stages" of the pipeline (what they do) and the dependencies between them (the order they run in). This makes your pipeline stages less reusable and the pipeline harder to maintain. Bpipe completely separates these things out, which is something I like about it.
I would appreciate if you elaborated on separating step definitions from dependency definitions. In my mind, they are the same thing. If you mean that steps might not be connected by input-output relationship, but still have dependencies, Drake fully supports that via tags. If you mean that steps might be connected through input-output files, but not depend upon each other, I don't frankly see how it's possible. And if you mean some other syntax which more clearly separates the two, Drake supports methods which achieves exactly that. If you mean something else, I would love to see an example.
Thanks!
You raise some interesting points (for example, a frequently changing code), which we ran into as well. Our current approach to it is not as fundamental, and basically includes ability to force re-build any target and everything down the tree and methods, and you can also add your binaries as a step's dependency.
I'm sure as we and other people use the tool, we'll have better ideas. For example, Drake could automatically sense that the step's definition has changed and offer to rebuild or dismiss.
Other points you raised are also definitely worth thinking about.
Drake supports the ability to run stages in parallel (at least in theory) - it's been speced out (https://docs.google.com/document/d/1bF-OKNLIG10v_lMes_m4yyaJ...), just not implemented yet. But of course, once you have the entire dependency graph, it's easy to know what can be run in parallel and what cannot.
As for distributing computations, our approach is that it lies outside of Drake's scope. Drake doesn't know what's going on in steps. But you can always implement a step that would use distributed computation, for example, by submitting a Hadoop job, or in any other way. The only requirement Drake has is for the step to be synchronous, i.e. do not return before all the computation is complete. But even that can be changed for some cases.
http://www.factual.com/jobs/oTR1Vfwq/Software-Engineer---Pal...
Another answer would be, "Why not?"
#!/bin/sh
case $1 in
contracts.csv)
curl http://www.ferc.gov/docs-filing/eqr/soft-tools/sample-csv/contract.txt
;;
evergreens.csv)
redo-ifchange contracts.csv
grep Evergreen contracts.csv
;;
report.txt)
input=evergreens.csv
redo-ifchange $input
python2 <<-EOF
linecount = len(file("$input").readlines())
print("File $input has {0} lines.\n".format(linecount))
EOF
;;
esac
[1] https://github.com/apenwarr/redoThe ideas behind redo are brilliant, but the way to express them in this particular implementation is not so fun.
http://news.ycombinator.com/item?id=5111527
I suspect most of the points I made would be applicable to redo as well, if not more so. Trivial things don't require Drake. Heck, they often times don't require Make as well - just put it in a linear shell script if the steps are not too expensive. It's when things are getting complicated you need something like Drake.
I am one of the data engineers at Factual and though I didn't have a role in creating it I definitely enjoy using it on a day to day basis. You begin to see the utility of it when you have a dozen people working up and down a data pipeline and need to coordinate as product specs evolve or schemas change.
I also really like the tagging features - you can add specific tags to different steps in the build and run different "flavors" of your workflow depending upon what is needed. For example, you might build a workflow that collects, cleans, filters, and performs calculations on data from all over the world - but you might also want alternative versions of the build that only work on specific regions or smaller debug datasets. Tags make that really simple to do, even when many steps are shared by the different versions or the dependencies are complicated.
Although (since you mention R), I wonder why there's no love for R in Drake, given that R is perhaps the quintessential data processing language.
$ time drake --version
Drake Version 0.1.0
Target not found: ...
drake --version 5.42s user 0.18s system 188% cpu 2.969 total
For short scripts that you should be running in the shell, this is really bad. I expect basic make commands on small projects to be effectively instant. Compilation might take a bit longer, but 5.4s to print the version points to a 5s overhead on all executions.I'm guessing this is due to the JVM overhead, so that pretty much says this project isn't suited to the JVM. The JVM is great for long running processes, and applications where the overhead is a very small percentage of the total running time, but if it takes 5s longer than `make` to print it's version, that's really not a good sign.
This is a fantastic idea, and I will definitely be using it. But this overhead needs fixing.
First of all, --version shouldn't try to run any targets. This seems like a bug. Thanks.
Yes, you guessed correctly - this is the JVM startup time. I just hate JVM for that. We experimented with Nailgun and Drip to eliminate it - Nailgun is problematic because it uses a shared JVM for all runs, and it can get quite hairy sometimes. In the long run, Nailgun is almost certainly not an answer, since it assumes things we have no control over (i.e. Clojure runtime) don't do destructive tear down. Drip is a bit more promising, but we didn't succeed running Drake under it (simpler things worked fine though).
So, we're still looking into it, and we're looking for other ideas, too.
In the meantime, you could run Drake under REPL:
(-main "...")
The only problem is that Drake calls System/exit but we can add a flag ("--repl") that would prevent it from doing so, and you'll stay in REPL.
Thoughts?
P.S. JVM is unfortunate but Clojure is a fantastic language for something like Drake.
I have limited experience with Clojure, but it does seem to be a good match to this sort of task due to it's structure. However the JVM seems to be a real drawback to me. Perhaps with something like Scheme or Lisp you might get a similar program structure, and be able to compile to faster binaries?
The REPL is a solution, but as many developers are using tools like make with many other tools in the shell, running a REPL like that would prevent them from using other things efficiently. Ultimately I think the overhead time needs to be removed.
If it takes far longer than something like make, that's not necessarily an issue. The key point is making it fast from the user's perspective. As long as it runs in a fraction of a second, I can't see much of a difference between 0.1s and 0.0001s, so I don't think that sort of difference really matters, it's when it gets over 1s that it becomes an issue.
Running something like Nailgun in the background may be a good solution, I don't have any experience with it. But if it requires starting a daemon in the background, that could get in the way of using the tool in a normal way.
I don't really know what the best solution to this problem is. I'm not sure Clojure is the best tool for the job.
I see that Drake is implemented in Clojure, so I'd imagine you understand the value of homoiconicity and extensible languages. So I wonder why you didn't just use Clojure all the way through?
http://www.youtube.com/watch?feature=player_detailpage&v...
In short, we don't feel like it's an either or question. We want to have Drake as a command-line frontend to the core functionality, but we would love to see/have other frontends developed as well. Currently, there's no Clojure DSL for Drake, but I think it'd be totally awesome.
The reason we started from command-line is because our workflows are heterogenous, and we also didn't want to limit Drake to developers and associate it with coding. Clojure can be quite a big learning curve if you only need it to specify steps and link them together through file dependencies.
We had an important design goal in mind: Drake should be as simple as writing a shell script. If it's not, our experience shows that most workflow start as trivial shell-scripts with one or two steps, and by the time it grows into something unmanageable, it's kinda too late. :)
On a related note, Drake supports Clojure code inlining for manipulation of the parse tree. It's not an equivalent, just a somewhat related feature. It allows you to modify the steps, dependencies, and anything else in the parse tree directly from Clojure.
Further, I don't understand how I'm supposed to alter my path to be able to run drake by simply entering 'drake'- would it be possible to get some help?
(I'm sorry if this is really obvious)
With a bit of creativity, I think there may be a lot of applications here.
If you're serious about it, please submit a feature request (https://github.com/Factual/drake/issues), and describe more specifically what you would like to be able to do in your case.
Thank you for a great thought.
Artem.
I think that a bit of a disconnect here may be because some OPs might be used to 'compiling' code versus 'compiling' data angle that you are using.
This is especially evident by make dependencies discussion with lars512.
To give a simple specific example: I have a dataset of say 5000-50000 SKUs that are aggregated across 9-12 dimensions. My final report/analysis uses 3 scenarios. Now one sub-set of one scenario has changed [that's the raw input] - of course running 'data compilation' by using data that changed and ONLY what depends on it is the most effective&efficient approach.
Just my 2 financial cents...
We built this based on our own pain points with a larger audience in mind. We hope we got some things right, because the success of any tool is defined by its users. So, if you like it, let's build a thriving community together!
Artem.
I really like the inline, multi-language scripting though.
First of all, you can file a feature request: https://github.com/Factual/drake/issues
Adding a new filesystem to Drake's source is very easy. You just create a filesystem object that implements a bunch of methods for: listing directory, removing file, renaming file and getting file's timestamps, and then put it along with the corresponding prefix in the filesystem map. That's pretty much it. Assuming there's client JAR for Amazon S3, written either in Clojure or in Java, it should be quite simple to do.
Artem.
The initial system that I used was pretty similar to Paul Butler's technique, with a whole bunch of hacks to inform Make as to the status of various MySQL tables, and to allow jobs to be parallelized across the cluster.
At Custora, we needed a system specifically designed for running our various machine learning algorithms. We are always making improvements to our models, and we need to be able to do versioning to see how the improvements change our final predictions about customer behavior, and how these stack up to reality. So in addition to versioning code, and rerunning analysis when the code is out of date we also need to keep track of different major versions of the code, and figure out exactly what needs to be recomputed.
We did a survey of a number of different workflow management systems such as JUG, Taverna, and Kepler. We ended up finding a reasonable model in an old configuration management program called VESTA. We took the concepts from VESTA and wrote a system in Ruby and R to handle all of our workflow needs. The general concepts are pretty similar to to Drake, but it is specialized for our ruby and R modeling.
Some more useful links for those interested:
JUG https://github.com/luispedro/jug
Taverna http://www.taverna.org.uk/
Kepler https://kepler-project.org/
(A degenerate drake file, one line per 'step', would almost be a 1:1 representation of this richer history... though you then might want to coalesce and reorder atomic steps to represent the real shape of your workflow and dependencies.)