I would appreciate if you elaborated on separating step definitions from dependency definitions. In my mind, they are the same thing. If you mean that steps might not be connected by input-output relationship, but still have dependencies, Drake fully supports that via tags. If you mean that steps might be connected through input-output files, but not depend upon each other, I don't frankly see how it's possible. And if you mean some other syntax which more clearly separates the two, Drake supports methods which achieves exactly that. If you mean something else, I would love to see an example.
Thanks!
As I said, I only very quickly skimmed since I'm busy, I might have overlooked information, and apologies in that case. But take the example from the front page:
evergreens.csv <- contracts.csv
grep Evergreen $INPUT > $OUTPUT
So now suppose a new requirement comes along - Evergreen is also called "Neverbrown" sometimes. It's decided the best way is to convert all references at input so nothing else gets confused downstream. So I need an extra step, now renamed.csv <- contracts.csv:
sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT
evergreens.csv <- renamed.csv:
grep Evergreen $INPUT > $OUTPUT
Adding this step forced me to modify the declaration of the original command, even though what I added had nothing to do with that command. With Bpipe, for example, you say extract_evergreens = {
exec "grep Evergreen $input > $output"
}
fix_names = {
exec "sed 's/Neverbrown/Evergreen/g' $input > $output"
}
Then you define your pipeline order separately - run { fix_names + extract_evergreens }
If I get contracts from a different source that don't need the renaming, I can still run my old version and I'm not changing the definition of anything: run { extract_evergreens }
Hope this explains what I mean, and again apologies if this is all clearly explained in your docs and I just jumped to conclusions from the simple examples!The fundamental issue is why do you have to repeat the filename, and I did give it some thought.
1. What your example does is allows to allocate dependencies based on positions. It's pretty cool. This seems to be easily reproducible in Drake, if we add a special symbol that would just mean "a temporary file" for the output, and "last temporary output" for the input (by the way, you don't need colons):
_ <- contracts.csv
sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT
evergreens.csv <- _
grep Evergreen $INPUT > $OUTPUT
or even: <<- contracts.csv
sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT
evergreens.csv <<-
grep Evergreen $INPUT > $OUTPUT
2. One of the problems, as you can see, that it only works if you don't care about the filenames, i.e. you use a temporary file. Similarly, your Bpipe expression: run { fix_names + extract_evergreens }
doesn't care about filenames as well. How do you add it there? What if you need this file for debugging purposes, or if it's an input to some further step down the road? In this case, you'd have to do what you want to avoid doing (i.e. modify the original step).3. I'm even more concerned with multiple inputs and multiple outputs. As long as your workflow is simple, you can get away with a + b. But when it's more complicated, you would have to do something like:
run { (((fix_names + extract_evergreens) * and_some_otheroutput) + some_other_step) * some_other_output }
(I used * as an operator that puts two outputs together to create an input with two files for the next command. Mathematically, + is better for that and * is for what + is used in your examples. :))As you can see, it gets unreadable so fast, that you'd want to use some sort of identifiers to specify dependencies, and would end up with a scheme pretty much equivalent to filenames. The fact that some file might be a temporary is a related, but parallel problem.
4. Now even worse, I'm not quite sure how this syntax could accomodate multiple outputs. If fix_name creates several outputs, and extract_evergreens uses only one, you can't get around it without some weird syntax and specifying a numeric position. It also gets out of hand pretty quickly and you're back to using some sort of identifiers, be it filenames or not.
5. Speaking of identifiers, you can use variables in Drake instead of filenames, so you can abstract filenames away. But it seems to me there's a more fundamental problem in play.
6. If you're concerned with coupling implementation and input and output names, Drake has methods for this:
fix_names()
sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT
extract_evergreens()
grep Evergreen $INPUT > $OUTPUT
renamed.csv <- contracts.csv [method:fix_names]
evergreens.csv <- renamed.csv [method:extract_evergreens]
or even, as discussed above: <<- contracts.csv [method:fix_names]
evergreens.csv <<- [method:extract_evergreens]
To summarize, I think your example is cool, but it seems to only be practical for rather simple workflows. And I can also see how Drake can easily be extended to support such syntactic sugar. For more complicated dependencies though, I don't really see a better approach.I would love to hear your further thoughts on the matter, and whether you'd like to see something similar to what I proposed in Drake. Or something else.
Artem.
> One of the problems, as you can see, that it only works if you don't care about the filenames
This is a really insightful point - it touches on one of the ways Bpipe differs philosophically from other tools. Bpipe absolutely says you don't want to manage the file names. Not that you don't care about them, but it takes the position that naming the files is a problem it should help you with, not a problem you should be helping it with. It enforces a systematic naming convention for files, so that every file is named automatically according to the pipeline stages it passed through. So, for example, after coming through the 'fix_names' stage, 'input.csv' will be called 'input.fix_names.csv'. It does sometimes give you names that aren't correct by default, but it gives you easy ways to "hint" at how to produce the right name. Eg - if we want the output to end with ".txt" we write:
fix_names = {
exec "sed 's/Neverbrown/Evergreen/g' $input > $output.txt
}
Similarly if there are a lot of inputs and you need the one ending with ".txt" you will write "$input.txt", if you want the second input ending with ".txt" you will write "$input2.txt", and so on. Part of this stems from the huge number of files that you can end up dealing with. When you start having hundreds or thousands of outputs naming them quickly goes from being something you want to do to a chore that drives you completely crazy and you want a tool to help you with. Bpipe's names definitively tell you all the processing that was done on a file which is extremely helpful for auditability as well.> I'm even more concerned with multiple inputs and multiple outputs
As I touch on above, it's really not too hard. Bpipe gives you ways to query for inputs in a flexible manner to get the ones you want. The commands you write imply what files you need, and Bpipe searches backwards through the pipeline to find the most recent files output that satisfy those needs. Multiple outputs are similar ...
fix_names = {
exec "sed 's/Neverbrown/Evergreen/g' > $output1.txt 2> $output2. txt"
}
If you need to reach further back in the pipeline to find inputs there are more advanced ways to do it, but this works for 80% of your cases (the whole idea of a pipeline is that each stage usually processes the outputs from the previous one - so this is what Bpipe is optimized to give you by default).> I think your example is cool, but it seems to only be practical for rather simple workflows. And I can also see how Drake can easily be extended to support such syntactic sugar.
It depends what you mean by "simple". I use it for fairly complicated things - 20 - 30 stages joined together with 3 or 4 levels of nested parallelism. It seems to work OK. I'd argue that it's more than syntactic sugar, though - it's a different philosophy about what problems are important and what the tool should be helping you with.
Thanks for the great discussion!