> One of the problems, as you can see, that it only works if you don't care about the filenames
This is a really insightful point - it touches on one of the ways Bpipe differs philosophically from other tools. Bpipe absolutely says you don't want to manage the file names. Not that you don't care about them, but it takes the position that naming the files is a problem it should help you with, not a problem you should be helping it with. It enforces a systematic naming convention for files, so that every file is named automatically according to the pipeline stages it passed through. So, for example, after coming through the 'fix_names' stage, 'input.csv' will be called 'input.fix_names.csv'. It does sometimes give you names that aren't correct by default, but it gives you easy ways to "hint" at how to produce the right name. Eg - if we want the output to end with ".txt" we write:
fix_names = {
exec "sed 's/Neverbrown/Evergreen/g' $input > $output.txt
}
Similarly if there are a lot of inputs and you need the one ending with ".txt" you will write "$input.txt", if you want the second input ending with ".txt" you will write "$input2.txt", and so on. Part of this stems from the huge number of files that you can end up dealing with. When you start having hundreds or thousands of outputs naming them quickly goes from being something you want to do to a chore that drives you completely crazy and you want a tool to help you with. Bpipe's names definitively tell you all the processing that was done on a file which is extremely helpful for auditability as well.> I'm even more concerned with multiple inputs and multiple outputs
As I touch on above, it's really not too hard. Bpipe gives you ways to query for inputs in a flexible manner to get the ones you want. The commands you write imply what files you need, and Bpipe searches backwards through the pipeline to find the most recent files output that satisfy those needs. Multiple outputs are similar ...
fix_names = {
exec "sed 's/Neverbrown/Evergreen/g' > $output1.txt 2> $output2. txt"
}
If you need to reach further back in the pipeline to find inputs there are more advanced ways to do it, but this works for 80% of your cases (the whole idea of a pipeline is that each stage usually processes the outputs from the previous one - so this is what Bpipe is optimized to give you by default).> I think your example is cool, but it seems to only be practical for rather simple workflows. And I can also see how Drake can easily be extended to support such syntactic sugar.
It depends what you mean by "simple". I use it for fairly complicated things - 20 - 30 stages joined together with 3 or 4 levels of nested parallelism. It seems to work OK. I'd argue that it's more than syntactic sugar, though - it's a different philosophy about what problems are important and what the tool should be helping you with.
Thanks for the great discussion!
Actually, I don't think there are any philosophical differences, and I'll try to make my case.
> Bpipe absolutely says you don't want to manage the file names.
I think this is too strong a statement as I try to show below.
> So, for example, after coming through the 'fix_names' stage, 'input.csv' will be called 'input.fix_names.csv'.
fix_names is the identifier in this case. There's really not much of a difference whether you use identifiers to come up with filenames, or you use filenames to come up with identifiers. If anything, I think filenames are preferable, because the user doesn't have to be aware of the scheme the tool uses to convert identifiers to filenames. The fact that identifiers are just a little bit shorter (e.g. don't have .txt extension or something) does not overweigh the inconvenience of knowing where the files are. The problem with this approach is because figuring out where the files are requires knowledge of the tool inner workings, that can only be acquired from reading the code or documentation.
There's another problem with these naming conventions, is that if you use the same code in multiple steps, things can become quite confusing. How will BPipe name them? Or is the only way to handle it is to copy-and-paste the code and create another rule?
It seems like not clear enough separation between the code and the filenames can be a source of problems... Please correct me if I'm wrong.
When I compare:
_ <- contracts.csv
sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT
evergreens.csv <- _
grep Evergreen $INPUT > $OUTPUT
with contracts:
sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT.csv
evergreens:
grep Evergreen $INPUT.csv > $OUTPUT.csv
contracts + evergreens
I strongly prefer the first option, because there's less implicit things going on, and the code is separated clearer from the file naming. Besides, it's even shorter.> Similarly if there are a lot of inputs and you need the one ending with ".txt" you will write "$input.txt", if you want the second input ending with ".txt" you will write "$input2.txt", and so on.
This can work for very simple workflows with maybe several cases of multiple inputs and outputs, but it's unmanageable when complexity grows.
Imagine a step which takes 3 inputs - one separate, one which is output #2 of a previous step, and one which is output #6 of yet another step. You can't use numbers to resolve that. You will end up coming up with some sort of semantic identifiers, which will almost completely replace BPipe's naming convention. And what's worse, they will be hard-coded in your step's commands, which means you'll have to edit the code if you want to change the filenames, or re-use this step's implementation somewhere else.
> When you start having hundreds or thousands of outputs naming them quickly goes from being something you want to do to a chore that drives you completely crazy and you want a tool to help you with.
I'm not sure I agree here. Here's how I see it:
Instead of naming hundreds of files, you have to name hundreds of methods (commands). Yes, you don't have to repeat the filenames to create dependencies, but you have to repeat the method names (in "contracts + evergreens"), and in a way which quickly breaches the boundaries of readability.
This doesn't work for complicated workflows, and for simple ones, I would prefer positional linking rather than comping up with names, like in the example I provided above.
There's nothing that prevents Drake from coming up with filenames from more abstract identifiers. We could come up with some syntax where you'd just give an identifier (say, "~contracts"), and we'll take care of the file location and name, just like BPipe does. The major difference is not this. The major difference is that we think you need to identify inputs and outputs to build the graph, and the method name is insignificant until you want code re-use, and BPipe seems to take the opposite position - that you need to give method names, and then use a separate expression to build the graph.
I think I provided at least a few strong arguments why BPipe is wrong on this one. I would really love to hear your further thoughts.
> As I touch on above, it's really not too hard. Bpipe gives you ways to query for inputs in a flexible manner to get the ones you want.
I'm sorry I didn't understand neither this nor the example you provided. Could you please elaborate? In the example you provided you identify different outputs by adding a number to their names. Is that how subsequent steps are supposed to refer to them as inputs - by the positional output number from the step that used to generate them?
> I'd argue that it's more than syntactic sugar, though - it's a different philosophy about what problems are important and what the tool should be helping you with.
I appreciate your opinion. But the way I see it is this:
1) As far as different philosophies go, I find BPipe's one to be a bit problematic for complicated cases.
2) And for simple cases, it all comes down to syntactic sugar.
I understand it's hard to argue an abstract, so I'll tell you what. Give me an example of a BPipe workflow that you particularly like, and I'll put it in Drake. I might need to invent some Drake features on the fly, but it's a good thing. This is what these discussions are for. I'll try to show you that there's no philosophical difference, and Drake has a more flexible approach overall. I am looking forward to this challenge, because your opinion is important to me.
Thank you!
Artem.
> The problem with this approach is because figuring out where the files are requires knowledge of the tool inner workings, that can only be acquired from reading the code or documentation
I suppose this is true but it's really not an issue I have in practice. I run the pipeline and it produces (let's say) a .csv file as a result. I execute
ls -lt *.csv
And I see my result at the top. There's really not a huge inconvenience in trying to find the output. Having the pipeline tool automatically name everything instead of me having to specify it is definitely a win in my case. I suspect we're using these tools in very different contexts and that's why we feel differently about this. It sounds like you need the output to be well defined (probably because there's some other automated process that then takes the files?) You can specify the output file exactly with Bpipe, it's just not something you generally want to do. There's nothing wrong with either one - right tool for the job always wins!> if you use the same code in multiple steps, things can become quite confusing. How will BPipe name them
It just keeps appending the identifiers:
run { fix_names + fix_names + fix_names }
will produce input.fix_names.fix_names.fix_names.csv. So there's no problem with file names stepping on each other, and it'll even be clear from the name that the file got processed 3 times. One problem is you do end up with huge file names - by the time it gets though 10 stages it's not uncommon to have gigantic 200 character file names. But after getting used to that I actually like the explicitness of it.> Imagine a step which takes 3 inputs - one separate, one which is output #2 of a previous step, and one which is output #6 of yet another step
Absolutely - you can get situations like this. We're sort of into the 20% of cases that need more advanced syntax (eventually we'll explore all of Bpipes's functions this way :-) ). But basically Bpipe gives you a query language that lets you "glob" the results of the pipeline output tree (not the files in the directory) to find input files. So to get files from specific stages you could write:
from(".xls", ".fix_names.csv", ".extract_evergreens.csv") {
exec "combine_stuff.py $input.xls $input1.csv $input2.csv"
}
It doesn't solve everything, but I guess the idea is, make it work right for the majority of cases ("sensible defaults") and then offer ways to deal with harder cases ("make simple things easy, hard things possible"). And when you really get in trouble it's actually groovy code so you can write any programmatic logic you like to find and figure out the inputs if you really need to.> Instead of naming hundreds of files, you have to name hundreds of methods (commands)
Not at all - if my pipeline has 15 stages then I have 15 commands to name. Those 15 stages might easily create hundreds of outputs though.
> The major difference is that we think you need to identify inputs and outputs to build the graph, and the method name is insignificant until you want code re-use, and BPipe seems to take the opposite position - that you need to give method names, and then use a separate expression to build the graph
Again, a really insightful comment, but I'd take it further (and this goes back to my very first comment). Bpipe isn't just not trying to build a graph up front, it really doesn't think there is a graph at all! At least, not an interesting one. The "graph" is a runtime product of the pipeline's execution. We don't actually know the graph until the pipeline finished. An individual pipeline stage can use if / then logic at runtime to decide whether to use a certain input or a different input and that will change the dependency graph. You have to go back and ask why you care about having the graph up front in the first place, and in fact it turns out you can get nearly everything you want without it. By not having the graph you lose some ability to do static analysis on the pipeline, but to have it you are giving up dynamic flexibility. So that's a tradeoff Bpipe makes (and there are downsides, it's just in the context where Bpipe shines the tradeoff is worth it).
> In the example you provided you identify different outputs by adding a number to their names. Is that how subsequent steps are supposed to refer to them as inputs - by the positional output number from the step that used to generate them
I think the "from" example above probably illustrates it. The simplest method is positional, but it doesn't have to be, you can filter with glob style matching to get inputs as well so if you need to pick out one then you just do so.
> 1) As far as different philosophies go, I find BPipe's one to be a bit problematic for complicated cases.
I can't argue with that - but that's sort of the idea: simple things easy, hard things possible. Complicated cases are complicated with every tool. I guess I would say that pipeline tools live at a level of abstraction where they aren't meant to get that complicated.
> 2) And for simple cases, it all comes down to syntactic sugar.
I guess I'd have to disagree with this, as I really think there are some fundamental differences in approach that go well beyond syntactic sugar.
> Give me an example of a BPipe workflow that you particularly like, and I'll put it in Drake
I wouldn't mind doing that - I'll need to look around and find an example I can share that would make sense (what I do is very domain specific - unless you have familiarity with bioinformatics it will probably be very hard to understand). I'll pm you when I manage to do this, but it may take me a little while (apologies).
Thanks as always for the interesting discussion. I think this is a fascinating space, not least because there have been so many attempts at it - I would say there are probably dozens of tools like this going back over 20 years or so - and it seems like nobody has ever nailed it. Bpipe has problems, but so does every tool I've ever tried (I'm probably up to my 8th one or so now!).
> It doesn't solve everything, but I guess the idea is, make it work right for the majority of cases ("sensible defaults") and then offer ways to deal with harder cases ("make simple things easy, hard things possible").
My contention is that while BPipe makes simple things easy, hard things possible, Drake makes both easy and possible. I think I've made some points to that regard, and gave you examples of Drake code which is just as easy to write as the corresponding BPipe's code without compromising on functionality. But to really conclusively prove this, I'm looking forward to more BPipe examples. So far, I haven't seen anything that is simpler (or even shorter) in Bpipe.
> Not at all - if my pipeline has 15 stages then I have 15 commands to name. Those 15 stages might easily create hundreds of outputs though.
When I first read it I thought this is a great point and you're onto something. But as I thought about it more, I realized that it only seems this way.
Here's the thing: if you have 15 stages but hundreds of files, it can mean only two things:
1) The vast majority of those files are leaf files, that is - they are either inputs (with pre-determined names) or outputs, which names you don't really care about (surprisingly). Drake can generate filenames for leaf output files with ease, as they don't affect the dependency graph.
2) The vast majority of those files are not leaves, but it means that the steps either:
2a) pass to each other dozens multiple inputs and outputs, and you have to either give them identifiers (as described above, Drake can do it too) or use positions (unmanageable).
2b) even worse, have a big and complicated dependency graph with much more than 14 edges, in which case your syntax of { a + b + c } will be almost definitely inadequate to describe such a complex thing (15 vertices and several dozens edges).
So, any way you look at it, Drake can do the same thing in the same way or better. Am I missing something?
> Bpipe isn't just not trying to build a graph up front, it really doesn't think there is a graph at all! At least, not an interesting one. The "graph" is a runtime product of the pipeline's execution.
I don't understand it. I'm afraid it doesn't work this way. You can't have the graph as a runtime product of the execution (i.e. after the execution), because it cripples your ability to do partial evaluation of targets. That is, you have to have dependency graph before you can even answer the question - "is target A up-to-date?". If you need to run the workflow to arrive at a conclusion, there's no guarantee how much time it will take. I also believe it unnecessary melds the distinction between the commands and the workflow. If your code needs to care about its dependencies, it can't be used out of context. So, maybe an example?
But if all you need to do is re-run everything every time, then it means you're really doing something trivial, and it also raises the question of why we need a tool like BPipe in the first place.
> An individual pipeline stage can use if / then logic at runtime to decide whether to use a certain input or a different input and that will change the dependency graph.
I don't see how it could work this way. Could you please give me an example along with the explanation of how BPipe will handle it on the control level?
> You have to go back and ask why you care about having the graph up front in the first place, and in fact it turns out you can get nearly everything you want without it.
I'm confused, I think nothing could be further from the truth. The dependency graph specifies what steps depend on what steps. If you don't know it, you don't even know how to start evaluating the workflow, because you don't know which step to build first. I don't understand this statement at all. Could you please elaborate or give me an example?
> By not having the graph you lose some ability to do static analysis on the pipeline, but to have it you are giving up dynamic flexibility.
I need to see an example of this.
> I can't argue with that - but that's sort of the idea: simple things easy, hard things possible. Complicated cases are complicated with every tool.
I don't think having 3 inputs is a very complicated case. And neither is having any dependency graph which is not a linear step1, step2, step3. My point is as soon as you get any of those, BPipe starts to slowly evolve into Drake, with some very weird syntax and inconsistencies (like having "implicit" dependencies in steps' implementations but having to also specify some or all of the dependencies in the "run" statement).
It's possible that I'm misunderstanding BPipe. Maybe some more examples would fix this.
> I guess I'd have to disagree with this, as I really think there are some fundamental differences in approach that go well beyond syntactic sugar.
I don't really see them. And you can't just disagree, you have to provide arguments. :) I understand you can see it differently, but it seems like so far, there could be a Drake workflow for every BPipe example, which uses the same ideas and is equally easy to write (but not necessarily the reverse). This means it all comes down to syntax, no?
Again, I might be misunderstanding BPipe.
I think it's really, really hard to argue abstract concepts. I would very much appreciate some examples. It doesn't even have to be your favorite workflow. Just give me anything. Write something and ask - "how would you put it in Drake?". I think my response would make it clear whether there are syntactic or philosophical differences. We've already established that there are some things BPipe cannot do as well as Drake can. I'd like to see the reverse to be true. Because in this case we can really identify philosophical differences, but if it's the opposite - i.e. Drake can do everything BPipe can with the same ease - than it's not a question of philosophy any more but design.
I'm not trying to attack BPipe. I just want to make the best tool possible, and if we make compromises, I want to make sure they are informed. We must consciously choose some things not to be as easy or possible in Drake for some other greater good. So far, I can't identify any of those things.
Show me. :)
Artem.
P.S. You don't have to give a real world example. I think that would actually unnecessary restrain and slow you down. Just demonstrate a basic concept, a feature, name your steps A, B, C - I don't care what they do. Only if it's something extremely exotic I might ask if there's a real world use-case for this, but I think I can come up with use-cases for pretty much anything. :)
P.P.S. Please include what you do to run the workflow in your examples. I suspect I might have misconceptions about what "run" statement does and how Bpipe resolves dependencies.
P.P.S. I appreciate the dialog as well. Especially since BPipe is your 8th tool. I would like Drake to be your 9th, and better than anything you used before, including Bpipe.
>> The problem with this approach is because figuring out where the files are requires knowledge of the tool inner workings, that can only be acquired from reading the code or documentation > I suppose this is true but it's really not an issue I have in practice. I run the pipeline and it produces (let's say) a .csv file as a result.
It's a good point and I, guess, I didn't mean it's a major issue. Just something which is, I believe, less than an ideal design, because it spreads (de-centralizes) information. For example, if you want to do something with the files outside of your workflow in some shell script, this shell script would contain a filename which a reader would have no idea how you came up with. Again, it's not something to obsess over, just an observation.
> There's really not a huge inconvenience in trying to find the output.
Even if so (highly doubtful in case of, as you say, 200 character filenames), this reasoning only applies to interactive sessions.
> Having the pipeline tool automatically name everything instead of me having to specify it is definitely a win in my case.
It's only true if you have to type less. I'm trying to make a case that you don't have to sacrifice clarity to achieve the same result. I'm trying to show you can win without losing.
> I suspect we're using these tools in very different contexts and that's why we feel differently about this.
That might be true, but we were also trying to come up with a universal tool. That is, we are willing to make sacrifices if not making them means severely limiting the scope of usage. But again, I am tying to show you don't even have to make sacrifices.
> It sounds like you need the output to be well defined (probably because there's some other automated process that then takes the files?)
Sometimes, yes; sometimes only for debugging; sometimes only for convenience. But more importantly, I'm arguing using filenames is just a better way to build the dependency graph regardless of whether you write them themselves or you use some identifiers that result in automatic filename generation. Remember I said Drake could easily do that? The core issue here is not filenames. It's what is the better (easier to read, less to type, easier to understand) way to define the dependency graph.
> You can specify the output file exactly with Bpipe, it's just not something you generally want to do.
Again, it's not the point. If you start specifying filenames exactly with Bpipe (I'm assuming you mean in commands themselves), you would just end up with a very strange beast: you'd have essentially define the dependency graph twice, once indirectly, and once directly. Or at least different dependencies in different ways. It seems like this would just be a total mess. But I'm trying to show even if you want to not care about filenames, Drake's approach is better.
> There's nothing wrong with either one - right tool for the job always wins!
My feeling so far was that it's not like a comparison of C and Python, but rather like a comparison of C and C++. There's absolutely nothing that you can't do in C++ better or at least as well as in C. Of course, I might be wrong, and that's why I would love to see an example workflow which I would then put in Drake and we'll be able to objectively compare.
> It just keeps appending the identifiers: will produce input.fix_names.fix_names.fix_names.csv. So there's no problem with file names stepping on each other, and it'll even be clear from the name that the file got processed 3 times.
First, I don't want to process the file 3 times - I didn't mean call the same method 3 times, I meant use the same code in different parts of the workflow. For example, you have a method to convert data from CSV to JSON, and you use it a dozen times all over the workflow.
Secondly, I think this is pretty bad. The way you described it, it makes filenames situational - i.e. depending on what part of the workflow they're in. Removing one fix_names from the chain could invalidate other fix_names's inputs and outputs, or worse - not invalidate the timestamps, but make such a huge mess, the user won't even know what hit him. Editing the workflow should not require such careful consideration for the tool's inner workings. And if you can afford to re-run the whole thing every time you add or delete the step, you're working on something very, very simple.
> One problem is you do end up with huge file names - by the time it gets though 10 stages it's not uncommon to have gigantic 200 character file names.
I apologize I didn't even realize the filenames carry all their creation history - I thought it was only the case with repeating names. I don't want to be harsh, but I think it's beyond bad. It means any change to the workflow can invalidate everything. This makes BPipe unusable for anything even remotely expensive. Please correct me if I'm wrong.
> Absolutely - you can get situations like this.
This is actually pretty common.
from(".xls", ".fix_names.csv", ".extract_evergreens.csv") { exec "combine_stuff.py $input.xls $input1.csv $input2.csv" }
I'm sorry, I tried but I didn't understand this code. Could you please elaborate? What do you mean "glob"? The way I see it, you may glob all you want, but there are just two ways to resolve this: use positional numbers or use some sort of identifiers. If you use positional numbers, it becomes unmanageable. And if you use identifiers, we're back where we started. It doesn't matter if they're filenames or not, what matters is that once you started using identifiers, you can generate the dependency graph yourself, from identifiers. In other words, you've arrived to Drake's model.
...continued in part2...