undefined | Better HN

0 pointsaboytsov13y ago0 comments

...continued from part1. read part1 first!...

> It doesn't solve everything, but I guess the idea is, make it work right for the majority of cases ("sensible defaults") and then offer ways to deal with harder cases ("make simple things easy, hard things possible").

My contention is that while BPipe makes simple things easy, hard things possible, Drake makes both easy and possible. I think I've made some points to that regard, and gave you examples of Drake code which is just as easy to write as the corresponding BPipe's code without compromising on functionality. But to really conclusively prove this, I'm looking forward to more BPipe examples. So far, I haven't seen anything that is simpler (or even shorter) in Bpipe.

> Not at all - if my pipeline has 15 stages then I have 15 commands to name. Those 15 stages might easily create hundreds of outputs though.

When I first read it I thought this is a great point and you're onto something. But as I thought about it more, I realized that it only seems this way.

Here's the thing: if you have 15 stages but hundreds of files, it can mean only two things:

1) The vast majority of those files are leaf files, that is - they are either inputs (with pre-determined names) or outputs, which names you don't really care about (surprisingly). Drake can generate filenames for leaf output files with ease, as they don't affect the dependency graph.

2) The vast majority of those files are not leaves, but it means that the steps either:

2a) pass to each other dozens multiple inputs and outputs, and you have to either give them identifiers (as described above, Drake can do it too) or use positions (unmanageable).

2b) even worse, have a big and complicated dependency graph with much more than 14 edges, in which case your syntax of { a + b + c } will be almost definitely inadequate to describe such a complex thing (15 vertices and several dozens edges).

So, any way you look at it, Drake can do the same thing in the same way or better. Am I missing something?

> Bpipe isn't just not trying to build a graph up front, it really doesn't think there is a graph at all! At least, not an interesting one. The "graph" is a runtime product of the pipeline's execution.

I don't understand it. I'm afraid it doesn't work this way. You can't have the graph as a runtime product of the execution (i.e. after the execution), because it cripples your ability to do partial evaluation of targets. That is, you have to have dependency graph before you can even answer the question - "is target A up-to-date?". If you need to run the workflow to arrive at a conclusion, there's no guarantee how much time it will take. I also believe it unnecessary melds the distinction between the commands and the workflow. If your code needs to care about its dependencies, it can't be used out of context. So, maybe an example?

But if all you need to do is re-run everything every time, then it means you're really doing something trivial, and it also raises the question of why we need a tool like BPipe in the first place.

> An individual pipeline stage can use if / then logic at runtime to decide whether to use a certain input or a different input and that will change the dependency graph.

I don't see how it could work this way. Could you please give me an example along with the explanation of how BPipe will handle it on the control level?

> You have to go back and ask why you care about having the graph up front in the first place, and in fact it turns out you can get nearly everything you want without it.

I'm confused, I think nothing could be further from the truth. The dependency graph specifies what steps depend on what steps. If you don't know it, you don't even know how to start evaluating the workflow, because you don't know which step to build first. I don't understand this statement at all. Could you please elaborate or give me an example?

> By not having the graph you lose some ability to do static analysis on the pipeline, but to have it you are giving up dynamic flexibility.

I need to see an example of this.

> I can't argue with that - but that's sort of the idea: simple things easy, hard things possible. Complicated cases are complicated with every tool.

I don't think having 3 inputs is a very complicated case. And neither is having any dependency graph which is not a linear step1, step2, step3. My point is as soon as you get any of those, BPipe starts to slowly evolve into Drake, with some very weird syntax and inconsistencies (like having "implicit" dependencies in steps' implementations but having to also specify some or all of the dependencies in the "run" statement).

It's possible that I'm misunderstanding BPipe. Maybe some more examples would fix this.

> I guess I'd have to disagree with this, as I really think there are some fundamental differences in approach that go well beyond syntactic sugar.

I don't really see them. And you can't just disagree, you have to provide arguments. :) I understand you can see it differently, but it seems like so far, there could be a Drake workflow for every BPipe example, which uses the same ideas and is equally easy to write (but not necessarily the reverse). This means it all comes down to syntax, no?

Again, I might be misunderstanding BPipe.

I think it's really, really hard to argue abstract concepts. I would very much appreciate some examples. It doesn't even have to be your favorite workflow. Just give me anything. Write something and ask - "how would you put it in Drake?". I think my response would make it clear whether there are syntactic or philosophical differences. We've already established that there are some things BPipe cannot do as well as Drake can. I'd like to see the reverse to be true. Because in this case we can really identify philosophical differences, but if it's the opposite - i.e. Drake can do everything BPipe can with the same ease - than it's not a question of philosophy any more but design.

I'm not trying to attack BPipe. I just want to make the best tool possible, and if we make compromises, I want to make sure they are informed. We must consciously choose some things not to be as easy or possible in Drake for some other greater good. So far, I can't identify any of those things.

Show me. :)

Artem.

P.S. You don't have to give a real world example. I think that would actually unnecessary restrain and slow you down. Just demonstrate a basic concept, a feature, name your steps A, B, C - I don't care what they do. Only if it's something extremely exotic I might ask if there's a real world use-case for this, but I think I can come up with use-cases for pretty much anything. :)

P.P.S. Please include what you do to run the workflow in your examples. I suspect I might have misconceptions about what "run" statement does and how Bpipe resolves dependencies.

P.P.S. I appreciate the dialog as well. Especially since BPipe is your 8th tool. I would like Drake to be your 9th, and better than anything you used before, including Bpipe.

0 comments

3 comments · 1 top-level

zmmmmm13y ago· 2 in thread

I'm sorry I don't have time to answer in full. I'm just going to respond to this one point because I think it's pretty fundamental and perhaps explaining it will clear up other things!

    > The dependency graph specifies what
    > steps depend on what steps. If you don't
    > know it, you don't even know how to
    > start evaluating the workflow, because
    > you don't know which step to build
    > first. I don't understand this statement
    > at all. Could you please elaborate or
    > give me an example?

I can see this is really really hard to grok if you're basing everything on the idea of a DAG, and so many tools are that it's very natural to think you couldn't do it any other way. Think of it as imperative vs declarative if you like. In Bpipe the user declares the pipeline order explicitly (as you've seen) - so that's the first part of the answer to your question. Bpipe knows which part to execute first because the user said to explicitly. But this isn't used for figuring out dependencies - dependencies arise as actual commands are executed. Back to our famous example:

    fix_names = {
      exec "sed 's/Neverbrown/Evergreen/g' $input > $output"
    }

    extract_evergreen = ...

    run { fix_names + extract_evergreen }

We run it like this:

    bpipe run pipeline.groovy input.csv

If you run it once, Bpipe builds input.fix_names.csv. If you run it twice, Bpipe is clever enough not to build input.fix_names.csv again! How is that if it doesn't know about the dependency graph?! Well, it does it "just in time". It executes the "fix_names" pipeline stage (or "method") and that calls the "exec" command. The "exec" command sees that all the inputs referenced ($input variables) are older than the outputs referenced ($output variables). So it knows it doesn't have to rebuild those outputs, and skips executing the command. So what about transitive dependencies? If C depends on B which depends on A, (so dependencies are A => B => C) what happens if you delete file B? Technically you don't need to build C because it's still newer than A, but Bpipe can't see it any more. Well, Bpipe knows this too because it keeps a detailed manifest on all the files created. So when the call to create B is executed it can see that although B was deleted, it did exist and in its last known state was newer than input files, so there's no need to rebuild it, as long as downstream dependencies are OK.

So in this way Bpipe handles dependencies for you. What it does not do is figure out which order to execute things in. It does them in exactly the order you tell it. This is one of those things that conventional tools solve which isn't actually that important (in my uses) but which occasionally is very annoying - I actually want to control the order of things sometimes. I want to be able to tell it "do this first, then that, then the next thing" regardless of dependencies. Usually it's pretty obvious what the right order things should be in and there are other externalities that influence how I like to do it ("I know this part uses a lot of i/o so try to do it in parallel with another bit that's mainly using CPU", or "Let's run this part last because it will be after hours and the other jobs will have finished"). Having the tool think this stuff up by itself can save you a bit of time but it can lose you a lot because you don't have the ability to really control what's going on.

aboytsovOP13y ago

> I'm sorry I don't have time to answer in full.

We're not getting anywhere. Just give me goddamn examples! :) Please! Examples!

> I can see this is really really hard to grok if you're basing everything on the idea of a DAG, and so many tools are that it's very natural to think you couldn't do it any other way.

There is no other way. BPipe is based on the idea of a DAG. You just don't see it.

> In Bpipe the user declares the pipeline order explicitly.

And this is a big mistake. The reason is simple - explicit order is very hard to manage once you have multiple inputs and outputs, and as a consequence, complicated (instead of linear) dependency relationships.

What you don't seem to realize, is that by "declaring the pipeline order explicitly" you create a dependency graph. It's a part of your workflow definition. Your workflow contains the full definition of the dependency graph. Even if it didn't, you would still use it. There is no other way.

This is what I meant when I said - you create your dependency graph in "run". And this is a bad idea.

> dependencies arise as actual commands are executed.

What does it mean exactly? That the first command will somehow tell Bpipe what to run next? If not, then I don't understand this statement at all.

> How is that if it doesn't know about the dependency graph?! Well, it does it "just in time".

It does not matter if you calculate the dependency graph before you run the first command, or as you run the commands. It makes absolutely no difference. The only difference is whether it is computable or not. If you say it's not computable until run-time, please elaborate on that.

> So in this way Bpipe handles dependencies for you.

So far I see that this is very standard and doesn't differ in any way from what Drake or any other tool does. The only thing that differs, and I am repeating myself, is how you define your dependency graph - through input and outputs, or in "run". So far it seems that "run" is quite unfortunate. But please give me examples.

> So in this way Bpipe handles dependencies for you. What it does not do is figure out which order to execute things in. It does them in exactly the order you tell it.

This is a meaningless statement. Drake also executes steps in the order you tell it. The only difference is how you tell it. In Drake, you tell it through specifying a list of steps each step depends on individually (once again, it doesn't matter that filenames are used for that - Drake also supports tags, or it could be some other identifiers). In Bpipe, you tell it in "run", collectively and sequentially. Drake's way supports the whole variety of graphs, while Bpipe's way - only a very limited subset. And for this limited subset, Drake can give you (I think) a syntax just as good if not better than Bpipe's. If you don't quite understand what I'm talking about, give me an example, and I will demonstrate.

> I actually want to control the order of things sometimes.

This is fine, the only question is how. You say Bpipe's way is convenient. I say give me an example and I'll show you that Drake's way is not any less convenient. I'm sorry to keep repeating myself, I thought I stressed the importance of examples quite a bit in my previous email and I want to stress it again. Examples, please!

> I want to be able to tell it "do this first, then that, then the next thing" regardless of dependencies.

This statement is self-contradictory. You don't seem to realize that by telling it "do this first, then that" you are defining dependencies. It's fine, and it's OK, and it can be convenient, but you can't say regardless of them.

Again - give me examples! Our conversation is becoming useless without examples.

You did not, but I'll just grab whatever you threw my way:

    fix_names = {
      exec "sed 's/Neverbrown/Evergreen/g' $input > $output"
    }

    extract_evergreen = ...

    run { fix_names + extract_evergreen }

    $ bpipe run pipeline.groovy input.csv

Drake can support this perfectly:

    _ <- $[in]
      exec "sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT"

    $[out] < _
      ........

    $ drake -v out=pipeline.groovy,in=input.csv

Isn't that much nicer? What disadvantages you can see?

Tell me what is it that you would like to do with this script, and I'll tell you a better way to do it in Drake. Is it multiple versions of run that you want to have? Easy. Are you concerned about inserting a step in the middle? Trivial. Tell me why Drake's code is worse, and I'll listen. So far it seems like it's better because it's shorter and more flexible at the same time.

> Having the tool think this stuff up by itself can save you a bit of time but it can lose you a lot because you don't have the ability to really control what's going on.

What exactly are you losing?

I am sorry if I sound irritated. I am. I've just been begging for examples, and you keep talking in abstract, and it would be fine, but you're making a lot of mistakes. So, instead of looking at concrete things that would make my point apparent to you (or the opposite, prove that I'm wrong), I keep pointing to flaws in your reasoning, which frankly, is irrelevant. One picture is worth a thousand words.

I really want your feedback. But please give me examples.

zmmmmm13y ago

> There is no other way. BPipe is based on the idea of a DAG. You just don't see it.

So if you think Bpipe uses a DAG, then I wonder how you would think it deals with:

  run { fix_names + fix_names + fix_names }

In terms of the pipeline stages that run this is cyclic, so it cannot be a DAG. On the other hand the files created do usually form a DAG dependency relationship, but even there, in the most general case, it's not at all impossible in an imperative pipeline to read a file in and write the same file out again in modified form (or more likely, to modify it in place), so the file depends on itself - another non-DAG relationship. I'm sure you'll object to this in a purist sense, and tell me it is a horribly broken idea, but as a practising bioinformatician, when I have a 10TB file and modifying it in place will save me hours and huge amounts of space, I'm much more interested in getting my job done than being pure about things.

I think you're right that we're at diminishing returns here, and I'm sorry I've frustrated you. We're trying to bite off more than we can chew in a forum like this.

I wish you all the best with Drake and I'll definitely check it out down the track (when it supports parallelism, since that's too important to me right now). For now, though, I don't intend to read / respond to any more replies in this thread.

1 more reply

j / k navigate · click thread line to collapse