undefined | Better HN

0 pointsrcthompson9y ago0 comments

Here's one example where I had to use a kind of ugly hack ot make it work with Snakemake, a Python Makefile-style "DAG-of-rules" workflow tool: https://github.com/DarwinAwardWinner/CD4-csaw

Basically, I need to first fetch the metadata on all the samples, and then later group them by treatment based on that metadata. In other words, the structure of later parts of the DAG depends on the results of executing earlier parts of the DAG, so the full structure of the DAG is not known initially. The solution I used was to split the workflow in two: a "pre-workflow workflow" that fetches the sample metadata and then the main workflow which reads the metadata and builds the DAG based on it. See here: https://github.com/DarwinAwardWinner/CD4-csaw/blob/master/Sn...

This a common pattern that I see when putting together bioinformatics workflows: the full DAG of actions to execute cannot be known until part of the way through executing that DAG. Most workflow tools can't handle this gracefully. Another Python DAG-executor, called doit, can handle this case, by specifying that some rules should not be evaluated until after others have finished running. But it doesn't have some features that I wanted from Snakemake (e.g. compute cluster execution), so I ended up with the above solution instead.

0 comments

elsherbini9y ago

I use snakemake quite a bit, it was cool to scan through your Snakefile and learn some things. The processify decorator looks really useful[0].

It's possible that you could use snakemake subworkflows [1] for this issue of "pre-workflow" workflows.

[0] https://github.com/DarwinAwardWinner/CD4-csaw/blob/master/pr...

[1] https://bitbucket.org/snakemake/snakemake/wiki/Documentation...

rcthompsonOP9y ago

I also use a subworkflow in this workflow, but for a different purpose (the subworkflow is also on Github: https://github.com/DarwinAwardWinner/hg38-ref). But subworkflow rules are still resolved as part of the same DAG, so they have the same issue. Hence the need for a separate pre-workflow outside the normal framework of Snakemake.

By the way, I guess I didn't add a comment explaining this, but the reason for using the processify decorator is that the snakemake API is not re-entrant, so calling `snakemake` from within a Snakefile normally breaks things. The solution is to call it in a separate process.

j / k navigate · click thread line to collapse