Here is a 2009 paper, "Composing and executing parallel data-flow graphs with shell pipes", which is also a bash extension. (I'm impressed with anyone who successfully enhances bash's source code.)
Although it has a completely different model and I think more suitable for "big data".
https://scholar.google.com/scholar?cluster=98697598478714306...
http://dl.acm.org/citation.cfm?id=1645175
In this paper we extend the concept of shell pipes to incorporate forks, joins, cycles, and key-value aggregation.
I have a printout of this paper, but unfortunately it doesn't appear to be online :-(
Also, to nitpick, this is more accurately called a directed acyclic graph shell, or simply a DAG shell. The language doesn't seem to allow cycles. dagsh reads nicer than dgsh too.
What I found so far: Most tools, that invent a new language or try to cram complex processes into lesser suited syntactical environments are not loved too much.
A few people like XSLT, most seem to dislike it, although it has a nice functional core hidden under a syntax that seems to come from a time, where the answer to everything was XML. There are big data orchestration frameworks, that use an XML as configuration language, which can be ok, if you have clear processing steps.
Every time a tool invents a DSL for data processing, I grab my list of ugly real world use cases and most of the tools fail soon, if not immediately. That's a pity.
Programming languages can be effective as they are, and with the exceptions that unclean data brings, you want to have a programming language at your disposal anyway.
I'll give dgsh a try. The tool reuse approach and the UNIX spirit seems nice. But my initial impression of the "C code metrics" example from the site is mixed: It reminds me of awk, about which one of the authors said, that it's a beautiful language, but if your programs getting longer than hundred lines, you might want to switch to something else.
Two libraries which have a great grip at the plumbing aspect of data processing systems are airflow and luigi. They are python libraries and with it you have a concise syntax and basically all python libraries plus non-python tools with a command line interface at you fingertips.
I am curious, what kind of process orchestration tools people use and can recommend?
We basically did not really find any of the popular DSL-based bioinformatics pipeline tools (snakemake, bpipe etc) to fit the bill. Nextflow came close, but in fact allows quite some custom code too.
What worked for us was to use Spotify's Luigi, which is a python library rather than DSL.
The only thing was that we had to develop a flow-based inspired API on top of Luigi's more functional programming based one, in order to make defining dependencies fluent and easy enough to specify for our complex workflows.
Our flow-based inspired Luigi API (SciLuigi) for complex workflows, is available at:
https://github.com/pharmbio/sciluigi
We wrote up a paper on it as well, detailing a lot of the design decisions behind it:
http://dx.doi.org/10.1186/s13321-016-0179-6
Then, lately we are working on a pure Go alternative to Luigi/SciLuigi, since we realized that with the flow-based paradigm, we could just as well just rely on the Go channels and go-routines to create an "implicit scheduler" very simply and robustly. This is work in progress, but a lot of example workflows already work well (it has 3 times less LOC than a recent bioinformatics pipeline tool written in python and put into production). Code available at:
https://github.com/scipipe/scipipe
It is also very much a programming library rather than a DSL.
It in fact even implements streaming via named pipes, seemingly allowing somewhat similar operations as dgsh, with a bit more code probably, but with the (seeming) benefit of a bit easier handling of multiple inputs and outputs (via the flow-based progr. ports concept).
dgsh looks real interesting for simpler operations where there is one main input and output though - which occur a lot for ad-hoc work in the shell, in our experience. Will have to test it out for sure!
Only I hope to get time to test it out in some more depth sooner rather than later (it is one of my top goals for 2017).
Also, the pipeline feature in Pachyderm does not suffer from the "dependencies between tasks rather than data" problem that I mentioned in another post here, but properly identifies separate inputs and outputs declaratively.
Pachyderm specifies workflows in a kind of DSL AFAIK, and I'm very much interested to see if it could natively fit the bill for our complex workflows. But if not, I think we can always use it in a a light-weight way to fire off scipipe workflows (instead of the applications directly), and so let scipipe take care of the complex data wiring.
We would still like to benefit from the seemingly groundbreaking "git for big data" paradigm, and auto-executed workflow on updated data, which should enable something as impactful as on-line data analyses (auto-updated upon new data) in a manageable way.
In this case the task resource http://converge.aster.is/0.5.0/resources/task/ might help, as it allows you to create a directed graph using any kind of interpreter (for example, Python or Ruby) instead of having to use the DSL.
Unfortunately haven't found anything, so for our use cases in bioinformatics, I basically took an example workflow that was used in a course in next-gen sequencing analysis as a starting point:
https://github.com/NBISweden/workflow-tools-evaluation/tree/...
Only partly implemented it in Common Workflow Language [1] and SciPipe [2] so far ... the implementation turned out to take a tremendous of work :P
Much interested if anyone has found / created a more general such set of example workflows.
Me too, for better or for worse.
As for the issues, there are many. Just quickly a few:
* Data provider has an FTP server, most files are automatically generated, some are hand-named (with inconsistencies). How do you handle (without a lot of effort) a list of exceptions along with the regular files?
* Data provider has a good strict XML schema, but the relevant information for a single item is spread across three files, inside a tar archive. Since the there are 500k files inside the archive, you best not want to extract it, but process it on the fly.
* Data provider chooses layout that saves every item in a single XML file, inside 2-3 levels of directories. There are 20M of them. Unzipping the archive alone takes more than a day with default system settings and the usual tools. How do you process these things fast?
There are more subtle issues as well:
* FFFD regularly occurs in natural language strings. Can you correct these strings?
* File has .csv ending, looks like CSV on first glance, but all the standard RFC compliant parsers choke on it.
* XML file that elements, that have RTF tags embedded in it. You need to parse the RTF in the elements, because there is relevant information there, that you need to add to the transformed version.
* Date issues. Inconsistent formats and almost-valid dates.
* Combine data, coming from an API with data fetched from ten different servers to produce a transformed version with a legacy command line application (that might be slow, so you have to split your data first and parallelize the work, combine it and make sure it's complete).
I am thinking about a longer article or even a short book about these kind of data handling and quality questions and what ways there are to address them. Would you read a book like this and what topic would be the most pressing or relevant?
Basically, I need to first fetch the metadata on all the samples, and then later group them by treatment based on that metadata. In other words, the structure of later parts of the DAG depends on the results of executing earlier parts of the DAG, so the full structure of the DAG is not known initially. The solution I used was to split the workflow in two: a "pre-workflow workflow" that fetches the sample metadata and then the main workflow which reads the metadata and builds the DAG based on it. See here: https://github.com/DarwinAwardWinner/CD4-csaw/blob/master/Sn...
This a common pattern that I see when putting together bioinformatics workflows: the full DAG of actions to execute cannot be known until part of the way through executing that DAG. Most workflow tools can't handle this gracefully. Another Python DAG-executor, called doit, can handle this case, by specifying that some rules should not be evaluated until after others have finished running. But it doesn't have some features that I wanted from Snakemake (e.g. compute cluster execution), so I ended up with the above solution instead.
What's often needed for robust systems, instead, is solid support for error handling such that "if this bit doesn't make it in, then neither does that bit." Data is always messy and dirty, and too many ETL systems don't seem architected to cope with that reality.
Of course, maybe I just haven't found the right tools. Anyone know of tools that handle this particularly well?
Poorman version of multiple pipes is to write intermediate results into files, then "cat" the files as many times as needed for the following processes. I use short file names "o1", "o2" standing for output-1, output-2 and see them as temp variables.
When I had to do a lot of data processing at my last job, I started building up tools in Ruby. If I had time, I'd hack the workflow so that the next time I needed it, I could just run the tool from the command line.
Eventually I had a pluggable architecture that I could use to pull data from any number of sources and mix it with any other data. Do that with a shell? Why?