Dgsh – Directed graph shell (opens in new tab)

(dmst.aueb.gr)

178 pointsnerdlogic9y ago51 comments

51 comments

This looks pretty interesting, although I'll have to dig more into the examples to see why they chose this set of primitives (multipipes, multipipe blocks, and stored values).

Here is a 2009 paper, "Composing and executing parallel data-flow graphs with shell pipes", which is also a bash extension. (I'm impressed with anyone who successfully enhances bash's source code.)

Although it has a completely different model and I think more suitable for "big data".

https://scholar.google.com/scholar?cluster=98697598478714306...

http://dl.acm.org/citation.cfm?id=1645175

In this paper we extend the concept of shell pipes to incorporate forks, joins, cycles, and key-value aggregation.

I have a printout of this paper, but unfortunately it doesn't appear to be online :-(

xiaq9y ago

I've always thought about integrating this functionality into elvish https://github.com/elves/elvish but cannot cone up with a good syntax. dgsh has a good one, but unfortunately using & breaks its traditional semantics. Does anyone has some idea of a tradition-compatible grammar?

Also, to nitpick, this is more accurately called a directed acyclic graph shell, or simply a DAG shell. The language doesn't seem to allow cycles. dagsh reads nicer than dgsh too.

mtrn9y ago

I've worked with and looked at a lot of data processing helpers. Tools, that try to help you build data pipelines, for the sake of performance, reproducibility or simply code uniformity.

What I found so far: Most tools, that invent a new language or try to cram complex processes into lesser suited syntactical environments are not loved too much.

A few people like XSLT, most seem to dislike it, although it has a nice functional core hidden under a syntax that seems to come from a time, where the answer to everything was XML. There are big data orchestration frameworks, that use an XML as configuration language, which can be ok, if you have clear processing steps.

Every time a tool invents a DSL for data processing, I grab my list of ugly real world use cases and most of the tools fail soon, if not immediately. That's a pity.

Programming languages can be effective as they are, and with the exceptions that unclean data brings, you want to have a programming language at your disposal anyway.

I'll give dgsh a try. The tool reuse approach and the UNIX spirit seems nice. But my initial impression of the "C code metrics" example from the site is mixed: It reminds me of awk, about which one of the authors said, that it's a beautiful language, but if your programs getting longer than hundred lines, you might want to switch to something else.

Two libraries which have a great grip at the plumbing aspect of data processing systems are airflow and luigi. They are python libraries and with it you have a concise syntax and basically all python libraries plus non-python tools with a command line interface at you fingertips.

I am curious, what kind of process orchestration tools people use and can recommend?

samuell9y ago

Exactly our experience too, from complex machine learning workflows in various aspects of drug discovery.

We basically did not really find any of the popular DSL-based bioinformatics pipeline tools (snakemake, bpipe etc) to fit the bill. Nextflow came close, but in fact allows quite some custom code too.

What worked for us was to use Spotify's Luigi, which is a python library rather than DSL.

The only thing was that we had to develop a flow-based inspired API on top of Luigi's more functional programming based one, in order to make defining dependencies fluent and easy enough to specify for our complex workflows.

Our flow-based inspired Luigi API (SciLuigi) for complex workflows, is available at:

https://github.com/pharmbio/sciluigi

We wrote up a paper on it as well, detailing a lot of the design decisions behind it:

http://dx.doi.org/10.1186/s13321-016-0179-6

Then, lately we are working on a pure Go alternative to Luigi/SciLuigi, since we realized that with the flow-based paradigm, we could just as well just rely on the Go channels and go-routines to create an "implicit scheduler" very simply and robustly. This is work in progress, but a lot of example workflows already work well (it has 3 times less LOC than a recent bioinformatics pipeline tool written in python and put into production). Code available at:

https://github.com/scipipe/scipipe

It is also very much a programming library rather than a DSL.

It in fact even implements streaming via named pipes, seemingly allowing somewhat similar operations as dgsh, with a bit more code probably, but with the (seeming) benefit of a bit easier handling of multiple inputs and outputs (via the flow-based progr. ports concept).

dgsh looks real interesting for simpler operations where there is one main input and output though - which occur a lot for ad-hoc work in the shell, in our experience. Will have to test it out for sure!

baq9y ago

Have you checked out airflow? Any opinions?

1 more reply

dwhitena9y ago

Thanks for sharing your experience. I work with Pachyderm, which is an open source data pipelining and data versioning framework. Some things like might be relevant to this conversation are the fact that Pachyderm is language agnostic and that it keeps analyses in sync with data (because it triggers off of commits to data versioning). This makes it distinct from Airflow or Luigi, for example.

samuell9y ago

Pachyderm, with its "git for big data" approach is one of, if not THE, coolest thing I learned about in 2016.

Only I hope to get time to test it out in some more depth sooner rather than later (it is one of my top goals for 2017).

Also, the pipeline feature in Pachyderm does not suffer from the "dependencies between tasks rather than data" problem that I mentioned in another post here, but properly identifies separate inputs and outputs declaratively.

Pachyderm specifies workflows in a kind of DSL AFAIK, and I'm very much interested to see if it could natively fit the bill for our complex workflows. But if not, I think we can always use it in a a light-weight way to fire off scipipe workflows (instead of the applications directly), and so let scipipe take care of the complex data wiring.

We would still like to benefit from the seemingly groundbreaking "git for big data" paradigm, and auto-executed workflow on updated data, which should enable something as impactful as on-line data analyses (auto-updated upon new data) in a manageable way.

mtrn9y ago

Thanks for the pachyderm pointer. I just installed it and will give it a try.

steveb9y ago

We've been working on a directed graph execution engine called Converge https://github.com/asteris-llc/converge.

In this case the task resource http://converge.aster.is/0.5.0/resources/task/ might help, as it allows you to create a directed graph using any kind of interpreter (for example, Python or Ruby) instead of having to use the DSL.

mtrn9y ago

Nice, thanks for the pointer. It's nice to see templated shell calls, as these can be a powerful bridge between orchestration and execution.

nerdponx9y ago

You call this as a configuration management tools on Github. Does that make this a competitor to Ansible, etc as well?

1 more reply

nerdponx9y ago

This post is making me think it would be a great educational exercise to construct equivalent data processing flows in some popular tools: Make, Airflow, Luigi, Snakemake, Rake, others?

samuell9y ago

Indeed, not only for education, but also as a tool to evaluate tools for various use cases, I think. Have been thinking the same and looked hard for anything like a set of evaluation workflows, incorporating various specific "motifs" if you like (such as nested parameter sweeps).

Unfortunately haven't found anything, so for our use cases in bioinformatics, I basically took an example workflow that was used in a course in next-gen sequencing analysis as a starting point:

https://github.com/NBISweden/workflow-tools-evaluation/tree/...

Only partly implemented it in Common Workflow Language [1] and SciPipe [2] so far ... the implementation turned out to take a tremendous of work :P

Much interested if anyone has found / created a more general such set of example workflows.

[1] http://commonwl.org

[2] https://github.com/scipipe/scipipe

1 more reply

rtpg9y ago

I've been thinking about this space a lot too, would you mind listing out some of the messier use cases that you have?

mtrn9y ago

> I've been thinking about this space a lot

Me too, for better or for worse.

As for the issues, there are many. Just quickly a few:

* Data provider has an FTP server, most files are automatically generated, some are hand-named (with inconsistencies). How do you handle (without a lot of effort) a list of exceptions along with the regular files?

* Data provider has a good strict XML schema, but the relevant information for a single item is spread across three files, inside a tar archive. Since the there are 500k files inside the archive, you best not want to extract it, but process it on the fly.

* Data provider chooses layout that saves every item in a single XML file, inside 2-3 levels of directories. There are 20M of them. Unzipping the archive alone takes more than a day with default system settings and the usual tools. How do you process these things fast?

There are more subtle issues as well:

* FFFD regularly occurs in natural language strings. Can you correct these strings?

* File has .csv ending, looks like CSV on first glance, but all the standard RFC compliant parsers choke on it.

* XML file that elements, that have RTF tags embedded in it. You need to parse the RTF in the elements, because there is relevant information there, that you need to add to the transformed version.

* Date issues. Inconsistent formats and almost-valid dates.

* Combine data, coming from an API with data fetched from ten different servers to produce a transformed version with a legacy command line application (that might be slow, so you have to split your data first and parallelize the work, combine it and make sure it's complete).

I am thinking about a longer article or even a short book about these kind of data handling and quality questions and what ways there are to address them. Would you read a book like this and what topic would be the most pressing or relevant?

2 more replies

rcthompson9y ago

Here's one example where I had to use a kind of ugly hack ot make it work with Snakemake, a Python Makefile-style "DAG-of-rules" workflow tool: https://github.com/DarwinAwardWinner/CD4-csaw

Basically, I need to first fetch the metadata on all the samples, and then later group them by treatment based on that metadata. In other words, the structure of later parts of the DAG depends on the results of executing earlier parts of the DAG, so the full structure of the DAG is not known initially. The solution I used was to split the workflow in two: a "pre-workflow workflow" that fetches the sample metadata and then the main workflow which reads the metadata and builds the DAG based on it. See here: https://github.com/DarwinAwardWinner/CD4-csaw/blob/master/Sn...

This a common pattern that I see when putting together bioinformatics workflows: the full DAG of actions to execute cannot be known until part of the way through executing that DAG. Most workflow tools can't handle this gracefully. Another Python DAG-executor, called doit, can handle this case, by specifying that some rules should not be evaluated until after others have finished running. But it doesn't have some features that I wanted from Snakemake (e.g. compute cluster execution), so I ended up with the above solution instead.

1 more reply

cturner9y ago

Something I have found fun in the past: using xslt where the underlying document is not xml. In order for xslt to work (in java setting, apache libs) you do not need an underlying xml document, just something that satisfies the appropriate java interface. For example, you could wrap a filesystem directory structure.

wfunction9y ago

Is it possible to show what XSLT is and why it's useful in like 5 minutes? I've always wanted a transformation language of some sort, but I've never managed to figure out XSLT (probably because I've never needed it) so I don't know what problems it solves or doesn't solve.

2 more replies

timthelion9y ago

I downvoted your comment, because it doesn't seem to me that you read the article and are responding to the contents. You are simply responding with a pre-formed opinion. Conversations only work when you read first, then think, and finally respond. But I guess conversations cannot happen on HN, because everything has to be so FAST in silicone valley.

samuell9y ago

This is not really a post. Rather a documentation website. Not sure if it makes sense to have to read through the full documentation to make any comment.

karlmdavis9y ago

This is perhaps a bit off-topic, but what I really wish more data processing/ETL tools supported is the concept of transactional units. Too many of them seem to start with the worldview that "we need to shove in as many of the separate bits as we possibly can."

What's often needed for robust systems, instead, is solid support for error handling such that "if this bit doesn't make it in, then neither does that bit." Data is always messy and dirty, and too many ETL systems don't seem architected to cope with that reality.

Of course, maybe I just haven't found the right tools. Anyone know of tools that handle this particularly well?

visarga9y ago

I write complex shell commands every day, but when it gets longer than 2-3 rows I switch to a text editor and write it in Perl instead. I see no need to use bash up to that complexity, doesn't look good in terminal.

Poorman version of multiple pipes is to write intermediate results into files, then "cat" the files as many times as needed for the following processes. I use short file names "o1", "o2" standing for output-1, output-2 and see them as temp variables.

vinceguidry9y ago

This is what it comes down to to me too. Using the shell to do programming seems to me like putting your job on hard mode.

When I had to do a lot of data processing at my last job, I started building up tools in Ruby. If I had time, I'd hack the workflow so that the next time I needed it, I could just run the tool from the command line.

Eventually I had a pluggable architecture that I could use to pull data from any number of sources and mix it with any other data. Do that with a shell? Why?

DSpinellis9y ago

The advantage of using the shell are the hundreds of powerful command-line tools you can use. Increasingly, there are Perl/Python/Ruby packages that offer similar functionality, but these require some ceremony to use and therefore prohibit rapid prototyping and experimentation.

db48x9y ago

Funny, just two/three weeks ago I was saying that I really needed a dag of pipes in a shell script that I was writing...

tingletech9y ago

Interesting, this seems to be from a couple of people at Information Systems Technology Laboratory (ISTLab) at the Athens University of Economics and Business. I wonder what the motivation is. Security, or does it utilize multiple processor cores better than traditional pipes?

ufo9y ago

The impression I got is that it is still using traditional unix tools and pipes under the hood so I would expect the same efficiency as now. I think the big difference here is the syntax. Traditional shells are great if you have a linear dataflow where each program has one standard input and one standard output. However, if you want to have programs receiving multiple inputs from pipes or writing to multiple pipes then the `|` syntax is not enough.

mtdewcmu9y ago

This looks like potentially a great tool. It might be helpful if the author showed the code examples alongside the equivalent code in bash, so it's easy to see both what the example code is doing and how much effort is saved by doing it in dgsh.

nerdponx9y ago

It doesn't look all that different to me. Seems like it's just saving you mess around with assigning function inputs and outputs to shell variables. Otherwise it just looks like piping stuff around between functions.

DSpinellis9y ago

You can write many of the examples we provide in bash using tee and tee >(process) syntax when you pipe data into multipipe blocks. To collect the data from multipipe blocks you need to construct Unix domain named pipes and use them in exactly the right order. It quickly gets complicated and ugly. This is our fourth stab at the problem. The earlier ones generated bash scripts, which looked awful and were unreliable.

mtdewcmu9y ago

I think my main reason for posting was to suggest showing the equivalent bash so it was easier to see what the tool did. I just threw in the "looks potentially great" as a little sweetener. ;)

be219y ago

I am not familiar with the project. What are the advantages of Dgsh in comparision to pipexec: https://github.com/flonatel/pipexec

DSpinellis9y ago

Pipexec offers a versatile pipeline construction syntax, where you specify the topology of arbitrary graphs through the numbering of pipe descriptors. Dgsh offers a declarative directed graph construction syntax and automatically connects the parts for you. Also dgsh comes with familiar tools (tee, cat, paste, grep, sort) written to support the creation of such graphs.

CDokolas9y ago

Author's page: http://www.dmst.aueb.gr/dds/index.en.html

haddr9y ago

I wonder if there is any perfomance benchmark of this graph shell? Especially on some complex pipelines running huge datasets?

DSpinellis9y ago

We have measured many of the examples against the use of temporary files and the web report one against (single-threaded) implementations in Perl and Java. In almost all cases dgsh takes less wall clock time, but often consumes more CPU resources.

nerdponx9y ago

Fun fact: "dgsh" is also the name of a CLI tool for DMs to manage RPG campaigns: http://dgsh.sourceforge.net/

j / k navigate · click thread line to collapse

51 comments

chubot9y ago

This looks pretty interesting, although I'll have to dig more into the examples to see why they chose this set of primitives (multipipes, multipipe blocks, and stored values).

Here is a 2009 paper, "Composing and executing parallel data-flow graphs with shell pipes", which is also a bash extension. (I'm impressed with anyone who successfully enhances bash's source code.)

Although it has a completely different model and I think more suitable for "big data".

https://scholar.google.com/scholar?cluster=98697598478714306...

http://dl.acm.org/citation.cfm?id=1645175

In this paper we extend the concept of shell pipes to incorporate forks, joins, cycles, and key-value aggregation.

I have a printout of this paper, but unfortunately it doesn't appear to be online :-(

xiaq9y ago

Also, to nitpick, this is more accurately called a directed acyclic graph shell, or simply a DAG shell. The language doesn't seem to allow cycles. dagsh reads nicer than dgsh too.

mtrn9y ago

I've worked with and looked at a lot of data processing helpers. Tools, that try to help you build data pipelines, for the sake of performance, reproducibility or simply code uniformity.

What I found so far: Most tools, that invent a new language or try to cram complex processes into lesser suited syntactical environments are not loved too much.

Every time a tool invents a DSL for data processing, I grab my list of ugly real world use cases and most of the tools fail soon, if not immediately. That's a pity.

Programming languages can be effective as they are, and with the exceptions that unclean data brings, you want to have a programming language at your disposal anyway.

I am curious, what kind of process orchestration tools people use and can recommend?

samuell9y ago

Exactly our experience too, from complex machine learning workflows in various aspects of drug discovery.

We basically did not really find any of the popular DSL-based bioinformatics pipeline tools (snakemake, bpipe etc) to fit the bill. Nextflow came close, but in fact allows quite some custom code too.

What worked for us was to use Spotify's Luigi, which is a python library rather than DSL.

Our flow-based inspired Luigi API (SciLuigi) for complex workflows, is available at:

https://github.com/pharmbio/sciluigi

We wrote up a paper on it as well, detailing a lot of the design decisions behind it:

http://dx.doi.org/10.1186/s13321-016-0179-6

https://github.com/scipipe/scipipe

It is also very much a programming library rather than a DSL.

baq9y ago

Have you checked out airflow? Any opinions?

1 more reply

dwhitena9y ago

samuell9y ago

Pachyderm, with its "git for big data" approach is one of, if not THE, coolest thing I learned about in 2016.

Only I hope to get time to test it out in some more depth sooner rather than later (it is one of my top goals for 2017).

mtrn9y ago

Thanks for the pachyderm pointer. I just installed it and will give it a try.

steveb9y ago

We've been working on a directed graph execution engine called Converge https://github.com/asteris-llc/converge.

mtrn9y ago

Nice, thanks for the pointer. It's nice to see templated shell calls, as these can be a powerful bridge between orchestration and execution.

nerdponx9y ago

You call this as a configuration management tools on Github. Does that make this a competitor to Ansible, etc as well?

1 more reply

nerdponx9y ago

This post is making me think it would be a great educational exercise to construct equivalent data processing flows in some popular tools: Make, Airflow, Luigi, Snakemake, Rake, others?

samuell9y ago

Unfortunately haven't found anything, so for our use cases in bioinformatics, I basically took an example workflow that was used in a course in next-gen sequencing analysis as a starting point:

https://github.com/NBISweden/workflow-tools-evaluation/tree/...

Only partly implemented it in Common Workflow Language [1] and SciPipe [2] so far ... the implementation turned out to take a tremendous of work :P

Much interested if anyone has found / created a more general such set of example workflows.

[1] http://commonwl.org

[2] https://github.com/scipipe/scipipe

1 more reply

rtpg9y ago

I've been thinking about this space a lot too, would you mind listing out some of the messier use cases that you have?

mtrn9y ago

> I've been thinking about this space a lot

Me too, for better or for worse.

As for the issues, there are many. Just quickly a few:

There are more subtle issues as well:

* FFFD regularly occurs in natural language strings. Can you correct these strings?

* File has .csv ending, looks like CSV on first glance, but all the standard RFC compliant parsers choke on it.

* XML file that elements, that have RTF tags embedded in it. You need to parse the RTF in the elements, because there is relevant information there, that you need to add to the transformed version.

* Date issues. Inconsistent formats and almost-valid dates.

2 more replies

rcthompson9y ago

Here's one example where I had to use a kind of ugly hack ot make it work with Snakemake, a Python Makefile-style "DAG-of-rules" workflow tool: https://github.com/DarwinAwardWinner/CD4-csaw

1 more reply

cturner9y ago

wfunction9y ago

2 more replies

timthelion9y ago

samuell9y ago

This is not really a post. Rather a documentation website. Not sure if it makes sense to have to read through the full documentation to make any comment.

karlmdavis9y ago

Of course, maybe I just haven't found the right tools. Anyone know of tools that handle this particularly well?

visarga9y ago

vinceguidry9y ago

This is what it comes down to to me too. Using the shell to do programming seems to me like putting your job on hard mode.

Eventually I had a pluggable architecture that I could use to pull data from any number of sources and mix it with any other data. Do that with a shell? Why?

DSpinellis9y ago

db48x9y ago

Funny, just two/three weeks ago I was saying that I really needed a dag of pipes in a shell script that I was writing...

tingletech9y ago

ufo9y ago

mtdewcmu9y ago

nerdponx9y ago

DSpinellis9y ago

mtdewcmu9y ago

I think my main reason for posting was to suggest showing the equivalent bash so it was easier to see what the tool did. I just threw in the "looks potentially great" as a little sweetener. ;)

be219y ago

I am not familiar with the project. What are the advantages of Dgsh in comparision to pipexec: https://github.com/flonatel/pipexec

DSpinellis9y ago

CDokolas9y ago

Author's page: http://www.dmst.aueb.gr/dds/index.en.html

haddr9y ago

I wonder if there is any perfomance benchmark of this graph shell? Especially on some complex pipelines running huge datasets?

DSpinellis9y ago

nerdponx9y ago

Fun fact: "dgsh" is also the name of a CLI tool for DMs to manage RPG campaigns: http://dgsh.sourceforge.net/

j / k navigate · click thread line to collapse