Pypeline: A Python library for creating concurrent data pipelines | Better HN

46 comments

44 comments · 12 top-level

anentropic7y ago· 18 in thread

Too much abbreviation!

pypeline --> pypeln

multiprocessing pipeline --> pr

threads pipeline --> th

asyncio pipeline --> io

this is totally unnecessary

If I want to use short abbreviated names in my code I can always `from pypeline import multiprocess_pipeline as pr`

Your library shouldn't export them like this as the default.

`io` is especially bad since this overshadows the `io` module in the Python stdlib

ricw7y ago

Not a fan of this. You save a couple of chars and massively decrease legibility. Anyone new to the project will find it way harder to understand. Instead of abbreviation hell, just spell it out and move on. Not worth saving a couple of characters.

True masters of naming make sure to use variable names that are long but unambiguous after the first several characters, so that with tab-completion you get the best of both worlds.

cgarciaeOP7y ago

Point taken! Thanks a lot for your feedback. Just a few points: * pypeline is already taken :( * My main reason for this was because initially I was thinking that you did an `import pypeln as pl` and then called things like e.g. `pl.pr.map` since you cant abbreviate the module inside `pl` then I picked short names, but then I decided to go for and import the module kind of strategy.

I am thinking about expanding the module names to their worker names: * pr --> process * th --> thread * io --> task

And then have the conventions * from pypeln import process as pr * from pypeln import thread as th * from pypeln import task as io # as ta?

This conversation is very valuable, thank you all for the feedback.

anentropic7y ago

Hi, just coming back to say congrats on your library and I wish you all the best with it :)

I see your reasoning here (`import pypeln as pl`) but I still think where you have submodules you should use unabbreviated words for their names.

For me I'd be happy with `pl.process.map` in my code, but `pl.pr.map` feels a bit too obscure to have as the default.

These things are quite subjective of course, but part of that subjective judgement comes from the experience of what is commonly done in other Python libraries (the stdlib is a bit of a mixed bag in this regard unfortunately, riddled with CamelCase and other abominations).

bb887y ago

First of all, it's super cool. :)

There are a lot of "pipe" projects in PyPi, but your project is also about process management. Maybe you should avoid "pipe" in your name perhaps? FlowProcessor? nFlow? xFlow?

I do agree that you should avoid io for asyncio. You should probably at least use aio, but there's no reason you can't have asyncio_task, thread_task, multiprocessing_task.

Lastly, in my mind the killer app for this would be to allow something that works on top of Celery in production, but then be able to fall back to say multiprocessing or threading when running locally. That would allow me to prototype something, and then when I want to scale, I can just change a config setting.

cgarciaeOP7y ago

Hey, thanks for all the feedback. I will change the naming since its something most of you have agreed is a good change.

The goal I have for Pypeline is much simpler: let you easily setup data pipelines where you leverage processes, threads and asyncio where they are good at. So in my mind a killer app would be a pipeline that maybe starts with an asyncio stage for e.g. downloading images, maybe then a multiprocess stage for e.g. doing image processing, and finally a threading stage for e.g. interacting with the OS.

Right now I see Pypeline more as an easy to use single machine tool instead of a higher level distributed abstraction like Celery. Maybe other framework could leverage Pypeline to ease their work.

PurpleRamen7y ago

> My main reason for this was because initially I was thinking that you did an `import pypeln as pl` and then called things like e.g. `pl.pr.map` since you cant abbreviate the module inside `pl` then I picked short names

"Assumptions are the root of all evil."

With autocomplete a coder has no reason to use shortnames anyway.

neuromantik80867y ago

"import pypeln as pl" could cause quite a bit of confusion in Poland I would imagine.

bb887y ago

And maybe perl...

cgarciaeOP7y ago

jajajajaja

dajohnson897y ago

At risk of speaking from ignorance (i dont really know python), isn't this succinctness an aim of python? A lot of python enthusiasts i speak with decry the verbosity of Java and point to how much less code it takes to do the same thing in python.

trymas7y ago

Python has a famous adage `explicit better than implicit`.

You want to be explicit but still concise while writing python.

I do not have any experience with Java, but I guess when writing Java you can feel that you use too many words than needed (i.e. definition of verbose [0]), wikipedia has a hello world example[1] and it feels just heavy.

IMHO if you write pythonic code, very often it feels like writing/reading prose, seems like talking to computer. You are not too implicit, neither too verbose.

[0] https://www.merriam-webster.com/dictionary/verbose

[1] https://en.wikipedia.org/wiki/Java_(programming_language)#"H...

agumonkey7y ago

I'm guilty of this, because words are so meaningless to me. I'd rather have single lettre or glyph than ~clear yet ambiguous words.

dec0dedab0de7y ago

It's really more about readability, In this case threads_pipeline is much easier to understand than th. Descriptive names also means needing less comments.

anentropic7y ago

The succinctness of Python (such as there is) comes from the syntax itself. On the other hand, obscure variable names are discouraged. It's not a 'code golf' language. The aim is clarity and readability - unobscured by either verbose boilerplate or excessive brevity.

A library author should provide names which are descriptive and clear. Users can then abbreviate them however much or little as we choose (by import aliasing). For example it is very common to see `import numpy as np`... but you wouldn't want them to publish the library as `np`. It should have its proper name.

One reason for this is if I'm exploring code in a REPL or IDE with tab-completion. You want to have some idea what a module is for, without having to play 'guess the abbreviation'.

rcthompson7y ago

And it's especially unnecessary since Python already supports abbreviation during import. It could just as easily be "from pypeline import process as pr" or something similar.

sebcat7y ago

Agreed. When it comes to naming things, I really like the way Russ Cox puts it: https://research.swtch.com/names

thenaturalist7y ago

Absolutely agree with this.

adamcharnock7y ago· 3 in thread

    Pypeline was designed to solve simple medium 
    data tasks that require concurrency 
    and parallelism but where using frameworks 
    like Spark or Dask feel exaggerated or unnatural.

This is exactly what I was looking for very recently. Thank you for writing this, I'll certainly look into it.

marcyb5st7y ago

What about Apache Beam? Getting started with the Python SDK has been very easy IMHO. Also, you are future proof as you can easily switch runner from Local to Dataflow/Flink/...

cgarciaeOP7y ago

I use BEAM for my Dataflow jobs. But their local "DirectRunner" is just for testing purposes. As with Spark, BEAM is a huge beast, Pypeline was created with simplicity in mind, its a pure python library, no dependencies.

But not future proof in having to use python 2.7 unfortunately.

roel_v7y ago· 3 in thread

None of these frameworks (there are many) seem to have support for repeating a certain target multiple times, with different arguments. For example, say you have a data set with per-country data; how do you repeat the same analysis on each country? This simple example is easy with a loop, but when you have multiple dimensions like this, you want to call each target with all possible permutations, depending on which type of dimension is actually relevant for that target. Does any ETL framework support that?

(I was actually just writing a spec for a new tool that does just this this afternoon because I can't find anything suitable)

memoir2comment7y ago

snakemake does this trivially:

    rule analyze_country:
        input: 'whatever.{country}.txt'
        output: 'analysis.{country}.txt'
        shell:
            'run-analysis-on-country {input} {output} --country=country'

    rule analyze_target_countries:
        input: ['analysis.usa.txt', 'analysis.canada.txt', 'analysis.mexico.txt']

elsherbini7y ago

Small change, you have to use wilcards.country inside the shell call:

    rule analyze_country:
        input: 'whatever.{country}.txt'
        output: 'analysis.{country}.txt'
        shell:
            'run-analysis-on-country {input} {output} --country={wildcards.country}'

cgarciaeOP7y ago

Sounds like a group_by and then do a function per group? Pypeline doesn't have grouping but sounds like Spark or Dask should the the job.

bayesian_horse7y ago· 3 in thread

Dask is relatively lightweight actually, because it is pure Python.

Also, there is "Streamz" which solves a similar problem, seems more mature and can work with or without Dask or Dask-Distributed.

cgarciaeOP7y ago

Dask might be lightweight internally but resorting to it just to solve a simple task that requires concurrency is not "simple".

Streamz looks nice! However:

"Streamz relies on the Tornado framework for concurrency. This allows us to handle many concurrent operations cheaply and consistently within a SINGLE THREAD."

Apparently you can set it up to use Dask to escape the single threads but that is kind of a global config. With Pypeline you can mix and match between using Processes, Threads, and asyncio.Tasks where it makes sense, resource management per stage is simple and explicit. If you have some understanding of the multiprocessing, threading and asyncio modules, Pypeline will save you tons of time.

Still, will keep an eye on Streamz, its a very nice work, lots of features, it should get more visibility.

bayesian_horse7y ago

Pypeline doesn't seem that simple itself.

unethical_ban7y ago

Why not? It seems to be a boilerplate remover for simple parallel processing tasks.

TBastiani7y ago· 2 in thread

mpipe might also be of interest.

http://vmlaker.github.io/mpipe/

cgarciaeOP7y ago

Thanks! Did take a look at mpipe (its actually referenced in the readme). But mpipe has its flaws:

1. It uses None as the stage terminator, this is VERY error prone, what if you actually want to send None? Pypeline uses a special private terminator.

2. You have to first manually put all the data into the pipe in a for-loop and then manually get it out. In Pypeline all this is simplified: it consumes iterables and all stages are iterables, so its 100% compatible with any function/framework that accepts iterables.

cgarciaeOP7y ago

3. If you first have to put in all the data first you cant handle streams. Pypeline accepts non-terminating iterables and also gives you back possibly non-terminating iterables.

somewhatoff7y ago· 1 in thread

I wonder if you might compare this to Bonobo [https://www.bonobo-project.org/] which I think has similar design goals?

Pypeline is a library you use in your code while Bonobo seems to be a framework that use your code. I tend to think that you lose flexibility with the latter.

chrisjc7y ago· 1 in thread

Seems like a good time to link to this curated list of pipeline toolkits (not all python).

https://github.com/pditommaso/awesome-pipeline/blob/master/R...

neuromantik80867y ago

This too:

https://github.com/common-workflow-language/common-workflow-...

Also, whenever these conversation of flow-based / piplining tools come up, I always like to point people to Common Workflow Language to remind people that there is an attempt at standardizing workflow descriptions so that they can be used with different packages:

https://www.commonwl.org/

make37y ago· 1 in thread

Looks similar to what tf.data does for Tensorflow

cgarciaeOP7y ago

Nice to hear that. When I wrote this:

"Pypeline was designed to solve simple medium data tasks that require concurrency and parallelism but where using frameworks like Spark or Dask feel exaggerated or unnatural."

it was actually because I've resorted / hacked into tf.data and Dask in the past just to get concurrency and parallelism. Pypeline is way more natural for pure python stuff.

elsherbini7y ago

Snakemake [0] is a tool worth checking out. You can use it to create declarative workflows, and similar to make, it creates a DAG of dependencies when you give it your desired output. Each rule can specify how many threads it needs and other arbitrary resources and the scheduler uses that to constrain execution. Workflows are architecture independent - you should be able to execute a snakemake workflow on a laptop, in the cloud, or on an HPC cluster.

It also allows you to use UNIX pipes with your dependent jobs when that is appropriate [1].

[0] https://snakemake.readthedocs.io/en/stable/index.html

[1] https://snakemake.readthedocs.io/en/stable/snakefiles/rules....

From my experience building similar pipelining and reverse polish function application tooling in python.

Piping using the | operator can make tracebacks pretty ugly with some operators.

If you want to keep the code still somewhat 'pythonic' without introducing the syntax magic using |, you can do it similarly:

  range(10)
  | pp.flatmap(lambda x: [x + 1, x + 2])
  | pp.map(lambda x: x * x)
  ...

You can do this instead:

  xs = range(10)
  xs = pp.flatmap(xs, lambda x: [x + 1, x + 2])
  xs = pp.map(xs, lambda x: x * x)
  ...

It helps to keep the operand as first argument, instead of last, because those lambdas are best kept at the end.

So instead of

  map(fn, xs)

do

  map(xs, fn)

timkpaine7y ago

Similar to a library I've been working on as well: https://github.com/timkpaine/tributary

Wow. It seems to save a lot of boilerplate code for ETL.

j / k navigate · click thread line to collapse