pypeline --> pypeln
multiprocessing pipeline --> pr
threads pipeline --> th
asyncio pipeline --> io
this is totally unnecessary
If I want to use short abbreviated names in my code I can always `from pypeline import multiprocess_pipeline as pr`
Your library shouldn't export them like this as the default.
`io` is especially bad since this overshadows the `io` module in the Python stdlib
I am thinking about expanding the module names to their worker names: * pr --> process * th --> thread * io --> task
And then have the conventions * from pypeln import process as pr * from pypeln import thread as th * from pypeln import task as io # as ta?
This conversation is very valuable, thank you all for the feedback.
I see your reasoning here (`import pypeln as pl`) but I still think where you have submodules you should use unabbreviated words for their names.
For me I'd be happy with `pl.process.map` in my code, but `pl.pr.map` feels a bit too obscure to have as the default.
These things are quite subjective of course, but part of that subjective judgement comes from the experience of what is commonly done in other Python libraries (the stdlib is a bit of a mixed bag in this regard unfortunately, riddled with CamelCase and other abominations).
There are a lot of "pipe" projects in PyPi, but your project is also about process management. Maybe you should avoid "pipe" in your name perhaps? FlowProcessor? nFlow? xFlow?
I do agree that you should avoid io for asyncio. You should probably at least use aio, but there's no reason you can't have asyncio_task, thread_task, multiprocessing_task.
Lastly, in my mind the killer app for this would be to allow something that works on top of Celery in production, but then be able to fall back to say multiprocessing or threading when running locally. That would allow me to prototype something, and then when I want to scale, I can just change a config setting.
The goal I have for Pypeline is much simpler: let you easily setup data pipelines where you leverage processes, threads and asyncio where they are good at. So in my mind a killer app would be a pipeline that maybe starts with an asyncio stage for e.g. downloading images, maybe then a multiprocess stage for e.g. doing image processing, and finally a threading stage for e.g. interacting with the OS.
Right now I see Pypeline more as an easy to use single machine tool instead of a higher level distributed abstraction like Celery. Maybe other framework could leverage Pypeline to ease their work.
"Assumptions are the root of all evil."
With autocomplete a coder has no reason to use shortnames anyway.
You want to be explicit but still concise while writing python.
I do not have any experience with Java, but I guess when writing Java you can feel that you use too many words than needed (i.e. definition of verbose [0]), wikipedia has a hello world example[1] and it feels just heavy.
IMHO if you write pythonic code, very often it feels like writing/reading prose, seems like talking to computer. You are not too implicit, neither too verbose.
[0] https://www.merriam-webster.com/dictionary/verbose
[1] https://en.wikipedia.org/wiki/Java_(programming_language)#"H...
A library author should provide names which are descriptive and clear. Users can then abbreviate them however much or little as we choose (by import aliasing). For example it is very common to see `import numpy as np`... but you wouldn't want them to publish the library as `np`. It should have its proper name.
One reason for this is if I'm exploring code in a REPL or IDE with tab-completion. You want to have some idea what a module is for, without having to play 'guess the abbreviation'.
Pypeline was designed to solve simple medium
data tasks that require concurrency
and parallelism but where using frameworks
like Spark or Dask feel exaggerated or unnatural.
This is exactly what I was looking for very recently. Thank you for writing this, I'll certainly look into it.(I was actually just writing a spec for a new tool that does just this this afternoon because I can't find anything suitable)
rule analyze_country:
input: 'whatever.{country}.txt'
output: 'analysis.{country}.txt'
shell:
'run-analysis-on-country {input} {output} --country=country'
rule analyze_target_countries:
input: ['analysis.usa.txt', 'analysis.canada.txt', 'analysis.mexico.txt'] rule analyze_country:
input: 'whatever.{country}.txt'
output: 'analysis.{country}.txt'
shell:
'run-analysis-on-country {input} {output} --country={wildcards.country}'Also, there is "Streamz" which solves a similar problem, seems more mature and can work with or without Dask or Dask-Distributed.
Streamz looks nice! However:
"Streamz relies on the Tornado framework for concurrency. This allows us to handle many concurrent operations cheaply and consistently within a SINGLE THREAD."
Apparently you can set it up to use Dask to escape the single threads but that is kind of a global config. With Pypeline you can mix and match between using Processes, Threads, and asyncio.Tasks where it makes sense, resource management per stage is simple and explicit. If you have some understanding of the multiprocessing, threading and asyncio modules, Pypeline will save you tons of time.
Still, will keep an eye on Streamz, its a very nice work, lots of features, it should get more visibility.
1. It uses None as the stage terminator, this is VERY error prone, what if you actually want to send None? Pypeline uses a special private terminator.
2. You have to first manually put all the data into the pipe in a for-loop and then manually get it out. In Pypeline all this is simplified: it consumes iterables and all stages are iterables, so its 100% compatible with any function/framework that accepts iterables.
https://github.com/pditommaso/awesome-pipeline/blob/master/R...
https://github.com/common-workflow-language/common-workflow-...
Also, whenever these conversation of flow-based / piplining tools come up, I always like to point people to Common Workflow Language to remind people that there is an attempt at standardizing workflow descriptions so that they can be used with different packages:
"Pypeline was designed to solve simple medium data tasks that require concurrency and parallelism but where using frameworks like Spark or Dask feel exaggerated or unnatural."
it was actually because I've resorted / hacked into tf.data and Dask in the past just to get concurrency and parallelism. Pypeline is way more natural for pure python stuff.
It also allows you to use UNIX pipes with your dependent jobs when that is appropriate [1].
[0] https://snakemake.readthedocs.io/en/stable/index.html
[1] https://snakemake.readthedocs.io/en/stable/snakefiles/rules....
Piping using the | operator can make tracebacks pretty ugly with some operators.
If you want to keep the code still somewhat 'pythonic' without introducing the syntax magic using |, you can do it similarly:
range(10)
| pp.flatmap(lambda x: [x + 1, x + 2])
| pp.map(lambda x: x * x)
...
You can do this instead: xs = range(10)
xs = pp.flatmap(xs, lambda x: [x + 1, x + 2])
xs = pp.map(xs, lambda x: x * x)
...
It helps to keep the operand as first argument, instead of last, because those lambdas are best kept at the end.So instead of
map(fn, xs)
do map(xs, fn)