In particular:
* Notebooks store code and data together, which is very messy if you want to look at [only] code history in git. * It's hard to turn a notebook into an assertive test. * Converting a notebook function into a python module basically involves cutting and pasting from the notebook into a .py file.
These must be common issues for anyone working in this area. Are there any guides on best practices for bridging from notebooks to applications?
Ideally I'd want to build a python application that's managed via git, but some modules / functions are lifted exactly from notebooks.
The main point of friction is that the "default" format for storing notebooks is not valid, human readable python code, but an unreadable json mess. The situation would be much better if a notebook was stored as a python file, with code cells verbatim, and markdown cells inside python comments with appropriate line breaking. That way, you could run and edit notebooks from outside the browser, and let git track them easily. Ah, what a nice world would that be.
But this is exactly the world we already live in, thanks to jupytext!
https://github.com/nnicandro/emacs-jupyter
I'm not a great fan of notebooks though, I keep using the REPL with X forwarding for matplotlib, sided with a code editor.
Pipe into pandoc, prepend some css, optionally a mathjax header, done. Beautiful reports.
Honestly I've yet to be convinced there's good reason for anything more than this.
1) This is painful. There are tools to help, but the most effective means I've found are having a policy to only commit notebooks in a reset, clean state (enforced with githook).
2) I don't understand. I've written full testing frameworks for applications as notebooks as a means of having code documentation that enforced/tested the non-programmatic statements in the document. Using tools like papermill (https://papermill.readthedocs.io/en/latest/), you can easily write a unit test as a notebook with a whole host of documentation around what it's doing, execute, and inspect the result (failed execution vs. final state of the notebook vs. whatever you want)
3) Projects like ipynb (https://ipynb.readthedocs.io/en/stable/) allow you to import notebooks as if they were python modules. Some projects have different opinions of what that means to match different use cases. Papermill allows you have an interface with a notebook that is more like a system call than importing a module. I've personally used papermill and ipynb and found both enjoyable for different flavors of blending applications and notebooks.
(Caveat that Jupyter is way better with e.g. Julia, in my (limited) experience)
Especially if it deals with multimedia, can just blit images or audio or HTML applications inline.
And it’s fairly trivial to go from Jupyter Notebook -> Python file once you’re done.
As the joke goes: The best thing about R is that it's designed by statisticians. The worst thing about R is that it's designed by statisticians.
To summarize: I think notebooks are great for newcomers. It requires more maturity to appreciate more principled programming.
Failing that, I think fast.ai's nbdev[0] is probably the most persuasive attempt at making notebooks a useable platform for library/application development. Netflix also has reported[1] substantial investment in notebooks as a development platform, and open-sourced many/most of their tools.
[0]: https://nbdev.fast.ai [1]: https://netflixtechblog.com/notebook-innovation-591ee3221233
Notebooks are essential for the EDA and early prototyping stages but all data scientists should be enough "software engineer" to get their code out of their notebook and into a reusable library/package of tools shared with engineering.
The best teams I've worked on the hand off between DS and engineering is not a notebook, it's a pull request, with code review from engineers. Data scientists must put their models in a standard format in a library used by engineering, they must create their own unit tests, and be subject to the same code review that engineer would. This last step is important: my experience is that many data scientists, especially coming from academic research, are scared of writing real code. However after a few rounds of getting helpful feedback from engineers they quickly realize how to write code much better.
This process is also essential because if you are shipping models to production, you will encounter bugs that require a data scientist to fix that an engineer cannot solve alone. If the data scientists aren't familiar with the model part of the code base this process is a nightmare, as you have to ask them to dust of questionable notebooks from months or years ago.
There are lots of the process of shipping a model to production that data scientists don't need to worry about, but they absolutely should be working as engineers at the final stage of the hand off.
Browsing code, underlying library imports and associated code, type hinting, error checking, etc., are so vastly superior in something like Pycharm that it is really hard to see why one would give it all up to work in a Notebook unless they never matured their skillsets to see the benefits afforded by a more powerful IDE? I think notebooks can have their place and are certainly great for documenting things with a mix of Markdown, LaTeX and code, as well as for tutorials that someone else can directly execute. And some of the interactive widgets can also make for nice demos when needed.
Notebooks also make for poor habits often times and as you mentioned, having data scientists and ML engineers write code as modules or commit them via pull-requests helps them grow into being better software engineers which in my experience is almost a necessity to be a good and effective data scientist and ML engineer.
And lastly, version controlling notebooks is such a nightmare. Nor is it conducive to code reviews.
Learned this the hard way after working for a group for awhile with a single shared notebook I had nicknamed "The wall of madness".
Atom (editor) + Hydrogen (Atom plugin). I like Hydrogen over more notebook-like plugins that exist for VSCode because it's nothing extra (no 'cells') beyond executing the line under your cursor/selection.
Then i just start coding, executing/testing, refactoring, moving functions to separate files, importing, call my own APIs.. rinse repeat.
I tend to maintain 3 'types' of .py files.
1. first class python modules - the refactored and nicely packaged re-usable code from all my tinkering
2. workspace files - these are my working files. I solve problems here. it gets messy, and doesn't necessarily execute top to bottom properly (i'm often highlighting a line and running just it, in the middle of the file)
3. polished workspaces - once i've solved a problem ("pull all the logs from this service and compute average latency, print a table"), i take the workspace file and turn it into a script that executes top to bottom so i can run it in any context.
There are no guides that I'm aware of. Part of the reason may be a mild "culture" divide between casual and professional programmers, for lack of better terms. Any HN thread about "scientific" programming will include some comments to the effect that we should just leave programming to the pro's.
My advice is to immerse yourself in the actual work environment of the casual programmers: Observe how we work, what pressures and obstacles we face, what makes our domain unique, and so forth. Figure out what solutions work for the people in the trenches. My team hired an experienced dev, and I asked him specifically to help me with this. One thing I can say for sure is that practical measures will be incremental -- ways that we can improve our code on the fly. They will also have to recognize a vast range of skills, ranging from raw beginners to coders with decades of experience (and habits).
Jot down what you learn, and share it. I think our side of the cultural divide needs help, and would welcome some guidance.
Are you aware of https://software-carpentry.org/? It started after I graduated and I knew people who were involved with it at the time. It seemed like a good idea.
Excel has gotten more people to write code than all other programming environments together. And they’ve often enjoyed doing it. It’s a fantastic success story.
- Papermill is a great tool when setting up a scheduled notebook and then shipping the output to S3: https://papermill.readthedocs.io/en/latest/
- When turning notebooks into more user-facing prototypes, I've found Streamlit is excellent and easy-to-use. Some of these prototypes have stuck around as Streamlit apps when there's 1-3 users who need to use them regularly.
- Moving to full-blown apps is much tougher and time-consuming.
Notebooks, do not have to be stored in ipynb form, I would suggest to look at https://github.com/mwouts/jupytext, and notebook UI is inherently not design for multi-file and application developpement. So training humans will always be necessary.
Technically Jupyter Notebook does not even care that notebooks are files, you could save then using say postgres (https://github.com/quantopian/pgcontents) , and even sync content between notebooks.
I'm not too well informed anymore on this particular topic, but there are other folks at https://www.quansight.com/ that might be more aware, you can also ask on discourse.jupyter.org, I'm pretty sure you can find threads on those issues.
I think on the Jupyter side we could do a better job curating and exposing many tools to help with that, but there are just so many hours in the day...
I also recommend I don't like notebook from Joel Grus, https://www.youtube.com/watch?v=7jiPeIFXb6U it's a really funny talk, a lot of the points are IMHO invalid as Joel is misinformed on how things can be configured, but still a great watch.
I'd have thought there would be some things you could strongly encourage:
1. Come up with some standard format where the code and the data live in separate files.
2. Come up with some standard format where you can take load a regular .py script as a cell based notebook using metadata comments (and save it again).
If these came out of the box it would solve most of the issues.
People tend to have strong feeling when they see my pandas code as it is different from much of the (bad advice) in the Medium echo chamber. Generally, most who try it out are very happy.
The basics are embrace chaining, avoid .apply, and organize notebooks with functions (using the chain).
Oh, and Jupytext is a life saver if you are someone who uses source control.
0 - https://store.metasnake.com/effective-pandas-book 1 - https://www.youtube.com/watch?v=zgbUk90aQ6A
A decent data scientist who also understands software engineering will sooner or later take the prototype code from the notebook and refactor it into proper modules. Either this or the notebook will become an unrunnable mess as it is developed further. Reusing code and functions in a grown notebook is just too fragile.
https://github.com/fastai/nbdev
I have been using it for more than a year and it has been a great experience
The framework is called Mercury and is open-source https://github.com/mljar/mercury
- Look at nbdime & ReviewNB for git diffs
- Checkout treon & nbdev for testing
- See jupytext for keeping .py & .ipynb in sync
I agree it's a bit of a pain to install & configure a bunch of auxiliary tools but once set up properly they do solve most of the issues in the Jupyter notebook workflow.
Disclaimer: I built ReviewNB & Treon
This approach while much slower limits errors and ensures sustainability because both the notebook creator and the app creator will know what's going on.
I think solutions like papermill and others only work when you have infinite money and time.
They are still kind of a mess because I use them as scratch space. Anything worthwhile gets polished and put into a package manually.
Write libraries, track them in git and call them in notebooks?
I think there's ways to feed it a template that basically metaprograms what you want the output .py file to look like (e.g. render markdown cells as comments, vs. just removing them), but I've never quite figured that out.
See my first PR https://github.com/ipython/ipython/pull/776.
GitHub lost some of the original (non-rebased) commits, but I had semicolon at the ends of the lines.
And yes I stayed because it was "Fun". Hope to see more contributions !
A small typo here (in the companion blog post https://labs.quansight.org/blog/2022/01/ipython-8.0-lessons-...) I think:
> Python has multiline strings with triple backticks
I think this should say "quote marks" instead of "backticks" since backticks are a different char, Python strings use single- or double-quote char, and three of them delimits a multiline string.
As an aside, I really wish the VSCode team did more to integrate iPython REPL more seamlessly into VSCode as that is one of the big blockers for me to using VSCode for anything Python related.
I don't use VS Code myself, but I think the team is doing in increasingly better job, Microsoft is just a huge beast. I would also love for some IPython feature to get into Core Python. But that might just take time as I don't think many Core Python Dev do that much interactive coding, and thus don't see that much the interest of doing so.
BTW it's uppercase I and P, we don't want to be in trouble with a billion dollar fruit company, even if we predate their use of iPxxxx
Thanks for your work on it, it really is much appreciated.
VS recently made big changes to notebooks support [1], and they are now fully integrated into VS with their own Notebooks API. I've been following the changes for the past year on VS Code Insiders and the latest integration is really impressive from a UI and developer point of view. What's more is VS Code lets you easily use notebooks with any language (not just Python). I've had a really good experience so far using Julia kernels.
[1] https://code.visualstudio.com/blogs/2021/11/08/custom-notebo...
from IPython import embed; embed()
This will open iPython in the terminal window with the state of your program at the debug point loaded in. You do need to "quit()" it before moving on in the debugger though.
Generating grammar tables from /usr/local/lib/python3.10/site-packages/blib2to3/Grammar.txt
Writing grammar tables to /root/.cache/black/21.12b0/Grammar3.10.1.final.0.pickle
Writing failed: [Errno 2] No such file or directory: '/root/.cache/black/21.12b0/tmpx51kjom5'
Generating grammar tables from /usr/local/lib/python3.10/site-packages/blib2to3/PatternGrammar.txt
Writing grammar tables to /root/.cache/black/21.12b0/PatternGrammar3.10.1.final.0.pickle
Writing failed: [Errno 2] No such file or directory: '/root/.cache/black/21.12b0/tmp80hsbuff
I believe this is the issue:https://github.com/psf/black/issues/1143
Not entirely clear what the reasons are for adding the Black dependency to IPython....
It should though fail gracefully if it can't import black.
In that vein I have probably somewhat obscure question, but since OP is here I thought I'd give it a shot. I'd like to use Unix shell in concert with IPython. I'd send data to IPython kernel from zsh terminal sessions and call functions get data back. This data I could then send to Visidata or browser for bespoke visualization. Or whatever else is available in the shell. I think Jupyter's messaging protocol kind of allows this, but I haven't managed to grasp the fine details enough to get anywhere. I can get to shell from IPython, but from the outside this REPL isn't accessible from the Unix "REPL".
[1] https://ipython.readthedocs.io/en/latest/whatsnew/version8.h...
[0] https://datascienceatthecommandline.com/2e/chapter-10-polygl...
But honestly at that point I would just look into https://xon.sh/ that blends Python and Shell together. IPython and Xonsh devs are friends, so if you need anything from one into the other it's likely doable.
I am confused on what having a persistent Python process means in this context. Isn’t IPython already that? Jupyter console states it’s a single process IPython terminal. That does leave me wondering what is different when I start IPython vs Jupyter console. I might have assumed years ago that they are mutually exclusive…
OTOH for quick experiments notebooks are great, although I feel like the more modern the GUI the farther back we go in terms of experience. The latest updates to visual studio code's Jupyter extension for example have turned this into a thoroughly frustrating experience for the visually impaired - gray-on-gray-on-gray text and even more gray and transparent thin lines that are supposed to clearly mark where a cell ends and where the output begins. Unfortunately no amount of fiddling with the color scheme could fix these 'design' choices...
Known issue (it's a six year old issue IIRC). They're working on it if I'm not mistaken. They're also working on real-time collaboration.
Plug: We have long-running notebook scheduling in the background and the output is streamed and saved whether you close your browser or visit from another device.
We run the notebooks on your own Kubernetes cluster on GCP's GKE, AWS' EKS, Azure's AKS, DigitalOcean, and pretty much anything.
https://iko.ai/static/assets/img/landing/async-notebook-on-c...
The run saves everything as an experiment: it automatically detects model parameters without tagging cells, tinkering with metadata, or you calling a tracking library. We also automatically detect the model that is produced, and the model's metrics (again, without you doing anything).
PSA: on all process control equipment running Windows 10, install O&O shutup10 and enable the default set of disablements. Finding out that an incubator has been sitting there baking $300,000 of Andor cameras for 61 hours while the organism library died because the Windows 10 box running the Python control stack decided to update: it’s a bad time. https://www.oo-software.com/en/shutup10
https://bugs.python.org/issue38530 https://docs.python.org/3/whatsnew/3.10.html#attributeerrors
An alternative is to use https://friendly-traceback.github.io/docs/index.html which gets even more information than Python 3.10 does and is compatible with IPython/Jupyter.
I'm also hopping to integrate with https://pypi.org/project/friendly-traceback/ at some point.
IPython is more robust in various ways than ptpython so I’d prefer to switch back but maybe it still needs a bit of improvement. Open to suggestions if there is configuration I’m missing.
Seems like a great release though with tons of code cleanup.
I should also look into Rich and Textual
https://bpython-interpreter.org/ is also another alternative python shell, and of course https://xon.sh
With vim and the qtconsole side by side you can send lines and selections (or entire cells delimited with #%%) to execute in the qtconsole. Plots appear in the qtconsole.