The environment/dependency management story for Python is such a tire fire and I wish someone introduced me to it from a reasonably high level at the very beginning of my time with Python.
It doesn't have to dig deep. Just needs to talk about what these things are, why they exist and why it's such a challenge.
Too often I think people skip over these topics, considering them as nothing more than a means to an end. But by far and wide my largest challenges with Python has been environment and dependency management.
My personal choice is to use pyenv together with poetry on a Linux machine.
Yet, please understand that I teach this course to absolute (!) beginners. That is why I chose to use the Anaconda Distribution in class, which comes with every package installed that I want my students to study.
Making your first steps in programming is hard enough and so I want to keep the "boiler plate" as small as possible.
But dependency management will be part of the more "advanced" lectures I will do in the future.
What's on my mind is that all of the "meta" parts of programming are usually seen as "boilerplate" when they make up such a critical part of "authoring" meaningful software.
I'm approaching this concern from my personal experience in university. I took a 100 level "CS for Non-CS majors." We did all the formal CS stuff like talking about variables, functions, loops, recursion, etc. but by the end of the course I was still never taught how to take my java applications and package them in a way that I could share with my friends. It felt very much, "you can do so much with a computer... as long as you stay within the environment we rushed you through setting up and use the few libraries we gave you." (I wrote a Reversi game for my girlfriend and shipped it by installing the entire development environment on her laptop! Hah.)
I guess my $0.02 is, maybe near the start, I would have loved a little detour block, "If you want to learn how to take these Python scripts/programs we're writing and share them with the world, check out this advanced chapter at the end of the book!" The same might probably be true for, "if you want to play with other Python libraries, take a peek at this chapter that talks about pypi and environments!"
One of the highest-impact things you can do is orient them in a way to understand outside material more effectively. It doesn't have to be comprehensive, but they're going to see things like "pipenv" and "virtualenv" in outside material sooner rather than later.
Data scientists using python but not using jupyter notebooks is the exception.
I always tell people to please make sure they know all the contents from here https://www.youtube.com/watch?v=ZK3O402wf1c&list=PL49CF3715C... (Gilbert Strang's Linear Algebra course) before they make any claims about being a data scientist. I bet I can teach a monkey to open a CSV in pandas and call .fit() in sklearn. But do the people really understand the underlying assumptions. Most self-proclaimed data scientists don't I am sure.
I still have a hard time calling myself a data scientist. And I am three years into a relevant PhD. The more I study, the less I feel I truly know.
The applicant pool seems to be full of those who took an online course and whose "personal" projects consist entirely of projects ripped from fast.ai or similar. Anymore, I get seem to get more spam on LinkedIn from people looking to be hired as a data scientist at my company than I do from recruiters looking to hire me. And looking over their resumes, I can see why they need to hustle so much. Successful candidates need to know how to do more than classify pet images from the Oxford dataset.
So to answer your question...Maybe? I mean, these candidates are certainly trying much harder than I ever needed to for a job. But presumably they eventually get hired into roles with less discerning companies.
I also agree about the algorithms part. For my research, I look into vehicle routing problems a lot and there is no sklearn for that or something alike. Maybe an idea for a future project.
An experienced data scientist with deep knowledge in the topics you mention easily replaces 10 of those "candidates". I feel the best way to train more serious programmers and data scientists is to teach kids to program earlier in high school. Maybe make it mandatory just as math is today. Then, a lot more students may choose CS or math as a major in college.
I wish there was more effort spent on creating intermediate courses. It would be great if there were more people trying to write books like this Nicolas Rougier is trying (https://github.com/rougier/scientific-visualization-book). Take a specific library and help people become proficient in them. Usually, the core contributors to a library are not always the best people to teach people how to use that library.
- how to write a python library that you can host for public/private use
- adding test coverage to data science python projects
- learning libraries like matplotlib, seaborn beyond what you see in tutorials
I think material for all fo this exists in different sources like documentation/stack over flow but either it's too detailed or too superfluous. The middle (intermediate) layer is often missing.
I can't wait to test this on my non-programmer friends who want to learn how to code. This seems to have a somewhat different approach to the usual Python tutorial, it might do the trick!
Thanks
That is why I spend so much time on the memory diagrams in the videos in particular. It literally took me years to figure that out. I have watched a lot of online lectures (e.g., from OCW), but only rarely do instructors talk about that. Maybe formally studying CS would have done the trick for me as well :)
This is a good resource.
Regarding the videos: they're hard to watch because the text size is very small and it seems to be blurry due to compression/encoding issues and in some of the videos, the audio track has a high-pitched hum in the background that makes it very hard to listen to.
If you still have access to the source materials, you should seriously consider cleaning up the sound and re-encoding the videos for better quality (as well as making the text bigger where possible).
@webartifex it could help the spread of your course if you could put the Notebooks in a self-hosted Jupyter Notebook environment, so students don't have to install anaconda and all that.
Shameless plug here, I'm the creator of Notebooks.ai, which is a hosted Jupyter Lab environment for students, 100% free (we only charge big schools, so teachers and students are free to use). Here's a quick demo of your first lecture: https://notebooks.ai/santiagobasulto/intro-to-python-demo
And aside from ours, there are other options like mybinder or Google Collab.
I am planning to use Google Collab for a totally remote class next weekend to set up group work.
Does the material seem to have major gaps for DS to anyone else? There's no pandas, no matplotlib, no ML. It seems like more a tutorial with a computer science focus, with recursion and bit manipulation. Those are great programming topics but rarely used by data scientists. For that I would probably use .py files, not jupyter notebooks. It just doesn't align with my experience as a Python-based data scientist.
Furthermore, it is a programming (!) course, not (!) a data science course.
Please define data science to me. Is it "only" curve fitting to you? Or also optimization (e.g., in logistics)? Then, dynamic programming (and because of that recursion) is super important.
I rely on Jupyter notebooks mainly because it is a course for people with absolutely no prior experience that also do not major in CS.
.py files are actually explained in Chapter 2 and will be used in a follow-up course.
I was helping a friend who decided to start a CompSci degree recently (the 101 course was taught in python), and she was massively struggling with type coercion and understanding type methods. Looking at the course forum, so was everybody else. I helped her understand what types were, why they were important and what the built in type methods were (and how to read the python documentation), and it was a major breakthrough for her. She’s now getting close to 100% on her tests.
I was also blown away by how archaic some of the content was, like using .format() instead of fStrings, but that’s a seperate topic.
As Python is really more about the behavior of objects and not so much their type, I introduce these already from chapter 4 onward, for example, iterable vs. container, and many more. I actually would say that this is the essence of any dynamic language (duck typing).
"and how to read the python documentation" -> that is an important point you raise!!! I found that beginners have real trouble reading the docs because they are screening for words like "list" instead of "iterable". However, as I teach abstract behaviors early, they actually understand the docs.
"like using .format() instead of fStrings" -> I mention .format() but mainly use f-strings and tell the students right away that they are both faster and easier to read and that they should default to them.
Code in repo is MIT licensed and videos are tagged as Creative Commons (CC BY). Finally a good OER course for people to learn Python!
I have a second half for the "book" planned but "unfortunately" also need to do some research for my PhD.
Maybe someone wants to provide some more advanced chapters via a pull request?
As a side note, I think sharing lessons as Jupyter notebooks is a really great idea for programming education. I would like to see more of this style of course for other programming languages as well.
My observation about myself when studying is this: Nothing, no video course, beats a well written book. And, such books are often "slow" reads as you have to think a lot when reading. I found Jupyter notebooks a good instrument to put a lot of info in. My guess is that there is three times the info in my course's notebooks as there is in the 25 hour video recordings.
Didn't think anyone would comment here as there are so many Python intros out there :)
I am a PhD student in Operations Research and developed an Introduction to Python & Programming course for the Bachelor program at a business school.
Due to Corona, I got a chance to video tape it thoroughly (should have done that earlier).
Maybe someone here finds my lectures (playlist: https://www.youtube.com/playlist?list=PL-2JV1G3J10lQ2xokyQow...) useful or knows someone who does. The recordings are 25 hours in total. If you do the readings and exercises, you should allocate 90 - 120 hours roughly.
The GitHub repo is really an interactive book in Jupyter notebooks: https://github.com/webartifex/intro-to-python
As I didn't study CS myself, I think I provide another angle for newbies.
Pull requests with improvements are highly welcome. The materials contain lots of references to the Python community as well. I love this community: I started without any knowledge of Python in 2014 and the many conference talks and repos really help when learning.
Stay safe everybody, Alex