For the places where bash was used I would just use python and any cli tools you want to call I just use subprocess. It’s much simpler and I can run the scripts in a repl and execute cells in Jupyter or just normal pycharm so its quick and interactive.
Love that you included something on building a data dictionary, I am honestly guilty of in the past not including a good data dictionary for the source data. I would just leave in the output of df.describe() or df.info() at the top of the jupyter notebook where you restructure the source data before processing it. I now think you should include and save as a CSV a data dictionary of the source data and the final data as it’s more maintainable or at least leave a comment in your script.
Otherwise everything else is pretty similar to what I would do, I just went to my google takeout and apparently all my google play data and songs are gone so I guess I can’t try this myself…
I guess it is the same for make vs airflow. I had no idea they could be used interchangeably for single machine workloads.
While I've seen datasette mentioned a lot of places, I still don't really know what it is, but if it makes exploring sqlite databases easy, I should give it a try!
- https://news.ycombinator.com/item?id=22283368
- https://news.ycombinator.com/item?id=18896204
I personally learned it from bioinformaticians theres great coverage of this and other command line data skills in this book: https://www.oreilly.com/library/view/bioinformatics-data-ski...
The SQLite, pandas, bash, make stack for quick data science projects is a great and maintainable one that doesn’t require too much specialized knowledge.
And you're right - use the tools you know and have running. I have all sorts of schemas and tables on that old instance, since I tend to use it if I need "anything SQL" - when I'm at home, is it as easy as using SQLite. My latest article used Trino and Hadoop-adjacent stuff... while fascinating it its own right, sometimes it's nice to just say "jdcb:// ..." :)
The article never mentioned how this showed up in the GPM app itself which feels lacking.
Otherwise a nice article but it reminds me why I long ago gave up on media metadata organization. So much work, so much mess...
Separate from that, GPM matched your uploaded MP3 file against the service music corpus, and if there was a match, the service streamed the canonical version. Originally the streaming service used 320 kbps MP3, but later the service switched to 256 kbps AAC. GPM takeout does not provide the canonical version.
If you takeout from YTM it says your music files are "Your originally uploaded audio file" which is nice. Since music in YTM may have been migrated from GPM, that seems to imply that GPM retained the originals.
When they shut down GPM I migrated to YTM, which doesn't seem to have these specific catalog problems. I also just re-organized my local copy of my FLACs using MusicBrainz Picard. Unlike this author I no longer have the giant wall of CDs!
How do you parallelize a loop in bash without getting all the echo's intertwined and jumbled together?
If you don't use nohup, you can just pipe stdout and err to a different file descriptor for logs within said loop.
Or, of course, script something in a language of your choice.