Currently I use yt-dlp to manually download individual videos that I want to keep. At the moment I only save the video itself. And most of the time I then also paste the URL of the video into archive.is save page and web.archive.org/save so that there is a snapshot of what the video page itself looked like at the time. But this is still incomplete, and relies on those services continuing to exist. Locally saving a snapshot of the page like that, and then also saving the thumbnail and perhaps more of the comments would be nice.
youtube_details = {
"youtube_id": vid_id,
"channel_name": vid["channel"],
"vid_thumb_url": vid["thumbnail"],
"title": vid["title"],
"channel_id": vid["channel_id"],
"duration": duration_str,
"published": published,
"timestamp": int(datetime.now().timestamp()),
# Pulling enum value out so it is serializable
"vid_type": vid_type.value,
}https://www.reddit.com/r/youtubedl/comments/mcvgmr/download_...
For people looking for a more lightweight option of that kind, I run the following script hourly [1]. This script uses yt-dlp to go through a text file full of YouTube RSS urls (either a channel RSS or a playlist RSS works for channels where you're only interested in a subset of videos) [2] and downloads the latest 5 videos organized in folders based on channel name. I watch these files by adding the output folder in a Jellyfin "Movies" type library sorted by most recent. The script contains a bunch of flags to make sure Jellyfin can display video metadata and thumbnails without any further plugins, and repackages videos in a format that is 1080p yet plays efficiently even in web browsers on devices released in at least the last 10 years.
It uses yt-dlp's "archive" functionality to keep track of videos it's already downloaded such that it only downloads a video once, and I use a separate script to clean out files older than two weeks once in a while. Running the script depends on ffmpeg (just used for repackaging videos, not transcoding!), xq (usually comes packaged with jq or yq) and yt-dlp being installed. You sometimes will need to update yt-dlp if a YouTube side change breaks it.
For my personal usage it's been honed for a little while and now runs reliably for my purposes at least. Hope it's useful to more people.
[1]: https://pastebin.com/s6kSzXrL
[2]: E.g. https://danielmiessler.com/p/rss-feed-youtube-channel/
inb4 dropbox/rsync reference. yeah yeah, I'm not saying everybody should do it like this, I'm just saying that archiving and indexing/searching needn't be heavyweight. I'm sure there's plenty of utility in a nice GUI for it, but it could easily be a light weight GUI.
(edit: removed my "plug" since I mentioned it elsewhere.)
I haven’t yet wired up the bits to use whispercpp to automatically generate subtitles for downloads, but I have done so on an ad-hoc basis in the past and gotten (much) better results than the YouTube auto-generated subtitles.
Also, "Tube Archivist depends on Elasticsearch 8." . Wow, why?
Well, if you're self hosting for yourself, friends, and family.. it isn't likely to be a thing you'll want to care about fixing when it eventually breaks in mysterious ways.
Better to use sqlite or just a blob of yaml in a file if self hosting might be involved.
I always dream of writing a proxy server—-where all videos—-irrespective of device—-get stored in a local cache and served without going outside on subsequent requests.
Gonna try this one, and gonna take that direction.
Look at the upside-down-ternet: http://www.ex-parrot.com/pete/upside-down-ternet.html
But here are the problems you'll run into today:
Cache hit rate. How big is your cache? Large enough to get a hit rate that economically saves you money vs the cost of the SAN?
Can you cache YouTube videos? Can you intercept YouTube videos? You'll need a root cert installed on your client devices. And, here's the worst: many applications do cert pinning so they'll refuse to load, even if the signer is in the root store. They require a specific signer.
They are actively updated every time a new blocking technique comes along.
It looks like Django + SQLite is used for user accounts, but all other data storage happens in Elasticsearch.
It's an interesting design decision. I would have gone all-in on the database, and used SQLite FTS in place of Elasticsearch for simplicity, but that's my own personal favourite stack. Not saying their design is bad, just different.
It would be great to add embeddings to the index, possibly using one of your Python tools.
The way I was using them was to create a playlist named "save" and pulling from it once a day. It worked for a while, but YT started to ban somehow my script. Tube Archivist looks like would be ideal for that.
Thanks for sharing this!
I use YT's RSS feature to follow channels and playlists I'm interested in and discovered that (somewhat ironically) if I have it query the RSS periodically Google will decide that I am a bot, will return errors for all reads and force me to pass a captcha next time I try to use any Google product (presumably connecting the two activities via ip).
So now my RSS reader does not periodically query YT and instead I manually click the update button when I'm interested...
It will crash and then restoration will fail internally with corruption errors, requiring reading through docker logs or just starting over from scratch completely.
https://github.com/rumca-js/Django-link-archive
I support not only youtube, but also any RSS source.
It functions as link aggregation software. I can also fetch meta for all videos in channel, and download videos, audios.
I am using standard Django auth module.
It still lacks polish, and it is under development. I am not a webdev, so I am still struggling with overall architecture