This is not to say that this is a good idea or a bad one, but I think you will, long-term, have better luck if people don’t feel their content is being siphoned.
A great case-in-point is what my friends at 404 Media did: https://www.404media.co/why-404-media-needs-your-email-addre...
They saw that a lot of their content was just getting scraped by random AI sites, so they put up a regwall to try to limit that as much as possible. But readers wanted access to full-text RSS feeds, so they went out of their way to create a full-text RSS offering for subscribers with a degree of security so it couldn’t be siphoned.
I do not think this tool was created in bad faith, and I hope that my comment is not seen as being in bad faith, but: You will find better relationships with the writers you share if you ask rather than just take. They may have reasons for not having RSS feeds you may not be aware of. For example, I don’t want my content distributed in audio format, because I want to leave that option open for myself.
People should have a say in how their content is distributed. I worry what happens when you take those choices away from publishers.
I love these projects but often they can have a negative side-effects.
Getting consumed by ai scrapers will be inevitable in the long run i think.
You are describing the “give an inch, take a mile” concept neatly.
I think your mindset will just lead to a lot of people who otherwise would not want to regwall their content to do so. And if I ever do so, I will include a link to your post so they know who to blame.
1. Downloading and polling that doesn't resemble a cyberattack.
2. Not reproducing their content in a way that could compete with theirs or tarnishes their identity... and there's a lot of open ongoing debate about how that principle relates to different ways of using LLMs.
> 1. [limited history of posts]
> 2. [partial content]
To fix the limitation N°1 on some cases, maybe the author can rely on sitemaps [1], is a feature present in many sites (as RSS feeds) and it shows all the pages published.
Written in Django.
I can always go back, parse saved data. If web page is not available, I fall back to Internet Archive.
- https://github.com/rumca-js/Django-link-archive - RSS reader / web scraper
- https://github.com/rumca-js/RSS-Link-Database - bookmarks I found interesting
- https://github.com/rumca-js/RSS-Link-Database-2024 - every day storage
- https://github.com/rumca-js/Internet-Places-Database - internet domains found on the internet
After creating python package for web communication, that replaces requests for me, which uses sometimes selenium I wrote also CLI interface to read RSS sources from commandline: https://github.com/rumca-js/yafr
One of my favorite tricks when coming across a blog with a longtail of past posts is to verify that it's hosted on WordPress and then to ingest the archives into my feedreader.
Once you have the WordPress feed URL, you can slurp it all in by appending `?paged=n` (or `&paged=n`) for the nth page of the feed. (This is a little tedious in Thunderbird; up till now I've generated a list of URLs and dragged and dropped each one into the subscribe-to-feed dialog. The whole process is amenable to scripting by bookmarklet, though—gesture at a blog with the appropriate metadata, and then get a file that's one big RSS/Atom container with every blog post.)
[1] https://arstechnica.com/gadgets/2024/08/tumblr-migrates-more...
When the book was done the blog was replaced by a link where one could buy the printed version.
None of those are problems with RSS or Atom¹ feeds. There’s no technical limitation to having the full history and full post content in the feeds. Many feeds behave that way due to a choice by the author or as the default behaviour of the blogging platform. Both have reasons to be: saving bandwidth² and driving traffic to the site³.
Which is not to say what you just made doesn’t have value. It does, and kudos for making it. But twice at the top of your post you’re making it sound as if those are problems inherit with the format when they’re not. They’re not even problems for most people in most situations, you just bumped into a very specific use-case.
¹ It’s not an acronym, it shouldn’t be all uppercase.
² Many feed readers misbehave and download the whole thing instead of checking ETags.
³ To show ads or something else.
RSS was invented in 1999, 6 years before git!
Now we have git and should just be "git cloning" blogs you like, rather than subscribing to RSS feeds.
I still have RSS feeds on all my blogs for back-compat, but git clone is way better.
If anything were to replace RSS (and Atom) I'd personally hope for h-feed [1] since it's DRYer. But realistically it's going to be hard to eclipse RSS, there's far too much adoption and it is mostly sufficient.
A million?
Having your own local copy of your favorite authors' collections is the absolute way to go. So much faster, searchable, transformable, resistant to censorship, et cetera.
Can’t say anything about blogs, but the kernel folks actively use mailing list archives over Git[1,2] (also over NNTP and of course mail is also delivered as mail).
<link rel="alternate" type="application/x-git" title="my blog as a git repo" href="..." />
..and tooling could take care of all the things you like in an RSS reader. I could see this working really well for static site generators like vitepress or Jekyll or what have you, but going beyond what's in the source is kind of project-specific, but maybe I'm interested in just a summary of commits/PRsAnyway, there isn't an official IANA-defined type for a git repo (the application/x-git is my closest guess until one became official) but my point is it isn't too far beyond what auto-discovery of RSS is.
I think the GP's comment is from the point of view of making it easy to retrieve the contents of the blog archive, easier than the hoops mentioned (bulk archive retrieval and generating WordPress page sequences, etc.) as well as solving the problem in TFA (partial feeds, partial blog contents in the feed).
horrible simple hack: use `wget` with `--mirror` option, and commit the result to a git repository. Repeat with a `cron` job to keep an archive with change history.
You clone static site generated websites.
Scroll is designed for this, but there's no reason other SSCs can't copy our patterns.
Here's a free command line working client you can try [beta]: https://wws.scroll.pub/readme.html
Instead of favoriting feeds, you favorite repos. Then you type "wws fetch" to update all your local repos.
It fetches the branch that contains the built artifacts along with the source, so you have ready to read HTML and clean source code for any transformations or analysis you want to do.
---
I love Wordpress, but the WordpressPHPMySQL stack is a drag. At some point I expect they will move the Wordpress brand, community, and frontend to be powered by a static site generator.
To be quite honest, I suspect they'll probably want to use Scroll as their new backend.