Show HN: Full Text, Full Archive RSS Feeds for Any Blog (opens in new tab)

(dogesec.com)

152 pointspanoramas4good1y ago45 comments

45 comments

39 comments · 13 top-level

breck1y ago· 10 in thread

The future of RSS is "git clone".

RSS was invented in 1999, 6 years before git!

Now we have git and should just be "git cloning" blogs you like, rather than subscribing to RSS feeds.

I still have RSS feeds on all my blogs for back-compat, but git clone is way better.

8organicbits1y ago

What problems does that solve? Reading blogs over git clone sounds like re-inventing the wheel. Are there even any tools that do that?

If anything were to replace RSS (and Atom) I'd personally hope for h-feed [1] since it's DRYer. But realistically it's going to be hard to eclipse RSS, there's far too much adoption and it is mostly sufficient.

[1] https://indieweb.org/h-feed

breck1y ago

> What problems does that solve?

A million?

Having your own local copy of your favorite authors' collections is the absolute way to go. So much faster, searchable, transformable, resistant to censorship, et cetera.

mananaysiempre1y ago

> What problems does that solve? Reading blogs over git clone sounds like re-inventing the wheel.

Can’t say anything about blogs, but the kernel folks actively use mailing list archives over Git[1,2] (also over NNTP and of course mail is also delivered as mail).

[1] https://public-inbox.org/README.html

[2] https://lore.kernel.org/

kevindamm1y ago

I'm not the GP commenter, but I'm supposing there would be some way of announcing the git repo where you can find the source -- similar to the `<link...>` tag used for RSS, you could have a

  <link rel="alternate" type="application/x-git" title="my blog as a git repo" href="..." />

..and tooling could take care of all the things you like in an RSS reader. I could see this working really well for static site generators like vitepress or Jekyll or what have you, but going beyond what's in the source is kind of project-specific, but maybe I'm interested in just a summary of commits/PRs

Anyway, there isn't an official IANA-defined type for a git repo (the application/x-git is my closest guess until one became official) but my point is it isn't too far beyond what auto-discovery of RSS is.

I think the GP's comment is from the point of view of making it easy to retrieve the contents of the blog archive, easier than the hoops mentioned (bulk archive retrieval and generating WordPress page sequences, etc.) as well as solving the problem in TFA (partial feeds, partial blog contents in the feed).

1 more reply

mfashby1y ago

It's not what you're aiming for with this comment, but I bet git would actually make a pretty good storage tool/format for archival of mostly static sites.

horrible simple hack: use `wget` with `--mirror` option, and commit the result to a git repository. Repeat with a `cron` job to keep an archive with change history.

breck1y ago

I assume this is what wayback machine uses?

1 more reply

Tomte1y ago

You clone what? A WordPress database?

breck1y ago

> You clone what? A WordPress database?

You clone static site generated websites.

Scroll is designed for this, but there's no reason other SSCs can't copy our patterns.

Here's a free command line working client you can try [beta]: https://wws.scroll.pub/readme.html

Instead of favoriting feeds, you favorite repos. Then you type "wws fetch" to update all your local repos.

It fetches the branch that contains the built artifacts along with the source, so you have ready to read HTML and clean source code for any transformations or analysis you want to do.

---

I love Wordpress, but the WordpressPHPMySQL stack is a drag. At some point I expect they will move the Wordpress brand, community, and frontend to be powered by a static site generator.

To be quite honest, I suspect they'll probably want to use Scroll as their new backend.

xiande041y ago

And if the blog's repo is private or, gasp, it's not versioned with git?

breck1y ago

Then it's not worth reading.

shortformblog1y ago· 7 in thread

As a publisher who publishes a full-text RSS feed at a time when not a lot of publishers do, I must say: The publisher should have a say in this.

This is not to say that this is a good idea or a bad one, but I think you will, long-term, have better luck if people don’t feel their content is being siphoned.

A great case-in-point is what my friends at 404 Media did: https://www.404media.co/why-404-media-needs-your-email-addre...

They saw that a lot of their content was just getting scraped by random AI sites, so they put up a regwall to try to limit that as much as possible. But readers wanted access to full-text RSS feeds, so they went out of their way to create a full-text RSS offering for subscribers with a degree of security so it couldn’t be siphoned.

I do not think this tool was created in bad faith, and I hope that my comment is not seen as being in bad faith, but: You will find better relationships with the writers you share if you ask rather than just take. They may have reasons for not having RSS feeds you may not be aware of. For example, I don’t want my content distributed in audio format, because I want to leave that option open for myself.

People should have a say in how their content is distributed. I worry what happens when you take those choices away from publishers.

donohoe1y ago

This.

I love these projects but often they can have a negative side-effects.

throwaway143561y ago

ive never implemented it but it should be possible to check if content still lives behind the url where it was originally found before serving any kind of archived copy.(preferably with contact info for the unwilling author) Using it for a search index should be fine ofc

elfelf121y ago

I disagree. If you put your content out in the open for everyone to read, it is totally valid to scrape that content. Otherwise put it behind a paywall. If i can access it for free with a browser then you should be fine with me consuming your content with the tool of my choice. So i can search or use it however i see fit. Why not?

Getting consumed by ai scrapers will be inevitable in the long run i think.

shortformblog1y ago

Just because I make the information available in a convenient way doesn't mean I expect it to be harvested. That you make that leap is 100% troubling and makes me not want to have you as a reader, because you don't respect my work.

You are describing the “give an inch, take a mile” concept neatly.

I think your mindset will just lead to a lot of people who otherwise would not want to regwall their content to do so. And if I ever do so, I will include a link to your post so they know who to blame.

Terr_1y ago

I feel like the two massive unspoken caveats are:

1. Downloading and polling that doesn't resemble a cyberattack.

2. Not reproducing their content in a way that could compete with theirs or tarnishes their identity... and there's a lot of open ongoing debate about how that principle relates to different ways of using LLMs.

southwesterly1y ago

So I can take all the words written here by you and use them to pretend to be you elsewhere online, right?

iamacyborg1y ago

As a one-off thing you personally do, yeah that’s probably okay. Turning that into a product that you then offer to others is where the line is drawn, in my opinion.

1 more reply

yawnxyz1y ago· 3 in thread

It's so clever to just pull from Wayback Machine rather than scrape the site itself. Never even thought of that

cxr1y ago

Before building an app that depends on the Wayback Machine (or other Archive infrastructure) it's good to keep in mind this post from their blog: <https://blog.archive.org/2023/05/29/let-us-serve-you-but-don...>

One of my favorite tricks when coming across a blog with a longtail of past posts is to verify that it's hosted on WordPress and then to ingest the archives into my feedreader.

Once you have the WordPress feed URL, you can slurp it all in by appending `?paged=n` (or `&paged=n`) for the nth page of the feed. (This is a little tedious in Thunderbird; up till now I've generated a list of URLs and dragged and dropped each one into the subscribe-to-feed dialog. The whole process is amenable to scripting by bookmarklet, though—gesture at a blog with the appropriate metadata, and then get a file that's one big RSS/Atom container with every blog post.)

yawnxyz1y ago

wait, so if WordPress is migrating 500M blogs to Wordpress[1], does this mean essentially we'll have easy access to all tumblr blogs' history?

[1] https://arstechnica.com/gadgets/2024/08/tumblr-migrates-more...

simonw1y ago

I used it to recover some lost content from my blog a few years ago, it was fantastic: https://simonwillison.net/2017/Oct/8/missing-content/

latexr1y ago· 3 in thread

> RSS and ATOM feeds are problematic for two reasons; 1) lack of history, 2) contain limited post content.

None of those are problems with RSS or Atom¹ feeds. There’s no technical limitation to having the full history and full post content in the feeds. Many feeds behave that way due to a choice by the author or as the default behaviour of the blogging platform. Both have reasons to be: saving bandwidth² and driving traffic to the site³.

Which is not to say what you just made doesn’t have value. It does, and kudos for making it. But twice at the top of your post you’re making it sound as if those are problems inherit with the format when they’re not. They’re not even problems for most people in most situations, you just bumped into a very specific use-case.

¹ It’s not an acronym, it shouldn’t be all uppercase.

² Many feed readers misbehave and download the whole thing instead of checking ETags.

³ To show ads or something else.

tandav1y ago

Also Atom feeds supports pagination https://www.rfc-editor.org/rfc/rfc5005#section-3

rainworld1y ago

Also, there’s an existing, moderately well supported format for JSON feeds: https://www.jsonfeed.org

msephton1y ago

I have the full history in my blog feed.

z3t41y ago· 1 in thread

The mystical creature - the URL - is a link to a resource that doesn't have to be static, it's only the URL that is static. eg. the content might change. So you might want to have the program revisit the resource once in a while to see if there are updates.

throwaway143561y ago

A really original idea i see one time: someone was writing a technical book in the first post of their blog, new posts talked about the work they done and linked to that part of the book. At times the posts had almost nothing besides the link, sometimes they talked about the technicalities and considerations of the writing but at times it just talked about every day life, why it was a good or a bad day to write.

When the book was done the blog was replaced by a link where one could buy the printed version.

steamodon1y ago· 1 in thread

I wrote a similar tool [1], although it's designed to let you gradually catch up on a backlog rather than write a full feed all at once. Right now it only works on Blogger and WordPress blogs, so I'll need to learn from their trick of pulling from Internet Archive.

[1] https://github.com/steadmon/blog-replay

jayemar1y ago

I had a similar idea to replay blogs. It'll pull from WordPress or Internet Archive and give you a replay link to add to your feed reader.

https://refeed.to

johnbellone1y ago· 1 in thread

Someone somewhere is still running a gopher server.

throwaway143561y ago

Having a fixed path for a search api is a great idea.

pentagrama1y ago

> generally the RSS and ATOM feeds for any blog, are limited in two ways;

> 1. [limited history of posts]

> 2. [partial content]

To fix the limitation N°1 on some cases, maybe the author can rely on sitemaps [1], is a feature present in many sites (as RSS feeds) and it shows all the pages published.

[1] https://www.sitemaps.org/

renegat0x01y ago

Similar goal, different approach. I wrote RSS reader, that captures link meta from various RSS sources. The meta data are exported every day. I have different repositories for bookmarks, different for daily links, different for 'known domains'.

Written in Django.

I can always go back, parse saved data. If web page is not available, I fall back to Internet Archive.

- https://github.com/rumca-js/Django-link-archive - RSS reader / web scraper

- https://github.com/rumca-js/RSS-Link-Database - bookmarks I found interesting

- https://github.com/rumca-js/RSS-Link-Database-2024 - every day storage

- https://github.com/rumca-js/Internet-Places-Database - internet domains found on the internet

After creating python package for web communication, that replaces requests for me, which uses sometimes selenium I wrote also CLI interface to read RSS sources from commandline: https://github.com/rumca-js/yafr

msephton1y ago

This reminds me of something I wrote in early 2000. At that time RSS was less than a year old and if I'm honest I wasn't aware of it at all. I wrote a short PHP script to get the HTML of each site in a list, do a diff against the most recent snapshot, and generate a web page with a table containing all the changes. I could set per site thresholds for change value to cope with small dynamic content like dates and exclude certain latger sections of content via regexp. I probably still have the code in my backups from the dot com boom job I had at the time.

zczc1y ago

Looks like a nice tool for extending existing RSS sources. As for the sites that don't have RSS support in the first place, there is also RSSHub [1]. Sadly, you can't use both for the same source: history4feed's trick with the Wayback Machine wouldn't work with the RSSHub feed.

[1] https://rsshub.app/

wonderfuly1y ago

Awesome, I once developed a project called https://rerss.xyz, aimed at creating an RSS feed that reorders historical blog posts, but it was hindered by the two issues mentioned in the article.

twoprops1y ago

Does no one find it ironic that one of the complaints about RSS feeds is they don't give you the full content, forcing you to visit the site, while trying to access the poster's web site through reader view gives you a warning that you have to visit the site directly to get the full content?

2 more replies

j / k navigate · click thread line to collapse

45 comments

39 comments · 13 top-level

breck1y ago· 10 in thread

The future of RSS is "git clone".

RSS was invented in 1999, 6 years before git!

Now we have git and should just be "git cloning" blogs you like, rather than subscribing to RSS feeds.

I still have RSS feeds on all my blogs for back-compat, but git clone is way better.

8organicbits1y ago

What problems does that solve? Reading blogs over git clone sounds like re-inventing the wheel. Are there even any tools that do that?

[1] https://indieweb.org/h-feed

breck1y ago

> What problems does that solve?

A million?

Having your own local copy of your favorite authors' collections is the absolute way to go. So much faster, searchable, transformable, resistant to censorship, et cetera.

mananaysiempre1y ago

> What problems does that solve? Reading blogs over git clone sounds like re-inventing the wheel.

Can’t say anything about blogs, but the kernel folks actively use mailing list archives over Git[1,2] (also over NNTP and of course mail is also delivered as mail).

[1] https://public-inbox.org/README.html

[2] https://lore.kernel.org/

kevindamm1y ago

I'm not the GP commenter, but I'm supposing there would be some way of announcing the git repo where you can find the source -- similar to the `<link...>` tag used for RSS, you could have a

  <link rel="alternate" type="application/x-git" title="my blog as a git repo" href="..." />

1 more reply

mfashby1y ago

It's not what you're aiming for with this comment, but I bet git would actually make a pretty good storage tool/format for archival of mostly static sites.

horrible simple hack: use `wget` with `--mirror` option, and commit the result to a git repository. Repeat with a `cron` job to keep an archive with change history.

breck1y ago

I assume this is what wayback machine uses?

1 more reply

Tomte1y ago

You clone what? A WordPress database?

breck1y ago

> You clone what? A WordPress database?

You clone static site generated websites.

Scroll is designed for this, but there's no reason other SSCs can't copy our patterns.

Here's a free command line working client you can try [beta]: https://wws.scroll.pub/readme.html

Instead of favoriting feeds, you favorite repos. Then you type "wws fetch" to update all your local repos.

It fetches the branch that contains the built artifacts along with the source, so you have ready to read HTML and clean source code for any transformations or analysis you want to do.

---

I love Wordpress, but the WordpressPHPMySQL stack is a drag. At some point I expect they will move the Wordpress brand, community, and frontend to be powered by a static site generator.

To be quite honest, I suspect they'll probably want to use Scroll as their new backend.

xiande041y ago

And if the blog's repo is private or, gasp, it's not versioned with git?

breck1y ago

Then it's not worth reading.

shortformblog1y ago· 7 in thread

As a publisher who publishes a full-text RSS feed at a time when not a lot of publishers do, I must say: The publisher should have a say in this.

This is not to say that this is a good idea or a bad one, but I think you will, long-term, have better luck if people don’t feel their content is being siphoned.

A great case-in-point is what my friends at 404 Media did: https://www.404media.co/why-404-media-needs-your-email-addre...

People should have a say in how their content is distributed. I worry what happens when you take those choices away from publishers.

donohoe1y ago

This.

I love these projects but often they can have a negative side-effects.

throwaway143561y ago

elfelf121y ago

Getting consumed by ai scrapers will be inevitable in the long run i think.

shortformblog1y ago

You are describing the “give an inch, take a mile” concept neatly.

Terr_1y ago

I feel like the two massive unspoken caveats are:

1. Downloading and polling that doesn't resemble a cyberattack.

southwesterly1y ago

So I can take all the words written here by you and use them to pretend to be you elsewhere online, right?

iamacyborg1y ago

As a one-off thing you personally do, yeah that’s probably okay. Turning that into a product that you then offer to others is where the line is drawn, in my opinion.

1 more reply

yawnxyz1y ago· 3 in thread

It's so clever to just pull from Wayback Machine rather than scrape the site itself. Never even thought of that

cxr1y ago

One of my favorite tricks when coming across a blog with a longtail of past posts is to verify that it's hosted on WordPress and then to ingest the archives into my feedreader.

yawnxyz1y ago

wait, so if WordPress is migrating 500M blogs to Wordpress[1], does this mean essentially we'll have easy access to all tumblr blogs' history?

[1] https://arstechnica.com/gadgets/2024/08/tumblr-migrates-more...

simonw1y ago

I used it to recover some lost content from my blog a few years ago, it was fantastic: https://simonwillison.net/2017/Oct/8/missing-content/

latexr1y ago· 3 in thread

> RSS and ATOM feeds are problematic for two reasons; 1) lack of history, 2) contain limited post content.

¹ It’s not an acronym, it shouldn’t be all uppercase.

² Many feed readers misbehave and download the whole thing instead of checking ETags.

³ To show ads or something else.

tandav1y ago

Also Atom feeds supports pagination https://www.rfc-editor.org/rfc/rfc5005#section-3

rainworld1y ago

Also, there’s an existing, moderately well supported format for JSON feeds: https://www.jsonfeed.org

msephton1y ago

I have the full history in my blog feed.

z3t41y ago· 1 in thread

throwaway143561y ago

When the book was done the blog was replaced by a link where one could buy the printed version.

steamodon1y ago· 1 in thread

[1] https://github.com/steadmon/blog-replay

jayemar1y ago

I had a similar idea to replay blogs. It'll pull from WordPress or Internet Archive and give you a replay link to add to your feed reader.

https://refeed.to

johnbellone1y ago· 1 in thread

Someone somewhere is still running a gopher server.

throwaway143561y ago

Having a fixed path for a search api is a great idea.

pentagrama1y ago

> generally the RSS and ATOM feeds for any blog, are limited in two ways;

> 1. [limited history of posts]

> 2. [partial content]

To fix the limitation N°1 on some cases, maybe the author can rely on sitemaps [1], is a feature present in many sites (as RSS feeds) and it shows all the pages published.

[1] https://www.sitemaps.org/

renegat0x01y ago

Written in Django.

I can always go back, parse saved data. If web page is not available, I fall back to Internet Archive.

- https://github.com/rumca-js/Django-link-archive - RSS reader / web scraper

- https://github.com/rumca-js/RSS-Link-Database - bookmarks I found interesting

- https://github.com/rumca-js/RSS-Link-Database-2024 - every day storage

- https://github.com/rumca-js/Internet-Places-Database - internet domains found on the internet

msephton1y ago

zczc1y ago

[1] https://rsshub.app/

wonderfuly1y ago

Awesome, I once developed a project called https://rerss.xyz, aimed at creating an RSS feed that reorders historical blog posts, but it was hindered by the two issues mentioned in the article.

twoprops1y ago

2 more replies

j / k navigate · click thread line to collapse