That's not right, is it? The headers are defined at the http level and with caching layers in between, the endpoint is free to return the current feed - with both new and seen entries. Are there many servers optimising that to a shorter feed?
At the very least, static blogs will not filter the entries - they're serving / not serving the same file, regardless of etag.
Here's a un-duplicator in a feed reader I wrote back in 2009.[1] This is used for printing RSS feeds on antique teletype machines. Reliable duplicate removal is essential when printing at 5 characters per second.
[1] https://github.com/John-Nagle/baudotrss/blob/master/messager...
https://play.google.com/store/apps/details?id=thorio.solutio...
Beyond that you probably could cover the last 2 or so percent using string comparison against title and description or peppeteering the website.
Zebra as a social feed reader was a great learning: for example that a lot of sites circulate their content multiple times in different packages (/tiles) and very few flag paywalled content - still working on recognizing that. Any hints for a good way to distinguish that, when investigating the urls?
That's certainly not the official semantics for these headers, they're validators, so that the server can tell the client nothing changed (304). I would assume overriding these for some sort of pagination would also hinder intermediate caches, though I guess that has become less of an issue now that HTTPS is everywhere, but edge caches performing HTTPS termination might still take this information in account?
As far as HTTP caching semantics are concerned, the new version of the resource would replace the old one, and new clients would be served the latest cached version, truncated.
In fact the requests headers make that very clear, as they're called respectively If-None-Match and If-Modified-Since.
Incidentally there are also If-Match and If-Unmodified-Since headers (POST, PUT, DELETE), but I don't know if anyone actually uses them in the wild. IIRC they were intended for "transactional" update guarantees: you'd fetch a resource, then PUT to it with If-Match and / or If-Unmodified-Since, and you'd a 412 (Precondition Failed) if the resource had been modified in the meantime.
> The headers are defined at the http level and with caching layers in between, the endpoint is free to return the current feed - with both new and seen entries.
A lot of feeds behave this way. Especially if it's just a static blog
> Are there many servers optimising that to a shorter feed?
I've come across a fair few that behave with the semantics I described
All jokes aside, you just described literally all the points I encountered while developing the built-in feed reader for HeyHomepage.com Good summary!
One thing I notice a lot of people say - like you - is "forcing users to link through to read the article on the original site (semi defeating the point of subscribing via feedreader)".
I don't really agree and my own approach focuses explicitly on sending visitors to the original site. I only show the snippet, even when the full content is in the feed. Imagine you did your best for your website, made it nice and shiny, you want people to see the site as well. The original site usually contains more content, like a photo or image, which might also be useful for visitors. Besides, I want the webmasters to know I was there by showing up in the visitor statistics (I attach a '&rss_ref=heyhomepage.com' to the end of the link to the original site).
I'm not saying one way is good and the other bad - there are valid reasons for seeing a feed reader more as an aggregator - but I wanted to point out there are valid reasons for doing the opposite as well.
Original link should be preserved, but I disagree with you on the moral of appending '&rss_ref=heyhomepage.com' to the links. It is still tracking. Besides, server-side feed aggregators have a valid reason to cache the feed and canonicalize the item url to avoid dupes.
The original link to the feed is preserved for my users to click on. My system - in use with the user - pointing the user to someone else's website accompanied by a GET var containing the user's own URL is not tracking. The end website can also know that info from the HTTP referrer. It's a very crude implementation of a webmention, in a sense. Because it's not necessarily about the linking website, but about telling someone's RSS feed is in use!
That is incredibly unsound (on a technical level, not merely in matters of taste).
It's frustrating when you're forced to change the behavior of your "agnostic" application for the sake of a large, commonly-used third party tool in the ecosystem.
If an HTTP server is ignoring the Accept-Encoding header and choosing to serve a Content-Encoding that the client can't accept, that is the problem here. If the server and client can't come to an agreement, isn't that the purpose of HTTP 406? But, being able to serve both gzip'd and plain text versions of an XML file doesn't seem that crazy.
https://discovery.thirdplace.no/?q=jackevansevo.github.io
It's not perfect but it's better than a simple parsing of <link> tags in the html.
Edit: figured it out from https://discovery.thirdplace.no/about - it looks like it's using link tags but also has a big list of baked in known-patterns, e.g. these:
https://git.sr.ht/~thirdplace/feed-finder/tree/main/item/src...
Always wished RSS/ATOM had a dedicated field for images. Why didn’t they? Currently it always seems to involve some inline HTML in a CDATA element. Pretty gross.
How I parse enclosures in my own timeline you can see here: https://www.heyhomepage.com/?module=timeline&post=4 (also with a nice link to the original source)
The o-umlaut doesn't occur in English, and "Nate Hopper" sounds like an English name.
Helpful hint if you need favicons for your reader you can use Google.
https://www.google.com/s2/favicons?domain=techmeme.com
The above is a load balancer for this url where the t1 subdomain may change to t[1-9] but this URL allows you to change the image size.
https://t1.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&...
I use it to grab and store sizes 16,32,48,64 of the icons with a monthly update ping.
My current iteration is built in python with a mysql backend. It's setup in a river of news style with an everything river and one for each feed and I generate topic bundles also. The feed engine is running every 15 minutes grabbing 40 feeds at a time but the static site generator is only running every 6 hours to keep me from spending all my time reading news. Since I pull in Reddit feeds I found that it's great for feed discovery.
https://github.com/RSS-Bridge/rss-bridge/blob/5e664d9b2b0cb0...
My favorite thing he mentioned is that various tags can have different meanings. Published, updated, description, content, subtitle. To do this at scale you need some configurations for each feed to specify where you can get information. Does <published> mean published, or does it actually mean updated? Everyone does it differently.
And the etag thing. Yeah…
One thing he didn’t mention is media. I think the HN crowd really likes RSS because the mostly-text tech blogs they like to read all support it, and it seems to work fine. But a lot of the population likes to read content that has embedded images and videos. Even slideshows sometimes. There are RSS extensions for this, but they suck for all the same reasons.
At my company we ended up abandoning RSS and writing a customizable web scraper instead (ingesting HTML pages). It was actually a lot easier than dealing with RSS.
PS: For the (ongoing) struggle (trying) to "normalize" the RSS and ATOM feed formats (or JSON Feeds) see the feedparser gem - https://github.com/rubycocos/feedparser
I learnt a lot. My goal was getting something working that the ttrss android app would connect to and I reasonably succeeded there, running it for a few years.
I went back to hosting the full ttrss application at some point.
There has been some attempts on tackling this problem, but none have managed to get it right and become truly universal, as far as I know.
If you don't agree with the philosophy, kindly move along, no need to downvote.
I hesitated for a long time too. One day I just decided to keep at it and launch.