The struggles of building a feed reader (opens in new tab)

(jackevansevo.github.io)

129 pointsJackevansevo3y ago49 comments

49 comments

40 comments · 20 top-level

viraptor3y ago· 7 in thread

> Including an ETag or Last-Modified header in the body of a request when fetching a feed is a mechanism to tell the server to only return new/modified entries/items (aka: a changeset) since a specific date.

That's not right, is it? The headers are defined at the http level and with caching layers in between, the endpoint is free to return the current feed - with both new and seen entries. Are there many servers optimising that to a shorter feed?

At the very least, static blogs will not filter the entries - they're serving / not serving the same file, regardless of etag.

Animats3y ago

Un-duplicating RSS items is hard. Timestamps can't be trusted. IDs can't be trusted. Some sources will resend the same item with a different timestamp. RSS servers behind a load balancer may return different IDs and timestamps for the same items. I had to compute a hash of each item to reliably remove duplicates.

Here's a un-duplicator in a feed reader I wrote back in 2009.[1] This is used for printing RSS feeds on antique teletype machines. Reliable duplicate removal is essential when printing at 5 characters per second.

[1] https://github.com/John-Nagle/baudotrss/blob/master/messager...

thorio3y ago

True, I realized this when developing the initial version of zebra I posted a few days back. However relying on a SQL server that requires a unique URL turned out to be the easiest and most effective solution.

https://play.google.com/store/apps/details?id=thorio.solutio...

Beyond that you probably could cover the last 2 or so percent using string comparison against title and description or peppeteering the website.

Zebra as a social feed reader was a great learning: for example that a lot of sites circulate their content multiple times in different packages (/tiles) and very few flag paywalled content - still working on recognizing that. Any hints for a good way to distinguish that, when investigating the urls?

1 more reply

masklinn3y ago

> Are there many servers optimising that to a shorter feed?

That's certainly not the official semantics for these headers, they're validators, so that the server can tell the client nothing changed (304). I would assume overriding these for some sort of pagination would also hinder intermediate caches, though I guess that has become less of an issue now that HTTPS is everywhere, but edge caches performing HTTPS termination might still take this information in account?

As far as HTTP caching semantics are concerned, the new version of the resource would replace the old one, and new clients would be served the latest cached version, truncated.

In fact the requests headers make that very clear, as they're called respectively If-None-Match and If-Modified-Since.

Incidentally there are also If-Match and If-Unmodified-Since headers (POST, PUT, DELETE), but I don't know if anyone actually uses them in the wild. IIRC they were intended for "transactional" update guarantees: you'd fetch a resource, then PUT to it with If-Match and / or If-Unmodified-Since, and you'd a 412 (Precondition Failed) if the resource had been modified in the meantime.

aendruk3y ago

Yeah, the whole post is riddled with little technical misunderstandings. It’s nice to see someone working things out in the open though.

JackevansevoOP3y ago

Author of the post, could you clarify?

denton-scratch3y ago

I gave up around paragraph four; the article isn't a technical article, it's a touchy-feely people story about an old, blind man with a very white beard.

2 more replies

JackevansevoOP3y ago

It's a tad confusing. I believe you're totally correct, this is how those headers should behave.

> The headers are defined at the http level and with caching layers in between, the endpoint is free to return the current feed - with both new and seen entries.

A lot of feeds behave this way. Especially if it's just a static blog

> Are there many servers optimising that to a shorter feed?

I've come across a fair few that behave with the semantics I described

rambambram3y ago· 3 in thread

Where's your own RSS icon then!? ;)

All jokes aside, you just described literally all the points I encountered while developing the built-in feed reader for HeyHomepage.com Good summary!

One thing I notice a lot of people say - like you - is "forcing users to link through to read the article on the original site (semi defeating the point of subscribing via feedreader)".

I don't really agree and my own approach focuses explicitly on sending visitors to the original site. I only show the snippet, even when the full content is in the feed. Imagine you did your best for your website, made it nice and shiny, you want people to see the site as well. The original site usually contains more content, like a photo or image, which might also be useful for visitors. Besides, I want the webmasters to know I was there by showing up in the visitor statistics (I attach a '&rss_ref=heyhomepage.com' to the end of the link to the original site).

I'm not saying one way is good and the other bad - there are valid reasons for seeing a feed reader more as an aggregator - but I wanted to point out there are valid reasons for doing the opposite as well.

derekzhouzhen3y ago

His homepage has a `link rel="alternate"` meta and that's all that matters.

Original link should be preserved, but I disagree with you on the moral of appending '&rss_ref=heyhomepage.com' to the links. It is still tracking. Besides, server-side feed aggregators have a valid reason to cache the feed and canonicalize the item url to avoid dupes.

rambambram3y ago

An RSS icon has an important signaling function, if you ask me. Automatic discovery is good, but why not also have a textual link or icon pointing to your feed!?

The original link to the feed is preserved for my users to click on. My system - in use with the user - pointing the user to someone else's website accompanied by a GET var containing the user's own URL is not tracking. The end website can also know that info from the HTTP referrer. It's a very crude implementation of a webmention, in a sense. Because it's not necessarily about the linking website, but about telling someone's RSS feed is in use!

1 more reply

cxr3y ago

> I attach a '&rss_ref=heyhomepage.com' to the end of the link to the original site

That is incredibly unsound (on a technical level, not merely in matters of taste).

ryangittins3y ago· 2 in thread

I ran into a number of finicky issues building siftrss[1] a few years back. One I toiled over quite a bit was the discovery that Feedly, a very popular feed reader, does not support gzip. I haven't checked in recent years, but they may still not.

It's frustrating when you're forced to change the behavior of your "agnostic" application for the sake of a large, commonly-used third party tool in the ecosystem.

[1] https://siftrss.com/

coder5433y ago

I don't understand how feedly is the issue here. If the client doesn't say they accept gzip encoding, why are you sending gzip encoded content? It would be slightly weird if the feedly client doesn't ask for gzip, but this is standard HTTP content negotiation.

If an HTTP server is ignoring the Accept-Encoding header and choosing to serve a Content-Encoding that the client can't accept, that is the problem here. If the server and client can't come to an agreement, isn't that the purpose of HTTP 406? But, being able to serve both gzip'd and plain text versions of an XML file doesn't seem that crazy.

ryangittins3y ago

I'm fuzzy on the details as it's been 5+ year since I looked at it, but it wasn't as simple as that. I think it may have been that it worked over HTTP but not HTTPS, and/or they did say that accepted it but it broke under some circumstances.

thirdplace_3y ago· 2 in thread

I also attempted to build a feed reader a while back. In the process I built a feed discovery service:

https://discovery.thirdplace.no/?q=jackevansevo.github.io

It's not perfect but it's better than a simple parsing of <link> tags in the html.

simonw3y ago

Does your implementation there parse HTML and look for link tags or is it doing something else as well?

Edit: figured it out from https://discovery.thirdplace.no/about - it looks like it's using link tags but also has a big list of baked in known-patterns, e.g. these:

https://git.sr.ht/~thirdplace/feed-finder/tree/main/item/src...

rambambram3y ago

Nice app with a clear use case!

lloydatkinson3y ago· 2 in thread

This is a great read and I will be sure to use this when a project I have in mind needs to parse a variety of feeds. So far the default .NET SyndicationFeed class works well though.

Always wished RSS/ATOM had a dedicated field for images. Why didn’t they? Currently it always seems to involve some inline HTML in a CDATA element. Pretty gross.

rambambram3y ago

There's "enclosure" for RSS. And Atom can have "<link rel='enclosure'>".

How I parse enclosures in my own timeline you can see here: https://www.heyhomepage.com/?module=timeline&post=4 (also with a nice link to the original source)

lloydatkinson3y ago

I did try that but none of the rss apps I tried displayed the image

1 more reply

denton-scratch3y ago· 2 in thread

> Coördinated Universal Time

The o-umlaut doesn't occur in English, and "Nate Hopper" sounds like an English name.

owenm3y ago

It's a diaeresis symbol rather than an umlaut, used (infrequently) to show that it's pronounced co-or rather than coor. Quite archaic, unless you're the New York Times, who use it as part of their house style, but not wrong!

cldellow3y ago

I think you're thinking of The New Yorker, not the New York Times.

1 more reply

smilbandit3y ago· 1 in thread

Feed readers are my learning project, I use it to learn new languages. I've built and rebuilt readers in vbscript, vb.net, c#, php and python. php and python have been the easiest since they have good parser libraries. Also I've used SQL Server, MySQL, SQLite and just JSON flat files. I think I've built something like 10 or so variations. In the last few I've expanded to not only pull from RSS and included Hacker News, Twitter and an enhanced pull for Reddit feeds. Though I'm not pulling Twitter currently because of some API changes that I've haven't bothered to spend time on.

Helpful hint if you need favicons for your reader you can use Google.

https://www.google.com/s2/favicons?domain=techmeme.com

The above is a load balancer for this url where the t1 subdomain may change to t[1-9] but this URL allows you to change the image size.

https://t1.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&...

I use it to grab and store sizes 16,32,48,64 of the icons with a monthly update ping.

My current iteration is built in python with a mysql backend. It's setup in a river of news style with an everything river and one for each feed and I generate topic bundles also. The feed engine is running every 15 minutes grabbing 40 feeds at a time but the static site generator is only running every 6 hours to keep me from spending all my time reading news. Since I pull in Reddit feeds I found that it's great for feed discovery.

jcgoette3y ago

Just discovered this favicon "trick" the other day in RSS-Bridge code:

https://github.com/RSS-Bridge/rss-bridge/blob/5e664d9b2b0cb0...

mawise3y ago· 1 in thread

This os great! I recognize a lot of the challenges I ran into (or decided to ignore!) When building the reader for https://havenweb.org . I had a particular chuckle at "#just for sorting", remembering feeds that kept bumping themselves to the top of my reader!

rambambram3y ago

Hey, you're in my OPML list of shared links: https://www.heyhomepage.com/?module=timeline&view=sharedlist

butz3y ago

Annoying thing with RSS readers is when a website implements some sort of "security feature", RSS reader might not be able to download any feeds. I had one occurrence where feed reader was asked to complete a captcha to reach content. Being a "bot" it of course failed. Another time one website was blocking all traffic from abroad, so RSS reader just got access errors, as server is located in another country.

apeace3y ago

This is a good list. I did this at a medium scale once (about 10,000 feeds that needed to be checked once per minute).

My favorite thing he mentioned is that various tags can have different meanings. Published, updated, description, content, subtitle. To do this at scale you need some configurations for each feed to specify where you can get information. Does <published> mean published, or does it actually mean updated? Everyone does it differently.

And the etag thing. Yeah…

One thing he didn’t mention is media. I think the HN crowd really likes RSS because the mostly-text tech blogs they like to read all support it, and it seems to work fine. But a lot of the population likes to read content that has embedded images and videos. Even slideshows sometimes. There are RSS extensions for this, but they suck for all the same reasons.

At my company we ended up abandoning RSS and writing a customizable web scraper instead (ingesting HTML pages). It was actually a lot easier than dealing with RSS.

bitforger3y ago

I haven't used a feed reader in a long time, but I had a brief period when I was obsessed with Fraidycat. Worth a look if you're interested in a different approach to keeping up with people.

https://fraidyc.at/

geraldbauer3y ago

FYI: Another feed reader I built (called pluto with sqlite as feed / data storage) see https://github.com/feedreader - used by OpenStreetMaps Blogs, Planet KDE, and others.

PS: For the (ongoing) struggle (trying) to "normalize" the RSS and ATOM feed formats (or JSON Feeds) see the feedparser gem - https://github.com/rubycocos/feedparser

derekzhouzhen3y ago

Been there, done that. A lot of feeds, I means 99%+ have subtle bugs in the meta data that can be easily fixed and make feed reader writer's life easier and broaden your readership. There are rss validators, please make use of them. I have a lint tool for your blog that cross check meta data from the feed and meta data from the post:

https://roastidio.us/lint

pricechild3y ago

A long time back I had a go at this too, but reimplementing ttrss's api instead of writing my own frontend: https://github.com/nvtrss/nvtrss

I learnt a lot. My goal was getting something working that the ttrss android app would connect to and I reasonably succeeded there, running it for a few years.

I went back to hosting the full ttrss application at some point.

rsolva3y ago

What I am missing is a robust solution for keeping my feeds (blogs, podcasts etc) in sync between multiple devices, using a standardised protocol that enables the usage of many different clients on any platform.

There has been some attempts on tackling this problem, but none have managed to get it right and become truly universal, as far as I know.

mcfunley3y ago

I worked on a feed reader back in 2006. The worst feed discovery kluge I can recall needing to special case was that certainly the most popular blog at the time (Cute Overload) was a frameset around blogger. That was typical though, people’s sites are a mess.

denton-scratch3y ago

https://archive.ph/tmbk6

lormayna3y ago

Writing my own feed reader was one of my unfinished side projects. Thank you for sharing your struggling.

animitronix3y ago

I'd like to know what issues the author has with ttrss

ernsheong3y ago

FWIW, I made https://readerize.com that doesn't rely on RSS. Freemium is coming soon, kindly bear with me. For now, signup/trial is free without needing a credit card.

If you don't agree with the philosophy, kindly move along, no need to downvote.

I hesitated for a long time too. One day I just decided to keep at it and launch.

j / k navigate · click thread line to collapse

49 comments

40 comments · 20 top-level

viraptor3y ago· 7 in thread

At the very least, static blogs will not filter the entries - they're serving / not serving the same file, regardless of etag.

Animats3y ago

[1] https://github.com/John-Nagle/baudotrss/blob/master/messager...

thorio3y ago

https://play.google.com/store/apps/details?id=thorio.solutio...

Beyond that you probably could cover the last 2 or so percent using string comparison against title and description or peppeteering the website.

1 more reply

masklinn3y ago

> Are there many servers optimising that to a shorter feed?

As far as HTTP caching semantics are concerned, the new version of the resource would replace the old one, and new clients would be served the latest cached version, truncated.

In fact the requests headers make that very clear, as they're called respectively If-None-Match and If-Modified-Since.

aendruk3y ago

Yeah, the whole post is riddled with little technical misunderstandings. It’s nice to see someone working things out in the open though.

JackevansevoOP3y ago

Author of the post, could you clarify?

denton-scratch3y ago

I gave up around paragraph four; the article isn't a technical article, it's a touchy-feely people story about an old, blind man with a very white beard.

2 more replies

JackevansevoOP3y ago

It's a tad confusing. I believe you're totally correct, this is how those headers should behave.

> The headers are defined at the http level and with caching layers in between, the endpoint is free to return the current feed - with both new and seen entries.

A lot of feeds behave this way. Especially if it's just a static blog

> Are there many servers optimising that to a shorter feed?

I've come across a fair few that behave with the semantics I described

rambambram3y ago· 3 in thread

Where's your own RSS icon then!? ;)

All jokes aside, you just described literally all the points I encountered while developing the built-in feed reader for HeyHomepage.com Good summary!

One thing I notice a lot of people say - like you - is "forcing users to link through to read the article on the original site (semi defeating the point of subscribing via feedreader)".

derekzhouzhen3y ago

His homepage has a `link rel="alternate"` meta and that's all that matters.

rambambram3y ago

An RSS icon has an important signaling function, if you ask me. Automatic discovery is good, but why not also have a textual link or icon pointing to your feed!?

1 more reply

cxr3y ago

> I attach a '&rss_ref=heyhomepage.com' to the end of the link to the original site

That is incredibly unsound (on a technical level, not merely in matters of taste).

ryangittins3y ago· 2 in thread

It's frustrating when you're forced to change the behavior of your "agnostic" application for the sake of a large, commonly-used third party tool in the ecosystem.

[1] https://siftrss.com/

coder5433y ago

ryangittins3y ago

thirdplace_3y ago· 2 in thread

I also attempted to build a feed reader a while back. In the process I built a feed discovery service:

https://discovery.thirdplace.no/?q=jackevansevo.github.io

It's not perfect but it's better than a simple parsing of <link> tags in the html.

simonw3y ago

Does your implementation there parse HTML and look for link tags or is it doing something else as well?

Edit: figured it out from https://discovery.thirdplace.no/about - it looks like it's using link tags but also has a big list of baked in known-patterns, e.g. these:

https://git.sr.ht/~thirdplace/feed-finder/tree/main/item/src...

rambambram3y ago

Nice app with a clear use case!

lloydatkinson3y ago· 2 in thread

This is a great read and I will be sure to use this when a project I have in mind needs to parse a variety of feeds. So far the default .NET SyndicationFeed class works well though.

Always wished RSS/ATOM had a dedicated field for images. Why didn’t they? Currently it always seems to involve some inline HTML in a CDATA element. Pretty gross.

rambambram3y ago

There's "enclosure" for RSS. And Atom can have "<link rel='enclosure'>".

How I parse enclosures in my own timeline you can see here: https://www.heyhomepage.com/?module=timeline&post=4 (also with a nice link to the original source)

lloydatkinson3y ago

I did try that but none of the rss apps I tried displayed the image

1 more reply

denton-scratch3y ago· 2 in thread

> Coördinated Universal Time

The o-umlaut doesn't occur in English, and "Nate Hopper" sounds like an English name.

owenm3y ago

cldellow3y ago

I think you're thinking of The New Yorker, not the New York Times.

1 more reply

smilbandit3y ago· 1 in thread

Helpful hint if you need favicons for your reader you can use Google.

https://www.google.com/s2/favicons?domain=techmeme.com

The above is a load balancer for this url where the t1 subdomain may change to t[1-9] but this URL allows you to change the image size.

https://t1.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&...

I use it to grab and store sizes 16,32,48,64 of the icons with a monthly update ping.

jcgoette3y ago

Just discovered this favicon "trick" the other day in RSS-Bridge code:

https://github.com/RSS-Bridge/rss-bridge/blob/5e664d9b2b0cb0...

mawise3y ago· 1 in thread

rambambram3y ago

Hey, you're in my OPML list of shared links: https://www.heyhomepage.com/?module=timeline&view=sharedlist

butz3y ago

apeace3y ago

This is a good list. I did this at a medium scale once (about 10,000 feeds that needed to be checked once per minute).

And the etag thing. Yeah…

At my company we ended up abandoning RSS and writing a customizable web scraper instead (ingesting HTML pages). It was actually a lot easier than dealing with RSS.

bitforger3y ago

I haven't used a feed reader in a long time, but I had a brief period when I was obsessed with Fraidycat. Worth a look if you're interested in a different approach to keeping up with people.

https://fraidyc.at/

geraldbauer3y ago

FYI: Another feed reader I built (called pluto with sqlite as feed / data storage) see https://github.com/feedreader - used by OpenStreetMaps Blogs, Planet KDE, and others.

PS: For the (ongoing) struggle (trying) to "normalize" the RSS and ATOM feed formats (or JSON Feeds) see the feedparser gem - https://github.com/rubycocos/feedparser

derekzhouzhen3y ago

https://roastidio.us/lint

pricechild3y ago

A long time back I had a go at this too, but reimplementing ttrss's api instead of writing my own frontend: https://github.com/nvtrss/nvtrss

I learnt a lot. My goal was getting something working that the ttrss android app would connect to and I reasonably succeeded there, running it for a few years.

I went back to hosting the full ttrss application at some point.

rsolva3y ago

There has been some attempts on tackling this problem, but none have managed to get it right and become truly universal, as far as I know.

mcfunley3y ago

denton-scratch3y ago

https://archive.ph/tmbk6

lormayna3y ago

Writing my own feed reader was one of my unfinished side projects. Thank you for sharing your struggling.

animitronix3y ago

I'd like to know what issues the author has with ttrss

ernsheong3y ago

FWIW, I made https://readerize.com that doesn't rely on RSS. Freemium is coming soon, kindly bear with me. For now, signup/trial is free without needing a credit card.

If you don't agree with the philosophy, kindly move along, no need to downvote.

I hesitated for a long time too. One day I just decided to keep at it and launch.

j / k navigate · click thread line to collapse