Generate RSS feed for any website using CSS selectors (opens in new tab)

(rss-bridge.org)

203 pointsthirdplace_2y ago53 comments

53 comments

48 comments · 22 top-level

PaulHoule2y ago· 5 in thread

I've wondered why people have tried all sorts of cumbersome ways to splice metadata onto HTML like RDFa but never tried the obvious approach of basing extraction rules on CSS selectors... Often these work without the cooperation of the target site so long as they use CSS the way it was supposed be used (e.g. not tailwind, bootstrap, etc.)

ttepasse2y ago

Back in the optimistic 2000s there was the idea of GRDDL – using XSLT stylesheets and XPath selectors for extracting stuff, e.g. microformats, HTML meta, FOAF, etc:

https://www.w3.org/TR/grddl/

account-52y ago

Having learned xpath and a little xslt I've always wondered why it isn't more popular. It seems like a powerhouse for reading and transforming data from XML type documents. I've found it hard to find decent resources to learn more than the basics (and none for xquery) because of lack of popularity nowadays, but I do thing it's a skill you should have like SQL and regex. Seems a no brainer.

1 more reply

bubblematrix2y ago

CSS selectors has been common for the scrapers I've been using for years.

kybernetikos2y ago

I quite like the microformats approach to this. https://developer.mozilla.org/en-US/docs/Web/HTML/microforma...

k1m2y ago

Sadly the trend does seem to be a move away from semantic CSS. I get the appeal of Tailwind for creating components and custom designs, but it's surprising when you see content heavy sites like the BBC no longer using class attributes in their news articles the way they used to.

dagurp2y ago· 4 in thread

These days I just let chagpt generate a script that scrapes a site and spits out an rss file. Then I run it with cron.

notadev2y ago

I’m guessing they paste a portion of the website’s source then tell ChatGPT to generate a script that can generate an RSS feed from that site.

dagurp2y ago

Yeah I just copy the html that's relevant. There's some manual work involved but it doesn't take a lot of time.

dopidopHN2y ago

Are you not limited by the cut off date of the content the model is trained off ?

pinkcan2y ago

1. the script is generated by the llm

2. the user runs the script that does the scraping

these are temporally separate actions

1 more reply

eviks2y ago· 3 in thread

What's the easiset way to also run a few basic filters on the site/RSS feed's content to make it truly shine vs simplistic scraping, like

- splitting the full feed by theme of the article into separate feeds and at the same time

- remove a few keywords and also

- get article length and split into a long / short feed

- Or maybe get what you used to have on some news sites - subscribe only to a specific author instead of getting bombarded with hundreds of items in a feed

rakoo2y ago

Write a parser for rss-bridge that takes a rss feed in, does what you need, and spits a feed out

I don't know any service that does that automatically but it's attainable to have a generic way of doing what you need. That's the power of rss-bridge: make the feed you want from content that already exists

pinkcan2y ago

you could start by pushing all articles into a database; have another process quickly label/tag the entries based on the criteria you care about; web or tui app to show you only the entries you care about; slower clean up job for entries you don't care to keep around anymore

eviks2y ago

Thanks, but I meant which of the RSS services offers this basic filtering? From a dozen I know of, including paid ones, at most you get keywords black/white lists, which is too limiting Used to use Huginn for that on Heroku

1vuio0pswjnm72y ago· 3 in thread

"Generate RSS feed for any website using CSS selectors"

For me, "CSS selectors" always seems like a deceptive term, if it means selecting HTML tag elements. What if the website does not use styling.

I read 1000s of websites, including all HN submissions, without using CSS. When I want to extract information from a website, I focus on patterns in the page. They might be HTML, they might be style elements, but they could be anything. I never assume that all websites will wrap the information I want in certain elements. There is a ridiculous amount of random variation amongst websites.

mmcwilliams2y ago

I'm not sure that CSS being used on the page is a requirement. In the way that `h1 a` would be a valid CSS selector, in this case, would not be require that it be styled by a style sheet.

The key here is that it uses selectors, not the style sheets themselves.

daniel-s2y ago

You just need to use the same logic, syntax as CSS' selectors to pick out can ntent from the page. That's something a little different to CSS to style.

1vuio0pswjnm72y ago

Using CSS selectors, exclusively, is brittle and prone to failure.

toastal2y ago· 2 in thread

CSS selectors were more useful before the Tailwind fad of dropping meaningful classes names in favor of recreating inline styles but with abbreviations to memorize. I use μBlock Origin + userStyles a lot which both also uses CSS selectors & the last couple of years everything has become a lot harder on the end user to tweak/fix. If you’re lucky now, you’ll have some ARIA attributes to select on.

zelphirkalt2y ago

And it also became harder due to people thinking random ids and class names are totally fine. Super annoyed by that. It feels like they are forcing their vision onto the user, while the user does not want their vision and could not care less.

toastal2y ago

The web was nicer when you could inspect, learn, & riff off of what others where doing in the industry–like the old music industry used to do when covering & borrowing a phrase was considered homage not grounds for lawsuit. It’s now all meant to be closed off & behind build tools that complect the output where most folks don’t even know how their pipeline works; and this is strange since the simple tools of HTML, CSS, & JS simply construct the web without any build steps at all if you wanted.

1 more reply

nfriedly2y ago· 2 in thread

I run my own instance of RSS Bridge to keep track of authors that I like on Goodreads.

It works pretty well, although every once in a while Goodreads hiccups, and then RSS bridge gives me a bunch of "new posts" that are actually error messages.

captn3m02y ago

Hey, I wrote the Goodreads bridge for exactly this usecase. I’ll try to see if I can filter out the error messages.

nfriedly2y ago

Thanks! I've been meaning to play with the code and see if I could see if I could figure out how to add a few more features:

* Generate RSS feeds from book series

* Filter out translations

* Filter out compilations (not sure if this one is really plausible)

Any pointers on how I might accomplish some of those?

bubblematrix2y ago· 2 in thread

This honestly is standard web scraping but these projects always catch my attention.

You're bound at the mercy of rate-limiting firewalls (so you'll have to rotate proxies if you intend on using this heavily) on top of the standard CloudFront bot detection recaptcha, and div-obfuscation (a good example of this is Facebook).

captn3m02y ago

rss-Bridge has decent caching support, customisable on a bridge level, so that comes pre-tuned and works well at low volumes for personal use.

At large scale, like the kind of traffic I started seeing when I ran a public rss-bridge Instagram/Telegram bridge - rate limits are unavoidable.

k1m2y ago

That's been my experience too. Some of the bridges take into account the rate limits imposed by the platforms, and the steps required to get content without an API key.

So using RSS Bridge to generate feeds from large platforms is often a lot more reliable than the typical scraping script I'd code up myself for other sites.

snthd2y ago· 1 in thread

RSSHub[0] is in the same ballpark, but consists of a large library of site-specific code[1][2].

[0]https://github.com/DIYgod/RSSHub/

[1]https://github.com/DIYgod/RSSHub/tree/master/lib/routes

[2]https://github.com/DIYgod/RSSHub/tree/master/lib/v2

PurpleRamen2y ago

RSS Bridge also has a large library of site-specific code, CSS is just another of the hundred of solution they offer. And there are some other projects collecting and maintaining recipes for scrapping data from sites. Calibre for example and youtube-dl/yt-dlp for videos. Seeing so many projects doing all the same, I kinda feel sad that they are not cooperating to maintain a central recipe-collection.

solardev2y ago· 1 in thread

It ded.

Archive: https://web.archive.org/web/20230714202418/https://rss-bridg...

Sample feed: https://web.archive.org/web/20230308160413/https://rss-bridg...

crtasm2y ago

List of public instances: https://rss-bridge.github.io/rss-bridge/General/Public_Hosts...

edit: but the few I tried did not have the CSS Selector Bridge enabled so go with the original link or archive of it.

ChrisArchitect2y ago· 1 in thread

Other services like this: https://www.fivefilters.org/feed-creator/

k1m2y ago

I created Feed Creator, so nice to see it mentioned in the comments :)

I've written two blog posts about how we go about using CSS selectors when working with Feed Creator. Might be useful for those looking to do the same with RSS-Bridge.

How to turn a webpage into an RSS feed using Feed Creator

Part 1: https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...

Part 2 (using more advanced selectors): https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...

treyd2y ago· 1 in thread

I wonder if this would work better / be more expressive with XPATH-style selectors?

thirdplace_OP2y ago

rss-bridge also has xpath-style bridge: https://rss-bridge.org/bridge01/#bridge-XPathBridge

skribanto2y ago· 1 in thread

Getting 502 Bad Gateway

kalupa2y ago

yea, suffering from success ...

awesomegoat_com2y ago

I was always afraid to use on of these. I thought that the css selectors would be too brittle and ultimately break.

I have build my own solution that is automagical at https://awesomegoat.com/ but I am running into next set of issues which are various scraping protections. It seems that reasonable RSS gateway today needs to include botnet of residential proxies just to read content on the internet.

xnx2y ago

This is a great tool! Before I learned about nitter, this was my primary way to follow people on Twitter. I love the idea of trying to wrestle unsupported feeds (Twitter, Instagram, etc.) into a standard/open format.

jasonlotito2y ago

The lack of feed generation is why I so many of the latest blog platforms are non-starters in my book. It boggles my mind. Honestly, if you don't generate a feed of some sort, I really can't take you seriously.

okuntilnow2y ago

Huginn is an another useful tool that allows you to wrangle CSS selectors and XPath nodes to create RSS feeds.

I use it quite successfully to get data out of undocumented APIs and out into RSS.

https://github.com/huginn/huginn

CoBE102y ago

For me PolitePol is best because if doesn't limit the amount of feeds and the free plan is pretty good: https://politepol.com

account-52y ago

Is there a standalone application that can do similar. That doesn't require a web server to run. Like an RSS reader you'd run on you desktop or phone? I'd definitely be interested in that.

Hamuko2y ago

FreshRSS has XPath scraping.

https://danq.me/2022/09/27/freshrss-xpath/

midasz2y ago

Does it work for websites that fetch content async? I've had success with https://morss.it instead (which can also be selfhosted)

simonjgreen2y ago

This is very similar to how you can scrape data from web with powerquery

kayson2y ago

FreshRSS has this feature built in. But you can use rss-bridge for far more complicated scenarios too

j / k navigate · click thread line to collapse

53 comments

48 comments · 22 top-level

PaulHoule2y ago· 5 in thread

ttepasse2y ago

Back in the optimistic 2000s there was the idea of GRDDL – using XSLT stylesheets and XPath selectors for extracting stuff, e.g. microformats, HTML meta, FOAF, etc:

https://www.w3.org/TR/grddl/

account-52y ago

1 more reply

bubblematrix2y ago

CSS selectors has been common for the scrapers I've been using for years.

kybernetikos2y ago

I quite like the microformats approach to this. https://developer.mozilla.org/en-US/docs/Web/HTML/microforma...

k1m2y ago

dagurp2y ago· 4 in thread

These days I just let chagpt generate a script that scrapes a site and spits out an rss file. Then I run it with cron.

notadev2y ago

I’m guessing they paste a portion of the website’s source then tell ChatGPT to generate a script that can generate an RSS feed from that site.

dagurp2y ago

Yeah I just copy the html that's relevant. There's some manual work involved but it doesn't take a lot of time.

dopidopHN2y ago

Are you not limited by the cut off date of the content the model is trained off ?

pinkcan2y ago

1. the script is generated by the llm

2. the user runs the script that does the scraping

these are temporally separate actions

1 more reply

eviks2y ago· 3 in thread

What's the easiset way to also run a few basic filters on the site/RSS feed's content to make it truly shine vs simplistic scraping, like

- splitting the full feed by theme of the article into separate feeds and at the same time

- remove a few keywords and also

- get article length and split into a long / short feed

- Or maybe get what you used to have on some news sites - subscribe only to a specific author instead of getting bombarded with hundreds of items in a feed

rakoo2y ago

Write a parser for rss-bridge that takes a rss feed in, does what you need, and spits a feed out

pinkcan2y ago

eviks2y ago

1vuio0pswjnm72y ago· 3 in thread

"Generate RSS feed for any website using CSS selectors"

For me, "CSS selectors" always seems like a deceptive term, if it means selecting HTML tag elements. What if the website does not use styling.

mmcwilliams2y ago

I'm not sure that CSS being used on the page is a requirement. In the way that `h1 a` would be a valid CSS selector, in this case, would not be require that it be styled by a style sheet.

The key here is that it uses selectors, not the style sheets themselves.

daniel-s2y ago

You just need to use the same logic, syntax as CSS' selectors to pick out can ntent from the page. That's something a little different to CSS to style.

1vuio0pswjnm72y ago

Using CSS selectors, exclusively, is brittle and prone to failure.

toastal2y ago· 2 in thread

zelphirkalt2y ago

toastal2y ago

1 more reply

nfriedly2y ago· 2 in thread

I run my own instance of RSS Bridge to keep track of authors that I like on Goodreads.

It works pretty well, although every once in a while Goodreads hiccups, and then RSS bridge gives me a bunch of "new posts" that are actually error messages.

captn3m02y ago

Hey, I wrote the Goodreads bridge for exactly this usecase. I’ll try to see if I can filter out the error messages.

nfriedly2y ago

Thanks! I've been meaning to play with the code and see if I could see if I could figure out how to add a few more features:

* Generate RSS feeds from book series

* Filter out translations

* Filter out compilations (not sure if this one is really plausible)

Any pointers on how I might accomplish some of those?

bubblematrix2y ago· 2 in thread

This honestly is standard web scraping but these projects always catch my attention.

captn3m02y ago

rss-Bridge has decent caching support, customisable on a bridge level, so that comes pre-tuned and works well at low volumes for personal use.

At large scale, like the kind of traffic I started seeing when I ran a public rss-bridge Instagram/Telegram bridge - rate limits are unavoidable.

k1m2y ago

That's been my experience too. Some of the bridges take into account the rate limits imposed by the platforms, and the steps required to get content without an API key.

So using RSS Bridge to generate feeds from large platforms is often a lot more reliable than the typical scraping script I'd code up myself for other sites.

snthd2y ago· 1 in thread

RSSHub[0] is in the same ballpark, but consists of a large library of site-specific code[1][2].

[0]https://github.com/DIYgod/RSSHub/

[1]https://github.com/DIYgod/RSSHub/tree/master/lib/routes

[2]https://github.com/DIYgod/RSSHub/tree/master/lib/v2

PurpleRamen2y ago

solardev2y ago· 1 in thread

It ded.

Archive: https://web.archive.org/web/20230714202418/https://rss-bridg...

Sample feed: https://web.archive.org/web/20230308160413/https://rss-bridg...

crtasm2y ago

List of public instances: https://rss-bridge.github.io/rss-bridge/General/Public_Hosts...

edit: but the few I tried did not have the CSS Selector Bridge enabled so go with the original link or archive of it.

ChrisArchitect2y ago· 1 in thread

Other services like this: https://www.fivefilters.org/feed-creator/

k1m2y ago

I created Feed Creator, so nice to see it mentioned in the comments :)

I've written two blog posts about how we go about using CSS selectors when working with Feed Creator. Might be useful for those looking to do the same with RSS-Bridge.

How to turn a webpage into an RSS feed using Feed Creator

Part 1: https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...

Part 2 (using more advanced selectors): https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...

treyd2y ago· 1 in thread

I wonder if this would work better / be more expressive with XPATH-style selectors?

thirdplace_OP2y ago

rss-bridge also has xpath-style bridge: https://rss-bridge.org/bridge01/#bridge-XPathBridge

skribanto2y ago· 1 in thread

Getting 502 Bad Gateway

kalupa2y ago

yea, suffering from success ...

awesomegoat_com2y ago

I was always afraid to use on of these. I thought that the css selectors would be too brittle and ultimately break.

xnx2y ago

jasonlotito2y ago

okuntilnow2y ago

Huginn is an another useful tool that allows you to wrangle CSS selectors and XPath nodes to create RSS feeds.

I use it quite successfully to get data out of undocumented APIs and out into RSS.

https://github.com/huginn/huginn

CoBE102y ago

For me PolitePol is best because if doesn't limit the amount of feeds and the free plan is pretty good: https://politepol.com

account-52y ago

Is there a standalone application that can do similar. That doesn't require a web server to run. Like an RSS reader you'd run on you desktop or phone? I'd definitely be interested in that.

Hamuko2y ago

FreshRSS has XPath scraping.

https://danq.me/2022/09/27/freshrss-xpath/

midasz2y ago

Does it work for websites that fetch content async? I've had success with https://morss.it instead (which can also be selfhosted)

simonjgreen2y ago

This is very similar to how you can scrape data from web with powerquery

kayson2y ago

FreshRSS has this feature built in. But you can use rss-bridge for far more complicated scenarios too

j / k navigate · click thread line to collapse