[0]https://github.com/DIYgod/RSSHub/
edit: but the few I tried did not have the CSS Selector Bridge enabled so go with the original link or archive of it.
I have build my own solution that is automagical at https://awesomegoat.com/ but I am running into next set of issues which are various scraping protections. It seems that reasonable RSS gateway today needs to include botnet of residential proxies just to read content on the internet.
It works pretty well, although every once in a while Goodreads hiccups, and then RSS bridge gives me a bunch of "new posts" that are actually error messages.
* Generate RSS feeds from book series
* Filter out translations
* Filter out compilations (not sure if this one is really plausible)
Any pointers on how I might accomplish some of those?
I use it quite successfully to get data out of undocumented APIs and out into RSS.
You're bound at the mercy of rate-limiting firewalls (so you'll have to rotate proxies if you intend on using this heavily) on top of the standard CloudFront bot detection recaptcha, and div-obfuscation (a good example of this is Facebook).
At large scale, like the kind of traffic I started seeing when I ran a public rss-bridge Instagram/Telegram bridge - rate limits are unavoidable.
So using RSS Bridge to generate feeds from large platforms is often a lot more reliable than the typical scraping script I'd code up myself for other sites.
I've written two blog posts about how we go about using CSS selectors when working with Feed Creator. Might be useful for those looking to do the same with RSS-Bridge.
How to turn a webpage into an RSS feed using Feed Creator
Part 1: https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...
Part 2 (using more advanced selectors): https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...
- splitting the full feed by theme of the article into separate feeds and at the same time
- remove a few keywords and also
- get article length and split into a long / short feed
- Or maybe get what you used to have on some news sites - subscribe only to a specific author instead of getting bombarded with hundreds of items in a feed
I don't know any service that does that automatically but it's attainable to have a generic way of doing what you need. That's the power of rss-bridge: make the feed you want from content that already exists
My take is that some specifications can be written out in a linear way where you can start reading at the beginning and work to end and not feel like you need to read ahead.
Some specs have a minor discontinuity, I remember perceiving it in the K and R book on C but it seemed like there was just one kink in it and if you read the book twice you’d do OK.
Books in C++ are worse and have numerous topics that resist being put in the right order. It’s not unusual for “resource acquisition is initialization” to be repeated hundreds of times before it is defined, for instance.
That circularity is both a function of the domain and also a function of the text, I think a certain amount of circularity is inherent to many domains, but frequently you can bootstrap a domain by dividing it into numerous layers and put the circularity into a layer built just to manage the circularity.
XSLT, XMLSchema, and many XML specs have that kind of circular structure, you are left wondering what exact kind of machine is required to implement it so you can look at the spec and have a hard time understanding how to do easy things and no grasp of the hard-looking things that are actually easy. Couple that with numerous sharp edges in XML such as numeric values not being allowed in ID or IDREF fields (hate to break it to them but numeric identifiers are rampant in the jndustry) and it is no wonder people would rather use deeply lame ‘standards’ like JSON that lack comments, aren’t really clear about the semantics of numbers, and don’t have the moral authority to say “quit screwing around and just use ISO 8601 dares.
Now I finally realized the OWL spec is perfectly clear in the sense that you can understand what it really does by understanding the mapping of OWL axioms to first order logic, but the trouble is that logic is the most treacherous branch of mathematics.
For me, "CSS selectors" always seems like a deceptive term, if it means selecting HTML tag elements. What if the website does not use styling.
I read 1000s of websites, including all HN submissions, without using CSS. When I want to extract information from a website, I focus on patterns in the page. They might be HTML, they might be style elements, but they could be anything. I never assume that all websites will wrap the information I want in certain elements. There is a ridiculous amount of random variation amongst websites.
The key here is that it uses selectors, not the style sheets themselves.