undefined | Better HN

0 pointsPaulHoule2y ago0 comments

Wrote my own RSS feeder. It ingests 3000 articles a day from 100 feeds and picks out 300 a day to show me, I thumbs up or thumbs down every one of those and it trains a machine learning model to pick the next 300.

At least that’s the plan. It also shows me the “top 100 likely to front page HN” and “top 100 likely to get a lot of comments on HN” and I scan those quickly. I usually take more than a day to read the 300 which is OK because more articles go into the process so what is selected is even better. If you look at what I submit to HN those articles were all selected by YOShInOn first and me second.

That’s a good idea of what my feed looks like but I also like articles about sports which I rarely submit to HN.

0 comments

2 comments · 1 top-level

johhns42y ago· 1 in thread

Okay interesting! So how did you build it?

PaulHouleOP2y ago

One of these days I am going to blog about it but here is the short of it:

I thought RSS readers sucked in 2004 ("mark as read?" really?) and was involved in text classification enough to know a machine learning powered RSS reader was possible.

I thought about it on and off for a while but it was last December when the Twitter crisis broke that I felt I had to act.

It uses Suprfeedr to ingest RSS feeds, it costs about $10 a month to process 100 feeds which I can afford but I could not really afford to ingest 1000 independent blogs which is too bad. Superfeedr posts the items to an AWS lambda which puts them in an SQS queue, YOShiNoN takes items from the queue at its convenience.

Something that crawls RSS feeds directly is possible but a hassle in terms of another thing to code and the system having to do frequent polling. I get asked all the time about making it open source and one of the problems is I don't think many people will want to have the cloud component.

Those items go into an arangodb database. I use SBERT to tokenize and embed documents, then models from scikit-learn to classify and cluster. This is all in Anaconda Python. It uses the GPU on a very powerful PC but takes just a minute or so to process a day's worth of data so if it took 10 or 20 minutes on a lesser PC it might still be viable.

There is a front end web server using aiohttp and HTMX which implements the user interface. There is another Python package called "mastodonster" which runs in the Python.org Python because Anaconda doesn't support the mastodon client.

It has a feature to 'favorite' content which I can then click another button on to submit to Hacker News or Mastodon. Posts to HN get queued up so I can select them when I have time and then it dribbles them out so you never see two of my posts on the "new" page.

It spins like a top, despite the fact that I am doing the silly thing of throwing in articles about Postgres and arXiv preprints and news articles from The Guardian about soccer w/o any feature engineering. I'm not afraid to demo it because it never screws up but it is weird to demo because the content it shows seems totally random to anyone else.

I work on it furtively, focused on "pain-driven development" around using it to curate links for social media.

One weakness it has is that you have to evaluate a few 100 articles, maybe even 1000 for the recommendations to get good. if somebody is committed to it they can get to a place where it is great but I think a lot of people would try it out and expect it to be doing a good job with 10 articles and it just isn't like that. A commercial version of it would have strong pressures to use some kind of "collaborative filtering" as opposed to content-based recommendation because of that. I have a huge amount of training data that could be repurposed to experiment with a model that learns with less data, but it's not something I'm that interested in doing for myself because I am swimming in data now.

There are a lot of directions it could go on if I get more time to work on it:

1. A cloud-based service that works basically like YOShInOn does now

2. An open source project that does what it does right now but is cleaned up to be easy to install and work on, and

3. A "pro" edition that would be for professional searchers such as recruiters, salespeople, patent searchers, people who do meta-analysis, reporters, etc. That would support multiple workflows so you could have a few different classifications going on.

My current north star is (3) through the social media curation direction, the thing is that my current use case is not that interesting from the viewpoint of advanced A.I. because my problem is very "fuzzy", whether or not I like something is a bit random so I can't train a highly accurate classifier on it so one of these days I want to use the same U.I. to make a training set where 95% accuracy would be possible.

j / k navigate · click thread line to collapse