undefined | Better HN

0 pointsernsheong8y ago0 comments

Thanks :) Yup, I source news from local news portals as shown in the website. Hourly, each source is crawled in each relevant category to get the latest news items, and stored into a database. Another job then goes and fetches the body of all these articles. (I use https://github.com/robfig/cron for cron jobs). Server and HTML templates are both Golang. As for the aggregation (grouping) algorithm, I'll just say that it's straight out of the textbook http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

0 comments

3 comments · 1 top-level

e15ctr0n8y ago· 2 in thread

Thanks for the detailed reply. :-)

> As for the aggregation (grouping) algorithm, I'll just say that it's straight out of the textbook http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

So, in other words, you're using the MinHash algorithm as well as Locality-sensitive hashing (LSH)? How much volume are you able to process in how much time?

By the way, I first learned about this topic through Stanford’s “Mining of Massive Datasets” (MMDS) course that used to be free on Coursera. So it's thrilling to see someone put it to use in the real world and talk about it, too! :-)

ernsheongOP8y ago

Yup, MinHash with LSH. It's quite fast and low compute intensive, because articles shown are limited by recency (e.g. past 24 hours), say order of hundreds and thousands in a few seconds. Someone wrote an open source LSH on github on Golang, so no credits to me :) Probably would not have been able to code LSH myself.

e15ctr0n8y ago

It would be awesome if you blogged about your entire experience setting up your news aggregator. But I guess your first priority is PageDash these days so I can keep dreaming. :-)

j / k navigate · click thread line to collapse