It's privacy-focused in that there are no cookies, no usage of Google or Facebook components (like Google Analytics or Ads). No data tracking on users whatsoever.
There are bugs I'm aware of, but am looking for feedback on if this format and function is useful.
For those interested in the NLP side of this or the serverless side, I'd be happy to answer questions about how it was put together. The short version is, I pull down RSS feeds from 33 news sites (approximately 1M stories in the database so far), store them, create a term frequency model, cluster the most recent 10k stories based on TF similarity vectors, then store story similarities.
In the future I'd like to add paging, searching, and more filtering. I'm also thinking about having a URL for each story, that would show all the similar articles. That way, if you don't want to link directly to a particular news source, you can link to the drewesnews aggregate URL, and the reader can pick whichever source he/she wants to read.
Any feedback would be much appreciated!
There's three major phases to the document matching: Corpus Management, Frequency Modeling, and Matching & Clustering.
1. Corpus Management
- First we pull the raw articles
- Then we build a bag-of-words document, based on our overall dictionary, so we end up with a bunch of numbers to represent all the words and their frequencies for each document.
- We store all the stories for that day as bag-of-word, in a Matrix Market format.
- The day's .mm file is updated every few minutes into an S3 bucket, along with the dictionary. This way I can easily compose a corpus of documents for a single day, multiple days, or all time
- I use the Gensim library for Python to do most of the above.
2. Frequency Modeling
- As stories come in throughout the day, I periodically refresh a TF-IDF model for the entire corpus of stories.
- TF-IDF just allows us to see which terms happens frequently in a story, but relatively infrequently across all stories. So a word like "Hacker" would be relevant if it appears frequently, but "the" would not since it occurs across all documents.
- TF-IDF modeling also uses the Gensim library.
- TF-IDF model is stored in another S3 bucket, to be pulled on update by the Matching & Clustering job.
3. Matching & Clustering
- First we fit a limited corpus (usually the last 10k or so stories) to the TF-IDF model, and keep that as a sparse matrix.
- This gives us the ability to quickly determine the per-word importance of a document, and to represent that as a vector.
- Next we do a simple cosine similarity of the 10k documents against themselves. This tells us how much of an angle there is between each document vector against each other document vector. In other words, it's a measure of similarity.
- We limit all of this to 10k documents, because it would be computationally prohibitive to compare ALL stories to ALL OTHER stories. Since most news stories are published relatively close to one another, we only have to compare recent stories. 10k seems to produce good results. We can further shard this data set by news category if we need to (i.e., compare only sports stories or political stories).
- Next we use SciPy, and create a Ward linkage matrix.
- Next, also using SciPy, we use fcluster to do agglomerative clustering. These last two steps produce a cluster tree that puts similar stories together.
- Finally, we slice apart the clustered data set at a certain similarity height. Kind of like cutting through a head of broccoli above the stalk. The clusters we're left with are the most similar articles of that batch of 10k.
After all that is done, we just store associations between articles that were determined to be similar to one another. Since we're constantly running this process, the 10k story window keeps sliding forward, so you're able to store similarities of stories that are similar to far older stories, provided there is another story in-between that both are similar to. For example, if story 12,000 is similar to story 9,000, and story 9,000 is similar to story 10, you'll end up storing that story 10 is similar to story 12,000, even though you didn't directly compare them.
Quick note: the first time I went to the site it was blank other than the navbar. I did a refresh, but back and then forward again, finally went to tap the hamburger icon and the content showed.
I suspect maybe the content load is massive, since the page length looks very long. Maybe a spinner or shorter page would prevent this UX delay.
Subsequent page loads bring up the content right away.
Although, I'm surprised you didn't see the page load UI. There should be a pulsating circle to indicate it's loading.
Thanks for the feedback!