undefined | Better HN

0 pointsnojs5y ago0 comments

Thanks, I’ll check it out!

0 comments

2 comments · 1 top-level

jstrieb5y ago· 1 in thread

It's worth noting that my extension is far from perfect – it turns out that determining whether a specific page has been submitted to Hacker News is far from a trivial problem to solve. In general, this is because multiple URLs can map to the same page.

Direct string comparison of the current URL to previously submitted ones doesn't work because there are many ways for two identical web pages to have different URLs. For example, the URL fragments can differ (the part after the "#" that may or may not be present). Also there can be tracking parameters (often—but not necessarily—prefixed with "utm_"), which don't change anything about the page. But the URL parameters can't be entirely disregarded because sometimes sites, forums in particular, rely on them – consider pages that use an "?id=..." parameter for different pages. Thus some parameters should be removed, but some shouldn't. The same website having different domains (or domains that change over time) further complicates the situation.

My solution was to "canonicalize" URLs by transforming them into a simplified form using some pretty rough heuristics for common sources of noise. The Python code to do that is here: https://github.com/jstrieb/hackernews-button/blob/master/can...

All of this to say that even though I've used my extension for months and have been quite happy, there will inevitably be false negatives.

moehm5y ago

A solution designed to be used by search enginges is the canonical link element[0] present in the html of the page. I'm not sure how this would work for a browser extension, as you would have to crawl every submitted site once and save the canonical version.

[0] https://en.wikipedia.org/wiki/Canonical_link_element

j / k navigate · click thread line to collapse