undefined | Better HN

0 pointsmarginalia_nu3mo ago0 comments

Anyone actually scraping git repos would probably just do a 'git clone'. Crawling git hosts is extremely expensive, as git servers have always been inadvertent crawler traps.

They generate a URL for every version of every file on every commit and every branch and tag, and if that wasn't enough, n(n+1)/2 git diffs for every file on every commit it has exited on. Even a relatively small git repo with a few hundred files and commit explodes into millions of URLs in the crawl frontier. Server side many of these are very expensive to generate as well so it's really not a fantastic interaction, crawler and git host.

If you run a web crawler, you need to add git host detection to actively avoid walking into them.

0 comments

6 comments · 3 top-level

Tharre3mo ago· 2 in thread

And yet, it's exactly what all the AI companies are doing. However much it costs them in server costs and good will seems to be worth less to them then the engineering time to special case the major git web UIs.

marginalia_nuOP3mo ago

I doubt they're actually interested in the git repos.

From the shape of the traffic it just looks like a poorly implemented web crawler. By default, a crawler that does not take measures to actively avoid git hosts will get stuck there and spend days trying to exhaust the links of even a single repo.

kjuulh3mo ago

For me it was specifically crawlers from the large companies, they we're at least announcing themselves as such. They did have different patterns, bytedance was relatively behaved, but some of the less known ones, did have weird patterns of looking at comparisons.

I do think they care about repos, and not just the code, but also how it evolves over time. I can see some use, if marginal in those traits. But if they really wanted that, I'd rather they clone my repos, I'd be totally fine with that. But i guess they'd have to deal with state, and they likely don't want to deal with that. Rather just increase my energy bill ;)

Eldt3mo ago· 1 in thread

How probable is your "probably"?

marginalia_nuOP3mo ago

Well, one is 60 repos per hour, and the other is 60 hours per repo.

account422mo ago

If you are hosting your own git repost you don't really need to provide diffs between any arbitrary revision - just pregenerate diffs between each commit and its parent(s) and tell people to clone the repo if they want anything more fancy. Maybe add a few more cases like diffs between releases if you are feeling nice.

And you also don't need to host a version of each file for each commit - those should just be HTTP redirects to a unique URL for that version of the file, e.g. to the commit that last changed it - or just don't provide it at all since most people are only going to be interested in branches anyway and others can clone the repo.

The same goes for many other expensive operations that other websites (including blogs and forums) do that cause the website to go down when a bad crawler finds it. It's almost all self-inflicted pain that doesn't even provide meaningful features to real users compared to a better designed website with a finite number of pages that you can even host statically if you want.

j / k navigate · click thread line to collapse

0 comments

6 comments · 3 top-level

Tharre3mo ago· 2 in thread

marginalia_nuOP3mo ago

I doubt they're actually interested in the git repos.

kjuulh3mo ago

Eldt3mo ago· 1 in thread

How probable is your "probably"?

marginalia_nuOP3mo ago

Well, one is 60 repos per hour, and the other is 60 hours per repo.

account422mo ago

j / k navigate · click thread line to collapse