undefined | Better HN

0 points_c_9y ago0 comments

Worse is that Google tries to stop scraping. It's like they don't want anyone to see past the first page of results.

They could scrape your website and then they prevent you form scraping your own data back.

The whole process is silly; it reflects the duct tape and chicken wire nature of the www.

No one should have to "scrape" or "crawl".

Data should be put into a open universal format (no tags) and submitted when necessary (rsynced) to a public access archive, mirrored around the world.

This to bridge the gap until we reach a more content addressable system (cf. location based).

Clients (text readers, media players, whatever) can download and transform the universally formatted data into markup, binary, etc. -- whatever they wish, but all the design creativity and complexity of "web pages" or "web apps" can be handled at the network edge, client-side.

"Crawling" should not be necessary.

No one should have to store HTML tags and other window dressing for data.

Dream on.

0 comments

2 comments · 1 top-level

NamTaf9y ago· 1 in thread

That's the antithesis of the world wide web because you've just centralised data storage, which makes someone 'own' the www.

_c_OP9y ago

I do not understand your argument.

To give an example, there is a lot of free open source software mirrored all over the internet, mostly on ftp servers, but also on http, rsync, etc.

If you use Linux or BSD you probably are using some of this software. If you use the www, then you are probably accessing computers that use this software. If you drive a new Mercedes you are probably using some of this software. There are a lot of copies of this code in a lot of places.

Is that centralized? Does anyone hosting a mirror ("repository") "own" the software? Is it the same person or entity hosting every mirror?

Compare Google's copies of everyone else's data, also replicated in a lot of places around the world. Who "owns" this data?

j / k navigate · click thread line to collapse