Wouldn't it be fun to build your own Google? (opens in new tab)

(radar.oreilly.com)

150 pointsmartinkl11y ago61 comments

61 comments

46 comments · 16 top-level

Smerity11y ago· 8 in thread

[lightly modified version of a comment I put on the article as I love HN for discussion!]

Great article -- we're excited there's so much interest in the web as a dataset! I'm part of the team at Common Crawl and thought I'd clarify some points in the article.

The most important is that you can download all the data that Common Crawl provides completely for free, without the need to pay S3 transfer fees or process it only in an EC2 cluster. You don't even need to have an Amazon account! Our crawl archive blog posts give full details for downloading[1]. The main challenge then is storing it, as the full dataset is really quite large, but a number of universities have pulled down a significant portion onto their local clusters.

Also, we're performing the crawl once a month now. The monthly crawl archives are between 35-70 terabytes compressed. As such, we've actually crawled and stored over a quarter petabyte compressed, or 1.3 petabytes uncompressed, so far in 2014. (The archives go back to 2008.)

Comparing directly against the Internet Archive datasets is a bit like comparing apples to oranges. They store images and other types of binary content as well, whilst Common Crawl aims primarily for HTML, which compresses better. Also, the numbers used for Internet Archive were for all of the crawls they've done, and in our case the numbers were for a single month's crawl.

We're excited to see Martin use one of our crawl archives in his work -- seeing these experiments come to life the best part of working at Common Crawl! I can confirm that optimizations will help you lower that EC2 figure. We can process a fairly intensive MR job over a standard crawl archive in afternoon for about $30. Big data on a small budget is a top priority for us!

[1]: http://blog.commoncrawl.org/2014/11/october-2014-crawl-archi...

marktangotango11y ago

Before Google's page rank algorithm there was a lot of research into document search. A favorite of mine was 'scatter gather'[1].

I often wanted ability to filter and group when staring a pages of results from redhat jboss support forums when trying to fix a dead jboss cluster. But I quit that job so haven't had the need recently.

Edit; point being it would be nice if someone came up with a service that implemented the ideas in this paper :)

[1] http://www-users.cs.umn.edu/~han/dmclass/scatter.pdf

adwf11y ago

Is there any chance we might see an updated index for the Common Crawl? I've tried using the dataset before, but I found it difficult given that you have to process the entire thing in order to find the particular pages you are looking for.

As an example, I was trying a project to look at the top news sites, like BBC, CNN, Al Jazeera, etc. Then matching articles about the same news topic on each site, before finally fact checking for differences between the stories (ie. 20,000 homes were without power vs. 50,000)

That kind of project requires a load of crawling, but I can't look for specific pages without processing the entire CC set first.

I love the project though, so thank you for doing it!

stevesalevan11y ago

So awesome to see that folks are discovering Common Crawl, it's a really great project and the amount of data they've crawled has been tremendous. If you're looking to get into machine learning, it's perhaps the most comprehensive dataset you can get your hands on.

jtoy11y ago

anyway we can get an open sourced version of that optimized job for us to play with?

Smerity11y ago

There's are two levels of optimizations that come into play: AWS setup and the choice of primary language. I'm always happy to speak about both as we love seeing experiments run over the data!

AWS optimizations: For that level of cost efficiency, you really need to use spot instances. A cluster of 100 m1.xlarge machines (1.5TB of RAM, 400 cores, and 168TB of magnetic disk storage) will only cost you $3 per hour using spot instances, rather than $30 on-demand. You should pay on-demand prices for the Hadoop master however.

You'll also want to roll your own Hadoop cluster as opposed to using Elastic MapReduce (EMR). EMR is amazing but the cost overhead when using it on spot instances is ~100%.

For the code itself, this is a situation where you'll want to stick with the programming language that has both the best performance and best ecosystem. I'm personally not a big fan of Java, but it really does win out here -- it's close to C or C++ for performance and has the advantage of the Hadoop ecosystem behind it. Other languages are certainly usable, but even if LanguageX only ran 4x slower than Java, the resulting job would be 4x more expensive due to paying by the hour.

Other than that, it's really just a standard MapReduce job using Hadoop. You can see an three examples for the three different data formats we use at: https://github.com/commoncrawl/cc-warc-examples/

1 more reply

curiously11y ago

what would be the benefit of having the web as a dataset when it is rife with copyright laws (see craigslist), monopoly businesses who viciously protect their human uploaded content and profiles, and majority of the data from the web being useless without a context and purpose of the searcher?

I'm just curious as to how commoncrawl compares with kimonolabs and import.io as they seem to have the same goal of creating an internet as a dataset, or an API. I can't help but feel like it's just solving another 'semantic web' problem that nobody asked for.

It is funny that the most demanding customers of semantic web are also the ones who are willing to pay the least amount of time and money.

jobposter123411y ago

Regarding copyright issues, I believe you can still use copyrighted data as long as it's transformed. E.g., building language models, or doing a search engine like Google. In fact, I can think of more computational uses for copyrighted data, while on the "banned" side, I can only think of... SEO.

Regarding point two: "monopoly businesses who viciously protect their human uploaded content". I spend a lot of time scraping these monopoly businesses, and it seems to me they do a decent job of letting their users decide what data is exposed. Facebook, linkedin, and google all are decent about letting me scrape their public info. That's all I have a right to -- private info should stay private, at the behest of the owner (the User in UCG).

You are correct regarding the third point, but I don't see that as a problem. This isn't a solution in search of a problem -- it's a problem without a solution at the moment.

Here's a toy example of something I'd like to do: calculate the positive / negative sentiment of commenters at particular baseball fan sites, so I can hide the content I don't like, and show that which I do. Having a common crawl of the site would be immensely useful (and is indeed a prereq) for this. I wouldn't need to republish it, just compute on it.

pasteurquadrant11y ago

andrewhillman11y ago· 7 in thread

Yeah, this sounds all well and good in theory, but after visiting thousands of sites over the years, it might be a better idea to help engineers build a search engine for their own site/data first. I can't recall many websites that have amazing search. It's a problem when I have to use google to find what I want on xyz.com because if I go search for what I am looking for on xyz.com I cannot find it even if I know its on that site.

It would be so nice to go to xyz.com and actually find what I am looking for in under 1 second.

wslh11y ago

I precisely ranted about this issue as a way to challenge Google: http://blog.databigbang.com/letters-from-the-future-challeng...

I think we need to optimize local search engines and aggregate (in some ways) all this into a global search engine.

dugmartin11y ago

As a site owner you can pay Google for this:

https://www.google.com/work/search/products/gss.html

I used it several years ago to power this site search:

http://www.poetryfoundation.org/search/articles#qs=ginsberg

It works pretty well (disclaimer: I'm not sure this is still being used on that site)

andrewhillman11y ago

Yes. It's been around for years. I have used it but this misses my point or proves it. Either way, sites need to figure out how to power own search.

1 more reply

arthurcolle11y ago

I'm pretty sure Elastic is on the right direction in this regard, don't you think?

nemothekid11y ago

I think "full-text indexing" is a very different problem than "search."

Full-text indexing (what ES provides) has been around for almost forever, ES just does a way better job of productizing/delivering it.

However, Google is far more than a text index. Ranking, currently is still very difficult and requires messing around with facet and weighting parameters.

andrewhillman11y ago

I don't think so. I just went to their site and the first case study i opened was from theguardian.com. I went to theguardian.com and went to do a search. Guess who they are using to power their search function? Google.

In my opinion which means nothing, sites need to figure out how to power their own search. Using a third party isn't going to work for most. Maybe people need to focus on building custom architecture that indexes the data in a more structured way rather than cobbling systems together that ultimately hinder search efforts when its time to get the user what they want. I don't know the answer but somebody eventually will. Maybe wordpress will create a powerful search for all those wordpress sites.

1 more reply

Igglyboo11y ago

Elastic is more like a quick way to search through huge amounts of text, not a way to rank which hits are more relevant.

1 more reply

pjbrunet11y ago· 3 in thread

At first Google was a search algorithm, but at some point they decided to have humans review and rank the important queries. Important as in query volume.

Why use humans? People can decide if your navigation is intuitive. They can decide if your page looks like crap. If 230,000 people are searching for "coconut oil" per month (actual numbers) then it's worth having an intern spend 15 minutes to make sure page 1 of "coconut oil" looks right.

Google can afford that. They need a human to decide if the "user experience" is actually good vs. disallowing the back button and forcing the browser to crash, which is how I suppose you could fake a "time on site" metric if this was just an algorithmic problem.

Google is now more like playing Zork. You type "Go North" like 10 million other people before you typed "Go North" and Google has already crafted that experience you'll find in next room. (Which makes me wonder, do they score how boring you are based on predictability?) This is becoming more and more obvious over time as a search for "calculator" shows you an actual calculator that a human at Google created. That's not an algorithmic response.

Similarly, I see that human touch coming more into play with voice recognition, Google Glass, Siri, etc. Call that "AI" or whatever. You ask Google a question and Google has already sculpted a slick answer based on tons of testing. That's how I see Google as a search engine now. Part of the crawling is interesting (recognizing objects in photos?) but I think human reviews of all the important websites and SERPs, that's harder for a competitor to reproduce.

Animats11y ago

I think human reviews of all the important websites and SERPs, that's harder for a competitor to reproduce.

Google was forced into that by improved "search engine optimization". SEO used to be about things like keyword stuffing, but as Google made their search engine smarter, SEO companies made their search spamming smarter. There are now SEO operations using machine learning to reverse engineer Google's algorithms and then automatically spam just enough to stay under the threshold.

In 2010, Google tried using "local" data to improve search. That turned out to be extremely easy to spam. A classic example of this can be found by searching for "laptop repair bradford pa". This brings up "Illusory Laptop Repair", located in the middle of a railroad crossing. A SEO expert created that phony business listing to demonstrate how bad Google was at detecting such spam. Google still thinks it's real.

In 2012, Google tried using "social" data to improve search. That worked even worse. Fake Google accounts created to create fake "+1"s may have exceeded the number of real ones. Google "+1" are still for sale; the going rate is about $0.10 each.

Meanwhile, links aren't as useful as they used to be. Who creates a link to a retail outlet other than on social media any more? Google is trying all sorts of "signals", but in heavily spammed areas, they're not doing all that well.

Yandex has been trying search that doesn't weight links at all for some heavily spammed categories in the Moscow area. It seems to be working for fake real estate ads.

(We have a partial solution - find the real-world business behind the web site and check it out in hard data sources, such as Dun and Bradstreet or Experian, which have business credit data. See "sitetruth.com/doc".)

pjbrunet11y ago

I agree. And what are the implications? Wasn't Google sued by people unhappy with their rankings (Howard Stern?) and wasn't Google's defense that their algorithm was unbiased? (Not a rhetorical question, I never followed how that played out.) Once you introduce human reviewers, you're going to have more unhappy businesses. I'd rather a SEO spammer push me down (nothing personal) than know an actual Googler secretly decided my website wasn't good enough--a personal bias against me in particular. Who are the reviewers, what do we know about them, what are they looking for, not looking for, have they reviewed you personally, etc. I think Google may call it a "manual action" in Webmaster Tools, an ambiguous way of saying somebody at Google manipulated the algorithm against you. Do they have different levels of manual actions? Are manual actions always disclosed?

rjaco3111y ago

>A classic example of this can be found by searching for "laptop repair bradford pa". This brings up "Illusory Laptop Repair", located in the middle of a railroad crossing. A SEO expert created that phony business listing to demonstrate how bad Google was at detecting such spam. Google still thinks it's real.

This doesn't seem to work.

1 more reply

angersock11y ago· 3 in thread

For anyone interested, there's a hilariously bitter and practical paper on the trials and tribulations of building a search engine:

http://queue.acm.org/detail.cfm?id=988407

EDIT:

Article is clearly from an earlier era, but it's really cool to see how far we've come and how much more computing power we have available now. There are entire categories of problems that simply don't exist anymore.

Animats11y ago

She later designed the search engine of Cuil. While Cuil failed, it only cost them about $30 million to do most of what Google does.

It's surprising to me that there aren't search engines from Comcast, AT&T, and Apple. If you have customers, why give up all that ad revenue to Google? Google is paying some big players a lot of money not to do that. They were paying Apple $1 billion a year to be the default on Apple products. Apple switched from Google to Bing anyway.

desdiv11y ago

While Cuil failed, it only cost them about $30 million to do most of what Google does.

They raised ~$30 million in two rounds, but their valuation was at $200 million by round two. I agree with your point though; the cost to develop a good search engine is dirt cheap compared to the value it brings.

1 more reply

Anthony-G11y ago

I really enjoyed that article. I read it over a year ago while I was doing Udacity’s [Intro to Computer Science](https://www.udacity.com/course/cs101) course where you learn to build a web crawler and implement a basic page ranking algorithm.

discardorama11y ago· 3 in thread

Google's power comes not from the crawling, but from the retrieval and ranking. They use many more signals than the hyperlinks and anchor text (which is all you'd have if you crawled yourself). Indexing crawled content would have been OK in the year 2000; but today, the users demand more. Relevance is the top priority, and no one does it better than El Goog.

threeseed11y ago

Sorry but Google's ranking algorithm for me is far from brilliant.

To give you an example, search for "webhcat primary key" (without quotes) and note how the top three search results do not actually contain the term webhcat. Google constantly does this. It randomly ignores search terms unless you explicitly quote them.

I believe that there is still a market for a technical/advanced search engine.

jobposter123411y ago

Isn't google doing that because it detected the semantic information was on the page, even if the exact term wasn't? Is your issue with the fact that they're doing more than just a keyword retrieval, or is your issue with the fact that they're doing it poorly?

1 more reply

dredmorbius11y ago

Also on their dominance of the advertising market from which the value of crawling comes.

mjklin11y ago· 3 in thread

I thought Wikimedia tried this once. Big announcement, then nothing. Is that code still available?

Arkanosis11y ago

That was Wikia, not Wikimedia, but yes, the code is still available: - crawler: http://sourceforge.net/projects/grub/ - search engine: http://nutch.apache.org/

runarb11y ago

If my memory serves me correctly so is it only the client part of grub that is open source. Without the server part on cannot use it to setup one’s own crawl.

runarb11y ago

It is dead. More info at https://en.wikipedia.org/wiki/Wikia_Search .

sparkzilla11y ago· 2 in thread

The problem with algorithmic/scraper search methods, is that they only work with existing data. For example, most Google searches gives a list of websites on one side, and some data scraped from Wikipedia on the other. There is not much meaning there. That's because Google's algorithm cannot combine the results into something original, because that would require human creativity. As such, I see the rise of different kinds of search based on what humans create, rather than what computers can scrape. I wrote a (longish) blog post on this problem: http://newslines.org/blog/googles-black-hole/

minthd11y ago

You really discount the value of the long tail of search, which is where you get best info from google.

sparkzilla11y ago

You may get the best (more precise) info in the long tail, but I'm pretty sure that Google makes most of its money from the most popular searches.

mark_l_watson11y ago· 1 in thread

A really nice idea.

I volunteered a bit early this year for Common Crawl (not much, just some Java and Clojure examples for fetching and using the new archive format).

Common Crawl already has many volunteers (and a professional management and technical staff) so it would seem like a good idea to merge some of the author's goals with the existing Common Crawl organization. Perhaps more frequent Common Crawl web fetches and also making the data available on Azure, Google Cloud, etc. would satisfy the author's desire to have more immediacy and have the data available from multiple sources.

mark_l_watson11y ago

some edits:

Most of the Common Crawl data is on Microsoft Azure, but not all of it.

The Common Crawl is a great resource that deserves attention from more companies and developers.

JDDunn911y ago

I've always wanted to experiment with my own search algorithm. Unfortunately, I think this is still out of the budget of average programmers. Just the hard drives to download 1.3 petabytes would cost six-figures.[1][2]

[1] https://www.backblaze.com/petabytes-on-a-budget-how-to-build...

[2] https://www.backblaze.com/blog/why-now-is-the-time-for-backb...

smoyer11y ago

A couple thoughts:

1) I like the idea of human curation, but in combination with some sort of automated crawler (or other tool) that helps in the browser.

2) Why can't we also distribute the act of crawling, the maintenance of the index and the map-reduce (or other algorithm) that produces the data.

I've been thinking about architectures that would allow (in essence) a P2P search system. Would anyone be interested in talking about architectures to make this work? There are millions of computers on the web at any given time ... if it's built into the browser (or plugs in), you could have human input at the same time.

ryanthejuggler11y ago

This would be really cool to participate in, especially if it could be packaged à la Folding@Home/SETI@Home and widely distributed. I wonder if there's some clever method using crypto that can provably discourage bad actors if the network has certain properties (e.g. Bitcoin is nearly impossible to cheat unless one group owns >50% of the network).

swah11y ago

Maybe more people should start crawling and seeing what they can extract? I remember seeing DuckDuckGo Instant Answers and thinking what a valuable resource that would be (having a database like DDG must have, I mean).

Then one would be able to do some "stuff Google can do" - say, analysing trends - albeit with worse sampling, and not depend that much on them.

dmritard9611y ago

suprised not to see a mention of a bloom filter in url dedupe. Another tough problem now is the portion of the web in walled gardens or that is expensive to crawl (needs a js context).

thewarrior11y ago

Hmmm I'd think that ChuckMcM would have some interesting views about this.

imranq11y ago

what about Algolia? HN uses it

smartpants11y ago

Good Read

j / k navigate · click thread line to collapse

61 comments

46 comments · 16 top-level

Smerity11y ago· 8 in thread

[lightly modified version of a comment I put on the article as I love HN for discussion!]

Great article -- we're excited there's so much interest in the web as a dataset! I'm part of the team at Common Crawl and thought I'd clarify some points in the article.

[1]: http://blog.commoncrawl.org/2014/11/october-2014-crawl-archi...

marktangotango11y ago

Before Google's page rank algorithm there was a lot of research into document search. A favorite of mine was 'scatter gather'[1].

Edit; point being it would be nice if someone came up with a service that implemented the ideas in this paper :)

[1] http://www-users.cs.umn.edu/~han/dmclass/scatter.pdf

adwf11y ago

That kind of project requires a load of crawling, but I can't look for specific pages without processing the entire CC set first.

I love the project though, so thank you for doing it!

stevesalevan11y ago

jtoy11y ago

anyway we can get an open sourced version of that optimized job for us to play with?

Smerity11y ago

There's are two levels of optimizations that come into play: AWS setup and the choice of primary language. I'm always happy to speak about both as we love seeing experiments run over the data!

You'll also want to roll your own Hadoop cluster as opposed to using Elastic MapReduce (EMR). EMR is amazing but the cost overhead when using it on spot instances is ~100%.

Other than that, it's really just a standard MapReduce job using Hadoop. You can see an three examples for the three different data formats we use at: https://github.com/commoncrawl/cc-warc-examples/

1 more reply

curiously11y ago

It is funny that the most demanding customers of semantic web are also the ones who are willing to pay the least amount of time and money.

jobposter123411y ago

You are correct regarding the third point, but I don't see that as a problem. This isn't a solution in search of a problem -- it's a problem without a solution at the moment.

pasteurquadrant11y ago

andrewhillman11y ago· 7 in thread

It would be so nice to go to xyz.com and actually find what I am looking for in under 1 second.

wslh11y ago

I precisely ranted about this issue as a way to challenge Google: http://blog.databigbang.com/letters-from-the-future-challeng...

I think we need to optimize local search engines and aggregate (in some ways) all this into a global search engine.

dugmartin11y ago

As a site owner you can pay Google for this:

https://www.google.com/work/search/products/gss.html

I used it several years ago to power this site search:

http://www.poetryfoundation.org/search/articles#qs=ginsberg

It works pretty well (disclaimer: I'm not sure this is still being used on that site)

andrewhillman11y ago

Yes. It's been around for years. I have used it but this misses my point or proves it. Either way, sites need to figure out how to power own search.

1 more reply

arthurcolle11y ago

I'm pretty sure Elastic is on the right direction in this regard, don't you think?

nemothekid11y ago

I think "full-text indexing" is a very different problem than "search."

Full-text indexing (what ES provides) has been around for almost forever, ES just does a way better job of productizing/delivering it.

However, Google is far more than a text index. Ranking, currently is still very difficult and requires messing around with facet and weighting parameters.

andrewhillman11y ago

1 more reply

Igglyboo11y ago

Elastic is more like a quick way to search through huge amounts of text, not a way to rank which hits are more relevant.

1 more reply

pjbrunet11y ago· 3 in thread

At first Google was a search algorithm, but at some point they decided to have humans review and rank the important queries. Important as in query volume.

Animats11y ago

I think human reviews of all the important websites and SERPs, that's harder for a competitor to reproduce.

Yandex has been trying search that doesn't weight links at all for some heavily spammed categories in the Moscow area. It seems to be working for fake real estate ads.

pjbrunet11y ago

rjaco3111y ago

This doesn't seem to work.

1 more reply

angersock11y ago· 3 in thread

For anyone interested, there's a hilariously bitter and practical paper on the trials and tribulations of building a search engine:

http://queue.acm.org/detail.cfm?id=988407

EDIT:

Animats11y ago

She later designed the search engine of Cuil. While Cuil failed, it only cost them about $30 million to do most of what Google does.

desdiv11y ago

While Cuil failed, it only cost them about $30 million to do most of what Google does.

1 more reply

Anthony-G11y ago

discardorama11y ago· 3 in thread

threeseed11y ago

Sorry but Google's ranking algorithm for me is far from brilliant.

I believe that there is still a market for a technical/advanced search engine.

jobposter123411y ago

1 more reply

dredmorbius11y ago

Also on their dominance of the advertising market from which the value of crawling comes.

mjklin11y ago· 3 in thread

I thought Wikimedia tried this once. Big announcement, then nothing. Is that code still available?

Arkanosis11y ago

That was Wikia, not Wikimedia, but yes, the code is still available: - crawler: http://sourceforge.net/projects/grub/ - search engine: http://nutch.apache.org/

runarb11y ago

If my memory serves me correctly so is it only the client part of grub that is open source. Without the server part on cannot use it to setup one’s own crawl.

runarb11y ago

It is dead. More info at https://en.wikipedia.org/wiki/Wikia_Search .

sparkzilla11y ago· 2 in thread

minthd11y ago

You really discount the value of the long tail of search, which is where you get best info from google.

sparkzilla11y ago

You may get the best (more precise) info in the long tail, but I'm pretty sure that Google makes most of its money from the most popular searches.

mark_l_watson11y ago· 1 in thread

A really nice idea.

I volunteered a bit early this year for Common Crawl (not much, just some Java and Clojure examples for fetching and using the new archive format).

mark_l_watson11y ago

some edits:

Most of the Common Crawl data is on Microsoft Azure, but not all of it.

The Common Crawl is a great resource that deserves attention from more companies and developers.

JDDunn911y ago

[1] https://www.backblaze.com/petabytes-on-a-budget-how-to-build...

[2] https://www.backblaze.com/blog/why-now-is-the-time-for-backb...

smoyer11y ago

A couple thoughts:

1) I like the idea of human curation, but in combination with some sort of automated crawler (or other tool) that helps in the browser.

2) Why can't we also distribute the act of crawling, the maintenance of the index and the map-reduce (or other algorithm) that produces the data.

ryanthejuggler11y ago

swah11y ago

Then one would be able to do some "stuff Google can do" - say, analysing trends - albeit with worse sampling, and not depend that much on them.

dmritard9611y ago

suprised not to see a mention of a bloom filter in url dedupe. Another tough problem now is the portion of the web in walled gardens or that is expensive to crawl (needs a js context).

thewarrior11y ago

Hmmm I'd think that ChuckMcM would have some interesting views about this.

imranq11y ago

what about Algolia? HN uses it

smartpants11y ago

Good Read

j / k navigate · click thread line to collapse