Show HN: Advertising-Free Search-Engine (opens in new tab)

(deusu.org)

55 pointsdeusu11y ago59 comments

59 comments

41 comments · 22 top-level

rileyteige11y ago· 5 in thread

I searched for "History of aviation" (no quotes) and got

1) History in the HeadlinesThe Secrets of Ancient Roman Concret

2) Watch History Topics Videos Online - History.com

3) Ohio — History.com Articles, Video, Pictures and Facts

as my top three results. It's an ambitious problem to tackle, but it has to be able to give me better relevance for me to consider using it.

EDIT: For a comparison, I ran the same query on Google:

1) History of aviation - Wikipedia, the free encyclopedia

2) History of Flight - NASA

3) history of flight | aviation | Encyclopedia Britannica

userbinator11y ago

I tried the same phrase, quoted, and received the exact same results.

The 6th result is the only result that actually has the phrase "history of aviation" highlighted. The other results appear to be a conjunction of pages containing "history", "of", and "aviation", in no particular order.

I understand the advanced ranking algorithms of Google are probably too much to expect of something like this, but "find all the pages containing this phrase" shouldn't be. (The query "find all the pages containing this phrase", quoted, produces http://en.wikipedia.org/wiki/Vladimir_Putin as its first result. What?)

deusuOP11y ago

Please keep in mind that this is running on one server. Google has what? Several hundred thousand servers?

Yes, it needs to get better. But it's a good start, I think.

fiatjaf11y ago

It is not a good start. It needs to get much better to be compared to Google, and it will not, if it is a one-guy work and you haven't discovered a new fantabulous almost magical new algorithm for crawling and indexing.

Keep in mind that Google is not a propaganda machine, it really really _really_ solves problems and finds things. Any search engine willing to compete will have to do this.

I'm not saying it is impossible, but that it is not sufficient to be "advertising-free" to be good.

rileyteige11y ago

In no way do I intend to discourage your efforts. I was offering an example in which your search engine seems unable to identify the context that I had in mind.

What are some of your plans for search engine improvement? Ad-free aside, why should I fund your project? DuckDuckGo seems to offer itself up as a great alternative to Google and I have enjoyed its clean UI immensely for several months now. What is my incentive to help drive your project?

1 more reply

ChuckMcM11y ago

Its a fun way to look at what might be involved in writing a search engine and to start experiencing the myriad of ways in which people "search" and what they expect when they do.

I encourage you to continue in your efforts, and I will add some suggestions on how you might proceed which will get you further along.

First, in order to scale, it will have to be able to run on 'n' servers. So look at ways of breaking up your index across multiple machines and operating on the index in parallel.

Second, crawling can often be a bigger challenge than providing search results, so work on building a system that can crawl a bunch of different web sites. Folks like Octopart have shown that topical search engines are really useful. So consider putting together a billion document index of say "news sources".

In the 'everything that is old is new again' theme, consider building simply a 'blog search engine' which is one that focuses on various blog content. The old Technorati did that and later was subsumed and its become harder for blog writers to rank in Google's results, so perhaps there is an opening for a place to search about who is talking about 'x' for some X.

I happen to know you can run a pretty decent crawler and indexer for about $2M a year :-) so consider targeting your donation rate to hit about $150,000 a month.

1 more reply

orbifold11y ago· 2 in thread

I have the suspicion that Google user tracking is not just for better advertising. Their original page rank algorithm is a random walk on the web graph, every user similarly performs some walk on this graph, in principle it should be possible to use that walk to improve the relevance ranking. This is just one example, there are many others how user interaction could be used as feedback into the search engine.

ukandy11y ago

It does reduce your view of what's out there to some extent, giving seemingly undue bias to certain services or opinions.

on_and_off11y ago

Google does indeed use my data in order to personalize search results. I remember being impressed to see that for a query that could have several different meanings, the search engine gave me results centered on my engineering field, which was what I was looking for. I am not that convinced that it blindfolds you to some services or opinions though. So far I have only seen differences in understanding the context of the query, which I don't see as an 'undue bias' but an added value (especially if you read the anecdote about the 'history of aviation' query in this thread).

Some other thoughts on the bubble : http://www.blindfiveyearold.com/the-preference-bubble .

huhtenberg11y ago· 2 in thread

I applaud the effort, but the search results quality is really poor. It's basically an Altavista style laundry list of anything remotely relevant in no particular order.

That said, do you do your own crawling or do you source results from someone?

deusuOP11y ago

I do the crawling myself. There are currently about 320 million pages in the search-index.

frik11y ago

How many GB of data are 320 million (HTML) pages? How long does it take to refresh the index (with a single crawler on a 1GB/s connection)?

1 more reply

diegolo11y ago· 2 in thread

Why reimplement a Search Engine from scratch instead of contributing to lucene/solr/elastic search? did you add something new?

deusuOP11y ago

I started writing the software a LONG time ago. As far as I know Solr/Elastic Search didn't exist back then.

Query logs are preserved, but they do NOT contain the IP-adress of the user. So all I know is that someone of the billions of Internet users searched for something embarrassing. :)

diegolo11y ago

btw from the website is not clear if query logs are preserved and/or shared with third parties.

concerto11y ago· 1 in thread

I think relevance is the thing to work on. I did a search in a space I know a lot about and the majority of first page results were domains for sale. The company I have an interest in, which is the first result for certain keywords on Google and Bing, wasn't to be found anywhere in the first 4 pages on here. It would be interesting to hear a bit more about your plans for moving this forward.

I have a few questions it would be great to hear about either here or in a follow-up blog post:

* Is your reimplementation in JS a translation of your current from Pascal or a redevelopment?

* Why have you decided to move from Pascal to JS rather than spending the effort improving the current implementation?

* What lessons have you learned from your current implementation that you are attempting to overcome with your new version in JS?

deusuOP11y ago

The rewrite in JavaScript will be mixture of porting and redesign. Some parts of the Pascal-sources are relatively new and can almost be translated as is. But the older the Pascal-source is, the more a redesign is a better idea.

The old parts of the sources are in VERY bad shape. This project started more than 15 years ago on a much, MUCH smaller scale (about 2 million pages compared to 320 million now). Some design-choices simply aren't valid anymore. Also the use of JS has the advantage that it will make the project attractive to a lot more programmers than staying with Pascal could achieve.

The current implementation keeps the search-index as one big thing. I will split that up into many smaller pieces. That way it can run multi-threaded or even on multiple servers. Currently queries are running single-threaded.

Another shortcoming is that building an index involves several manual steps. I want to automate that.

frik11y ago· 1 in thread

It reminds me of a similar open search engine with good search results: http://www.gigablast.com, https://news.ycombinator.com/item?id=6152839

Ten years ago, the EU sponsored Exalead to become a Google competitor. Nowadays the company is owned by Dassault Systèmes (3D CAD Catia): https://www.exalead.com/search/ , http://www.heise.de/newsticker/meldung/Quaero-Erster-Vorlaeu...

Other mentioned DuckDuckGo, but one has to differ. DDG uses the Yahoo search API, and Yahoo itself uses the Bing search from Microsoft. DDG parses the query string and tries to add some snips from Wikipedia/Yelp/etc. There are only 4 big search engines left: Google, Bing, Yandex, Baidoo. There are some minor ones like Gigablast, Exalead and now Deusu. And there are meta search engines like DDG and Yahoo.

@Deusu dev: Good luck with the refactoring from Pascal to JavaScript, sounds like a good idea! Do you use a page-rank or how else do you score?

deusuOP11y ago

No pagerank. But I use backlink-counts and Alexa.com data. Also position of keywords in url, title and snippet. Length of url and number of elements in the url are also a part.

Animats11y ago· 1 in thread

A user-supported search engine has promise. A telco, a handset maker, or a country could do an ad-free search engine. Cuil had about 30 people, and eventually got a half-decent search engine. It's surprising that Apple doesn't have their own search engine. Their negative experience with the map business may have scared them.

The economics of search are strange. Search has negative market value - Google paid about $100M/year to be the default on Firefox, and Google still pays to be on iPhones. It's like the ad channels on cable. The Jewelry Channel pays to be carried; the NFL gets paid to be carried. Google is in the Jewelry Channel position.

As compute power increases, the cost of doing search goes down. But that hasn't been reflected in the "selling price" of search, which is expressed as ad density. Google is a high-margin business because of that. That makes them vulnerable.

DuckDuckGo and Blekko, although tiny, are quite profitable. Even InfoSeek and Ask apparently still make money. There's room at the bottom.

datacog11y ago

Even InfoSeek and Ask apparently still make money

> Ask is more of an 'ads search engine', and they make mainly because of their arbitrage model

jasonkostempski11y ago· 1 in thread

I searched for nothing but got suggestions for things I've searched for on Google this week. Since, for some, tracking is just as big a concern as ads, can you explain why and how that is happening? Does DeuSu now know those things or anything else about me even if that data is anonymized? If so, will that data be sold?

deusuOP11y ago

That is probably your browser keeping track of what you entered into input-fields that are named "q" for "query".

And no, this data does NOT get transmitted. So DeuSu does not know about what you searched-for on Google.

empressplay11y ago· 1 in thread

I wonder where the raw crawler data is coming from? Current events-related stuff is months (years?) out of date. For example it seems to think Tony Abbott is still the Australian opposition leader.

Did you snarf an old database from another search engine? Just curious.

deusuOP11y ago

The main crawl is several months old. There is a separate news-crawler which checks a few dozen news sites every 10 minutes or so.

ilaksh11y ago· 1 in thread

I believe we will probably eventually transition to some kind of distributed peer-to-peer semantic-ish system rather than having a giant company crawl plain text pages and control the vast majority of global advertising and quite a bit of its data.

barryhunter11y ago

Seems to be the premise of http://www.majestic12.co.uk/ - looks like it might be failing.

supercoder11y ago· 1 in thread

https://deusu.org/query?q=deusu

Does not manage to return deusu as the first result, or in any of the results for that matter.

deusuOP11y ago

And why should it? You don't need DeuSu to find itself. You are already on the site. :)

wbsun11y ago· 1 in thread

I appreciate such work that people are working hard on, but how can this last long? Google was started with Ad-free too. The search will need money to run anyway. Have you figured out a better idea to get money without Ads?

afshin11y ago

Perhaps you would know the answer to that if you had actually visited the link before commenting about it.

DanBC11y ago

I searched for my name and none of the first ten results was about me, or about any of the other people called "Dan Beale". But nicely it didn't return a bunch of filth based on my other surname, "Cocks", so that was nice.

It's a shame that a couple of replies or so harshly negative. Perhaps this submission would have been better received if it had been a blog post about how you created you engine; how it works; problems you have with it; and so on.

Search is not --despite the Google behemoth-- a solved problem so there's still space for creative thinking.

EG the search on a manufacturer's website is often hopeless. You emd up with a list of 8,000 widgets and need to iust scroll through them page by page. Amazon search is bafflingly poor. Ebay search has some sub-optimalities. (People can list "case for mp3 player £1.99" and "mp3 player £35" in the same listing, so a search and sort by price will sort by the cheaper case price.

ignoramous11y ago

Query: that movie where jim carrey plays dad to three black men

DeuSu: https://deusu.org/query?q=that+movie+where+jim+carrey+plays+...

Google: https://www.google.co.in/search?q=that+movie+where+jim+carre...

For me queries like these is why Google wins. NB: Duckduckgo and Bing seem to do fine on this query as well. Bing does better than Google[0] when you omit phrases like 'three' or half-spell phrases like 'carrey' as 'car'. Surprising since it has often been the case that Google is better at returning/ranking results for most of programming related queries I try on daily basis [1].

[0] Probably, because Google takes into account the fact that the user ignored the type-ahead suggestion (car -> carrey) for a reason and omits all 'jim carrey' related results from the list.

[1] http://www.hanselman.com/blog/AmIReallyADeveloperOrJustAGood...

qzcx11y ago

I searched for "XPS 13" and I got several pages in German.

bdcravens11y ago

I searched for "fedex refunds". On Google you get what you expect on the first page: some links to Fedex's site, and a list of Fedex shipping auditors. On Deusu.org, the first non-Fedex page is Cosco, and not a single shipping auditor. The rest of the links are just as irrelevant.

leereeves11y ago

Needs some work to improve both relevance and ranking algorithms. For example:

I'm watching the movie JFK, so I tried a search for Jack Ruby.

I found a teacher named Jack Ruby's videos, the Ruby Fortune online casino, the bar Ruby Tuesday, and a lot of other irrelevant results.

Nothing about the famous Jack Ruby in the first four pages.

austenallred11y ago

Searched for my company, and the meta information is years old. Tried a few other searches, and the results weren't even related. Sorry, but for me it's not even close to worth switching for.

supercoder11y ago

Searched 'Google' got:

http://www.google.com/chromeframe/?redirect=true

As first result

mdturnerphys11y ago

Tried searching for "BYU" and "UW" and got everything except the universities' main websites. Do you just not crawl .edu addresses?

zenincognito11y ago

Advertising free means it would not be able to support itself if it does not show value instantaneously. No value proposition here.

hudell11y ago

I searched for my game (Orange Season) and the first result was Vladimir Putin

j / k navigate · click thread line to collapse

59 comments

41 comments · 22 top-level

rileyteige11y ago· 5 in thread

I searched for "History of aviation" (no quotes) and got

1) History in the HeadlinesThe Secrets of Ancient Roman Concret

2) Watch History Topics Videos Online - History.com

3) Ohio — History.com Articles, Video, Pictures and Facts

as my top three results. It's an ambitious problem to tackle, but it has to be able to give me better relevance for me to consider using it.

EDIT: For a comparison, I ran the same query on Google:

1) History of aviation - Wikipedia, the free encyclopedia

2) History of Flight - NASA

3) history of flight | aviation | Encyclopedia Britannica

userbinator11y ago

I tried the same phrase, quoted, and received the exact same results.

deusuOP11y ago

Please keep in mind that this is running on one server. Google has what? Several hundred thousand servers?

Yes, it needs to get better. But it's a good start, I think.

fiatjaf11y ago

Keep in mind that Google is not a propaganda machine, it really really _really_ solves problems and finds things. Any search engine willing to compete will have to do this.

I'm not saying it is impossible, but that it is not sufficient to be "advertising-free" to be good.

rileyteige11y ago

In no way do I intend to discourage your efforts. I was offering an example in which your search engine seems unable to identify the context that I had in mind.

1 more reply

ChuckMcM11y ago

Its a fun way to look at what might be involved in writing a search engine and to start experiencing the myriad of ways in which people "search" and what they expect when they do.

I encourage you to continue in your efforts, and I will add some suggestions on how you might proceed which will get you further along.

First, in order to scale, it will have to be able to run on 'n' servers. So look at ways of breaking up your index across multiple machines and operating on the index in parallel.

I happen to know you can run a pretty decent crawler and indexer for about $2M a year :-) so consider targeting your donation rate to hit about $150,000 a month.

1 more reply

orbifold11y ago· 2 in thread

ukandy11y ago

It does reduce your view of what's out there to some extent, giving seemingly undue bias to certain services or opinions.

on_and_off11y ago

Some other thoughts on the bubble : http://www.blindfiveyearold.com/the-preference-bubble .

huhtenberg11y ago· 2 in thread

I applaud the effort, but the search results quality is really poor. It's basically an Altavista style laundry list of anything remotely relevant in no particular order.

That said, do you do your own crawling or do you source results from someone?

deusuOP11y ago

I do the crawling myself. There are currently about 320 million pages in the search-index.

frik11y ago

How many GB of data are 320 million (HTML) pages? How long does it take to refresh the index (with a single crawler on a 1GB/s connection)?

1 more reply

diegolo11y ago· 2 in thread

Why reimplement a Search Engine from scratch instead of contributing to lucene/solr/elastic search? did you add something new?

deusuOP11y ago

I started writing the software a LONG time ago. As far as I know Solr/Elastic Search didn't exist back then.

Query logs are preserved, but they do NOT contain the IP-adress of the user. So all I know is that someone of the billions of Internet users searched for something embarrassing. :)

diegolo11y ago

btw from the website is not clear if query logs are preserved and/or shared with third parties.

concerto11y ago· 1 in thread

I have a few questions it would be great to hear about either here or in a follow-up blog post:

* Is your reimplementation in JS a translation of your current from Pascal or a redevelopment?

* Why have you decided to move from Pascal to JS rather than spending the effort improving the current implementation?

* What lessons have you learned from your current implementation that you are attempting to overcome with your new version in JS?

deusuOP11y ago

Another shortcoming is that building an index involves several manual steps. I want to automate that.

frik11y ago· 1 in thread

It reminds me of a similar open search engine with good search results: http://www.gigablast.com, https://news.ycombinator.com/item?id=6152839

@Deusu dev: Good luck with the refactoring from Pascal to JavaScript, sounds like a good idea! Do you use a page-rank or how else do you score?

deusuOP11y ago

No pagerank. But I use backlink-counts and Alexa.com data. Also position of keywords in url, title and snippet. Length of url and number of elements in the url are also a part.

Animats11y ago· 1 in thread

DuckDuckGo and Blekko, although tiny, are quite profitable. Even InfoSeek and Ask apparently still make money. There's room at the bottom.

datacog11y ago

Even InfoSeek and Ask apparently still make money

> Ask is more of an 'ads search engine', and they make mainly because of their arbitrage model

jasonkostempski11y ago· 1 in thread

deusuOP11y ago

That is probably your browser keeping track of what you entered into input-fields that are named "q" for "query".

And no, this data does NOT get transmitted. So DeuSu does not know about what you searched-for on Google.

empressplay11y ago· 1 in thread

I wonder where the raw crawler data is coming from? Current events-related stuff is months (years?) out of date. For example it seems to think Tony Abbott is still the Australian opposition leader.

Did you snarf an old database from another search engine? Just curious.

deusuOP11y ago

The main crawl is several months old. There is a separate news-crawler which checks a few dozen news sites every 10 minutes or so.

ilaksh11y ago· 1 in thread

barryhunter11y ago

Seems to be the premise of http://www.majestic12.co.uk/ - looks like it might be failing.

supercoder11y ago· 1 in thread

https://deusu.org/query?q=deusu

Does not manage to return deusu as the first result, or in any of the results for that matter.

deusuOP11y ago

And why should it? You don't need DeuSu to find itself. You are already on the site. :)

wbsun11y ago· 1 in thread

afshin11y ago

Perhaps you would know the answer to that if you had actually visited the link before commenting about it.

DanBC11y ago

Search is not --despite the Google behemoth-- a solved problem so there's still space for creative thinking.

ignoramous11y ago

Query: that movie where jim carrey plays dad to three black men

DeuSu: https://deusu.org/query?q=that+movie+where+jim+carrey+plays+...

Google: https://www.google.co.in/search?q=that+movie+where+jim+carre...

[0] Probably, because Google takes into account the fact that the user ignored the type-ahead suggestion (car -> carrey) for a reason and omits all 'jim carrey' related results from the list.

[1] http://www.hanselman.com/blog/AmIReallyADeveloperOrJustAGood...

qzcx11y ago

I searched for "XPS 13" and I got several pages in German.

bdcravens11y ago

leereeves11y ago

Needs some work to improve both relevance and ranking algorithms. For example:

I'm watching the movie JFK, so I tried a search for Jack Ruby.

I found a teacher named Jack Ruby's videos, the Ruby Fortune online casino, the bar Ruby Tuesday, and a lot of other irrelevant results.

Nothing about the famous Jack Ruby in the first four pages.

austenallred11y ago

Searched for my company, and the meta information is years old. Tried a few other searches, and the results weren't even related. Sorry, but for me it's not even close to worth switching for.

supercoder11y ago

Searched 'Google' got:

http://www.google.com/chromeframe/?redirect=true

As first result

mdturnerphys11y ago

Tried searching for "BYU" and "UW" and got everything except the universities' main websites. Do you just not crawl .edu addresses?

zenincognito11y ago

Advertising free means it would not be able to support itself if it does not show value instantaneously. No value proposition here.

hudell11y ago

I searched for my game (Orange Season) and the first result was Vladimir Putin

j / k navigate · click thread line to collapse