1) History in the HeadlinesThe Secrets of Ancient Roman Concret
2) Watch History Topics Videos Online - History.com
3) Ohio — History.com Articles, Video, Pictures and Facts
as my top three results. It's an ambitious problem to tackle, but it has to be able to give me better relevance for me to consider using it.
EDIT: For a comparison, I ran the same query on Google:
1) History of aviation - Wikipedia, the free encyclopedia
2) History of Flight - NASA
3) history of flight | aviation | Encyclopedia Britannica
The 6th result is the only result that actually has the phrase "history of aviation" highlighted. The other results appear to be a conjunction of pages containing "history", "of", and "aviation", in no particular order.
I understand the advanced ranking algorithms of Google are probably too much to expect of something like this, but "find all the pages containing this phrase" shouldn't be. (The query "find all the pages containing this phrase", quoted, produces http://en.wikipedia.org/wiki/Vladimir_Putin as its first result. What?)
Yes, it needs to get better. But it's a good start, I think.
Keep in mind that Google is not a propaganda machine, it really really _really_ solves problems and finds things. Any search engine willing to compete will have to do this.
I'm not saying it is impossible, but that it is not sufficient to be "advertising-free" to be good.
What are some of your plans for search engine improvement? Ad-free aside, why should I fund your project? DuckDuckGo seems to offer itself up as a great alternative to Google and I have enjoyed its clean UI immensely for several months now. What is my incentive to help drive your project?
I encourage you to continue in your efforts, and I will add some suggestions on how you might proceed which will get you further along.
First, in order to scale, it will have to be able to run on 'n' servers. So look at ways of breaking up your index across multiple machines and operating on the index in parallel.
Second, crawling can often be a bigger challenge than providing search results, so work on building a system that can crawl a bunch of different web sites. Folks like Octopart have shown that topical search engines are really useful. So consider putting together a billion document index of say "news sources".
In the 'everything that is old is new again' theme, consider building simply a 'blog search engine' which is one that focuses on various blog content. The old Technorati did that and later was subsumed and its become harder for blog writers to rank in Google's results, so perhaps there is an opening for a place to search about who is talking about 'x' for some X.
I happen to know you can run a pretty decent crawler and indexer for about $2M a year :-) so consider targeting your donation rate to hit about $150,000 a month.
Some other thoughts on the bubble : http://www.blindfiveyearold.com/the-preference-bubble .
That said, do you do your own crawling or do you source results from someone?
Query logs are preserved, but they do NOT contain the IP-adress of the user. So all I know is that someone of the billions of Internet users searched for something embarrassing. :)
I have a few questions it would be great to hear about either here or in a follow-up blog post:
* Is your reimplementation in JS a translation of your current from Pascal or a redevelopment?
* Why have you decided to move from Pascal to JS rather than spending the effort improving the current implementation?
* What lessons have you learned from your current implementation that you are attempting to overcome with your new version in JS?
The old parts of the sources are in VERY bad shape. This project started more than 15 years ago on a much, MUCH smaller scale (about 2 million pages compared to 320 million now). Some design-choices simply aren't valid anymore. Also the use of JS has the advantage that it will make the project attractive to a lot more programmers than staying with Pascal could achieve.
The current implementation keeps the search-index as one big thing. I will split that up into many smaller pieces. That way it can run multi-threaded or even on multiple servers. Currently queries are running single-threaded.
Another shortcoming is that building an index involves several manual steps. I want to automate that.
Ten years ago, the EU sponsored Exalead to become a Google competitor. Nowadays the company is owned by Dassault Systèmes (3D CAD Catia): https://www.exalead.com/search/ , http://www.heise.de/newsticker/meldung/Quaero-Erster-Vorlaeu...
Other mentioned DuckDuckGo, but one has to differ. DDG uses the Yahoo search API, and Yahoo itself uses the Bing search from Microsoft. DDG parses the query string and tries to add some snips from Wikipedia/Yelp/etc. There are only 4 big search engines left: Google, Bing, Yandex, Baidoo. There are some minor ones like Gigablast, Exalead and now Deusu. And there are meta search engines like DDG and Yahoo.
@Deusu dev: Good luck with the refactoring from Pascal to JavaScript, sounds like a good idea! Do you use a page-rank or how else do you score?
The economics of search are strange. Search has negative market value - Google paid about $100M/year to be the default on Firefox, and Google still pays to be on iPhones. It's like the ad channels on cable. The Jewelry Channel pays to be carried; the NFL gets paid to be carried. Google is in the Jewelry Channel position.
As compute power increases, the cost of doing search goes down. But that hasn't been reflected in the "selling price" of search, which is expressed as ad density. Google is a high-margin business because of that. That makes them vulnerable.
DuckDuckGo and Blekko, although tiny, are quite profitable. Even InfoSeek and Ask apparently still make money. There's room at the bottom.
> Ask is more of an 'ads search engine', and they make mainly because of their arbitrage model
And no, this data does NOT get transmitted. So DeuSu does not know about what you searched-for on Google.
Did you snarf an old database from another search engine? Just curious.
Does not manage to return deusu as the first result, or in any of the results for that matter.
It's a shame that a couple of replies or so harshly negative. Perhaps this submission would have been better received if it had been a blog post about how you created you engine; how it works; problems you have with it; and so on.
Search is not --despite the Google behemoth-- a solved problem so there's still space for creative thinking.
EG the search on a manufacturer's website is often hopeless. You emd up with a list of 8,000 widgets and need to iust scroll through them page by page. Amazon search is bafflingly poor. Ebay search has some sub-optimalities. (People can list "case for mp3 player £1.99" and "mp3 player £35" in the same listing, so a search and sort by price will sort by the cheaper case price.
DeuSu: https://deusu.org/query?q=that+movie+where+jim+carrey+plays+...
Google: https://www.google.co.in/search?q=that+movie+where+jim+carre...
For me queries like these is why Google wins. NB: Duckduckgo and Bing seem to do fine on this query as well. Bing does better than Google[0] when you omit phrases like 'three' or half-spell phrases like 'carrey' as 'car'. Surprising since it has often been the case that Google is better at returning/ranking results for most of programming related queries I try on daily basis [1].
[0] Probably, because Google takes into account the fact that the user ignored the type-ahead suggestion (car -> carrey) for a reason and omits all 'jim carrey' related results from the list.
[1] http://www.hanselman.com/blog/AmIReallyADeveloperOrJustAGood...
I'm watching the movie JFK, so I tried a search for Jack Ruby.
I found a teacher named Jack Ruby's videos, the Ruby Fortune online casino, the bar Ruby Tuesday, and a lot of other irrelevant results.
Nothing about the famous Jack Ruby in the first four pages.
http://www.google.com/chromeframe/?redirect=true
As first result