A New Search Engine (opens in new tab)

(0x65.dev)

173 pointschrmod6y ago102 comments

102 comments

They talk about using query logs to optimize their search results:

>Queries performed by people, if associated to a web page, serve as even cleaner summaries than anchor text. This is because all the logic put in place by the search engine, who resolved the query with a list of web pages, and all human understanding and experience that led one to select the best page from the offered result list end up embedded in the association <query, url>.

This would seem to present a "rich get richer" problem where the oldest links that have the largest click-through tend to float to the top making it difficult for a new result that may be "better" to appear high in the search results. Anyone know how search engines tackle this problem?

bryanrasmussen6y ago

If I had a big enough user base I would give some users non optimized queries and check if the choices still matched up to previous choices, if they did I would increase rank on previous choices, if not I would start to down prioritize and increase the choices that were chosen. An ongoing test of is result X for search term y still best?

toast06y ago

Going strictly by volume of clicks (or volume of clicks for a keyword) is not going to be fair or particularly useful, you need to compare clicks received to clicks expected.

If it's shown in position one, it should be expected to get more clicks than position ten. If something gets more than expected, move it up.

solso6y ago

[Disclaimer, I work at Cliqz]

Your point is spot on. Old pages tend to have more association to seen queries, which does not play in favor for new pages.

That said, however, there are a couple of things to consider: 1) seen queries is not the only way to create queries, we are pretty good creating synthetic queries based on the content, descriptions, etc. This queries are more noisy that the seen queries of course, but good enough. And 2)novelty, freshness and popularity are very important features on the ranking. Feel free to try out any new topic you might think of on https://beta.cliqz.com, you will see that is not only "stale" content.

mgreg6y ago

Thank you for the additional detail and this certainly appears to be a challenging problem.

It is still a little fuzzy to me. What is a "synthetic query"? Is this basically generating queries that would match the content (i.e. essentially reversing the process)?

Novelty, freshness are interesting but can lead back to the noise problem mentioned in the blog. If many pages are created that may match the query (e.g. "best new movies") many young pages will match this. Popularity would be useful but difficult to establish and then there's the clickbait and other gaming problems.

kick6y ago

Why does Cliqz use an analytics domain with a typo in it to get around user tracker-blockers? That's incredibly scummy, given how much Cliqz has been shouting about privacy.

https://anolysis.privacy.cliqz.com/

__ka6y ago

Anolysis stands for Ano[nymized] [Ana]lysis. It's a new approach to do telemetry without sending unique identifiers (like most analytics / telemetry) systems do - but focus on goal attainment at a group level. This makes work harder of course, but it's a price we've been willing to pay. It is a pity you would take a domain name as evidence of malice. We should have a paper coming up at some point on the approach.

1 more reply

JadoJodo6y ago

Seems like an excellent opportunity to apply Hanlon's Razor.

"Never attribute to malice that which can be adequately explained by stupidity."

1 more reply

ben5096y ago

How does using an odd spelling thwart user-tracker blockers?

1 more reply

MayeulC6y ago

Wouldn't a multi-armed bandit help alleviate this issue? (Basically, randomly display a few other links, and use bayesian stats to figure out if the new links are more optimal).

That, or any kind of exploration/optimization algorithm, to be honest.

mc9876y ago

Surfacing new content in search engines is a very challenging problem. I am guessing they use a combination of social signals (twitter, facebook) popularity and domain popularity amongst other signals.

pheug6y ago

Google pretty much knows (or can accurately estimate) exactly when a new document appears on the web and how many people are visiting it, they don't even need to rely on second hand social signals for this. They control the web's dominant crawler (Googlebot), browser (Chrome, which sends everything you type in the address bar to them by default), ads (Adsense) and tracking (Google Analytics) platforms.

netankit6y ago

[Disclaimer: I work at Cliqz]

You are on point. Recency is a challenging problem in multiple ways for search engines. Not just limited to discovering new content, but also how does one index it? How does one balance out when you have for the same query "very new", "new", "slightly old" and "really old" results during ranking. This involves both news as well as new webpages surfacing on the web.

On top of this, we have to remember that this is a fully autonomous real time system which requires solving some of the most difficult engineering challenges at scale and at the same time being mindful of the latency and quality constraints.

At the end of the day, it's all about the final user experience that we ship. We are very much mindful of the same. We will be publishing more details about Cliqz search, on our blog https://0x65.dev/ in the coming days, so stay tuned.

alexis_fr6y ago

GPS tracking must also help Google a lot to determine mortar shop popularity.

stanislavb6y ago

Yes, that's an interesting problem to solve. Maybe some A/B testing and showing new results to a percentage of all searches.

lawrenceyan6y ago

> Money and Resources : We have been lucky enough to have fantastic investors, who fund and help us in our journey.

Just as an FYI, this company Cliqz is owned by Hubert Burda Media, a large media conglomerate based in Germany.

That doesn't necessarily inherently mean anything negative, but it's important to understand the potential underlying incentives given their marketing as such a strongly privacy oriented service.

solso6y ago

An excerpt from the 1st post of this series: "Why would a team be motivated to build another search engine? Why would Hubert Burda Media finance this over several years (they continued to back us especially in times when things got tough)?" https://0x65.dev/blog/2019-12-01/the-world-needs-cliqz-the-w...

hobofan6y ago

Yes, they mention it, which is a good move regarding transparency, but they still don't answer the question as to why the finance them.

They mostly push a narrative of privacy and censorship, when in the end the answer is probably close to "we want a piece of the pie" or "we want to be that monopoly".

solso6y ago

[Disclaimer, I work at Cliqz] I cannot answer for the "true" motivation of the investors, but their pitch and actions so far are well align with the fight against monopolies narrative. Do they want to get return on investment (eventually)? I would assume so, and I believe it would be fair. I do not see them as mutually exclusive. Of course, this is my personal opinion.

1 more reply

mfer6y ago

Isn’t Google basically an add company (if you go by their revenue). A search engine provided by a media company or ad company. I’m curious what people think the debate is between these.

MR4D6y ago

Differing biases.

Choice is good.

stanislavb6y ago

I don't know how many of you tried the engine, but there are 2 features that instantly took my attention:

1) Trackers Stats. Essentially, you can see how many and what trackers there are on the page you are about to visit. Before visiting it.

2) Page previews (I'm not sure about whether I like that)

__ka6y ago

> 1) Trackers Stats.

This feature is powered by another project we run, where we measure the tracking landscape in the web (most popular domains): https://whotracks.me. Details on how that works can be found in our paper [0]. Also - we are flirting with the idea of providing a mode where the ranking is informed by the trackers in the destination site. Would love to hear your thoughts on whether you'd like smth like this.

> 2) Page previews (I'm not sure about whether I like that)

At the moment it's only a placeholder for a lengthier title and description (if available), but we are planning to use the space for rendering a short summary of the content/media in that site + similar sites in terms of content (query-relevant of course). This is more work in progress as we want to make sure content creators are on board. Again: would love to hear your thoughts on that.

Disclaimer: I work at Cliqz.

[0] WhoTracks .Me: Shedding light on the opaque world of online tracking - https://arxiv.org/abs/1804.08959

martinlaz6y ago

> we are flirting with the idea of providing a mode where the ranking is informed by the trackers in the destination site. Would love to hear your thoughts on whether you'd like smth like this.

That's definitely the right way to go. I would also very much appreciate an option not to show in the result list sites/pages having any trackers.

trisch6y ago

I'm afraid this will remove any results from page :-D

1 more reply

rabidrat6y ago

What I really want, is not another search engine for contextless queries. Except for really basic queries (which Google/etc already do a good job at), I'm trying to answer a question, perhaps open-ended, and it will take multiple queries to resolve. And it's not a linear process of narrowing down with + or - keywords. It's establishing a context: I'm searching for something relevant to "go" the language, not "go" the english verb or "go" the board game.

I want to be able to open a jupyter-like notebook with the start of my search query, and the first step should be to show me the available eigencontexts, from which I can establish the gross context for my entire search. After this first click, none of the results should be about the board game or the english word--unless the relevant search results happen to include an implementation of Go the board game in Go the language.

And then when I'm done, I want to name and archive that notebook so I can return to it at a later date--whether to refresh my memory of the ultimate answer, or to continue the search.

I guess I would call this a 'research engine' instead of a search engine.

solso6y ago

[Disclaimer: I work at Cliqz]

I hate to answer this one, becasue it looks too much marketing-speech but this feature exists. Not on beta.cliqz.com but on the drop-down search on Cliqz browser.

Based on the tabs you have opened, different query expansions are selected. For instance as you type "hotel in ma..." probably would show you results for Mallorca, but if you have "Madrid" on a tab, then it will show results for "hotel in madrid".

There will be a blog post about this contextual search because it's our showcase that is possible to do personalization without compromising privacy. All this is done privately, the browser receives results for multiple expansions and can chose which one to display based on local information. We never track or collect sessions of users.

amelius6y ago

Ok, but can't you then figure out what the browser was displaying using Javascript?

solso6y ago

Not sure I get your point. But contextual search only works for the search within Cliqz browser, on the address bar dropdown, on the client space. The same approach cannot be done on the (web-page SERP page, beta.cliqz.com), because from a web we have no access to the tabs opened. It could only possible via tracking and user-profiling, which is something that we do not do, or want to do.

whatshisface6y ago

They could, but then everyone would see them do it, and kind of the whole point is that they won't do it.

1 more reply

HenriNext6y ago

I had very similar idea a few years ago, did some quick numbers and came to conclusion that it was not going to fly commercially.

Interestingly the working name for my idea was also "ResearchEngine", so i guess it summarizes pretty well the unmet need you and me have.

basq6y ago

Agreed. tried replicating this with org mode, which required an extensive learning curve to meet the needs akin to what you listed. Although at some level it did the job, I could imagine a tool that does it better.

bsder6y ago

> I want to be able to open a jupyter-like notebook with the start of my search query, and the first step should be to show me the available eigencontexts

AltaVista used to do this. I miss it terribly.

nl6y ago

I think you are mis-remembering this. I used Altavista heavily until Google came out and made search work, and I don't remember any feature anything like this.

bsder6y ago

AltaVista used to have a Java Applet that would draw the clusters. So "python" would get you a cluster that was "reptiles" and one that was "programming". You could then click on that cluster and it would "zoom in" on the cluster and then redraw the probability clusters again.

whalesalad6y ago

I believe this is called faceted search.

rabidrat6y ago

For the first level context, you're right. Does faceted search also allow the ability to specify "near this link"? Like a "warmer" or "colder" approach to searching, where I can train the engine during the search?

I also want to maintain a longer-term list of pages/sites to exclude. Like, unless otherwise specified, I never want results from w3schools or expertsexchange.

This field is ripe for disruption, in my opinion. We can do so much better, but I've not seen any serious attempts.

nostromo6y ago

The nationalism on the homepage is a little odd... in particular since they're still essentially building on top of Google.

https://cliqz.com/en/

> Europe has failed to build its own digital infrastructure. US companies such as Google have thus been able to secure supremacy. They ruthlessly exploit our data. They skim off all profits. They impose their rules on us. In order not to become completely dependent and end up as a digital colony, we Europeans must now build our own independent infrastructure as the foundation for a sovereign future. A future in which our values apply. A future in which Europe receives a fair share of the added value. A future in which we all have sovereign control over our data and our digital lives. And this is exactly why we at Cliqz are developing digital key technologies made in Germany.

autoexec6y ago

There's been a lot of discussion about how US-centric the internet is in general even discounting how many massively popular internet companies are US based. I don't think it's unreasonable for Europeans or other nations to try to be less dependent on the US and US based services. As an American, I think it's the smartest thing they can do and I welcome it.

Semiapies6y ago

As far as that goes, I agree.

I have an uncomfortable feeling, however that this is when the walls really start going up in the internet, beyond just the dictatorships.

yellowapple6y ago

I also welcome it, as an American. Hopefully it means more competition and therefore more motivation for US-based products to improve rather than stagnate.

TurkishPoptart6y ago

Well, I think they're right (American here).

ben5096y ago

> The nationalism on the homepage is a little odd.

Especially seeing as how Europe is not a nation.

> A future in which we all have sovereign control over our data and our digital lives. ... And this is exactly why we at Cliqz are developing digital key technologies made in Germany.

Ahhh, that makes more sense, "sovereign" == "made in Germany."

jsnell6y ago

Ah, yes. The old "hiybbprqag" method. Worked great for Bing.

cpeterso6y ago

TIL:

"Hiybbprqag?" How Google Tripped Up Microsoft

https://www.cbsnews.com/news/hiybbprqag-how-google-tripped-u...

jackbravo6y ago

And they even link to it on the notes of the blog post.

robgibbons6y ago

I like Cliqz and am already impressed with the results for several of my test queries, though they certainly have a ways to go.

But what I would really like to see, as has been mentioned in other threads, is an open-source or community-funded search engine. Something that "belongs to the web" itself, so to speak, and not to any particular corporation.

BlackLotus896y ago

Like yacy? https://yacy.net/

Tried many times to use it but since it was a resource hungry java application that required me to use the web through a http proxy to contribute it wasn't really useable for me at any time. Also the search results were mostly garbage for me.

z3t46y ago

If you are going to make a new search engine, you need to attack a problem that people have, like Duckduckgo solving privacy issues. I don't want to install something that collects a bunch of personal info about me. A better idea is to search bookmarked sites and the cache. And do it locally.

ricardo816y ago

To be fair, Mojeek addressed the search engine privacy problem long before DDG existed.

fractalf6y ago

Wow, just wow. I've been working with Odoo for a couple of years now and it's been a frustrating experience because it's documentation suck badly and it's really freakin hard to get relevant answers from DuckDuckGo or Google when stuck. I tried out a search now on Cliqz and can't believe how good and relevant the result was. Could be a lucky shot, but I'm definitly gonna try this out more. Great work guys! :)

1 more reply

topicseed6y ago

Anybody knows Cliqz's database stack? Curious to see what powers a large scale information retrieval index of this sort.

ssubu6y ago

[Disclaimer: I work at Cliqz]

We will have a blog post tomorrow on this very topic, but in short, we use a combination of Keyvi, Granne (both in-house) along with Cassandra and RocksDB.

Though our approach mentioned in this blogpost significantly reduces the storage needed to host the index, we still have an index of around 50 TB of data.

solso6y ago

[Disclaimer I work at Cliqz] There is a lot of systems under the hood, depending if it's the main index or the freshness index. But if I have to pick one as database it should be Keyvi (https://github.com/KeyviDev/keyvi).

FYI, there will be a bunch of articles regarding search in the next week.

superkuh6y ago

There was another one of these posts a week or so ago. In that one I was one of many that complained the search engine was unusable without javascript enabled.

Now you can search without javascript enabled. Thanks, cliqz devs.

__ka6y ago

It's still not perfect, but should be usable. Many thanks for the feedback :))

marcell6y ago

Do you (founders/employees) have any example queries that don’t work well with Google, and correspondingly what pages you think should be top ranked for those queries?

stevenicr6y ago

'sex chat' currently, from my location, 'free chat now' has 2 of the top 3 results. 'i sexy chat' has 2 of the top ten results.

'chat i w ' is number 13, and has been top 10 for much of past couple / few years.. yet they are 'not an adult site' since they run GGL ads...

should be top 5 again.. sexchatsexchat.com has way more content and history..

there are many more sites I could suggest that actually have chat systems running (unlike the porn dood site which is a top 20 link list)

there are many good sites that aren't even in the results at all... these are being gamed by well connected linkers, not ranked by amount of content and length of time people would stay and enjoy.

imbo - in my biased opinion, I have more to add but wonder if it does any good.

a new engine that handles adult better, I would help with.. the other sites listed here do not do these results justice either.. again imbo, ymmv.

cirno6y ago

Since the Cliqz devs are here, and this engine is based in Germany, a question: does your search engine have any mechanisms for reporting abusive URLs (doxxing, targeted harassment, revenge porn, etc) beyond right-to-be-forgotten, or are you more a lassiez-faire, everything-goes kind of search company?

I noticed that your engine ranks some of the nastier sites on the internet far higher than any other search engine I've looked into.

trisch6y ago

[Disclaimer: I work at Cliqz] Yes, there is a way to report such urls https://cliqz.com/en/report-url

We do have a list of blacklisted urls/domains mostly regarding adult topic (child porno etc). If you have noticed some bad sites in our results, please feel free to drop a line to our support team using link I provided

cirno6y ago

Thanks for the reply. A bit disappointed it only counts for extremely illegal content. There's a lot of really negative stuff out there that is blatantly false and manipulative (Ripoff Report, Tumblr callout posts, etc) and it's always a shame that this kind of negative toxicity gets promoted so high in SERPs.

I'd really like it if there were an ethical SERP that at least had some integrity with its results. Reporting factual unflattering statements is one thing (and ideal), but promoting libel feels really dirty, and so far Cliqz seems to be the worst at that of any search engines I've used, and your reporting link seems as though Cliqz is okay with that.

cocktailpeanuts6y ago

I would like to read this but I can't reach the web server. Is it just me?

Had the same issue with another article from this same site a couple of days ago. Looks like everyone else is able to read it but for some reason not me.

Anyone know what's going on?

kkm6y ago

Hi,

Interesting, could you tell what's the error to see.

Other ways you can reach the blog: If you use Tor browser can you try opening: http://cliqzdevxo33b4h6.onion/

Or if you use Beaker browser: dat://ee172d7cd9235b2cf86ea9481e8a40e48cea29c743036621edc79a4765aa0281

Disclaimer: I work for Cliqz.

cocktailpeanuts6y ago

I get the following error:

This site can’t be reached0x65.dev refused to connect.

Try:

Checking the connection Checking the proxy and the firewall ERR_CONNECTION_REFUSED

This happens on Chrome, Firefox, Safari, and Opera on my Mac.

kkm6y ago

First, let's check if you can open another domain on .dev TLD, like web.dev, if not then:

Seems like you have some mapping for .dev TLD. Assuming based on your mention of Safari, that you are using Mac.

Could you check if you have some setting in your /etc/resolver for dev TLD, or if you are using some service like dnsmasq which is trying to resolve .dev to a non-existent location.

1 more reply

heisenhuegel6y ago

Maybe it is just you. For me, it works. Not sure why it does not load for you.

thorwasdfasdf6y ago

it worked for me.

dsunku6y ago

> The experts, who chose to answer, suggested that we should first start with crawling the whole web. We were told that this would take between 1 and 2 years to complete, and would cost a minimum of $1 billion

Why are costs so high for crawling?

philippclassen6y ago

(Disclaimer: I work at Clizq) I don't work on the search, but did some work recently on the crawling part. What I know is that crawling is far more difficult if you are not a big player. Sites will quickly block you once you hit a rate limit.

We have to be very careful, since when we get blocked there is normally no way to get unblocked again. You can try to send them an email to unblock you, but it is unlikely that you get a response. This is one part of the explanation why crawling is slow. The other part is more obvious: the internet is large.

The blocking part is hard to overcome as a small player, while for Google it is the opposite as sites simply cannot afford being exclude from the index. If we would not have to care about rate limits, it would simplify the problem.

shriphani6y ago

I did a large-scale crawl of the web some years ago and we put together a billion pages.

The bulk of the content ended up being index pages - i.e. large list of links taking you to the content - pagination, other breadcrumbs etc.

You can exhaust a lot of resources without getting anything useful.

It is no longer a point and go kind of thing unfortunately, you need a good understanding of page structure, estimate what kinds of links are vital, etc. else there's a ton of crap you'll pick up.

Or maybe I was doing something wrong.

555556y ago

There is no way it costs near a billion dollars to do an effective crawl of the web. That's laughable. I do wonder how much Googlebot spends on bandwidth and servers though.

theblackcat20046y ago

Wouldn’t common crawl content be enough? If not what are the issues?

trisch6y ago

No, it's not enough and had poor coverage outside of USA. We have also answered this question (it appears to be popular) in today's post about technical details of our search https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...

disordinary6y ago

I remember when Cuil launched about 10 years ago they suggested that 1% of the search market was worth a billion dollars so it's big money if you can get inroads. Of course search is probably less important now than it was back then, with discovery happening on social media more and more, but the internet as a whole is much larger than 10 years ago so I wouldn't be surprised if 1% is worth more now days.

ragerino6y ago

We need a smarter search.

One that is based on analyzing the content of a page then on it's page rank.

Self speaking that it has to be open source.

Apache SOLR would be a good starting point.

BlackLotus896y ago

Cliqz nearly made me stop using firefox a while back https://www.heise.de/-3852129 (german article)

Tldr 1% of german firefox installations automatically uploaded search queries to cliquz. I wont trust a search engine like this with any of my data.

philippclassen6y ago

(Disclaimer: I work at Cliqz) Just read the article. It is from 2017 and very short. For the non-German speakers, I have to translate the relevant part:

> Rund ein Prozent der Firefox-Downloads enthalten künftig das Add-On Cliqz, das bereits beim Eintippen Vorschläge für Webseiten anzeigt. Dafür wertet es die Surf-Aktivitäten aller Nutzer aus.

About 1% of the Firefox downloads will contain the Cliqz Addon, which will show you search suggestions for websites while you type. For that, it uses browsing activities of all users.

---

The last "of all users" is important. Yes, our search is built on data collected from users, but the point is we cannot build profiles of single users; we are only seeing what the whole group of users does. I cannot stress that part enough. We are not Avast.

In fact, we are very open about our data collection system called Human Web:

* https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...

And this article explains how we provide anonymity while sending:

* https://0x65.dev/blog/2019-12-04/human-web-proxy-network-hpn...

I can understand that you did not like the way that Mozilla rolled it out in 2017. I'm also not glad about how it went (my personal opinion). But from the technical side, I'm more than happy to take any question on that topic (how we collect data in Cliqz).

autoexec6y ago

Would you trust google who does the same thing with chrome? Bing which does the same thing with IE (or whatever it's called now)? Blame firefox for selling out their users not the search engine. I didn't close my account with amazon when Ubuntu started sending searches to them, I just switched distros.

yellowapple6y ago

> Would you trust google who does the same thing with chrome?

No, which is why I use neither Google nor Chrome.

> Bing which does the same thing with IE (or whatever it's called now)?

No, which is why I use neither Bing (besides indirect use via DuckDuckGo including it as a data source) nor IE/Edge.

> I didn't close my account with amazon when Ubuntu started sending searches to them,

Amazon didn't and doesn't (last I checked) have a financial stake in Canonical, nor did/does Canonical have one in Amazon. No need to blame Amazon here; that was just Canonical being stupid.

However, per Wikipedia, the same disconnection can't be asserted between Mozilla and Cliqz:

> In August 2016, Mozilla, developer of Firefox, made a minority investment in Cliqz. Cliqz plans to eventually monetize the software through a program known as Cliqz Offers, which will deliver sponsored offers to users based on their interests and browsing history.

> [...]

> On 6 October 2017, Mozilla announced a test where approximately 1% of users downloading Firefox in Germany would receive a version with Cliqz software included. The feature provided recommendations directly in the browser's search field, including for news, weather, sports, and other websites, based on the user's browsing history and activities.

That is: Mozilla invests in a company whose stated business model is literally to scrape my browsing history and shove ads into my browser, and then a year or so later starts A/B testing this as something baked into Firefox for German users. That's scummy, no matter which way you look at it, and no matter how many times Cliqz assures users that "we pinky swear we're not collecting any personally-identifiable data".

I'm taking Cliqz' "we care about your privacy" claims with a hydrostatically-equilibrious and possibly-neighborhood-clearing grain of salt.

BlackLotus896y ago

I use mainly duckduckgo and only use google if duckduckgo doesn't bring up anything useable (which is far too often for me tbh). And yes I blame firefox for every mishap over the last few years like the certificate expiration, the mr robot "advertising", the cloudflare dns and so on. But I see the good things as well like trowing out avast. So I trust them more than google.

But still I will never trust something like cliqz which belongs to the media gmbh which produces Schund like die bunte.

Ps I tested the beta and the search results werent good Pps I will read the search engine articles thought

nostromo6y ago

There's a difference.

When I use a service from Google, I expect that my data will be parsed by Google. And I can decide if I trust Google or not.

But Firefox sending the urls I visit to a third party (Cliqz) silently and without permission is shady and deceptive.

And then, after all this, Cliqz claims that it's a company built on privacy... sheesh.

galaxyLogic6y ago

An interesting quote for their article: " Philosophically, we believe copying is a loaded term, we prefer to use the term learning. Learning from each other is something all of us do"

What is the difference between copying and learning?

amelius6y ago

What I miss in this post is a list of references to the huge literature that exists on this topic, and related fields such as NLP.

Also I don't see a clear problem description. What is a search engine, really? How would you compare the quality of two search engines, objectively?

charlesism6y ago

   > Why the second constraint? one might 
   > ask. Besides the obvious potential for 
   > profitability, our mission was

The search engine the world needs is one with independence and non-profitability. If the creators are preoccupied with turning a profit, they’ll introduce the same garbage features as Google. It’s a shame, because a good search engine could shorten the time humanity has to wait for advances (eg: cures for cancers, cheaper energy, etc)

erulabs6y ago

Indexing the web is a resource intensive activity tho - if it was federated then the resource cost only increases. I suppose a non-profit is the alternative, but non-profits are not exactly independent unless they have some sort of massive endowment. I'm not trying to disagree with you, it's just a paradoxical problem: to resolve the issue, resources must be accumulated. Accumulating resources means it's hard to resolve the issue (of an independent search engine).

jayess6y ago

Weird, your domain (cliqz.com) was blocked by my pihole.

philippclassen6y ago

(Disclaimer: I work at Cliqz) We had problems with being blocked in the past.

In cases, where we got a chance to explain, they agree that it is a false positive and took us off the block list. At least, that happened so far in all cases that I'm aware of. However, there are so many lists that it is hard to keep track of them. Would be nice if you could provide some information which block list it is, so we can contact them.

The reason why we end on the blocklist is normally a misconception of our data collection system Human Web: https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...

If someone does not want to send Human Web data, the feature can also be disabled through the UI. Same if you browse in a private window; Human Web is automatically disabled there. There is no need to configure blocking rules.

solso6y ago

Yes, we are on some minor blocking lists because of our data collection, even though is anonymous (please check the articles about Human Web on https://0x65.dev/) sending data, no matter what data, is a sin that has to be punished. A disservice to you ask me, but what can we do. [Disclaimer: I work at Cliqz]

yurisokolov6y ago

There is a dark side to this story. With Burda https://en.wikipedia.org/wiki/Hubert_Burda_Media, the same people who are behind the Cliqz search engine were originally also behind the German Leistungsschutzrecht. https://en.wikipedia.org/wiki/Ancillary_copyright_for_press_... This law, heavily lobbied for by publishers, forces every search engine and everybody else using content from the internet to pay a private tax of 6% of the revenue (not from profit!). https://www.vg-media.de/de/digitale-verlegerische-angebote/f... As the profit of most internet companies is below this margin, it is essentially forcing many companies out of business.

This tax is enforced and collected by VG Media, the German collecting society representing rights of a group of German publishers. https://www.vg-media.de Between 2013 and 2016 Burda was a shareholder of VG Media, which was commissioned to enforce the tax in its name.

The evil thing of this law is, that the publishers are not required to mark their content in machine-readable form as paid content. And a manual selection is infeasible for internet-scale with billions of pages. So a search engine has no means to bypass the paid content and indexing only free content, e.g. like Wikipedia which makes the majority of the internet content. Essentially the "Leistungsschutzrecht" takes the free content hostage to extort money for using the internet, even if you don't use paid content of the publishers (the just 200 publications the VG Media represents).

So while Burda's Cliqz write on their blog "The world needs more search engines" https://www.0x65.dev/blog/2019-12-01/the-world-needs-cliqz-t... they supported a law that made it impossible for many search engines to operate in Germany (and in the EU via the similar EU law "Extra copyright for news sites" (“Link tax”) https://juliareda.eu/eu-copyright-reform/extra-copyright-for... And while today they are not anymore shareholder of the VG Media, they still benefit from the suppressive legal environment they helped to create, as it prevents any new independent competition to enter the search market

solso6y ago

[Disclaimer: I work at Cliqz]

Sorry for taking so long to reply, I was personally trying to dig some information about this. An additional disclaimer: not a lawyer either.

Honestly, I have little idea of how this law affects search engines. What I can say is that we are no paying anything, as AFAIK we do not know anyone who is. Moreover, if some publisher would complain, even one in Burda, we would stop crawling by domain, there is no technical issue here, properties are known by the imprint. We have no say on what the investors do but I can assure you that we have no pressure. For instance, our ad-blocker works everywhere, regardless if the sites are from Burda or not.

On a general level, assuming that what you say is factually correct, I must personally agree that regulation is a bitch. It's typically designed fro big companies to control other big companies, but small ones get negatively affected if only because of the lack of resources. We recently had to suffer all the overhead of GDPR, which consumed a fair amount of our time, relatively we paid a higher price that Google.

Personally, I cannot respond for all the decisions made by the people funding Cliqz, I do not even think I can judge it either. They might be complaining and lobbying, no idea. But they are also putting good money to build a privacy-preserving search engine and a browser, something that no-one else is doing, so on my account they are on the positive side.

Braggadocious6y ago

How many algorithms are there in chrome alone? I remember when people realized that they could game Facebook shares for higher rankings on chrome and for a while buzzfeed top ten lists outranked Wikipedia every fucking time. I guess that’s still going on. What a clusterfuck search results are nowadays.

If anyone builds anything, please make it so algorithms or queries are archived. I hate how I can’t find anything on the internet that I searched for and found years ago. Its like the history of the internet evaporates every year. I don’t even know if some websites still exist or if I simply can’t find them because rankings are terrible.

I’m to the point that I haven’t been on a new website in years. How do you find new websites in this day and age when the same websites are ranked at the top every time?

j / k navigate · click thread line to collapse

102 comments

mgreg6y ago

They talk about using query logs to optimize their search results:

bryanrasmussen6y ago

toast06y ago

Going strictly by volume of clicks (or volume of clicks for a keyword) is not going to be fair or particularly useful, you need to compare clicks received to clicks expected.

If it's shown in position one, it should be expected to get more clicks than position ten. If something gets more than expected, move it up.

solso6y ago

[Disclaimer, I work at Cliqz]

Your point is spot on. Old pages tend to have more association to seen queries, which does not play in favor for new pages.

mgreg6y ago

Thank you for the additional detail and this certainly appears to be a challenging problem.

It is still a little fuzzy to me. What is a "synthetic query"? Is this basically generating queries that would match the content (i.e. essentially reversing the process)?

kick6y ago

Why does Cliqz use an analytics domain with a typo in it to get around user tracker-blockers? That's incredibly scummy, given how much Cliqz has been shouting about privacy.

https://anolysis.privacy.cliqz.com/

__ka6y ago

1 more reply

JadoJodo6y ago

Seems like an excellent opportunity to apply Hanlon's Razor.

"Never attribute to malice that which can be adequately explained by stupidity."

1 more reply

ben5096y ago

How does using an odd spelling thwart user-tracker blockers?

1 more reply

MayeulC6y ago

Wouldn't a multi-armed bandit help alleviate this issue? (Basically, randomly display a few other links, and use bayesian stats to figure out if the new links are more optimal).

That, or any kind of exploration/optimization algorithm, to be honest.

mc9876y ago

pheug6y ago

netankit6y ago

[Disclaimer: I work at Cliqz]

alexis_fr6y ago

GPS tracking must also help Google a lot to determine mortar shop popularity.

stanislavb6y ago

Yes, that's an interesting problem to solve. Maybe some A/B testing and showing new results to a percentage of all searches.

lawrenceyan6y ago

> Money and Resources : We have been lucky enough to have fantastic investors, who fund and help us in our journey.

Just as an FYI, this company Cliqz is owned by Hubert Burda Media, a large media conglomerate based in Germany.

That doesn't necessarily inherently mean anything negative, but it's important to understand the potential underlying incentives given their marketing as such a strongly privacy oriented service.

solso6y ago

hobofan6y ago

Yes, they mention it, which is a good move regarding transparency, but they still don't answer the question as to why the finance them.

They mostly push a narrative of privacy and censorship, when in the end the answer is probably close to "we want a piece of the pie" or "we want to be that monopoly".

solso6y ago

1 more reply

mfer6y ago

Isn’t Google basically an add company (if you go by their revenue). A search engine provided by a media company or ad company. I’m curious what people think the debate is between these.

MR4D6y ago

Differing biases.

Choice is good.

stanislavb6y ago

I don't know how many of you tried the engine, but there are 2 features that instantly took my attention:

1) Trackers Stats. Essentially, you can see how many and what trackers there are on the page you are about to visit. Before visiting it.

2) Page previews (I'm not sure about whether I like that)

__ka6y ago

> 1) Trackers Stats.

> 2) Page previews (I'm not sure about whether I like that)

Disclaimer: I work at Cliqz.

[0] WhoTracks .Me: Shedding light on the opaque world of online tracking - https://arxiv.org/abs/1804.08959

martinlaz6y ago

> we are flirting with the idea of providing a mode where the ranking is informed by the trackers in the destination site. Would love to hear your thoughts on whether you'd like smth like this.

That's definitely the right way to go. I would also very much appreciate an option not to show in the result list sites/pages having any trackers.

trisch6y ago

I'm afraid this will remove any results from page :-D

1 more reply

rabidrat6y ago

And then when I'm done, I want to name and archive that notebook so I can return to it at a later date--whether to refresh my memory of the ultimate answer, or to continue the search.

I guess I would call this a 'research engine' instead of a search engine.

solso6y ago

[Disclaimer: I work at Cliqz]

I hate to answer this one, becasue it looks too much marketing-speech but this feature exists. Not on beta.cliqz.com but on the drop-down search on Cliqz browser.

amelius6y ago

Ok, but can't you then figure out what the browser was displaying using Javascript?

solso6y ago

whatshisface6y ago

They could, but then everyone would see them do it, and kind of the whole point is that they won't do it.

1 more reply

HenriNext6y ago

I had very similar idea a few years ago, did some quick numbers and came to conclusion that it was not going to fly commercially.

Interestingly the working name for my idea was also "ResearchEngine", so i guess it summarizes pretty well the unmet need you and me have.

basq6y ago

bsder6y ago

> I want to be able to open a jupyter-like notebook with the start of my search query, and the first step should be to show me the available eigencontexts

AltaVista used to do this. I miss it terribly.

nl6y ago

I think you are mis-remembering this. I used Altavista heavily until Google came out and made search work, and I don't remember any feature anything like this.

bsder6y ago

whalesalad6y ago

I believe this is called faceted search.

rabidrat6y ago

I also want to maintain a longer-term list of pages/sites to exclude. Like, unless otherwise specified, I never want results from w3schools or expertsexchange.

This field is ripe for disruption, in my opinion. We can do so much better, but I've not seen any serious attempts.

nostromo6y ago

The nationalism on the homepage is a little odd... in particular since they're still essentially building on top of Google.

https://cliqz.com/en/

autoexec6y ago

Semiapies6y ago

As far as that goes, I agree.

I have an uncomfortable feeling, however that this is when the walls really start going up in the internet, beyond just the dictatorships.

yellowapple6y ago

I also welcome it, as an American. Hopefully it means more competition and therefore more motivation for US-based products to improve rather than stagnate.

TurkishPoptart6y ago

Well, I think they're right (American here).

ben5096y ago

> The nationalism on the homepage is a little odd.

Especially seeing as how Europe is not a nation.

> A future in which we all have sovereign control over our data and our digital lives. ... And this is exactly why we at Cliqz are developing digital key technologies made in Germany.

Ahhh, that makes more sense, "sovereign" == "made in Germany."

jsnell6y ago

Ah, yes. The old "hiybbprqag" method. Worked great for Bing.

cpeterso6y ago

TIL:

"Hiybbprqag?" How Google Tripped Up Microsoft

https://www.cbsnews.com/news/hiybbprqag-how-google-tripped-u...

jackbravo6y ago

And they even link to it on the notes of the blog post.

robgibbons6y ago

I like Cliqz and am already impressed with the results for several of my test queries, though they certainly have a ways to go.

BlackLotus896y ago

Like yacy? https://yacy.net/

z3t46y ago

ricardo816y ago

To be fair, Mojeek addressed the search engine privacy problem long before DDG existed.

fractalf6y ago

1 more reply

topicseed6y ago

Anybody knows Cliqz's database stack? Curious to see what powers a large scale information retrieval index of this sort.

ssubu6y ago

[Disclaimer: I work at Cliqz]

We will have a blog post tomorrow on this very topic, but in short, we use a combination of Keyvi, Granne (both in-house) along with Cassandra and RocksDB.

Though our approach mentioned in this blogpost significantly reduces the storage needed to host the index, we still have an index of around 50 TB of data.

solso6y ago

FYI, there will be a bunch of articles regarding search in the next week.

superkuh6y ago

There was another one of these posts a week or so ago. In that one I was one of many that complained the search engine was unusable without javascript enabled.

Now you can search without javascript enabled. Thanks, cliqz devs.

__ka6y ago

It's still not perfect, but should be usable. Many thanks for the feedback :))

marcell6y ago

Do you (founders/employees) have any example queries that don’t work well with Google, and correspondingly what pages you think should be top ranked for those queries?

stevenicr6y ago

'sex chat' currently, from my location, 'free chat now' has 2 of the top 3 results. 'i sexy chat' has 2 of the top ten results.

'chat i w ' is number 13, and has been top 10 for much of past couple / few years.. yet they are 'not an adult site' since they run GGL ads...

should be top 5 again.. sexchatsexchat.com has way more content and history..

there are many more sites I could suggest that actually have chat systems running (unlike the porn dood site which is a top 20 link list)

there are many good sites that aren't even in the results at all... these are being gamed by well connected linkers, not ranked by amount of content and length of time people would stay and enjoy.

imbo - in my biased opinion, I have more to add but wonder if it does any good.

a new engine that handles adult better, I would help with.. the other sites listed here do not do these results justice either.. again imbo, ymmv.

cirno6y ago

I noticed that your engine ranks some of the nastier sites on the internet far higher than any other search engine I've looked into.

trisch6y ago

[Disclaimer: I work at Cliqz] Yes, there is a way to report such urls https://cliqz.com/en/report-url

cirno6y ago

cocktailpeanuts6y ago

I would like to read this but I can't reach the web server. Is it just me?

Had the same issue with another article from this same site a couple of days ago. Looks like everyone else is able to read it but for some reason not me.

Anyone know what's going on?

kkm6y ago

Hi,

Interesting, could you tell what's the error to see.

Other ways you can reach the blog: If you use Tor browser can you try opening: http://cliqzdevxo33b4h6.onion/

Or if you use Beaker browser: dat://ee172d7cd9235b2cf86ea9481e8a40e48cea29c743036621edc79a4765aa0281

Disclaimer: I work for Cliqz.

cocktailpeanuts6y ago

I get the following error:

This site can’t be reached0x65.dev refused to connect.

Try:

Checking the connection Checking the proxy and the firewall ERR_CONNECTION_REFUSED

This happens on Chrome, Firefox, Safari, and Opera on my Mac.

kkm6y ago

First, let's check if you can open another domain on .dev TLD, like web.dev, if not then:

Seems like you have some mapping for .dev TLD. Assuming based on your mention of Safari, that you are using Mac.

Could you check if you have some setting in your /etc/resolver for dev TLD, or if you are using some service like dnsmasq which is trying to resolve .dev to a non-existent location.

1 more reply

heisenhuegel6y ago

Maybe it is just you. For me, it works. Not sure why it does not load for you.

thorwasdfasdf6y ago

it worked for me.

dsunku6y ago

Why are costs so high for crawling?

philippclassen6y ago

shriphani6y ago

I did a large-scale crawl of the web some years ago and we put together a billion pages.

The bulk of the content ended up being index pages - i.e. large list of links taking you to the content - pagination, other breadcrumbs etc.

You can exhaust a lot of resources without getting anything useful.

It is no longer a point and go kind of thing unfortunately, you need a good understanding of page structure, estimate what kinds of links are vital, etc. else there's a ton of crap you'll pick up.

Or maybe I was doing something wrong.

555556y ago

There is no way it costs near a billion dollars to do an effective crawl of the web. That's laughable. I do wonder how much Googlebot spends on bandwidth and servers though.

theblackcat20046y ago

Wouldn’t common crawl content be enough? If not what are the issues?

trisch6y ago

disordinary6y ago

ragerino6y ago

We need a smarter search.

One that is based on analyzing the content of a page then on it's page rank.

Self speaking that it has to be open source.

Apache SOLR would be a good starting point.

BlackLotus896y ago

Cliqz nearly made me stop using firefox a while back https://www.heise.de/-3852129 (german article)

Tldr 1% of german firefox installations automatically uploaded search queries to cliquz. I wont trust a search engine like this with any of my data.

philippclassen6y ago

(Disclaimer: I work at Cliqz) Just read the article. It is from 2017 and very short. For the non-German speakers, I have to translate the relevant part:

> Rund ein Prozent der Firefox-Downloads enthalten künftig das Add-On Cliqz, das bereits beim Eintippen Vorschläge für Webseiten anzeigt. Dafür wertet es die Surf-Aktivitäten aller Nutzer aus.

About 1% of the Firefox downloads will contain the Cliqz Addon, which will show you search suggestions for websites while you type. For that, it uses browsing activities of all users.

---

In fact, we are very open about our data collection system called Human Web:

* https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...

And this article explains how we provide anonymity while sending:

* https://0x65.dev/blog/2019-12-04/human-web-proxy-network-hpn...

autoexec6y ago

yellowapple6y ago

> Would you trust google who does the same thing with chrome?

No, which is why I use neither Google nor Chrome.

> Bing which does the same thing with IE (or whatever it's called now)?

No, which is why I use neither Bing (besides indirect use via DuckDuckGo including it as a data source) nor IE/Edge.

> I didn't close my account with amazon when Ubuntu started sending searches to them,

Amazon didn't and doesn't (last I checked) have a financial stake in Canonical, nor did/does Canonical have one in Amazon. No need to blame Amazon here; that was just Canonical being stupid.

However, per Wikipedia, the same disconnection can't be asserted between Mozilla and Cliqz:

> [...]

I'm taking Cliqz' "we care about your privacy" claims with a hydrostatically-equilibrious and possibly-neighborhood-clearing grain of salt.

BlackLotus896y ago

But still I will never trust something like cliqz which belongs to the media gmbh which produces Schund like die bunte.

Ps I tested the beta and the search results werent good Pps I will read the search engine articles thought

nostromo6y ago

There's a difference.

When I use a service from Google, I expect that my data will be parsed by Google. And I can decide if I trust Google or not.

But Firefox sending the urls I visit to a third party (Cliqz) silently and without permission is shady and deceptive.

And then, after all this, Cliqz claims that it's a company built on privacy... sheesh.

galaxyLogic6y ago

An interesting quote for their article: " Philosophically, we believe copying is a loaded term, we prefer to use the term learning. Learning from each other is something all of us do"

What is the difference between copying and learning?

amelius6y ago

What I miss in this post is a list of references to the huge literature that exists on this topic, and related fields such as NLP.

Also I don't see a clear problem description. What is a search engine, really? How would you compare the quality of two search engines, objectively?

charlesism6y ago

   > Why the second constraint? one might 
   > ask. Besides the obvious potential for 
   > profitability, our mission was

erulabs6y ago

jayess6y ago

Weird, your domain (cliqz.com) was blocked by my pihole.

philippclassen6y ago

(Disclaimer: I work at Cliqz) We had problems with being blocked in the past.

The reason why we end on the blocklist is normally a misconception of our data collection system Human Web: https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...

solso6y ago

yurisokolov6y ago

solso6y ago

[Disclaimer: I work at Cliqz]

Sorry for taking so long to reply, I was personally trying to dig some information about this. An additional disclaimer: not a lawyer either.

Braggadocious6y ago

I’m to the point that I haven’t been on a new website in years. How do you find new websites in this day and age when the same websites are ranked at the top every time?

j / k navigate · click thread line to collapse