Evaluating Search Algorithms (opens in new tab)

(shopify.engineering)

107 pointsclandry945y ago15 comments

15 comments

Interesting article! Shopify's approach is cool, it's interesting they're using Kafka to generate datasets. I wonder if the explicit human rankings will get stale (and also be hugely outweighted by implicit judgements in the training data). The real-time feedback aspect sounds cool, I wonder if it's just for metrics or also for re-training in real-time.

I worked on a Learning To Rank implementation a year or so ago. What struck me then (and now reading about Shopify's implementation) is that the approach is often very similar across sites, but the implementation is usually rather tailored. You see the same patterns: online/offline metrics; nDCG; click models and implicit/explicit relevance judgements; re-ranking top-k of results, and so on.

Unfortunately there doesn't seem to be a technology tying all of the components of an LtR system together. A managed service like Algolia could be an answer. I wonder if industry will eventually converge on a framework, such as an extension to Open Source Connection's Elasticsearch Learning to Rank plugin (https://diff.wikimedia.org/2017/10/17/elasticsearch-learning...).

It's a really interesting area of theory and practice - I hope Shopify write more about their implementation!

I'd also recommend reading Airbnb's really excellent paper - https://arxiv.org/pdf/1810.09591.pdf.

jsloa0715y ago

Appreciate the feedback and recommendation! You're right that explicit judgments can get stale - fortunately for our document collection the information architecture and article structures themselves are slow-changing (the answers themselves might change, but the document that answers the question probably won't for some time). We also primarily use explicit judgments to label head queries/common topics, and may augment our datasets with fresh data from time to time. The team is currently exploring augmenting these human-made datasets with automatic judgments using click models.

For realtime feedback, we've implemented (on another search product at Shopify, not the Help Center) a "near"-time feedback loop using implicit judgments to alter search results. Perhaps I'll write a post about that one soon :) . My colleague Doug talks a bit about the new systems we're building in this blog post - https://shopify.engineering/apache-beam-for-search-getting-s....

BillFranklin5y ago

Great stuff, and it’s cool you are working with Doug, I enjoyed his book on search relevance :) I’ll look out for more posts from your team, good luck!

NumberCruncher5y ago

I should re-read the article because I can't see what kind of problem they try to solve with MAP, NDCG and "invented here" Pagerank what couldn't be solved with tf-idf and out-of-the box Elasticsearch functionality. It's a highly underrated peace of software.

loic-sharma5y ago

Hi, I worked on improving the search rankings for a popular package manager. Imagine your search algorithm is already excellent, you have countless of documents, and customers do countless of unique queries. Now say that you want to improve your search rankings further. How do you do that? What if your improvement helps some queries but hurts others? Things like tf-idf or Elasticsearch won't help here.

That's where NDCG comes in! Basically it gives a score for your search rankings that you can use to compare different search algorithms. The higher the score, the closer your algorithm was to producing the expected search results. This is super useful as you can try lots of experiments and get a good sense of whether the experiment is promising or not.

1 more reply

jll295y ago

TFIDF (Spärck Jones (1972) Journal of Documentation) is a weighting scheme for word frequencies in the vector space model of information retrieval.

In constrast, MAP and NDCG (and others, like Precision, Recall, F1-score, MRR) are _evaluation_ metrics.

So the former are part of systems, the latter are part of measuring the quality of systems.

softwaredoug5y ago

MAP, NDCG, etc are evaluation stats. Like unit tests for algorithm correctness. TF*IDF and Page rank are solutions that may or may not increase NDCG or MAP for your given problem/users

LZ_Khan5y ago

Where do the relevance scores come from? Are they human rated? I feel like that could leave room for error as raters would probably not have the same opinion as me on what a good document is.

jillesvangurp5y ago

This definitely is a key challenge. A lot of bias can creep in via picking different ways to judge relevance. Even just something seemingly simple as a click does not necessarily indicate success. Take click bait as an example, that's literally tricking people into believing something is more relevant than it really is. People click, but are they happy about it? And how would you measure that?

As the article states, human evaluated rankings are the best. But that is of course a relatively expensive process. And getting any group of people to do things in a consistent/systematic way is a challenge in it self. Is the group of people you pick to do this representative of your user base? Is what you let them evaluate representative of what your user base actually does? And what is the shelf life of these evaluations? Even something simple as cultural and racial biases can skew results quite a bit unintentionally and more than a few big name companies have fallen into this particular trap.

Relevance is inherently subjective. What's relevant for you might not be the same as what is relevant for me. Because we don't share the same context probably (goals, intentions, preferences, circumstances, locale, environment, etc.). Many ML based search solutions are effectively one size fits all type solutions that end up locking you into recommendation bubbles. I usually jokingly call this the "more of the FFing same algorithm". A lot of ML devolves into that.

I've been involved with a few machine learning based ranking projects over the years. To be clear, I'm not an ML expert and instead usually work on non ML based search projects (mostly Elasticsearch in recent years). With Elasticsearch, the learning to rank plugin is one of several ways you can leverage ML to improve ranking. It works best in very stable well understood domains where the feature and data set are relatively static and where there is an abundance of user data to work with. The few times I've seen rigorous and effective AB testing was on teams that had this.

This particular form of ML is of course hardly state of the art at this point. It involves manual feature extraction (i.e. no deep learning here) and relatively simplistic algorithms to basically tune things like boost factors and other parameters in queries. Variations of logistical regression basically. It's really expensive to do and more expensive to do well. Your starting point is basically a manually crafted query that already does the right things mostly.

And even so, in all these teams I noticed a pattern of things hitting a local optimum where obviously low hanging fruit in query improvements became hard to pick simply because of the overhead of "proving" it did not negatively impact rankings. Teams like this become change resistant. I've seen teams insist their ranking was awesome when users and product owners clearly had different ideas about that (i.e. they were raising valid complaints).

The paradox here is that most good changes here inevitably degrade the experience until you dial them in over subsequent releases. If you obsessively avoid that kind of disruptive change, your system stops improving altogether. If you have people obsessing over less than a percent deviations in metrics they are tracking, those changes never happen and teams get stuck chasing their own tails. I'm not kidding, I've been in meetings where people were debating fractions of a percent changes in metrics. Analysis paralysis is a thing with ML teams. They end up codifying their own biases and then are stuck with them until somebody shakes things up a little.

LZ_Khan5y ago

Very interesting, thanks for the explanation. I work on an ML team as well and definitely notice the heavy reliance on analysis. Almost like reading tea leaves at some point.

ntonozzi5y ago

Great article!

This seems like a fairly tricky ranking function. I wonder if they compared it to combining TF-IDF and the page popularity. This would help with the problem they explained.

It'd be interesting to see more details about how they implemented the query-specific page rank.

jsloa0715y ago

Great question! We intended the article to be more of an introduction to the evaluation process for search algos and definitely glossed over the example's problem area here and should have talked more about the specific experiment we used as the example. We use TF-IDF in combination with the popularity scores for both algorithms mentioned. There are a few reasons we chose to experiment with the query-specific popularity - mainly, that global popularity wasn't as much of a signal for "helpfulness" for the user's intent as much as it was a signal for the most popular pages overall (i.e. turns out to be overview pages and most popular problems encountered on the platform), so it wasn't having the desired effect on ranking. We wanted to see whether using popularity within the context of the user's topic enhanced the utility of the popularity score. Plus, with Query-Specific ranking, when the user's keywords aren't an exact match with the document's language, we still receive a boost from users who previously scrolled further to find the match to solve their problem

lernerzhang5y ago

I wonder how they decide how many cases to manually label?

colesantiago5y ago

Not sure why they didn't just go with Elasticsearch?

softwaredoug5y ago

We do use Elasticsearch :) But getting ranking right is extremely custom in any context. If relevance was a webapp, relevance ranking is like building the business logic, whereas Elasticsearch is like MySQL. It's a piece of infra that needs to be customized for the project.

j / k navigate · click thread line to collapse

15 comments

BillFranklin5y ago

It's a really interesting area of theory and practice - I hope Shopify write more about their implementation!

I'd also recommend reading Airbnb's really excellent paper - https://arxiv.org/pdf/1810.09591.pdf.

jsloa0715y ago

BillFranklin5y ago

Great stuff, and it’s cool you are working with Doug, I enjoyed his book on search relevance :) I’ll look out for more posts from your team, good luck!

NumberCruncher5y ago

loic-sharma5y ago

1 more reply

jll295y ago

TFIDF (Spärck Jones (1972) Journal of Documentation) is a weighting scheme for word frequencies in the vector space model of information retrieval.

In constrast, MAP and NDCG (and others, like Precision, Recall, F1-score, MRR) are _evaluation_ metrics.

So the former are part of systems, the latter are part of measuring the quality of systems.

softwaredoug5y ago

MAP, NDCG, etc are evaluation stats. Like unit tests for algorithm correctness. TF*IDF and Page rank are solutions that may or may not increase NDCG or MAP for your given problem/users

LZ_Khan5y ago

Where do the relevance scores come from? Are they human rated? I feel like that could leave room for error as raters would probably not have the same opinion as me on what a good document is.

jillesvangurp5y ago

LZ_Khan5y ago

Very interesting, thanks for the explanation. I work on an ML team as well and definitely notice the heavy reliance on analysis. Almost like reading tea leaves at some point.

ntonozzi5y ago

Great article!

This seems like a fairly tricky ranking function. I wonder if they compared it to combining TF-IDF and the page popularity. This would help with the problem they explained.

It'd be interesting to see more details about how they implemented the query-specific page rank.

jsloa0715y ago

lernerzhang5y ago

I wonder how they decide how many cases to manually label?

colesantiago5y ago

Not sure why they didn't just go with Elasticsearch?

softwaredoug5y ago

j / k navigate · click thread line to collapse