I worked on a Learning To Rank implementation a year or so ago. What struck me then (and now reading about Shopify's implementation) is that the approach is often very similar across sites, but the implementation is usually rather tailored. You see the same patterns: online/offline metrics; nDCG; click models and implicit/explicit relevance judgements; re-ranking top-k of results, and so on.
Unfortunately there doesn't seem to be a technology tying all of the components of an LtR system together. A managed service like Algolia could be an answer. I wonder if industry will eventually converge on a framework, such as an extension to Open Source Connection's Elasticsearch Learning to Rank plugin (https://diff.wikimedia.org/2017/10/17/elasticsearch-learning...).
It's a really interesting area of theory and practice - I hope Shopify write more about their implementation!
I'd also recommend reading Airbnb's really excellent paper - https://arxiv.org/pdf/1810.09591.pdf.
For realtime feedback, we've implemented (on another search product at Shopify, not the Help Center) a "near"-time feedback loop using implicit judgments to alter search results. Perhaps I'll write a post about that one soon :) . My colleague Doug talks a bit about the new systems we're building in this blog post - https://shopify.engineering/apache-beam-for-search-getting-s....
That's where NDCG comes in! Basically it gives a score for your search rankings that you can use to compare different search algorithms. The higher the score, the closer your algorithm was to producing the expected search results. This is super useful as you can try lots of experiments and get a good sense of whether the experiment is promising or not.
In constrast, MAP and NDCG (and others, like Precision, Recall, F1-score, MRR) are _evaluation_ metrics.
So the former are part of systems, the latter are part of measuring the quality of systems.
As the article states, human evaluated rankings are the best. But that is of course a relatively expensive process. And getting any group of people to do things in a consistent/systematic way is a challenge in it self. Is the group of people you pick to do this representative of your user base? Is what you let them evaluate representative of what your user base actually does? And what is the shelf life of these evaluations? Even something simple as cultural and racial biases can skew results quite a bit unintentionally and more than a few big name companies have fallen into this particular trap.
Relevance is inherently subjective. What's relevant for you might not be the same as what is relevant for me. Because we don't share the same context probably (goals, intentions, preferences, circumstances, locale, environment, etc.). Many ML based search solutions are effectively one size fits all type solutions that end up locking you into recommendation bubbles. I usually jokingly call this the "more of the FFing same algorithm". A lot of ML devolves into that.
I've been involved with a few machine learning based ranking projects over the years. To be clear, I'm not an ML expert and instead usually work on non ML based search projects (mostly Elasticsearch in recent years). With Elasticsearch, the learning to rank plugin is one of several ways you can leverage ML to improve ranking. It works best in very stable well understood domains where the feature and data set are relatively static and where there is an abundance of user data to work with. The few times I've seen rigorous and effective AB testing was on teams that had this.
This particular form of ML is of course hardly state of the art at this point. It involves manual feature extraction (i.e. no deep learning here) and relatively simplistic algorithms to basically tune things like boost factors and other parameters in queries. Variations of logistical regression basically. It's really expensive to do and more expensive to do well. Your starting point is basically a manually crafted query that already does the right things mostly.
And even so, in all these teams I noticed a pattern of things hitting a local optimum where obviously low hanging fruit in query improvements became hard to pick simply because of the overhead of "proving" it did not negatively impact rankings. Teams like this become change resistant. I've seen teams insist their ranking was awesome when users and product owners clearly had different ideas about that (i.e. they were raising valid complaints).
The paradox here is that most good changes here inevitably degrade the experience until you dial them in over subsequent releases. If you obsessively avoid that kind of disruptive change, your system stops improving altogether. If you have people obsessing over less than a percent deviations in metrics they are tracking, those changes never happen and teams get stuck chasing their own tails. I'm not kidding, I've been in meetings where people were debating fractions of a percent changes in metrics. Analysis paralysis is a thing with ML teams. They end up codifying their own biases and then are stuck with them until somebody shakes things up a little.
This seems like a fairly tricky ranking function. I wonder if they compared it to combining TF-IDF and the page popularity. This would help with the problem they explained.
It'd be interesting to see more details about how they implemented the query-specific page rank.