I hadn't seen that approach to evaluating search engines. But looking at the github repo I'm also not quite following how I would use this/what the standard approach is for scoring relevance ranking approaches, and how this approach differs from that standard approach. If it's not too much trouble a tldr on that would be a really useful intro.
I do intend to have the UX be setup in a way for the user base to sort of re-rank or at least provide some feedback on which results were irrelevant to help with re-training the model over time. For certain applications where very limited domain knowledge is required (for example, is this a hardware product or a sofware product? - or is this a product or a service?), I can also use mechanical turk or similar to label data and I fully intend to do that.