I imagine if your volume is high enough it could be worthwhile to at least check to see if simple preprocessing gets you anywhere.
Basically compare model performance on a bunch of problems, and see if the queries which actually require an expensive model have anything in common (e.g. low Flesch-Kincaid readability, or a bag-of-words approach which tries to detect the frequency of subordinate clauses/potentially ambiguous pronouns, or word rarity, or whatever).
Maybe my knowledge of old-school NLP methods is useful after all :-) Generally those methods tend to be far less compute-intensive. If you wanted to go really crazy on performance, you might even use a Bloom filter to do fast, imprecise counting of words of various types.
Then you could add some old-school, compute-lite ML, like an ordinary linear regression on the old-school-NLP-derived features.
Really the win would be for a company like Hypermode to implement this automatically for customers who want it (high volume customers who don't mind saving money).
Actually, a company like Hypermode might be uniquely well-positioned to offer this service to smaller customers as well, if query difficulty heuristics generalize well across different workloads. Assuming they have access to data for a large variety of customers, they could look for heuristics that generalize well.