I've found that creating a 1-1.5 hour test that is deliberately designed to be a realistic depiction of the work the candidate will do works and scales just fine.
In practice this means creating a small example project with all of the tedious boilerplate ready and in place and asking the candidate to implement 1-3 relevant stories on it.
It's not the streetlight effect that got the industry into this sorry state. It's a combination of cargo culting and laziness.