Most design houses don’t write their own macro placers but customize commercial flows for their designs.
The problem with macro placement as an RL technology demonstrator is that to evaluate quality you need to go through large parts of the design flow which involves using other commercial tools. This makes it incredibly hard to evaluate superiority since all those steps and tools add noise.
Easier problems would have been to use RL to minimize the number of gates in a logic circuit or just focus on placement with half perimeter wirelength (I think this is what you mean with your grad student example). Essentially solving point problems in the design flow and evaluating quality improvements locally.
They evaluated quality globally and only globally and that destroys credibility in this business due to the noise involved unless you have lots of examples, can show statistical significance, and (unfortunately for the authors) also local improvements.
That’s what the follow on studies did and that’s why the community has lost faith in this particular algorithm.