> Most design houses don’t write their own macro placers but customize commercial flows for their designs.
Most I don't know, but all the mid-to-large ones have automated macro placers. Obviously, the output is introduced into the commercial flow, generally by setting placement constraints. The larger houses go much further and may even override specific parts of the flow, but not basing it on an commercial flow is out of the question right now.
> The problem with macro placement as an RL technology demonstrator is that to evaluate quality you need to go through large parts of the design flow which involves using other commercial tools.
Not really, not any more than any other optimization such as e.g. frontend which I'm more familiar with. If you don't want to go through the full design flow (which I agree introduces noise more than anything else), then benchmark your floorplans in some easily calculable metric (e.g., HPWL). Likewise, if you want to test the quality of some logic simplification _in theory_ you'd have to also go through the entire flow (backend included), but no one does that and you just evaluate some easily calculable metric e.g. number of gates. These distinctions are traditional more than anything else.
Academic macro placers generally have limited access to commercial flows (either due to licensing issues or computing resource availability) so it is rather common to benchmark them in other metrics. Google paper tried to be too smart for its own good and therefore incomparable to anything academic.