This isn't a quality of fit issue (and even if it were, linear models are not always sufficient). The problem is that different causal structures can entail the same set of correlations, which makes them impossible to distinguish through observation alone.
Grandparent commenter here -- I'm glad I sufficiently communicated my concern, I feel like you and mjburgess have nailed it. Fit metrics alone aren't sufficient to determine an appropriate model use (even ignoring the issues of p-hacking an other ills).