I don't think the bitter lesson is applies to ASTs.
From the Bitter Lesson:
"Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better."
Those models are taking advantage of inductive biases. Every model has them, including the massive language models. They are not the same as engineered features (such as SIFTs) or heuristics.
Using the AST is just another way of looking at the code already in your dataset. For the model to understand what it is writing, it needs to map the text sequences map to ASTs anyways. It can attempt to learn this, but the 12B model still makes illegal Python code so it clearly hasn't.