In general having AST is always better than having a plain text file, unless you want to read it. But then you can easily dump AST back to text whenever you want.
Yeah, making AST will help you analyse your codebase programmatically which in turn will let you understand the codebase better and faster. This is some very basic programming knowledge, I think. Or is it not? Some comments here don't know what AST even is - is this the state of PL knowledge in the mainstream? Lisp and Smalltalk people would be very, very sad if it was so.
You'd be better off to start with just the build scripts and build tools.
ASTs are great for increasing understanding of much smaller projects but for something this size you'd likely end up with very little to show for your effort except the crashlogs of your tools.
You need to go 'coarse' before you can go 'fine' on something this magnitude.
This is not a 3 week project, just mapping the thing properly will take (man)years.
Yeah, I started commenting before the realization of how HUGE this thing would be hit me, sorry :)
I have written some Scheme and I still can't say I need to screw around with ASTs. May be I will be enlightened some day?
A UML diagram for this level of hugeness would be a really useful thing according to me, much much better than an AST.
As for this:
> A UML diagram for this level of hugeness would be a really useful thing
we actually agree 100% here. What I mean is that having AST is meaningless by itself, but you need AST if you want to generate UML diagram from the code. Or generate a callgraph. Or find similarities or duplication in the code. Or indeed perform any kind of automatic code transformation.
So extracting AST is a first step to developing your own tools for working with a codebase. And with a codebase of this size you just have to write your own tools, adapted to the nature of this particular codebase. So while "trying to make sense using ASTs" really is a bit hard to imagine, trying to make sense of a codebase using all the tools AST enables you to write is what I had in mind.
An AST for 100M lines would be absolute madness, a call graph just might work and I'm somewhat hoping that it turns out to be either a ton of generated or duplicated code.
I also wonder if the OP isn't out of his depth based on the question(s) asked.