I had to google... because it doesn't actually say that it stands for Abstract Syntax Tree haha
Would be nice to highlight what AST stands for in the first sentence of that section! :D
Reading this article might be confusing to someone who's trying to learn what an AST is. ASTs are not unique to Python, they're just a common data structure used in compiler design.
ASTs are used by compilers like this:
1) A compiler will take source code and process it into little pieces called tokens (e.g., a number, an equals sign, a variable type, etc) with a little program called a "lexer".
2) Then, those tokens are processed by a "parser" -- which is a little program that inputs the tokens from the lexer, as well as a description of a programming language (e.g. a Chomsky context-free grammar in Backus Naur Form) and outputs an AST.
3) Then finally, the AST nodes are walked and machine code is generated.
This article hooks into the AST inside the Python "compiler" between steps 2 and 3 to do some analysis on the AST instead of converting it to something that can be executed (e.g. machine code or some other IR). Which, is a very useful thing, but probably not a good introduction to compilers.
If you're new to compilers, I suggest staying away from the Python "ast" module until you're comfortable with general compiler design. Maybe start with playing around with something like PLY instead -- create a simple little language yourself and write a compiler for it:
PLY on the other hand is an amazing resource, thanks for linking it here.
A tangent, but as it relates to that, if anyone reading has ideas on how to apply traditional computer science curriculum, I would love to hear it. I can think of toy CPU emulators, system architecture diagramming, language creation... But not sure if there's a thing I can build that would say, "I understand computer science."
If you've ever taught a course with intro level python, it quickly becomes apparent how repeatable the mistakes are, or where you didn't spend enough time. As a student, this is frustrating because the correction comes too late, it's why having someone knowledgeable over your shoulder can speed up your learning.
The challenge that I believe ASTs present is that they only parse compliant code. So if someone makes a syntax error, it becomes a whole new ball game. I'd glanced at tree sitter to see if this could fix some of these issues, but I think it's a more fundamental problem than that.
Existing code editors like VSCode do exactly this for better syntax highlighting of incomplete code.
The Roslyn SDK exposes its syntax tree, symbol table, and semantic model, with the primary use being for custom code analysis. I surprisingly easily made a linter ('analyzer') for a personal style preference, along with 'code fix' (lightbulb suggestion that appears in Visual Studio) through the quick-start tutorial. The resulting .NET assembly integrated impressively with msbuild and Visual Studio, my custom analyzer being indistinguishable in UX from the built-in ones. Seeing the actual syntax tree, especially where the compiler had recovered from syntax errors, also seemed a great learning experience for getting a feel of how the compiler treated errors.
It seems to now be fairly common for .NET projects to develop their own analyzers to enforce specialized best-practices; I wonder if other languages have similar customs?
Besides lisps?
The one big drawback is that the AST for even trivial code patterns has had a history of changing between Python versions. This makes it more annoying than usual to support multiple versions at the same time. Luckily 3.9 and 3.10 hasn't brought any changes that impacted my codebase, as far as I've noticed.
The introduction of Constant also deprecated everything it replaced (Str, Num, Bytes, and NameConstant).
There's also the introduction of f-strings (ast'd as JoinedStr), various nodes being duplicated for their async version.
Probably more relevant to automatically discovering signatures would be the addition of positional-only arguments to the `arguments` object.
But messing with the AST is definitely a lot more stable than messing with the bytecode.
I've written simple code using the AST-visitor approach to enforce some common-standards on code within our company. Simple things like ensuring that when we use Troposphere to generate AWS cloudformation templates we always setup some specific values. (For example I wrote a checker to ensure that every time an ECR instance is created we must enable ScanOnPush, or every time we declare a security-group we must have a comment "[cloudformation] ..." with it - so that manual edits stand out.)
The stuff that ASTs let you do really flexibly is almost always lost to people because they're not aware of it. A lot of other developers would try to do this with string or regex matching, and that often leads to painful experiences.
A call to function "Foo" Must always have an argument matching the regexp "/blah/". Otherwise raise an error.
And they're so lightweight you can add them to any CI/CD/automation steps in your repository. Once you get a few things like that, or validating naming-standards, you can roll them up into a simple "linter".
Why does the Name() need to know its own context? It seems like the manner in which Name() is used, whether load, store, or delete, is fully determined by the Name()'s parent and whether the parent uses the Name() as a target or as a value.
But, not having ctx available as an explicit value would make certain ast manipulations, as well as bytecode generation by the interpreter more complicated. This is because in both the scenarios, checking for the context will involve checking the parent, instead of a child. And finding the parent in a tree is often times much harder than a child.
-- the following expression can appear in assignment context
| Attribute(expr value, identifier attr, expr_context ctx)
| Subscript(expr value, expr slice, expr_context ctx)
| Starred(expr value, expr_context ctx)
| Name(identifier id, expr_context ctx)
| List(expr* elts, expr_context ctx)
| Tuple(expr* elts, expr_context ctx)
perhaps it's just neater to just have the context there on every one of them.also, from a pragmatic standpoint, when actually processing the AST and analyzing it semantically, you'll pretty much always going to be handling expressions and patterns (= rvalues/lvalues) differently, bc they mean completely different things! and having the context right there makes it more convenient to handle. so like, when designing an AST datatype, you could just as well not include the context in Name() and it'd be fine, but the python `ast` module's primary usecase is compiling the AST to bytecode, where it's more convenient to just have that info around, so that's what they did.
However, python has a builtin tokenize module for that as well.