Learn Python ASTs, by building your own linter | Better HN

32 comments

31 comments · 9 top-level

jmac014y ago· 5 in thread

> "So what is an AST?"

I had to google... because it doesn't actually say that it stands for Abstract Syntax Tree haha

Would be nice to highlight what AST stands for in the first sentence of that section! :D

In the context of this article -- it's mostly just talking about Python-specific ASTs.

Reading this article might be confusing to someone who's trying to learn what an AST is. ASTs are not unique to Python, they're just a common data structure used in compiler design.

ASTs are used by compilers like this:

1) A compiler will take source code and process it into little pieces called tokens (e.g., a number, an equals sign, a variable type, etc) with a little program called a "lexer".

2) Then, those tokens are processed by a "parser" -- which is a little program that inputs the tokens from the lexer, as well as a description of a programming language (e.g. a Chomsky context-free grammar in Backus Naur Form) and outputs an AST.

3) Then finally, the AST nodes are walked and machine code is generated.

This article hooks into the AST inside the Python "compiler" between steps 2 and 3 to do some analysis on the AST instead of converting it to something that can be executed (e.g. machine code or some other IR). Which, is a very useful thing, but probably not a good introduction to compilers.

If you're new to compilers, I suggest staying away from the Python "ast" module until you're comfortable with general compiler design. Maybe start with playing around with something like PLY instead -- create a simple little language yourself and write a compiler for it:

<https://www.dabeaz.com/ply/ply.html#ply_nn2>

tusharsadhwaniOP4y ago

I'll agree, it's not a good introduction to compilers, but it isn't meant to be.

PLY on the other hand is an amazing resource, thanks for linking it here.

jmac014y ago

Thanks!

tusharsadhwaniOP4y ago

My bad xD, to my credit it's mentioned later in the article. But you're right, I should add that in the beginning.

jmac014y ago

NP all good I'm just a complete noob so was confused and waiting for that haha! Thanks :)

popotamonga4y ago· 4 in thread

In general, for me at least i find the best way to learn about something is to work in the 'internals'. For instance when react came i couldn't wrap my head around it so i started my own js framework, and it ended up almost exactly like react (then i dumped it as it ended up just being a learning exercise)

sarupbanskota4y ago

You'll enjoy https://codecrafters.io

I'm trying to become a top 0.01% JS user and creating linters, flavors, etc is my plan as well. I've read through and annotated the React codebase but it didn't stick very well. I would have done better to create my own framework! I keep having to relearn that lesson... I can have a lot of knowledge about a thing through reading, but knowledge of the thing requires some practical application.

A tangent, but as it relates to that, if anyone reading has ideas on how to apply traditional computer science curriculum, I would love to hear it. I can think of toy CPU emulators, system architecture diagramming, language creation... But not sure if there's a thing I can build that would say, "I understand computer science."

agumonkey4y ago

I forgot whoever coined the saying (Feynmann or else) but I'm definitely in the camp that needs to build something to feel at home with it.

alansammarone4y ago

I believe you are referring to "What I cannot create I don't understand", which is indeed by Feynman.

benhoff4y ago· 3 in thread

I've often thought it would be cool to use AST's and perhaps code embeddings generated from machine learning as a tool to help students improve.

If you've ever taught a course with intro level python, it quickly becomes apparent how repeatable the mistakes are, or where you didn't spend enough time. As a student, this is frustrating because the correction comes too late, it's why having someone knowledgeable over your shoulder can speed up your learning.

The challenge that I believe ASTs present is that they only parse compliant code. So if someone makes a syntax error, it becomes a whole new ball game. I'd glanced at tree sitter to see if this could fix some of these issues, but I think it's a more fundamental problem than that.

tusharsadhwaniOP4y ago

tree sitter can definitely help with this problem, but so can regular AST parsers, the idea is the same: just add code or grammar that will parse the "invalid" grammar, mark it as invalid, and continue parsing valid code as soon as possible.

Existing code editors like VSCode do exactly this for better syntax highlighting of incomplete code.

hsbauauvhabzb4y ago

Wouldn’t that be impossible? The structure of python is finite, and invalid deviations are infinite. Sure any language AST compiler could be more helpful, but they can’t take trash and turn it to gold.

TuringTest4y ago

In the context of programming lessons, there is a known correct program. Wouldn't it be possible to calculate the distance between what has been typed and how the correct finished program should be, to guide students into correcting the non-parsing parts of their code?

mgdlbp4y ago· 2 in thread

Quite a few languages can access their own ASTs, but I don't know of one other than C# (and VB.NET--Roslyn is the compiler for both) where the API is so deeply integrated and hence useful.

The Roslyn SDK exposes its syntax tree, symbol table, and semantic model, with the primary use being for custom code analysis. I surprisingly easily made a linter ('analyzer') for a personal style preference, along with 'code fix' (lightbulb suggestion that appears in Visual Studio) through the quick-start tutorial. The resulting .NET assembly integrated impressively with msbuild and Visual Studio, my custom analyzer being indistinguishable in UX from the built-in ones. Seeing the actual syntax tree, especially where the compiler had recovered from syntax errors, also seemed a great learning experience for getting a feel of how the compiler treated errors.

It seems to now be fairly common for .NET projects to develop their own analyzers to enforce specialized best-practices; I wonder if other languages have similar customs?

https://docs.microsoft.com/en-us/dotnet/csharp/roslyn-sdk/

eyelidlessness4y ago

> Quite a few languages can access their own ASTs, but I don't know of one other than C# (and VB.NET--Roslyn is the compiler for both) where the API is so deeply integrated and hence useful.

Besides lisps?

tusharsadhwaniOP4y ago

That's amazing. I'm interested in C# now

apurtbapurt4y ago· 2 in thread

I maintain some code that rely on Python AST for finding and packaging modules with appropriate class signatures when building customer specific distributions. It works really well most of the time. And, it is a lot easier to maintain than 50+ separate wheel definitions.

The one big drawback is that the AST for even trivial code patterns has had a history of changing between Python versions. This makes it more annoying than usual to support multiple versions at the same time. Luckily 3.9 and 3.10 hasn't brought any changes that impacted my codebase, as far as I've noticed.

tusharsadhwaniOP4y ago

The only major changes that I'm aware of since python3 has been the change with keyword arguments in 3.6, and the deprecation of Index and introduction of Constant more recently. Those are big changes, but relatively small and maintainable imo. What challenges have you faced?

> the deprecation of Index and introduction of Constant more recently.

The introduction of Constant also deprecated everything it replaced (Str, Num, Bytes, and NameConstant).

There's also the introduction of f-strings (ast'd as JoinedStr), various nodes being duplicated for their async version.

Probably more relevant to automatically discovering signatures would be the addition of positional-only arguments to the `arguments` object.

But messing with the AST is definitely a lot more stable than messing with the bytecode.

stevekemp4y ago· 2 in thread

That's a pretty awesome read, and the approach is pretty flexible.

I've written simple code using the AST-visitor approach to enforce some common-standards on code within our company. Simple things like ensuring that when we use Troposphere to generate AWS cloudformation templates we always setup some specific values. (For example I wrote a checker to ensure that every time an ECR instance is created we must enable ScanOnPush, or every time we declare a security-group we must have a comment "[cloudformation] ..." with it - so that manual edits stand out.)

tusharsadhwaniOP4y ago

Thanks!

The stuff that ASTs let you do really flexibly is almost always lost to people because they're not aware of it. A lot of other developers would try to do this with string or regex matching, and that often leads to painful experiences.

stevekemp4y ago

Agreed. Simple checks like these are trivial:

A call to function "Foo" Must always have an argument matching the regexp "/blah/". Otherwise raise an error.

And they're so lightweight you can add them to any CI/CD/automation steps in your repository. Once you get a few things like that, or validating naming-standards, you can roll them up into a simple "linter".

jpeloquin4y ago· 2 in thread

One thing I don't understand from the article: "Hopefully this explains why we explicitly need to tell each variable whether it is in a load or store context in the AST."

Why does the Name() need to know its own context? It seems like the manner in which Name() is used, whether load, store, or delete, is fully determined by the Name()'s parent and whether the parent uses the Name() as a target or as a value.

tusharsadhwaniOP4y ago

I can't think of any exceptions to this right away, so you might be right.

But, not having ctx available as an explicit value would make certain ast manipulations, as well as bytecode generation by the interpreter more complicated. This is because in both the scenarios, checking for the context will involve checking the parent, instead of a child. And finding the parent in a tree is often times much harder than a child.

uryga4y ago

yeah, that's not really explained in the article, and it's not explained in the `ast` docs either [1]. for reference, the full list of assignment targets is:

         -- the following expression can appear in assignment context
         | Attribute(expr value, identifier attr, expr_context ctx)
         | Subscript(expr value, expr slice, expr_context ctx)
         | Starred(expr value, expr_context ctx)
         | Name(identifier id, expr_context ctx)
         | List(expr* elts, expr_context ctx)
         | Tuple(expr* elts, expr_context ctx)

perhaps it's just neater to just have the context there on every one of them.

also, from a pragmatic standpoint, when actually processing the AST and analyzing it semantically, you'll pretty much always going to be handling expressions and patterns (= rvalues/lvalues) differently, bc they mean completely different things! and having the context right there makes it more convenient to handle. so like, when designing an AST datatype, you could just as well not include the context in Name() and it'd be fine, but the python `ast` module's primary usecase is compiling the AST to bytecode, where it's more convenient to just have that info around, so that's what they did.

[1] https://docs.python.org/3/library/ast.html#ast.Load

wbkang4y ago· 1 in thread

This is a cool post thank you. I knew about ASTs but did not know how to build them easily for Python so the second half was very useful for me.

tusharsadhwaniOP4y ago

I really wasn't expecting anyone to read all of it, I was afraid people will either find it too trivial or too complex based on skill level. So that's great to hear.

ambrose24y ago· 1 in thread

This was a really nice read! The best part was learning that there’s no need to actually parse tokens when building a Python linter (well, maybe there’s an exception) because you can leverage the already parsed AST or CST.

tusharsadhwaniOP4y ago

True! Although there's some lints that would require you to parse tokens, such as checking for single vs. double quotes, or number of spaces used for indentation.

However, python has a builtin tokenize module for that as well.

j / k navigate · click thread line to collapse