Formatting a 25M-line codebase overnight (opens in new tab)

(stripe.dev)

184 pointsr00k5d ago91 comments

91 comments

One of my first jobs was a small software company writing software for a small number of clients, in MS basic PDS.

The lead developer didn't like to bother with formatting code, so I wrote a tool called makenice to format his nasty spaghetti gibberish into something with good indents and layout to make it easier for us normal people to parse.

He was furious, literally spun in circles about it right in the office in front of everyone, so I wrote makenasty to format code into the way he appeared to like.

I only shared makenasty/nice with a couple of the team, who loved it, as it allowed easy conversion between something readable and something the team lead like.

He never knew about makenasty.

5 more replies

munificent5d ago

> We chose a Saturday to format the entire codebase to avoid merge conflicts. And while our test suite gave us high confidence we'd gotten everything right, it's always a bit daunting to have a diff so large that GitHub can't render it.

The dart formatter has an internal sanity check. It walks through the unformatted and formatted strings in parallel skipping any whitespace. If any non-whitespace characters don't match, it immediately aborts. This ensures that the only thing the formatter changes is whitespace, and makes it much less spooky to run it blind on a huge codebase.

That sanity check has saved my ass a couple of times when weird bugs crept in, usually around unusual combinations of language features around new syntax.

(Unfortunately, the formatter in the past year has gotten a little more flexible about the kinds of changes it makes, including sometimes moving comments relatively to commas and brackets, so this sanity check skips some punctuation characters too, making it a little less reliable.)

Terr_5d ago

I imagine a fancier version would be to compare the Abstract Syntax Trees.

munificent5d ago

The balancing act is that the fancier your sanity check, the greater the chance of something slipping through its cracks too. Walking too strings in parallel is very simple and hard to get wrong. Traversing an AST and skipping a branch is exactly the kind of easy-to-make bug that the sanity check is designed to catch.

What I'd like to do is something somewhere in the middle where I walk the token stream and check that every token of the input ended up in the output, but I haven't figured out a simple and fast way to do that yet. Performance is particularly tricky because I obviously don't want to burn a bunch of CPU cycles on a sanity check that exists only to catch bugs.

saghm5d ago

I've always thought it would make sense for formatters to be baked into the toolchain so that they can reuse the language's parser (presumably exposed as a library) and then be implemented via parsing to AST and then formatted back out so that they're guaranteed to be correct and normalized. This doesn't seem to be how most formatters work in practice though, although I'm not sure if it's because of performance reasons or a lack of support for the parser being exposed in language toolchains.

kevin_thibedeau5d ago

That is essentially what clang-format is.

1 more reply

caminanteblanco5d ago

The only issue is then you're at the mercy of whatever parser your formatter uses to construct the AST

Terr_5d ago

Well, if any (common, non-hobby) parser is thrown off by the reformatting, then it's probably not a safe reformatting either way.

hyperhello5d ago

Strictly speaking that wouldn’t work, since a1 is different from a 1, for example.

saghm5d ago

I don't think they're trying to say that this is a sufficient test for correctness but a necessary one.

munificent5d ago

Correct. It won't catch 100% of possible bugs, but it will catch most.

The kind of bugs that are easiest to write in a formatter is dropping a bit of syntax on the floor and forgetting to include it in the output, and the sanity check will catch those.

It's also definitely possible to miss some whitespace that's necessary for things like identifier separation, but... <shrug> it's a sanity check, not a proof of correctness.

1 more reply

Skeime5d ago

Lots of formatters also unify things like trailing commas, so it would be slightly more involved than this.

munificent5d ago

Yes, the dart formatter does that now too. So the sanity check ignores commas and semicolons, which makes it less robust as a sanity check, unfortunately.

hobofan5d ago

I'm surprised they went with a all-at-once reformat. Even when doing it over a weekend this is bound to mess with a lot of open PRs at their scale.

I had to introduce a formatter in a few sizeable codebases in the past (few 100k to few million LOC), and I always did it incrementally via a script that reformatted all files that are not touched in any open PR. The initial run reformatted 95% of all files. Then I ran the script every day for ~two weeks and got up to 99.5% of all files and then manually each time one of the remaining ~dozen PRs that were WIP for longer were merged.

rileymichael5d ago

both options have their pros and cons. if you utilize some form of ratcheting[1], you can sneak it in without your team knowing.. but all of your PRs for the foreseeable future will have a ton of reformatting screwing with your git blame. if you do it all at once, someone will have to sort out conflicts, but you can utilize `blame.ignoreRevsFile`[2] so that your history remains useful

[1] https://github.com/diffplug/spotless/tree/main/plugin-gradle...

[2] https://git-scm.com/docs/git-blame#Documentation/git-blame.t...

yorwba5d ago

Even if you spread out reformatting over multiple commits in different PRs, you can still make use of blame.ignoreRevsFile as long as your pull request workflow doesn't enforce squash merges even when somebody took extra care to produce a nice commit history.

hobofan5d ago

Yes, that is a good point. This is also why I personally would recommend to let a central person/team handle the reformatting rather than sneaking it into every PR (- see my sibling comment). That way you can be in charge of having a uniform style of commit messages to make the reformat commits easy to identify and create a well kept ignoreRevsFile. I think that provides the best of both worlds.

BobbyTables25d ago

That’s a neat feature, thanks for sharing.

Unfortunately I find that code bases lacking auto formatting are often littered with non functional changes as developers temporarily instrument code, remove it, but leave whitespace changes behind.

In terms of tracking code changes, one really would have to rewrite the entire history with each commit reformatted.

WhyNotHugo5d ago

> I'm surprised they went with a all-at-once reformat. Even when doing it over a weekend this is bound to mess with a lot of open PRs at their scale.

Rebasing PRs should be trivial, just rewrite all commits reformatting the files it touches, then rebase, `git checkout --theirs`, run formatter again, and `git rebase --continue`. It's methodical and scriptable, you don't need to manually resolve any conflicts.

skydhash5d ago

You can always let the team know so that they can apply the formatter on their PR branch.

hobofan5d ago

In the smaller migrations I did I tried that, but some way or another a decent chunk of the people still managed to get stuck in merge/rebase conflicts. I would almost explicitly not recommend giving that advise to the teams.

My rough blueprint for introducing formatter or linter nowadys would be:

- Recorded knowledge share session around how to set up the tools for local use 1-2 weeks before the initial rollout, and outline how the process will take place

- On the day of the initial rollout send out a reminder + the recording again

- Do the initial PR

- Incrementally do the rest of the migration, and subscribe to the PRs that drag out the process

jrajav5d ago

This is exactly the remedy to the PR issue. I've "lucked" into owning a Prettier formatting pass at two different places now, and did the same process at each - full pass on master, simple step-by-step process to follow to update any PR by running the format script.

zx80805d ago

It's probably because the author can shit on others (let me guess, a senior principal something engineer).

varun_ch5d ago

I’m shocked at the 25M line part! That is a completely unfathomable amount of code for one codebase. I really want to know more about that.

bruckie5d ago

Only 25 million? :) Google had billions a decade ago...

https://research.google/pubs/why-google-stores-billions-of-l...

jsnell5d ago

Right, where is the rest of the code?

mr_mitm5d ago

They're up to 42 million now, as per the article

lukan5d ago

That sounds even more insane to me, but I guess most of that code does not really touch financial transactions, otherwise it would be a nightmare being responsible to verify that.

eigenblake5d ago

Really reminds me that there's nothing in principle stopping us from storing parse trees and exposing them via something git like so we can avoid even needing to format, let alone also needing to resolve a whole category of merge conflicts based on that formatting. I mean a format is just a theme over your data -- I mean code.

dgrin915d ago

I don't understand why the felt the need to do a big-bang merge like this. Its a formatter, so the files should be functionally equivalent before and after. Why not just enable it for new files/edit files for a while, then once comfortable apply it to old files in batches? What advantage does the big bang merge give? Seems higher risk for the same reward

fsckboy5d ago

could introduce subtle bugs, so doing it all at once while it's on the front of everybody's mind with as much comprehensive review and testing of parts or the whole to everybody's satisfaction. if you don't do it all at once, you'd need to repeat the same amount of testing multiple times.

>files should be functionally equivalent before and after

when you say something like this, the road you are on is paved with good intentions.

riffraff5d ago

You can also introduce subtle bugs in your own feature development, and if you change formatting there it's also at the front of your mind.

I think the main argument for doing a big bang rewrite is that you have a defined before/after, otherwise you're stuck into an endless in-between.

nitwit0055d ago

> Given that complexity, the hypothesis was simple: tackle the hardest syntax first and the rest will follow.

Always nice to see. I've seen people fall into the trap of designing for the common case, not realizing most of the code will be to deal with the less common cases.

1 more reply

burnte5d ago

The floating spiral thing is so distracting I spent more time deleting it in Inspector than reading the article. I feel like they hate their readers. Awful.

1 more reply

tmaly5d ago

How did I know this was going to be a rewrite in Rust?

comrade12345d ago

Man must me nice to have the time to put so much work into tabs.

1 more reply

ryanisnan5d ago

Cool story. The treat at the end was fun as well, thank you!

hokkos5d ago

Now it makes me wonder, are those 45M LoC are untyped ?

c3ab8ff1375d ago

No, Stripe has its own Ruby typechecker - https://sorbet.org/

m12k5d ago

https://brandur.org/nanoglyphs/015-ruby-typing#ruby-typing

CrzyLngPwd5d ago

Surely, it no longer needs to be human-readable, and the era of write-only code is finally upon us with the dawn of AI writing our mealtickets.

Why bother formatting 25m lines of slop, and why is AI wasting tokens on making code look human-readable anyway?

1 more reply

throwatdem123115d ago

What is even the point of formatting code anymore.

3 more replies

cadamsdotcom5d ago

An insight about code is that compared to the scale we operate on data, code as text is tiny. Instantaneous git operations and “run this tool over all the code” are the norm even while we wait for LLMs to stream their tokens to stream back so tool calls can operate on it.

That insight might seem obvious - but if you stay cognizant of it as you work, you can invent some pretty amazing tooling for yourself & your team.

j / k navigate · click thread line to collapse

91 comments

CrzyLngPwd5d ago

One of my first jobs was a small software company writing software for a small number of clients, in MS basic PDS.

He was furious, literally spun in circles about it right in the office in front of everyone, so I wrote makenasty to format code into the way he appeared to like.

I only shared makenasty/nice with a couple of the team, who loved it, as it allowed easy conversion between something readable and something the team lead like.

He never knew about makenasty.

5 more replies

munificent5d ago

That sanity check has saved my ass a couple of times when weird bugs crept in, usually around unusual combinations of language features around new syntax.

Terr_5d ago

I imagine a fancier version would be to compare the Abstract Syntax Trees.

munificent5d ago

saghm5d ago

kevin_thibedeau5d ago

That is essentially what clang-format is.

1 more reply

caminanteblanco5d ago

The only issue is then you're at the mercy of whatever parser your formatter uses to construct the AST

Terr_5d ago

Well, if any (common, non-hobby) parser is thrown off by the reformatting, then it's probably not a safe reformatting either way.

hyperhello5d ago

Strictly speaking that wouldn’t work, since a1 is different from a 1, for example.

saghm5d ago

I don't think they're trying to say that this is a sufficient test for correctness but a necessary one.

munificent5d ago

Correct. It won't catch 100% of possible bugs, but it will catch most.

The kind of bugs that are easiest to write in a formatter is dropping a bit of syntax on the floor and forgetting to include it in the output, and the sanity check will catch those.

It's also definitely possible to miss some whitespace that's necessary for things like identifier separation, but... <shrug> it's a sanity check, not a proof of correctness.

1 more reply

Skeime5d ago

Lots of formatters also unify things like trailing commas, so it would be slightly more involved than this.

munificent5d ago

Yes, the dart formatter does that now too. So the sanity check ignores commas and semicolons, which makes it less robust as a sanity check, unfortunately.

hobofan5d ago

I'm surprised they went with a all-at-once reformat. Even when doing it over a weekend this is bound to mess with a lot of open PRs at their scale.

rileymichael5d ago

[1] https://github.com/diffplug/spotless/tree/main/plugin-gradle...

[2] https://git-scm.com/docs/git-blame#Documentation/git-blame.t...

yorwba5d ago

hobofan5d ago

BobbyTables25d ago

That’s a neat feature, thanks for sharing.

Unfortunately I find that code bases lacking auto formatting are often littered with non functional changes as developers temporarily instrument code, remove it, but leave whitespace changes behind.

In terms of tracking code changes, one really would have to rewrite the entire history with each commit reformatted.

WhyNotHugo5d ago

> I'm surprised they went with a all-at-once reformat. Even when doing it over a weekend this is bound to mess with a lot of open PRs at their scale.

skydhash5d ago

You can always let the team know so that they can apply the formatter on their PR branch.

hobofan5d ago

My rough blueprint for introducing formatter or linter nowadys would be:

- Recorded knowledge share session around how to set up the tools for local use 1-2 weeks before the initial rollout, and outline how the process will take place

- On the day of the initial rollout send out a reminder + the recording again

- Do the initial PR

- Incrementally do the rest of the migration, and subscribe to the PRs that drag out the process

jrajav5d ago

zx80805d ago

It's probably because the author can shit on others (let me guess, a senior principal something engineer).

varun_ch5d ago

I’m shocked at the 25M line part! That is a completely unfathomable amount of code for one codebase. I really want to know more about that.

bruckie5d ago

Only 25 million? :) Google had billions a decade ago...

https://research.google/pubs/why-google-stores-billions-of-l...

jsnell5d ago

Right, where is the rest of the code?

mr_mitm5d ago

They're up to 42 million now, as per the article

lukan5d ago

That sounds even more insane to me, but I guess most of that code does not really touch financial transactions, otherwise it would be a nightmare being responsible to verify that.

eigenblake5d ago

dgrin915d ago

fsckboy5d ago

>files should be functionally equivalent before and after

when you say something like this, the road you are on is paved with good intentions.

riffraff5d ago

You can also introduce subtle bugs in your own feature development, and if you change formatting there it's also at the front of your mind.

I think the main argument for doing a big bang rewrite is that you have a defined before/after, otherwise you're stuck into an endless in-between.

nitwit0055d ago

> Given that complexity, the hypothesis was simple: tackle the hardest syntax first and the rest will follow.

Always nice to see. I've seen people fall into the trap of designing for the common case, not realizing most of the code will be to deal with the less common cases.

1 more reply

burnte5d ago

The floating spiral thing is so distracting I spent more time deleting it in Inspector than reading the article. I feel like they hate their readers. Awful.

1 more reply

tmaly5d ago

How did I know this was going to be a rewrite in Rust?

comrade12345d ago

Man must me nice to have the time to put so much work into tabs.

1 more reply

ryanisnan5d ago

Cool story. The treat at the end was fun as well, thank you!

hokkos5d ago

Now it makes me wonder, are those 45M LoC are untyped ?

c3ab8ff1375d ago

No, Stripe has its own Ruby typechecker - https://sorbet.org/