The lead developer didn't like to bother with formatting code, so I wrote a tool called makenice to format his nasty spaghetti gibberish into something with good indents and layout to make it easier for us normal people to parse.
He was furious, literally spun in circles about it right in the office in front of everyone, so I wrote makenasty to format code into the way he appeared to like.
I only shared makenasty/nice with a couple of the team, who loved it, as it allowed easy conversion between something readable and something the team lead like.
He never knew about makenasty.
The dart formatter has an internal sanity check. It walks through the unformatted and formatted strings in parallel skipping any whitespace. If any non-whitespace characters don't match, it immediately aborts. This ensures that the only thing the formatter changes is whitespace, and makes it much less spooky to run it blind on a huge codebase.
That sanity check has saved my ass a couple of times when weird bugs crept in, usually around unusual combinations of language features around new syntax.
(Unfortunately, the formatter in the past year has gotten a little more flexible about the kinds of changes it makes, including sometimes moving comments relatively to commas and brackets, so this sanity check skips some punctuation characters too, making it a little less reliable.)
What I'd like to do is something somewhere in the middle where I walk the token stream and check that every token of the input ended up in the output, but I haven't figured out a simple and fast way to do that yet. Performance is particularly tricky because I obviously don't want to burn a bunch of CPU cycles on a sanity check that exists only to catch bugs.
The kind of bugs that are easiest to write in a formatter is dropping a bit of syntax on the floor and forgetting to include it in the output, and the sanity check will catch those.
It's also definitely possible to miss some whitespace that's necessary for things like identifier separation, but... <shrug> it's a sanity check, not a proof of correctness.
I had to introduce a formatter in a few sizeable codebases in the past (few 100k to few million LOC), and I always did it incrementally via a script that reformatted all files that are not touched in any open PR. The initial run reformatted 95% of all files. Then I ran the script every day for ~two weeks and got up to 99.5% of all files and then manually each time one of the remaining ~dozen PRs that were WIP for longer were merged.
[1] https://github.com/diffplug/spotless/tree/main/plugin-gradle...
[2] https://git-scm.com/docs/git-blame#Documentation/git-blame.t...
Unfortunately I find that code bases lacking auto formatting are often littered with non functional changes as developers temporarily instrument code, remove it, but leave whitespace changes behind.
In terms of tracking code changes, one really would have to rewrite the entire history with each commit reformatted.
Rebasing PRs should be trivial, just rewrite all commits reformatting the files it touches, then rebase, `git checkout --theirs`, run formatter again, and `git rebase --continue`. It's methodical and scriptable, you don't need to manually resolve any conflicts.
My rough blueprint for introducing formatter or linter nowadys would be:
- Recorded knowledge share session around how to set up the tools for local use 1-2 weeks before the initial rollout, and outline how the process will take place
- On the day of the initial rollout send out a reminder + the recording again
- Do the initial PR
- Incrementally do the rest of the migration, and subscribe to the PRs that drag out the process
https://research.google/pubs/why-google-stores-billions-of-l...
>files should be functionally equivalent before and after
when you say something like this, the road you are on is paved with good intentions.
I think the main argument for doing a big bang rewrite is that you have a defined before/after, otherwise you're stuck into an endless in-between.
Always nice to see. I've seen people fall into the trap of designing for the common case, not realizing most of the code will be to deal with the less common cases.
Why bother formatting 25m lines of slop, and why is AI wasting tokens on making code look human-readable anyway?
That insight might seem obvious - but if you stay cognizant of it as you work, you can invent some pretty amazing tooling for yourself & your team.