Do you have favorite repos that highlight this?
I have an irrational fear of unknown codebases since it feels most of the code is either boilerplate or tied to some framework.
Do you have tips and tricks you use to read codebases?
The reason it always impresses me is that C can look like gobledygook, but yet this codebase is clean and understandable.
I seem to remember Postgres and Sqlite were relatively accessible to a low intermediate C programmer. When I've had to look at Android code (more C++ admittedly) I've started to get lost very quickly.
The best way to level up is to code. Reading code can be a complementary activity that can bring insights but it's not a way to level up. Active > passive.
> Do you have favorite repos that highlight this?
For what language? Desktop, mobile? Systems programming or web development? Linux/BSD/etc all have source code available. I believe microsoft has open sourced the .Net Framework or parts of it.
It's like you are learning a foreign language and want us to recommend good books? Can't really help you if you don't tell us the foreign language and your goals for the language ( casual conversation, business, translation, etc ).
Up to a point, yes. But beyond that point, in my experience, a deliberate study of software architecture is required to move forward. That and mentorship/code reviews by people who have a deeper appreciation of software architecture.
You start by wanting to learn how to code, then you write a lot of code, then you progress by learning how to write less code and less complex code.
Both writing and reading code is important. It's just that most people, in my experience, do not actively search out code to read and spend more time writing code.
> The best way to level up is to code.
I think it's much more subtle than either of these.
First of all, "excellent code" is an extremely subjective thing. I once worked with this one developer. He could cook up solutions to complex problems very quickly. But he didn't comment or docstring any of his code, he favored writing his own libraries and frameworks rather than pull in dependencies, and every single thing he wrote was grossly over-engineered once you managed to figure out what it was doing.
Which is a long way of saying, he was a brilliant programmer who wrote very shitty code. And unfortunately, there are a large number of open source projects and maintainers like this, so picking some at random to study may not get you very far.
Counter-point: As a professional developer one might spend far more time reading code than writing code. In my experience, all the good developers I've worked with have the ability to skim through large code bases and quickly zone into the parts that interest them. It is a very deliberate skill to cultivate.
I once put down my thoughts on this : http://lonetwin.net/20090829/hacks-you-can-live-without/on-r...
Reading text to research is different than reading text to learn to be a better writer.
I'll (tongue-in-cheekly) prompt you with the following:
Language: Whatever qiskit is most familiar with OR has a favorite recommendation for (based on qiskit's interests). Domain: Whatever qiskit is most familiar with OR has a favorite recommendation for (based on qiskit's interests).
https://mitchellh.com/writing/contributing-to-complex-projec...
> The first step to understanding the internals of any project is to become a user of the project.
It's normally easier to figure out complex behaviour from the spec/doc/interaction than from the code.
EDIT: Found the interview: <https://docs.microsoft.com/en-us/shows/Careers-Behind-the-Co...>
Get a list of all the files, sorted however (`find -name *.foo` works) and start going through them top to bottom, or bottom to top if that's a more clear convention of the language. Maybe shuffle order a bit if you discover unit tests (nearby or asking a tool to cross-reference a call) to read the code and the test around the same time, but resist the urge to jump around too much or too deeply. Jot down short notes about what seems to be the main purpose(s) of the file, and move on. Keep going, keep track of what you've seen, your first goal is to do a complete survey of all the files and not get too distracted by fully understanding new syntax (Java annotations and Python decorators can both be understood as high level declarative tags even though under the hood they're quite different) or endless note revisions from new insights as you progress and start seeing connections or just finally understanding terminology ("wtf is a 'hero'?").
You'd be surprised how fast you can do a single (high level, shallow, skimming in places) pass even for larger code bases, by the end of it you'll also have found the/an entry point, and are in a better place for followup study or producing materials that can help the next person (like an architecture diagram that lists the files involved in each element, at least at that moment, or just some important cross references you've noted that a tool isn't necessarily going to make clear). And for easy code, a single pass may be all you ever need, even if you read it in a strange order. A completed puzzle is perfectly clear regardless of the order you put the pieces down.
git clone <repo> ;
open project in editor/IDE
Read the readme.md to get an idea at the author's opinion
Start at `func main(){}` and find what I find.
Longer answer can be taught by taking the patterns out of
https://www.goodreads.com/book/show/567610.How_to_Read_a_Boo...
Both languages are extremely readable, even when looking at unfamiliar code.
The Zig standard library is small, yet covers a lot of common tools and structures. Every file contains implementations of one particular thing, so you can casually browse random files and understand what's going on without having to understand the entire context.
I don't write much Go anymore but my hunch is that it's a combination of the package layout and auto method delegation for embedded structs. Even Java does a much better job of helping the developer at obviating the interfaces between different subsystems.
Python does the file level thing and it's source of constant annoyances with cyclical imports. Ugh, wasted so much hours on fixing it.
Well, bad on a large scale. Go in particular has some nice tools to ensure code at a small scale is always good (enforcing syntax style), but no language can stop you from having a bad project architecture.
But you can easily code yourself into a corner with Go too. If someone doesn’t know how to use concurrency well they can do bad things like overcreating goroutines or making a mess with channels. And some of the common patterns (particularly excessively overriding things) can be considered an anti pattern in terms of understanding (IMO)
I don't know boilerplate-heavy systems like Rails or Django too well. But I just wouldn't suggest starting with reading web app code (though maybe I've ignored reading too much web app code over time).
The easiest code to start thinking about is libraries and things you use today already like the nginx code base or the CPython code base or your logging library or your web server library code.
In these cases maybe you download the repo, build it, see how you could make a small tweak and run it. And soon you're looking through its code to understand how it works.
Another maybe easier technique to start reading more is when you are programming and have an error in a 3rd party library, use grep to find that error in 3rd library code and just start poking around when you do. Maybe add some print statements to it so you can see more of what goes wrong. Try to solve the problem just looking at the code and modifying it instead of using google.
If you ever get into it I'd love to hear from you. Email is on my site and Discord is in my HN profile.
My best guess is that Django must be great for writing big important relational-database-backed apps, or rolling your own CMS, or something else people get a paid a lot of money to do. But personally, for my small projects I get more mileage out of starting with a micro-framework and just choosing and bolting on the bits I need.
My trick is to dig in when something doesn’t work the way I expect. Or someone says “I don’t think there’s a way to do X with blah”. My immediate reaction is to clone the code and take a look. I have a “tools” folder on my local machine that contains many of the tools / libraries is use.
Orientation is easier than you expect. The easiest scenarios are around “why did I get that error” situations. Grep for the error and away you go. But having a question to answer will definitely give you a direction to investigate.
For packages to avoid, stay away from Celery. It's just... icky.
Weirdly enough, I enjoy digging through C and Java libs more, mostly because they’re more unfamiliar to me. I’d spend more time in Postgres / fontforge / mupdf / pdfbox / nginx / uwsgi on the whole.
If so, then here's distributed consensus in Zig:
https://github.com/coilhq/tigerbeetle/blob/main/src/vsr/repl...
Something that differentiates this from many consensus implementations is that there's no boilerplate networking/multithreading code leaking through, it's all message passing, so that it can be deterministically fuzz tested.
I learned so much, and had so much fun writing this, that I also hope it's an enjoyable read—or please let me know what can be improved!
I have some qualms with these one-line early returns though:
if (commit <= self.commit_min) return;
Control flow statements should always be on their own lines, then it's easy to find all of them by visually scanning top-down, without needing to look all the way down each line.[1]: https://github.com/coilhq/tigerbeetle/blob/main/src/vsr/repl... [2]: https://github.com/coilhq/tigerbeetle/blob/main/src/vsr/repl...
e.g. "assert_equal" is really just "expected == actual" at it's core but it uses both both a block param (a kind of closure) for composing a default message and calls "diff" which is a dumb wrapper around the system "diff" utility (horrors!). There is even some evolved nastiness in there for an API change that uses the existing assert/refute logic to raise an informative message. this is handled with a simple if and not some sort of complex hard-to-follow factory pattern or dependency injection misuse.
https://github.com/seattlerb/minitest/blob/master/lib/minite...
It feels like it has more comments than code. The comments are written in a very nice, understandable language that even activley teaches about concepts that are only adjacent to the code at hand.
E.g. https://github.com/grbl/grbl/blob/master/grbl/stepper.c#L142 or https://github.com/grbl/grbl/blob/master/grbl/stepper.c#L233
... char old_value; // Old EEPROM value. char diff_mask; // Difference mask, i.e. old value XOR new value.
cli(); // Ensure atomic operation for the write operation. ...
You can remove the need for the first comment by calling the variable old_eeprom_value. Boom, simple and obvious. Commenting cli() is similarly ridiculous: call the function disable_interrupts() and it's completely obvious what it's doing. Later on:
sei(); // Restore interrupt flag state.
This is incorrect. It's enabling interrupts, not restoring them. If the intent was actually to restore the interrupt disable flag to its original state then this function is buggy and will unintentionally enable them. It would be far better to document the expected sematics in the documentation for the function above, but instead of documenting the expected semantics of the eeprom_put_char() function, you have to read the code to figure out what the semantics are. What would be better is to have a comment in the function description saying "this function can only be called with interrupts enabled" or "this function is atomic and can be safely called from an interrupt handler or with interrupts enabled". Then it's obvious when reading the code which semantics are guaranteed / expected.
So, sure, overly commented code makes it easy to figure things out, but this is a sign of a junior developer that is focused too much on the code and not enough on the overall system. This isn't something I'd like to see a developer pointed at that is looking to learn good habits. Good habits are telling other developer what they can expect from a function. Bad habits are making them read the code to figure that out.
Often this will be along the lines of "How does it do X?" - where X is something I either didn't know was possible or that I suspect to be really difficult.
Then I can dive in to the codebase (usually starting with GitHub code search) and try to figure out how they do it.
This helps me skip straight past the boilerplate and means I often get to a satisfying conclusion - where I've learned something new - in a very small amount of time.
And along the way I pick up knowledge about how their code is organized and often a few other tricks too.
A couple of searches against https://github.com/python/cpython lead me to this code here: https://github.com/python/cpython/blob/4674fd4e938eb4a29ccd5...
Are you interested in any particular languages?
For Python, take a look at: https://github.com/psf/requests
This is what i recommend to everybody who wants to read code. Why? Because the books explain the Design behind the Code lacking which, it is quite difficult to understand the code-base. Also you get an exposure to a bazillion different projects which is crucial to "grok" Software Architecture and Large-Scale Design.
I’ve also published an open-source iOS + Android app to the App Stores, called GitTrends that leverages my AsyncAwaitBestPractices library if anyone wants to see how to use it in a real/live production app!
The source code for GitTrends is available here: https://gittrends.com
Starting to explore scheme more and would be interested in some good pointers
https://github.com/google/leveldb
Jeff Dean and Sanjay Ghemawat are amazing engineers and this code is (/was?) nice.
Here's a windows manager (dwm) and it's docs and build system in 13 files and just around 3000 lines of code.
https://git.suckless.org/dwm/files.html
And sbase, a sort of "busybox-like" set of common *NIX base utils written to be small and portable. Some of the commands are just a few dozen lines.
It's funny because I remember comparing it to mine that I had tried to write during college, and appreciating how much better it is.
Pay attention to how there's a bunch of different types of chess in there too, and how that's factored.
I'd suggest this codebase as an excellent lesson in how bloat and complexity enter into the picture over time - I wish the actual commit history was available, but unfortunately the open source release was just a snapshot in time.
It is easy to read and has taught me some neat Python-isms.
I find a lot of code fairly alienating to read. Lots of codebases require you to get into the "mindset" of the person who wrote the code: their idioms, assumptions, patterns they lean on, etc. So unless you've got the time to get deep into it, the insights you can draw from reading it are minimal.
Ramda, by comparison, is just a library of utility functions, and all of those utilities perform very simple operations: merging, plucking, appending, equality checking, etc.
There's a lot of intention in the Ramda API as well. All functions are "data last," meaning that the actual piece of data you're operating on is the final argument to every function. This enables you to write Ramda code that is very structurally consistent: function parameters first, data last, every time.
It gives me a sense of empowerment, reading the code. It's like "This doesn't have to be rocket science. If you just start from these basic operations, and write those basic operations with a simple but strict ideology of 'data last' every time, and stick them together like lego blocks using compose, then you can achieve some very cool stuff with very little code."
I’ve got two main strategies:
1) I look at the part of the app I want to modify when I use the app and search for that part in the code. Once I’ve found that code I roughly try to find out how that code works by adding exploratory code (you can also use a debugger). Once I “think” I know what is going on I try to modify the code. This is where you usually find some exceptions or misunderstandings on you part if you haven’t touched the code before. If you are lucky and work in a team somebody can tell you in a code review that you didn’t understand. If you are alone you will have to see things blow up, debug and fix the problem.
2) You can try to figure out from the main entry point how the app works. This works better for some apps than for others. If you have an event based app this is most likely just a supplement to method 1, if you have a cli app or some type of data munching app this can replace method 1.
3) You can try looking at early versions of a code base in GIT to get an understanding of its architecture before the app became “more complex”.
You will always be a bit overwhelmed by any code base and many code bases are just to large for a single person so get comfortable working on “parts” of an app first rather than working on or understanding “the whole thing”. Also, code reading is not like reading books, code is way way denser than any book you can read (and that includes Heidegger) so you will not just “read” it, you will need to work with it. Zed Shaw’s “Learn X the Hard Way” series relies on you working with the code to understand it. The same holds true for code you “read”, you will at least need to try to “run” the code in your mind if you can’t run it for real.
You might also want to get over your thing about frameworks. QT, GTK, Ruby on Rails, React, ncurses, frameworks and libs are in just about any app and many apps that get larger might extract significant parts of their functionality into libs or frameworks. A lot of boilerplate is usually a good indication that an app could benefit from a framework. I never understood the “I want to be free from the constraints of frameworks” people. Their code bases usually have the start of multiple architectures and a lot of boiler plate code. I think they always search for some “perfect” solution and just can’t find it. The truth is, libs and frameworks are great, they give you an easy in on a new app and they give you documentation that probably wouldn’t exist on fully home grown code. In other words, they mace “reading” code easier.
Hashicorp projects also seem very well done too especially given how extensible they are.
If you are using ruby, for instance, just search for https://github.com/search?q=language%3Aruby and look for popular codebases. You can decide which are beautiful for yourself.
In terms of tips and tricks, I often start looking at new code by trying to write out in plain english prose, a bit of a story of how the code works. Almost like I'm writing a blog post explaining how things work to someone else. Often this process uncovers rabbit holes that I need to go down to understand isolated bits of logic before I can return to building this big picture view, which is sort of the point.
If you have a Linux machine, you can compile and install manually by just following the instructions on the README.
Then you can customize the window manager by copying and pasting the patches into your version and recompiling. That forces you to learn how to build and extend your own window manager in pure C. And it isn’t hard at all, even to a beginner.
That inspired the creation of many tiling window managers, because people understood the code and decided to build their own, like i3 or xmonad.
The project also features other easy to read C apps, like ST terminal and the surf web browser.
I see every day code that is elegant but has bugs, ugly code that is foolproof, optimized code that performs abysmally because of some architecture change that happened in between, and a lot of abominations that make the code bad for guy A and good for guy B (e.g. a neat typechecked, object-oriented, very elegant, Pythonic numerical code that is 100 times more confusing for your research level numerical analyst than an uglier but functional Matlab script).
What I agree on is "the best way to improve X in my code" is "read code that has quality X".
Given the broadness of your question I suspect you are still finding your way around programming in general. If that's the case my method is to be driven by curiosity.
- Why does macOS behave this way? Let's look up xnu's code - I wonder about list implementation... Let's look at cPython code for appending items to a list
And so on... There is a lot of open code for stuff we are using everyday. It is interesting to get into it.
For C++, try Chromium: https://chromium.googlesource.com/chromium/chromium/+/refs/h...
One upside of this might also be that it's not as you said boilerplate, because it's very foundational and not heavily using other stuff. It also is well documented, so you'll find good explanations why things are the way they are.
I've also heard good things said for OpenBSD's readability.
Working through some badly written code that actually performs well can be a real eye opener. I mainly work in C and reading some legacy code (sometimes even my own) can be a challenge to work out exactly what's going on.
If you want to learn how an algorithm works, then a good clean codebase with lots of comments is a good way to go. If you want to learn the details of a particular language, then just read a lot of code in that language whether it’s good or bad.
https://www.doomworld.com/idgames/utils/level_edit/deu/deu52...
For jumping into new codebases I stick to the Jetbrains toolbox because it’s usually a consistent enough environment to investigate a new codebase. I also greatly appreciate the indexing.
Also, a lot of "clean code" stuff can be confusing dogma.
You should try building things you find interesting, and try to build them in a way that "feels correct", and try to emphasize - what if someone else was reading this? What if someone else dived into this codebase to add this feature? Could they?
- Lua
- Redis
- idtech3
- libuv
- linux kernel
- sqlite
As much as Ruby, Python, and Go tout for being elegant or clean to read, they are pretty horrible to read in the wild. C is where it's at.
#1: If the codebase is huge, you can't read all of it. So you'd best know how to navigate it.
#2: You need an IDE or cscope-like too to navigate a codebase. The codebase is like a web of, say, wikipedia articles, and you're going to have to browse it a lot like how you'd browse wikipedia. Symbols are links!
#3: It helps to understand the big picture. What does this codebase implement? Where are the "entry points" -- where to start reading? What's the architecture? (E.g., Java is a byte-compiled language with a bytecode interpreter known as a JVM.) What's the design look like?
#4: If it's just for fun, well, just browse till you find something interesting, then read it carefully, and go spelunking like it's a wikipedia article.
#5: If you're reading it to debug something, you need to first find the relevant entry points.
#6: If you're reading it to add features, you really need to read the developer docs (if they exist), the internals docs (if they exist), and figure out a lot of things like APIs exported, internal utilities libraries, portability layers, external dependencies, protocols, etc. This will take time, and that's ok. Start with small features, and work your way. You'll build a deeper understanding as you go.
#7: You don't have to understand all that much about the codebase in question, and it might not be possible to if we're talking about a codebase that's in the hundreds of millions of lines of code. You'll have to specialize as you dive deep, and generalize as you wade "near the top".
#8: It can take time to pick up these skills to the point where you can do this quickly. And even then, it can take time to understand a large codebase well enough. There's just a ton of detail that you have to digest into a mental picture that's sufficiently high-level that you can use it productively. So be patient, and keep on going. Just because it's a lot to learn, you shouldn't be discouraged.
To really deal with huge codebases, you have to be a bit like a generalist who can specialize as needed.
For example, if you're reading the OpenJDK, you'll want to understand what Java is, what the JVM is, and so on, though you won't have to understand all of that if you just want to read the OpenJDK implementation of, say, TLS, but you will have to be able to navigate outside that particular bit of the OpenJDK sometimes, but if you tease out code threads far enough, you probably will learn a thing or three about seemingly unrelated things like the GC.
Get comfortable doing these things, and you'll be able to deal with codebases in the millions of lines of code.
Adding on Tailwind, nothing lock you in.