Analysis of longevity of code across many popular projects (opens in new tab)

(erikbern.com)

202 pointsalexlikeits19999y ago51 comments

51 comments

47 comments · 25 top-level

Look at how consistently the lines of code grow for these projects. I doubt that is surprising but think about the implications. Linux is a pretty old open source project and still on balance the lines of code just grow.

How many lines of code will it be in fifty years? Will we have to come up with new systems to manage the fact that individuals only really understand smaller and smaller pieces of it? Will it reach a mass where like a black hole it collapses from some uncomprehensible failure?

There have never been things like this that just grow in complexity forever.

buzzybee9y ago

There's a pretty definite threshold, perhaps somewhere in the early 10,000's of lines(depending on the language and how densely you wrote it), where the system becomes bigger than your headspace, and at that point, I think, the real challenge for code reduction begins. And there are some very big wins possible by consistently reducing the problem into "language for solving the problem plus trivial code in that language" - VPRI's STEPS articles [0] demonstrate as much.

But it's perceptually less work to accrete a few hundred new lines to the system than to engineer a 10x reduction that introduces an entirely new architecture layer, so nobody does, or only a few research-driven projects. Programmers are, after all, lazy and deadline-driven. And while proprietary software tends to accrete due to corporate dynamics creating an environment of continuous growth, open-source projects are often built on accretion as an assumed benefit.

My own software goal, having viewed this landscape, is to be stubborn enough for long enough to score some of the 10x wins and share those so that "more with less" becomes a little more viable. I do think that all codebases go obsolete eventually, but the overall ecosystem doesn't have to depend on them so long there's data and protocol compatibility.

[0] http://vpri.org/html/writings.php

jtigger9y ago

> There have never been things like this that just grow in complexity forever.

The evolution of the species? Homo sapiens seem rather complex.

Perhaps what we'll see is that over time, large portions of what have been long list of pedantic instructions become decision models that get tweaked over time. Imagine when referring to the size of a software system using two numbers: MLOCs and MDPs (Matured Decision Points). Not unlike industrial revolution assembly lines compared to the automated factories of today — modern factory workers are some degrees removed from the step-by-step movements.

JoachimSchipper9y ago

Linux already relies on specialized tools - like Cocinelle=spatch - to migrate very large codebases at once. Linux has also grown from a hobbyist project into a fairly complex (informal!) organization of highly-skilled full-time developers, just to keep development manageable.

rmchugh9y ago

I think one of the main reasons that something like Linux keeps on growing is that it needs to support more drivers as they come onto the market. So unless you want to remove support for older hardware, you're more or less going to grow endlessly.

georgeecollins9y ago

Right, but nothing can grow endlessly. Yet these projects do grow relentlessly, almost never getting any smaller, for as long as we have kept track. It can't go on forever, so how can it end?

2 more replies

cyphar9y ago· 4 in thread

It looks like the exponential model isn't a good fit at all -- in all cases it undershoots the decay at the start of the graph and overshoots at the tail end. So while it might "look close" there is some systematic that your model doesn't account for. In particular, I don't agree that all code in a codebase has a constant risk of being replaced -- most projects have different components that are developed at different rates. Some components are legacy code that is likely to never change, while other parts are under rapid development. In fact, I'd argue that's why the tail is so long -- legacy code is called "legacy" for a reason. And the tip of the graph dives down so quickly because code being rapidly developed has a higher chance of being replaced.

BlackFly9y ago

Agreed; you can reject exponential decay a priori: a code base has some minimal set of functionality that it specifies and there is some minimal set of lines required to provide that functionality. If you believe this, then the "decay" curves, must illustrate an asymptote. The asymptote only gets reduced by breaking backwards compatibility. Such an action would include projects like Angular that go ahead and throw away a ton of core functionality in moving from 1.x to 2.x.

The decay isn't actually decay at all, but represents the complement of lines that define peripheral functionality. Lines defining peripheral functionality typically require modification (refactoring) as additional functionality is added. The asymptote which all the curves illustrate but the fit cannot capture represents the proportion of irreducible core functionality.

Simply adding a constant to the fit might fix all the problems.

Nomentatus9y ago

This is a close fit to Fechner's Law (not Weber's) relating perceived intensity to stimulus in animal vision. Also, thanks so much, author.

phkahler9y ago

It might be useful to fit an exponential for each years code. The oldest code seems to decay to some constant and then linger forever at some LOC - it does not decay to zero LOC. It may be that new code is inherently inefficient and over time is gets cleaned up. Also some functionality may be abstracted and built into new code and removed from the old. Over time you end up with a chunk of code doing its one thing and doing it well. This is all speculation of course. I hope this gets looked at a lot more.

klodolph9y ago

Agreed... looking at the graphs, the fit screams "non-exponential" to me.

aamederen9y ago· 4 in thread

Just Amazing.

I wonder if there any research articles discussing the correlation between code-change and other metrics like product quality, change frequency of team members, estimation success, etc.

spawarotti9y ago

See the paper "Change bursts as defect predictors" by Nagappan et al. and other papers by the authors.

siscia9y ago

This is not a research article? What is the difference between this article and one printed in PDF with more complex words and uglier images?

My question sparked for your request of a RESEARCH article and not a "normal" blog post.

I would like to know why you want a "research" article, that I assumed being an academic article, instead of a blog post.

aamederen9y ago

What I interested in was not the "research article" part but the comparison case.

This article comes up with a tool measuring the half-life of the code and demonstrates it on some project. What I was requested as an addition is the discussion of correlation of this metric with other metrics.

Being said that, a "paper" makes a difference compared to a "blog post" at some cases. Sometimes, in order to convince your directors and project managers about your change proposals to the programming processes, you need to support your idea with more serious work.

For example, in my previous company, I could use such an academic research in order to demand more time budget for "code-cleanup" periods where the team focuses just to re-writing the parts of legacy code, instead of bugfixes and new features.

I am surprised that this small request offended someone.

1 more reply

spawarotti9y ago

Academic research articles have to follow a quality standard enforced by a peer review process.

1 more reply

mmerickel9y ago· 2 in thread

This is awesome. I ran it on pyramid for fun. http://imgur.com/a/KZ9KR

AstralStorm9y ago

A catastrophe based model seems more appropriate. You see big refactoring taking place. Only in a big number sum those Poisson-like events turn into exponentials.

mmerickel9y ago

2010 was a major transition for pyramid as it was re-branded from repoze.bfg and merged with the pylons project.

xiphias9y ago· 2 in thread

Maybe Linux is not rewritten because it doesn't have tests

AstralStorm9y ago

It is being rewritten all the time, but new code per arch or driver dwarfs common.

xiphias9y ago

Still, without a high and good test coverage that makes sure that the drivers don't break when the core is changed, it's not worth to do refactoring just to make the code cleaner. Still, it's amazing how much extra functionality was possible to be added

jsjohnst9y ago· 1 in thread

I'd posit that the reason is a nuance of #2, more thought was put into older code on the design of how it should work before making it work. Now we write code so fast that we have to scrap it all and do it again a second time to fix the mistakes of the first time [0]. I'm of the mind that upfront planning would've likely taken less time, but that's simply my opinion and I don't have anything to back it up besides anecdotal experience. The current practice of "move fast and break things" very well could be a better approach.

[0] I'm only mentioning this footnote due to the article picking on Angular (fairly or unfairly), but the point I made that this is footnoting is relevant to them potentially.

techiferous9y ago

> The current practice of "move fast and break things" very well could be a better approach.

It's all about context. What's the cost of "breaking things"? Does your not-yet-monetized social network startup go down for a couple of hours? Or does someone die?

Also, I have witnessed first-hand the slowdown in productivity that a "move fast and break things" approach has when what you are breaking is your team's ability to work quickly and confidently with the code base.

yoavfr9y ago· 1 in thread

Results for WordPress https://blog.yoavfarhi.com/2016/12/06/half-life-wordpress-co...

jwdunne9y ago

It's interesting that it seems better engineered projects tend to retain more initial lines of code. Almost no first version code exists in WordPress and that project's codebase truly is hodge bodge, where as Git is considered a better piece of emgineering. In fact, Linus has been asked specifically about the good design of his programs and how he achieves it.

inputcoffee9y ago· 1 in thread

I really like this. Suppose we were to accept the suggestion that perhaps linearly decaying code is better built, and more robust, than exponentially decaying code.

Would this give newbies a great new tool to answer their question of which framework or language to learn?

Rails or Django? Django lasts longer. Angular or React? VueJs just a trend?

You could answer all these questions with this kind of analysis.

If someone wants to make a genuine contribution, a blog post contrasting the various decays of Javascript frameworks would be a hit.

ElonsMosque9y ago

Ditto. That would be useful to know as newbie, I'm hoping someone can give an answer or atleast point into the right direction.

msluyter9y ago· 1 in thread

A lot of projects I've worked on have utility libraries consisting of mostly stateless, pure functions -- I have a theory that these constitute some of the longest lived code. That, and database models, which tend to be easier to expand than to contract. I'd be curious to see some analysis along these lines.

jtigger9y ago

Nice hypothesis! Reminds me of the shearing layers from Foote & Yoder's "Big Ball of Mud" — http://www.laputan.org/mud/mud.html#ShearingLayers

azag09y ago· 1 in thread

The simplest explanation is that the exponential model is not a good one (does not correspond to the underlying dynamics), and so the half time value is not an inherent (time-independent) property of the codebase, but depends on its age. It seems to me that in most projects, the code evolves quickly in the beginning and then stabilizes on a slower linear decay. This would explain the observed dependence of the fitted half life on age. It might be more meaningful to fit a linear dependence at the beginning of the project and in the asymptotic regime and also look at their ratio. This should be more stable, would tell you how well was the project designed from the beginning, and would also indicate whether the project has already stabilized or not.

jsjohnst9y ago

He addressed this partly in the article. Even when looking just at the early days of the more mature projects the code churned much less than projects less mature now.

aristus9y ago

My rule of thumb is that if a line of code survives its first five years it'll live forever. The age of a piece of code is the single greatest predictor of its future.

skeltoac9y ago

Code from year one may still be the same code but when it gets moved or reformatted its cohort is updated. If a bad change is reverted, will the cohort for those lines of code also be reverted? The effect of understated longevity is not so obvious when it is gradual and organic. Sometimes an event in a project's history makes the effect very obvious.

https://blog.yoavfarhi.com/2016/12/06/half-life-wordpress-co...

puredanger9y ago

Clojure takes a remarkably stable and additive approach to maintenance and growth. Graph for it is here: http://imgur.com/a/rH8DC

Insanity9y ago

I'm not a dad and I actually appreciated the "Git of Theseus" pun. The ship is an interesting thought experience on identity, maybe you can modernize it to introduce philosophy to computer science students ^^

But more on-topic, nice article, well done!

koja869y ago

Good job and great tool. Thanks.

This might actually be an interesting metric in regards to project architecture and/or project management.

Totally agree that exponential model explanatory power is great.

shivpat9y ago

Great stuff here.

Definitely some merit to reason #3 - people are more willing to work on something that they can easily build on top of.

alenmilk9y ago

Interesting, but it is not surprising that git and redis are more stable than node and angular. Git and redis are well defined problems that won't change that much. Angular is a framework and node is a platform. They should change more. But then again... javascript fatigue is a thing from what I heard.

qume9y ago

Would love to see this restricted to which lines are actually executed in a typical use. I.e. ignore dead code which is still in the repo.

Probably wouldn't be too hard for the interpreted languages.

jtigger9y ago

I wish we could capture the "inventiveness" of a particular project — how well the problem was understood when the project initiated.

There had been _many_ *nix'es by 2006, so the territory had significant prior art and with it collective deep understanding of the problems being solved. Angular sprouted alongside a number of other SPA frameworks in an ecosystem that was experiencing a "growth-spurt" (using that term loosely) — lots of variables.

jtigger9y ago

Another feature of the model could be stability of product vision. Is there a correlation between the half-life of committer membership and that of the code? How has the problem space of the product changed over time?

Perhaps we could talk about "intrinsic churn" vs. "accidental churn". The former results from the codebase keeping up with the "drift" in the problem space; the latter comes from having to learn.

malkia9y ago

There are still AAA games shipped with code written by id Software 20 or more years ago. Tools too :)

esmi9y ago

Super interesting.

But it doesn't seem fair to directly compare 2005 14 year old stable 2.6 Linux to projects which did basically foundational initial releases and not mention it.

twelvechairs9y ago

Great work. Would be interesting to compare by language - see if particular languages need more or less refactoring.

throwawygybj9y ago

So its true... Angular is the code abyss, my colleagues said it was a legend but I have seen it with mine own eyes.

Thank you

beefman9y ago

Would be interesting to see the results as a function of repository size.

j / k navigate · click thread line to collapse

51 comments

47 comments · 25 top-level

georgeecollins9y ago· 5 in thread

There have never been things like this that just grow in complexity forever.

buzzybee9y ago

[0] http://vpri.org/html/writings.php

jtigger9y ago

> There have never been things like this that just grow in complexity forever.

The evolution of the species? Homo sapiens seem rather complex.

JoachimSchipper9y ago

rmchugh9y ago

georgeecollins9y ago

Right, but nothing can grow endlessly. Yet these projects do grow relentlessly, almost never getting any smaller, for as long as we have kept track. It can't go on forever, so how can it end?

2 more replies

cyphar9y ago· 4 in thread

BlackFly9y ago

Simply adding a constant to the fit might fix all the problems.

Nomentatus9y ago

This is a close fit to Fechner's Law (not Weber's) relating perceived intensity to stimulus in animal vision. Also, thanks so much, author.

phkahler9y ago

klodolph9y ago

Agreed... looking at the graphs, the fit screams "non-exponential" to me.

aamederen9y ago· 4 in thread

Just Amazing.

I wonder if there any research articles discussing the correlation between code-change and other metrics like product quality, change frequency of team members, estimation success, etc.

spawarotti9y ago

See the paper "Change bursts as defect predictors" by Nagappan et al. and other papers by the authors.

siscia9y ago

This is not a research article? What is the difference between this article and one printed in PDF with more complex words and uglier images?

My question sparked for your request of a RESEARCH article and not a "normal" blog post.

I would like to know why you want a "research" article, that I assumed being an academic article, instead of a blog post.

aamederen9y ago

What I interested in was not the "research article" part but the comparison case.

I am surprised that this small request offended someone.

1 more reply

spawarotti9y ago

Academic research articles have to follow a quality standard enforced by a peer review process.

1 more reply

mmerickel9y ago· 2 in thread

This is awesome. I ran it on pyramid for fun. http://imgur.com/a/KZ9KR

AstralStorm9y ago

A catastrophe based model seems more appropriate. You see big refactoring taking place. Only in a big number sum those Poisson-like events turn into exponentials.

mmerickel9y ago

2010 was a major transition for pyramid as it was re-branded from repoze.bfg and merged with the pylons project.

xiphias9y ago· 2 in thread

Maybe Linux is not rewritten because it doesn't have tests

AstralStorm9y ago

It is being rewritten all the time, but new code per arch or driver dwarfs common.

xiphias9y ago

jsjohnst9y ago· 1 in thread

[0] I'm only mentioning this footnote due to the article picking on Angular (fairly or unfairly), but the point I made that this is footnoting is relevant to them potentially.

techiferous9y ago

> The current practice of "move fast and break things" very well could be a better approach.

It's all about context. What's the cost of "breaking things"? Does your not-yet-monetized social network startup go down for a couple of hours? Or does someone die?

yoavfr9y ago· 1 in thread

Results for WordPress https://blog.yoavfarhi.com/2016/12/06/half-life-wordpress-co...

jwdunne9y ago

inputcoffee9y ago· 1 in thread

I really like this. Suppose we were to accept the suggestion that perhaps linearly decaying code is better built, and more robust, than exponentially decaying code.

Would this give newbies a great new tool to answer their question of which framework or language to learn?

Rails or Django? Django lasts longer. Angular or React? VueJs just a trend?

You could answer all these questions with this kind of analysis.

If someone wants to make a genuine contribution, a blog post contrasting the various decays of Javascript frameworks would be a hit.

ElonsMosque9y ago

Ditto. That would be useful to know as newbie, I'm hoping someone can give an answer or atleast point into the right direction.

msluyter9y ago· 1 in thread

jtigger9y ago

Nice hypothesis! Reminds me of the shearing layers from Foote & Yoder's "Big Ball of Mud" — http://www.laputan.org/mud/mud.html#ShearingLayers

azag09y ago· 1 in thread

jsjohnst9y ago

He addressed this partly in the article. Even when looking just at the early days of the more mature projects the code churned much less than projects less mature now.

aristus9y ago

My rule of thumb is that if a line of code survives its first five years it'll live forever. The age of a piece of code is the single greatest predictor of its future.

skeltoac9y ago

https://blog.yoavfarhi.com/2016/12/06/half-life-wordpress-co...

puredanger9y ago

Clojure takes a remarkably stable and additive approach to maintenance and growth. Graph for it is here: http://imgur.com/a/rH8DC

Insanity9y ago

But more on-topic, nice article, well done!

koja869y ago

Good job and great tool. Thanks.

This might actually be an interesting metric in regards to project architecture and/or project management.

Totally agree that exponential model explanatory power is great.

shivpat9y ago

Great stuff here.

Definitely some merit to reason #3 - people are more willing to work on something that they can easily build on top of.

alenmilk9y ago

qume9y ago

Would love to see this restricted to which lines are actually executed in a typical use. I.e. ignore dead code which is still in the repo.

Probably wouldn't be too hard for the interpreted languages.

jtigger9y ago

I wish we could capture the "inventiveness" of a particular project — how well the problem was understood when the project initiated.

jtigger9y ago

Perhaps we could talk about "intrinsic churn" vs. "accidental churn". The former results from the codebase keeping up with the "drift" in the problem space; the latter comes from having to learn.

malkia9y ago

There are still AAA games shipped with code written by id Software 20 or more years ago. Tools too :)

esmi9y ago

Super interesting.

But it doesn't seem fair to directly compare 2005 14 year old stable 2.6 Linux to projects which did basically foundational initial releases and not mention it.

twelvechairs9y ago

Great work. Would be interesting to compare by language - see if particular languages need more or less refactoring.

throwawygybj9y ago

So its true... Angular is the code abyss, my colleagues said it was a legend but I have seen it with mine own eyes.

Thank you

beefman9y ago

Would be interesting to see the results as a function of repository size.

j / k navigate · click thread line to collapse