Ask HN: I have to analyze 100M lines of Java – where do I start? | Better HN

122 comments

90 comments · 50 top-level

sergiosgc11y ago· 6 in thread

For rewrite from scratch projects, I always start by identifying the use cases covered by the application. You don't need the code for that. Just run the application and identify what it is that it does. Then, work backwards. For each use case, use the existing code as specification of the use case behavior.

At 100 million lines, I'd suspect this is either an extremely large project, where a rewrite from scratch is inadvisable, or that there is a code generator at work. If it is the latter, you want to analyze the code generating source, not the end result.

Anyhow, generically, for a first contact with a new code base, code coverage tools are a good start, as is a call graph debug run of the project. It'll let you spot dead code as well as hot code (code being called at every run of the application). It'll highlight the important and non-important code parts, allowing you to read less code and get a grasp on the architecture.

rkalla11y ago

This, 100x this.

You and your team don't just have to build an understanding of the code (e.g. frameworks, patterns, DB's, etc.) but the app itself so you can truly understand it's purpose.

Your rewrite won't (hopefully?) also be 100mm lines so being able to understand the high level purpose of the app completely and then diving deep from there, you may (hopefully) find many places where the system can be simplified.

Are you really not exaggerating? 100mm lines of source? Yiiiikes...

fleitz11y ago

Yup, start the profiler up, use the application, check the call graph.

At 100 MLOC the code base is probably a complete mess.

Also, simian is probably your friend as it will identify large chunks of duplicate code.

As well source control can be your friend, the older the source is the more likely it is to contain useful code. The files with the most changes will usually be where the bugs are.

jacquesm11y ago

Why did this get downvoted? It's actionable and on topic.

stronglikedan11y ago

I could be wrong, but I think the "M" means thousands in this case. I know it means thousands when dealing with inventory UOMs. 100 million lines just seems like a lot, even for a code generator, and not something that would be turned over to a single team.

crpatino11y ago

I don't think there are that many companies with ~10^8 LOC code bases in the world, let alone for a single product[1]:

Facebook(webapp): ~310^7 LOC Linux Kernel: ~210^7 LOC Windows XP: ~5*10^7 LOC My impression is that 10 million LOC codebase is relatively common... but past that the size of the organization (company or volunteers) needed becomes a major sorting criteria.

[1] http://www.wired.com/2013/04/facebook-windows/

johan_larson11y ago

Identify the users. Figure out who actually does anything with the system. They'll be invaluable as you try to determine what part of the system are active.

Look for a test suite. You'll need one once you start making changes, to keep from breaking anything. If necessary create one based on actual jobs run of the system. You want integration tests, for big parts of the system, rather than unit tests.

Once you have a test suite, start with dead code analysis. Any codebase as big as this one will have a lot of accumulated cruft that is just getting in the way. Delete it.

goshx11y ago· 5 in thread

10 years ago I worked on a large project to re-write a code base written in C. Our approach was to forget about the code and document everything it did from the user's perspective. Once everything was mapped out we decided on what we were going to keep, modify or remove, and then started building everything from scratch. You can always go back to the original code to see how a particular feature was implemented and perhaps re-use the same logic.

burnte11y ago

Having done projects like this before as well, this is the best method. Knowing WHAT needs to be done today and tomorrow is far more important that knowing HOW it was done before. The how is only important once you know what you need to do.

k__11y ago

lol, so true...

Funny thing is, often the users tell you "I used it for XYZ" and it was...

...never been written for XYZ

...never DID XYZ, all the results/numbers were trash, but no one noticed

_em_11y ago

I will totally second that approach. Rather spending time on reading code line by line, you should have your team spend time (as a tester) to figure out use-cases. One by one. The end result of those use-cases should be a user-case/requirement document which you can feed into developers cycles to start building from scratch.

You can break the steps in more agile method and have 3 sub-teams.

1. Figuring out use cases 2. Developers 3. Testers

astockwell11y ago

This. Also from a business- (and CYA-) perspective, this gives you a punchlist of functionality to give to executive management, which can be used for anything from doing a proper scoping exercise to actually giving you a metric to show progress against.

pan6911y ago

Indeed. Code is useless without the people who wrote it being around. It's not the code that's valuable, it's the experience of the programmers who it that's valuable.

bjackman11y ago· 4 in thread

You haven't really described your goals: What do you want to extract from your analysis? Metrics to tell you what's "wrong" with the existing code base? Some sort of model of the system's semantics?

user1241320OP11y ago

We'd love to know what these lines do. For example what part of this codebase deals with the DB and what part does not. And then go deeper.

The final goal is to re-do what these lines do :(

grey-area11y ago

100, 000, 000 lines of code is a huge amount and would take you over 1000 days just to read at 1 line per second, and 1000 man-years to fully understand. If your final goal is to rewrite all of it you are probably doomed to fail. You should first ask yourself (and your clients) some simple questions about why this insane project has been dumped on you and what the goal is:

What is the order of priority of services - which services/apps are critical, and which are not very important?

Which services actually need to be rewritten and which are working just fine?

Which services have a clearly defined interface and can be rewritten?

Which tests are in place to test the existing services, and which will you have to write?

I wouldn't touch the code till you have answered those questions, and once you have those answers, having some sort of overview of code coverage etc is going to seem less important, because it will become obvious which bits need to be touched first (the ones that are both mission critical and broken), and which bits you can easily isolate.

You will find it very very hard to show concrete progress if you try to change all of this code at once, in a global way (for example by tidying up every single reference to a db to use a new db interface, or things like that). If you do, you'll never reach your final goal, and end up spending months tidying up without actually delivering value to the business.

jacquesm11y ago

> The final goal is to re-do what these lines do

That is quite possibly a huge mistake. (And a very costly one too!)

jacquesm11y ago

At a guess, understanding and ability to fix/change, that's the usual with projects like these, bringing the code back into a state of maintenance which could be described as 'under control'.

jacquesm11y ago· 4 in thread

Callgraph. Then document the larger chunks, working your way down.

It's like having a map versus having no map at all.

And 100M lines? Are you sure there is no code generator at work here?

user1241320OP11y ago

It's code that's been developed and it's been running for decades, I'm afraid.

jacquesm11y ago

How does that stop you from making a callgraph?

(On the off chance that you don't know what that is: http://en.wikipedia.org/wiki/Call_graph)

Satoshietal11y ago

Java was apparently introduced in 1995, and thus has not quite been around for "decades". Close, but 19 years is not quite "decades". LOL

Presumably somebody didn't write 100M lines of code on Day 1.

venkyk11y ago

developed in java and running for decades? how long has the language been around?

andrewljohnson11y ago· 3 in thread

The code length has to be overstated, by including libraries, generated files, or data files. Is any real code that long?

I bet the core java code the team actually wrote is 2 orders of magnitude smaller.

pacofvf11y ago

Yeah, the LOCs in one of the biggest banks in the world is close to 10M (including: Trading[eq & ficc] , Credit, Mortgages, Wholesale, Wealth Management, Quant, Risk, and also UIs, tests , reports , recons, interfaces) , I don't think a task like that was given to single person, usually with that base of code a big IT consulting company is hired.

Satoshietal11y ago

99M loc for dealing with timezones. Before it was a library call.

jacquesm11y ago

And they knew what they were doing.

mattgibson11y ago· 3 in thread

Just curious: what application needs 100M lines of anything?

user1241320OP11y ago

It's not just one application. It's many of them.

eCa11y ago

I agree with kubiiii.

Since a rewrite is in the card (hopefully) there is something wrong with the entire system.

* Identify where the applications interact with each other.

* Identify the most problematic applications.

* Rewrite those (starting with the smallest) while trying to keep the interfaces between applications constant.

And involve end users as much as possible.

kubiiii11y ago

Why not (sounds easy eh?) go one application at a time? Are there interactions between apps? Maybe you can start with documenting all the interaction (~ api). Then you'd go deeper in each app.

michaelvkpdx11y ago· 2 in thread

With a codebase like that, it's better to look at it through the users' eyes, rather than trying to reverse engineer the business from the code. Things that look like bugs in the code may actually be features for the users, or may have been absorbed so long ago that they've fundamentally changed the nature of the business.

You don't need to understand the whole codebase. It will take years. Best to focus on what the users need and analyze small chunks. If it's truly 100M lines, there's not going to be any semblance of consistency in the code.

You can also slap New Relic on it and you may be amazed at what you learn, right away.

Don't waste too much time trying to understand all the code. Focus on a couple issues first, make some hypothesis, and then see how well your understanding of the code fits the bigger picture. Refactor and repeat.

user1241320OP11y ago

Any human line-by-line/application-by-application analysis is (for this particular discussion) out of the scope.

The size of the thing and the way we thought we were going to work is quite different.

For instance, suppose we produce AST for all the routines/pieces of logic/you_name_it we wanted to then find similar patterns or clusters that would give us hint to then work on a "pareto-like" way.

As already stated it's not ONE project, it's an old (but still running), poorly-documented codebase produce in decades around this big firm we work for.

jerven11y ago

Don't try to figure out how the code does what does yet. Figure out what systems exists inside it:

  1.  What kind of modules?
  2.  Which servers/hardware?
  3.  Which databases/datastores?
  4.  What systems talk to what?
  5.  What test systems exist or existed?
  6.  Which api/frameworks where used?
  7.  Who is currently working on them/maintaining it?
  8.  Is anyone left who used to?
  9.  Why is a rewrite on the table?
  10. Is there any way you can work on smaller pieces at a time?
  11. What are the pain points of the current users (will tell you what area to focus on)?
  12. Can you document what comes in and out?

In my experience with such large code bases, there is never one way to do things. i.e. I once worked on a smaller system with 4 ways to talk to the same database. On one with 100 million lines I would expect even more ways to rome ;)

If you do want to go down the static analysis path, start with existing tools before trying to build your own. If needed get external help for this.

A 100 Million lines of code is not so bizarre. The project I work on is currently about 300,000 lines and a project some 300 times larger is quite imaginable for me.

rashthedude11y ago· 2 in thread

What type of application has 100 million loc's? Windows 7 has 40 million lines of codes so I'm wondering why type of application/software it is.

gerhardi11y ago

I have a quite strong feeling that certain complex automated systems within financial services / insurances domain could reach those LOC levels. Including all the frontend side, internal backend logic, possible web services, internal tools, tests, tens to hundreds of interfaces to different kinds of external services, report generation, libraries, etc.

user1241320OP11y ago

BINGO!

radicalbyte11y ago· 2 in thread

Do you and your team have experience of Java development?

Your question sounds like something that someone with either no real experience and/or no experience of an object-oriented language would ask.

100 million lines is a lot of code. Why do you need to "parse it to extra the AST"? That's crazy.

Do you have the original design documents and architectural documentation? If you do, read it.

klibertp11y ago

Downvoted.

Even if the design docs would exist, it would take months to read them, without any guarantee that they correspond to the reality.

Meanwhile, automated analysis of actual code can give you at least high-level overview of the codebase and maybe a hint where to start digging. Getting AST is a first step required for most automated tools to do their work.

EDIT: I acted too rashly and downvoted your post before I realized what are we really talking about. Sorry about this. I still am convinced that automatic, static analysis of the code is the way to go, but you obviously don't deserve a downvote for having different opinion. I'll try to make it up to you by being more careful in the future :)

radicalbyte11y ago

Hahaha, no problem.

Static Analysis would be my second step, but first I'd have a look at the architectural documentation. I can't imagine that a project of this size wouldn't at least have a Powerpoint explaining the structure and concepts of the code.

Then it's time to start using tools.

darrelld11y ago· 2 in thread

Start at the main function. See how things get setup and walk through the code from there. Keep notes on the structure and flow of things (if any). There isn't really an easy way to do this unless it had been documented properly before.

AstroGrep is a good Windows based tool that allows you to search within file so you could use it to find which files spit out a particular output to screen.

Not sure what you mean by using ASTs though.

jacquesm11y ago

AST = abstract syntax tree.

With 100M lines your process will take many years.

eklavya11y ago

Am I missing something or are you saying that making an AST like a compiler will help you understand a huge codebase better and faster?

bilalhusain11y ago· 2 in thread

Please elaborate. Why is there need for extracting ASTs? Is it a single 100M line of Java source file? AFAIK Java has limitations on method size. If code is already organized into file and methods try to come up with some sort of UML representation. (I am assuming you are trying to understand the code base, not profiling or doing code analysis.)

MaxBarraclough11y ago

Seconded.

> I have to analyse [...]

> We've started parsing it and tried to work on extracting abstract syntax trees and all that.

Why? How will this help? What are you really after?

user1241320OP11y ago

Not a single line. The whole Java codebase.

logn11y ago· 1 in thread

Source to UML: http://www.architexa.com/

Getting call paths: https://github.com/gousiosg/java-callgraph

Line coverage from instrumented jars: http://emma.sourceforge.net/

For this type of request, I'd push back and say, let's identify very small parts of this and begin rewriting those one at a time in an isolated project. Kind of an agile rewrite that will combine the legacy project with the slowly rewritten one. Use the tools to identify parts of the project than can be isolated. Build new interfaces or services to let the old project communicate with the new one. Get a history of the source repository to see where recent edits are and prioritize those to be rewritten first (presuming they want a rewrite to lower maintenance costs).

ramon11y ago

nice links! :)

cyrillevincey11y ago· 1 in thread

Before rebuilding anything piece of software from scratch, I would give a serious look at this amazing bunch of wisdom: http://www.joelonsoftware.com/articles/fog0000000069.html

jebblue11y ago

Great article, it hits at the core point. There's an old saying, don't throw the baby out with the bathwater.

mml11y ago· 1 in thread

Funny, I have 50,000 individual Java apps to analyze. I started with a copy/paste detector. Pmd has a free one. Good luck!

tcopeland11y ago

Wow, a blast from my distant past... a link to the copy/paste detector:

http://pmd.sourceforge.net/pmd-5.1.3/cpd-usage.html

dugmartin11y ago· 1 in thread

Understanding the "shape" of a codebase is something I've always been interested in and I started building a tool to help me understand and traverse code here:

http://sherlockcode.com/

However I don't think it would scale to 100M lines of code. I have run Linux through it and it was acceptable (both in run times and browse times). At 100M lines of code you need some way to see an overall "map" of the codebase and then drill in to the bits you are interested in. Just linking via symbols like SherlockCode does is too micro of a view.

There are a lot of interesting visualization tools out there both commercial and academic. I don't have any Java specific ones to recommend but a quick Google search for "java code visualization tools" shows a lot of promise.

dprice111y ago

This thing seems like a good start but I have a bug report for you. For me at least, it's a deal-breaker. When browsing a source file, (firefox 32.0 on macos) pageup/pagedown/spacebar and up/down arrows do not scroll the code, even when the code pane has focus. Pressing any of these give focus to the search box. I need to be able to use keyboard navigation at least for scrolling.

dmead11y ago· 1 in thread

why would you need the ast?

if given a large chunk of code to maintain i'll usually run doyxgen on it to generate the xmlish kind of chart that it makes. at least it gives me a roadmap to start, but it's not super great.

jebblue11y ago

I used this on some projects, works well. Building with the Dot diagrams is a good visualization, in the final HTML documents choosing Classes | Class Hierarchy shows them.

EtienneK11y ago

1) Focus on the functional use-cases and not code.

2) Identify integration points to other systems and ask why they are there

3) Realize that a "big-bang" rebuild never works and that it's better to break up the system into smaller pieces and replace them piece by piece.

sp33211y ago

As a first pass, try deleting as much code as possible :) If there are files or whole projects that aren't needed anymore, they're just slowing down your analysis. Also some dead-code analysis could be helpful, at least in broad strokes. You could instrument the code with a test coverage tool, then run the code instead of the tests to see what code gets reached.

Edit: You could also look for duplicated code, and quickly refactor that to just be in one place.

xradionut11y ago

Here's a suggestion I haven't seen: Unless you have full management support, a skilled team, valid business reasons for this conversion, and expectations of succeeding, consider moving to another company/job.

You've been given the task of digital archeology/septic cleanup. Unless you like the tedium and stank, it's not going to bode well...

myang11y ago

A very rough estimate: assuming you have 10 experienced developers on the team, each can read and comprehend 1000 lines of code per hour. Given a 10-hour workday, the team can digest 100000 lines of code per day. To finish just reviewing the code, it will take 1000 days, about 2 years and 8 months. Not sure how much time you have and when you would expect to deliver the final product. On the other hand, if you can find out the use cases and even take a look at the current product, you then may not have to review the source code but just go ahead implementing the features.

FollowSteph311y ago

I don't think you can just do a cold re-write of that size without domain knowledge. I would first try to refactor the existing system just to reduce the code size. That big a system probably has horrific code and you can easily shrink it quickly. Just finding duplicate code will have an impact. Pulling out to open source systems like file utilities based code.

Basically I would first try to reduce the size of the problem while trying to get domain expertise. I wouldn't consider a rewrite at this stage...

Koziolek11y ago

1. Configure Jenkins builds 2. Add PMD - code analyzing 3. Add Sonar - code analyzing (they has different rules than PMD) 4. Use Archeology 3d > https://github.com/pslusarz/archeology3d to visualizing your code stats.

But before you start just pray to Omnissiah (http://warhammer40k.wikia.com/wiki/Machine_God).

sgt10111y ago

First think is to go find the key users (start from the CEO and work down) and find out what is important that it does to them. Map that.

Find anyone technical who is still around and can talk sensibly about it and find out what they think is important. Map that.

Use anything automated to map what it's up too (calling...) and find out where the core of it is.

You may know what is important by this time, you will be able to make some sort of start...

weinzierl11y ago

Large projects tend to accumulate lots of unused cruft. Coverage tools like emma/jacoco but I have successfully used UCDetector [1]. It's not bullet proof but it helped me lot when analyzing code to remove the unused parts, even if there are false positives sometimes. [1] http://www.ucdetector.org/

fiatmoney11y ago

The hardest part is often figuring out what the inner loop actually looks like. The best way to find it is to hook up a profiler, and look at a bunch of stack traces. That'll let you find the most common entry points and calling patterns, which will go a long way towards understanding it.

mokeefe11y ago

Read this, then try to convince them not to re-build from scratch: http://www.informit.com/articles/article.aspx?p=1235624&seqN...

ufmace11y ago

A few ideas that have worked for me in the past:

Map the control flow. This code/app/whatever is doing something in production right now. What tells it to start? How does the control flow from the start point to the stuff that takes in data to the stuff that writes the output or does whatever this app does? Whatever the options for how it works are, where are they set, how do they make it into the core of the application to affect whatever it does?

Map the data flow. Input must be coming into this thing somewhere. Find where it reads it in, where it writes it out, and how it gets from one to the other, what data structures and methods it passes through on the way.

atlantic11y ago

1) determine the use cases of the different applications involved (ie, what is each one used for, how does it fit into the company's workflow)

2) treat each app as a black box, understand the major data flows involved (which data sources is it interacting with? is it doing reads or writes? which tables)

3) treat each app as a black box, and try to understand the interactions between each app and any external components (other apps, web services, etc)

4) identify the overall architecture, determine the class hierarchy for each app, identify the major classes and functionality

By this stage you should be ready for a rewrite. At no point do you need to go into the code in any great depth.

mping11y ago

There is no programmatic way of doing this. You need to have guys with domain expertise to help you through. Obviously, the effort to fix/migrate will always be proportional to the time it took to create such a mess.

This is 100MM lines, it never will be easy. I would take some time to create the tooling to do this. Say, create a tool to add some bytecode to generate a pretty callgraph. Then, I'd run the use cases or functionalities individually, and save the callgraph somewhere. But in the end, you will always need domain knowledge expertise to guide you through the logic of it.

aragot11y ago

From the "decentralized web"/Agile spirit: Keep the original app online, separate it in several functional domains, and replace them progressively, month after month. This way, each iteration is a small manageable chunk, functional experts can have a complete understanding of their own scope, and the result is a set of independant scalable webapps with a clearly defined scope.

... assuming you have webapps.

PeterisP11y ago

Divide and conquer.

Find a way to split it into something like 10 pieces of 10m LOC each in a way where you can understand and [re-]document the data and control flow between them.

Repeat with further subdivision, as much as you have people.

Then, if you really need to re-build this from scratch, do it per component - first, make automated integration tests for its functionality, and only then attempt to rebuild that part of the system.

joshdance11y ago

I don't know if you want to re-write it. Having to identify all the use cases for a system that large will be horrific. Especially since people have build onto the broken ways. And a large customer will require that it be exactly like it was before so you have to recreate the old broken way and the new shiny way. Who decided it needed to be re-written?

markc11y ago

Several years ago I saw an impressive demo of an analysis and refactoring tool for large Java codebases called SonarJ (now Sonargraph) by hello2morrow. There are a few other tools in this category (jdepend, agilej, jarchitect). They can give you dependency graph visualizations to help untangle the spaghetti and grok the higher-level structure.

V-211y ago

As for your first step - with a project that size (and thus of substantial age, I assume), there's surely tons and tons of dead code. I'd throw it away first. Slim the thing down. The very process of identifying unused code will already familiarize you roughly with the conceptual "shape" of the application

ramon11y ago

If you're a UML guy here's a great option http://www.altova.com/umodel/uml-reverse-engineering.html

Igglyboo11y ago

What are you actually trying to accomplish? "Analyze" is a broad term.

bettynormal11y ago

with 100 million lines of code I would: 1. Find out what is still used, remove the rest. 2. Split code into stand alone supportable units. Applications/Libraries etc.. 3. Rank units in order of new requirements and what code will need to be changed. 4. Divide code between teams. 5. Get code to build, pass any tests and match the last released versions. 5. go back to management and get them to let you hire lots of people. A person per million lines would be very low... 6. Learn code in order of need.

stuaxo11y ago

Start with a static analysis tool - it will find lots of small possible bugs, by fixing them you will get good coverage of the code and insight into the structure + it will be better at the end.

jhawk2811y ago

Look into Structure101 (http://structure101.com/) to use static analysis to see the structure of the application.

bosky10111y ago

if you can run the program, i highly recommend searching HN for strace.

eg: "whats that program actually doing. start with strace" https://blogs.oracle.com/ksplice/entry/strace_the_sysadmin_s...

i've successfully reverse engineered messaging protocols, written drivers for a different language, and ported large projects just by trying to see what it does over the file system, network.

~B

ramon11y ago

On Eclipse http://www.nwiresoftware.com/products/nwire-java

jebblue11y ago

jvisialvm comes with the JDK, I'd start there with profiling:

http://visualvm.java.net/profiler.html

Edit: Adding, I'd set some judicial breakpoints in the hot spot areas identified through profiling along with some System.out.println's (or better dump to a flat file database, SQL can be used to work wonders for analysis even for flat file data).

mschaef11y ago

You need to have a clear understanding of the point of the analysis before you analyze anything. What, specifically, does your team have to produce? How much time and how many people do you have to complete the work? If you're the leading edge of an effort to rewrite 100MLoc, my presumption is that your deliverable is mainly a 'gross anatomy' of the system... a basic description of the major structural components and how they interact with each other. If that's the case, I'd start by looking at the build scripts and the modules they build. Try to make a comprehensive list of major components. You'll get it wrong initially, but you'll need a starting point.

The next thing I'd do is take the top level list of modules and start assigning it to individual people within the team. Their responsibility is to produce some kind of top level description of how the individual modules work. A big part of this phase of the effort should be meetings or informal conversations as the per-module analysis progresses. As your team talks among itself, you should be able to find commonality between modules, communication links, etc. The key at this point is to keep it high level, and avoid getting too bogged down in the details. With this much code, there are plenty of details to get bogged down in. As a result, you'll probably have some mysteries about how the code actually works beneath various abstraction layers. Make and update a list of these 'mysteries' and keep it next to your team's list of modules. As you work through the list of modules, some of these will solve themselves, and some will be so obviously important that it's worth a detailed deep dive to really understand what's happening. Either way, there will be times that you have no idea what's going on in the codebase and you'll just have to trust that you'll figure it out later.

One final comment I'd like to make is that, as silly as SLoC is as a measure of the size of a software system, you're looking at a large software package. (Bigger than Windows, Facebook, Linux, OSX, etc.) If you take each line of code to have cost $5-10, then the system arguably cost $1B to build in the first place.

Because of the size of the system, you shouldn't expect your analysis work to be easy, fast, or cheap. Buy the tools you need to do the work. This means technical and domain training, software, hardware, process development, new staff,... basically whatever you need to make the work happen. You're at the point where long term investments are highly likely to pay off, because your scope is so large and your timeline is entirely in front of you.

I'd also highly recommend working this problem from two angles. You can understand the existing system by looking at the code, but you also need to clearly understand the system requirements from the 'business' point of view. If you're doing bottom-up analysis, then some other group needs to be doing top-down. Along those lines, you should also start to thinking about deployment strategies. I highly recommend avoiding a big bang deployment of that large of a system, so there will be some period of time when you're liable to be running both the 'old world' and the 'new world' systems at the same time. Think about how you want to do that...

There is lots to think about here, because this is a complex problem. Hopefully, I've given you at least a little bit to think about. Good luck.

andrewchambers11y ago

rebuild 100M lines from scratch? sounds impossible to me.

You can re implement the applications that are causing problems maybe one at a time maybe.

I don't understand what you want to get syntax trees for, but it sounds like you are gonna need to store them in a database and do queries on it if there really is info that you need.

mbrodersen11y ago

Read "Working Effectively with Legacy Code".

sgt10111y ago

Also - what is the history of getting into this mess?

DonPellegrino11y ago

check out https://github.com/facebook/pfff

clavalle11y ago

I would build a general profile of the application and then drill in as needed rather than try to grok the whole buffet at once.

If the idea is to rebuild the application I would start at the beginning: what is the input and output? What does the user see? What are the various service hooks? How are they called? When are they called? Why are they called?

Then I would look at how the overall code is organized. What modules are there? Are there core utility modules that seem to be called by everything else? What are those doing? What are the most used business function modules?

Then I would look at the build process. What external dependencies are there? What are they used for? Are there modern alternatives? What about internal dependencies? Does the build process look organized and sane or a chaotic mess cobbled together over the years?

Do you have logs? What is the most utilized part of the application?

Then I would look at the database. What tables seem to be the most important (if you could get usage stats from a running and used application that could help, but otherwise you could look at which tables are keyed off of the most)? What data is most critical? What modules interact with that data? What tables are essential for supporting this data?

Answering these questions will start to fill out a nice 30,000 ft view of the application and how it is actually used.

You are going to get the most bang for your re-implementation buck by identifying and replacing often used utilities (especially if they are custom built or built before a good de-facto standard was formed for that particular task) with modern, well known, alternatives. Then follow the execution path of the most often used modules and the modules that work with the most critical data and work down the list.

With a 100 million line application, you are looking at many years to understand all of it and many years to re-implement. To get anything useful in a reasonable amount of time you are going to have to boil it down as much as possible, then break what's left down into independent functional areas and tackle it an area at a time.

The code is important, but if it were me, I'd try to analyze how the users and processes work before I'd dig into the nitty gritty of the code too much if at all possible. I'd build the smallest functional unit from what I deem to be the most important and critical module(s) trying to cut as much cruft from the application and database as possible. I'd get users and processes to start banging on the new app as soon as possible. I'd keep the old application up and running and available to analyze (not for the users but for the developers and analysts) as the team works down the most often used parts. I would not try to analyze the whole mess in one go beyond finding waypoints as described above. If possible I'd also try to get users to understand that the old way is not necessarily the right way. Much pain has been caused trying to make new systems work exactly like the old systems when the new systems don't face the same constraints. It is just too tempting to say 'make it work like it did'.

mathusan11y ago

box

hartikainen11y ago

You could try to reduce the lines of code by removing alle the useless whitespaces.

j / k navigate · click thread line to collapse