We've started parsing it and tried to work on extracting abstract syntax trees and all that.
Any idea would help us a great deal.
Thanks.
At 100 million lines, I'd suspect this is either an extremely large project, where a rewrite from scratch is inadvisable, or that there is a code generator at work. If it is the latter, you want to analyze the code generating source, not the end result.
Anyhow, generically, for a first contact with a new code base, code coverage tools are a good start, as is a call graph debug run of the project. It'll let you spot dead code as well as hot code (code being called at every run of the application). It'll highlight the important and non-important code parts, allowing you to read less code and get a grasp on the architecture.
You and your team don't just have to build an understanding of the code (e.g. frameworks, patterns, DB's, etc.) but the app itself so you can truly understand it's purpose.
Your rewrite won't (hopefully?) also be 100mm lines so being able to understand the high level purpose of the app completely and then diving deep from there, you may (hopefully) find many places where the system can be simplified.
Are you really not exaggerating? 100mm lines of source? Yiiiikes...
At 100 MLOC the code base is probably a complete mess.
Also, simian is probably your friend as it will identify large chunks of duplicate code.
As well source control can be your friend, the older the source is the more likely it is to contain useful code. The files with the most changes will usually be where the bugs are.
Facebook(webapp): ~310^7 LOC Linux Kernel: ~210^7 LOC Windows XP: ~5*10^7 LOC My impression is that 10 million LOC codebase is relatively common... but past that the size of the organization (company or volunteers) needed becomes a major sorting criteria.
Look for a test suite. You'll need one once you start making changes, to keep from breaking anything. If necessary create one based on actual jobs run of the system. You want integration tests, for big parts of the system, rather than unit tests.
Once you have a test suite, start with dead code analysis. Any codebase as big as this one will have a lot of accumulated cruft that is just getting in the way. Delete it.
Funny thing is, often the users tell you "I used it for XYZ" and it was...
...never been written for XYZ
...never DID XYZ, all the results/numbers were trash, but no one noticed
You can break the steps in more agile method and have 3 sub-teams.
1. Figuring out use cases 2. Developers 3. Testers
The final goal is to re-do what these lines do :(
What is the order of priority of services - which services/apps are critical, and which are not very important?
Which services actually need to be rewritten and which are working just fine?
Which services have a clearly defined interface and can be rewritten?
Which tests are in place to test the existing services, and which will you have to write?
I wouldn't touch the code till you have answered those questions, and once you have those answers, having some sort of overview of code coverage etc is going to seem less important, because it will become obvious which bits need to be touched first (the ones that are both mission critical and broken), and which bits you can easily isolate.
You will find it very very hard to show concrete progress if you try to change all of this code at once, in a global way (for example by tidying up every single reference to a db to use a new db interface, or things like that). If you do, you'll never reach your final goal, and end up spending months tidying up without actually delivering value to the business.
That is quite possibly a huge mistake. (And a very costly one too!)
It's like having a map versus having no map at all.
And 100M lines? Are you sure there is no code generator at work here?
(On the off chance that you don't know what that is: http://en.wikipedia.org/wiki/Call_graph)
Presumably somebody didn't write 100M lines of code on Day 1.
I bet the core java code the team actually wrote is 2 orders of magnitude smaller.
Since a rewrite is in the card (hopefully) there is something wrong with the entire system.
* Identify where the applications interact with each other.
* Identify the most problematic applications.
* Rewrite those (starting with the smallest) while trying to keep the interfaces between applications constant.
And involve end users as much as possible.
You don't need to understand the whole codebase. It will take years. Best to focus on what the users need and analyze small chunks. If it's truly 100M lines, there's not going to be any semblance of consistency in the code.
You can also slap New Relic on it and you may be amazed at what you learn, right away.
Don't waste too much time trying to understand all the code. Focus on a couple issues first, make some hypothesis, and then see how well your understanding of the code fits the bigger picture. Refactor and repeat.
The size of the thing and the way we thought we were going to work is quite different.
For instance, suppose we produce AST for all the routines/pieces of logic/you_name_it we wanted to then find similar patterns or clusters that would give us hint to then work on a "pareto-like" way.
As already stated it's not ONE project, it's an old (but still running), poorly-documented codebase produce in decades around this big firm we work for.
1. What kind of modules?
2. Which servers/hardware?
3. Which databases/datastores?
4. What systems talk to what?
5. What test systems exist or existed?
6. Which api/frameworks where used?
7. Who is currently working on them/maintaining it?
8. Is anyone left who used to?
9. Why is a rewrite on the table?
10. Is there any way you can work on smaller pieces at a time?
11. What are the pain points of the current users (will tell you what area to focus on)?
12. Can you document what comes in and out?
In my experience with such large code bases, there is never one way to do things. i.e. I once worked on a smaller system with 4 ways to talk to the same database. On one with 100 million lines I would expect even more ways to rome ;)If you do want to go down the static analysis path, start with existing tools before trying to build your own. If needed get external help for this.
A 100 Million lines of code is not so bizarre. The project I work on is currently about 300,000 lines and a project some 300 times larger is quite imaginable for me.
Your question sounds like something that someone with either no real experience and/or no experience of an object-oriented language would ask.
100 million lines is a lot of code. Why do you need to "parse it to extra the AST"? That's crazy.
Do you have the original design documents and architectural documentation? If you do, read it.
Even if the design docs would exist, it would take months to read them, without any guarantee that they correspond to the reality.
Meanwhile, automated analysis of actual code can give you at least high-level overview of the codebase and maybe a hint where to start digging. Getting AST is a first step required for most automated tools to do their work.
EDIT: I acted too rashly and downvoted your post before I realized what are we really talking about. Sorry about this. I still am convinced that automatic, static analysis of the code is the way to go, but you obviously don't deserve a downvote for having different opinion. I'll try to make it up to you by being more careful in the future :)
Static Analysis would be my second step, but first I'd have a look at the architectural documentation. I can't imagine that a project of this size wouldn't at least have a Powerpoint explaining the structure and concepts of the code.
Then it's time to start using tools.
AstroGrep is a good Windows based tool that allows you to search within file so you could use it to find which files spit out a particular output to screen.
Not sure what you mean by using ASTs though.
> I have to analyse [...]
> We've started parsing it and tried to work on extracting abstract syntax trees and all that.
Why? How will this help? What are you really after?
Getting call paths: https://github.com/gousiosg/java-callgraph
Line coverage from instrumented jars: http://emma.sourceforge.net/
For this type of request, I'd push back and say, let's identify very small parts of this and begin rewriting those one at a time in an isolated project. Kind of an agile rewrite that will combine the legacy project with the slowly rewritten one. Use the tools to identify parts of the project than can be isolated. Build new interfaces or services to let the old project communicate with the new one. Get a history of the source repository to see where recent edits are and prioritize those to be rewritten first (presuming they want a rewrite to lower maintenance costs).
However I don't think it would scale to 100M lines of code. I have run Linux through it and it was acceptable (both in run times and browse times). At 100M lines of code you need some way to see an overall "map" of the codebase and then drill in to the bits you are interested in. Just linking via symbols like SherlockCode does is too micro of a view.
There are a lot of interesting visualization tools out there both commercial and academic. I don't have any Java specific ones to recommend but a quick Google search for "java code visualization tools" shows a lot of promise.
if given a large chunk of code to maintain i'll usually run doyxgen on it to generate the xmlish kind of chart that it makes. at least it gives me a roadmap to start, but it's not super great.
2) Identify integration points to other systems and ask why they are there
3) Realize that a "big-bang" rebuild never works and that it's better to break up the system into smaller pieces and replace them piece by piece.
Edit: You could also look for duplicated code, and quickly refactor that to just be in one place.
You've been given the task of digital archeology/septic cleanup. Unless you like the tedium and stank, it's not going to bode well...
Basically I would first try to reduce the size of the problem while trying to get domain expertise. I wouldn't consider a rewrite at this stage...
But before you start just pray to Omnissiah (http://warhammer40k.wikia.com/wiki/Machine_God).
Find anyone technical who is still around and can talk sensibly about it and find out what they think is important. Map that.
Use anything automated to map what it's up too (calling...) and find out where the core of it is.
You may know what is important by this time, you will be able to make some sort of start...
Map the control flow. This code/app/whatever is doing something in production right now. What tells it to start? How does the control flow from the start point to the stuff that takes in data to the stuff that writes the output or does whatever this app does? Whatever the options for how it works are, where are they set, how do they make it into the core of the application to affect whatever it does?
Map the data flow. Input must be coming into this thing somewhere. Find where it reads it in, where it writes it out, and how it gets from one to the other, what data structures and methods it passes through on the way.
2) treat each app as a black box, understand the major data flows involved (which data sources is it interacting with? is it doing reads or writes? which tables)
3) treat each app as a black box, and try to understand the interactions between each app and any external components (other apps, web services, etc)
4) identify the overall architecture, determine the class hierarchy for each app, identify the major classes and functionality
By this stage you should be ready for a rewrite. At no point do you need to go into the code in any great depth.
This is 100MM lines, it never will be easy. I would take some time to create the tooling to do this. Say, create a tool to add some bytecode to generate a pretty callgraph. Then, I'd run the use cases or functionalities individually, and save the callgraph somewhere. But in the end, you will always need domain knowledge expertise to guide you through the logic of it.
... assuming you have webapps.
Find a way to split it into something like 10 pieces of 10m LOC each in a way where you can understand and [re-]document the data and control flow between them.
Repeat with further subdivision, as much as you have people.
Then, if you really need to re-build this from scratch, do it per component - first, make automated integration tests for its functionality, and only then attempt to rebuild that part of the system.
eg: "whats that program actually doing. start with strace" https://blogs.oracle.com/ksplice/entry/strace_the_sysadmin_s...
i've successfully reverse engineered messaging protocols, written drivers for a different language, and ported large projects just by trying to see what it does over the file system, network.
~B
http://visualvm.java.net/profiler.html
Edit: Adding, I'd set some judicial breakpoints in the hot spot areas identified through profiling along with some System.out.println's (or better dump to a flat file database, SQL can be used to work wonders for analysis even for flat file data).
The next thing I'd do is take the top level list of modules and start assigning it to individual people within the team. Their responsibility is to produce some kind of top level description of how the individual modules work. A big part of this phase of the effort should be meetings or informal conversations as the per-module analysis progresses. As your team talks among itself, you should be able to find commonality between modules, communication links, etc. The key at this point is to keep it high level, and avoid getting too bogged down in the details. With this much code, there are plenty of details to get bogged down in. As a result, you'll probably have some mysteries about how the code actually works beneath various abstraction layers. Make and update a list of these 'mysteries' and keep it next to your team's list of modules. As you work through the list of modules, some of these will solve themselves, and some will be so obviously important that it's worth a detailed deep dive to really understand what's happening. Either way, there will be times that you have no idea what's going on in the codebase and you'll just have to trust that you'll figure it out later.
One final comment I'd like to make is that, as silly as SLoC is as a measure of the size of a software system, you're looking at a large software package. (Bigger than Windows, Facebook, Linux, OSX, etc.) If you take each line of code to have cost $5-10, then the system arguably cost $1B to build in the first place.
Because of the size of the system, you shouldn't expect your analysis work to be easy, fast, or cheap. Buy the tools you need to do the work. This means technical and domain training, software, hardware, process development, new staff,... basically whatever you need to make the work happen. You're at the point where long term investments are highly likely to pay off, because your scope is so large and your timeline is entirely in front of you.
I'd also highly recommend working this problem from two angles. You can understand the existing system by looking at the code, but you also need to clearly understand the system requirements from the 'business' point of view. If you're doing bottom-up analysis, then some other group needs to be doing top-down. Along those lines, you should also start to thinking about deployment strategies. I highly recommend avoiding a big bang deployment of that large of a system, so there will be some period of time when you're liable to be running both the 'old world' and the 'new world' systems at the same time. Think about how you want to do that...
There is lots to think about here, because this is a complex problem. Hopefully, I've given you at least a little bit to think about. Good luck.
You can re implement the applications that are causing problems maybe one at a time maybe.
I don't understand what you want to get syntax trees for, but it sounds like you are gonna need to store them in a database and do queries on it if there really is info that you need.
If the idea is to rebuild the application I would start at the beginning: what is the input and output? What does the user see? What are the various service hooks? How are they called? When are they called? Why are they called?
Then I would look at how the overall code is organized. What modules are there? Are there core utility modules that seem to be called by everything else? What are those doing? What are the most used business function modules?
Then I would look at the build process. What external dependencies are there? What are they used for? Are there modern alternatives? What about internal dependencies? Does the build process look organized and sane or a chaotic mess cobbled together over the years?
Do you have logs? What is the most utilized part of the application?
Then I would look at the database. What tables seem to be the most important (if you could get usage stats from a running and used application that could help, but otherwise you could look at which tables are keyed off of the most)? What data is most critical? What modules interact with that data? What tables are essential for supporting this data?
Answering these questions will start to fill out a nice 30,000 ft view of the application and how it is actually used.
You are going to get the most bang for your re-implementation buck by identifying and replacing often used utilities (especially if they are custom built or built before a good de-facto standard was formed for that particular task) with modern, well known, alternatives. Then follow the execution path of the most often used modules and the modules that work with the most critical data and work down the list.
With a 100 million line application, you are looking at many years to understand all of it and many years to re-implement. To get anything useful in a reasonable amount of time you are going to have to boil it down as much as possible, then break what's left down into independent functional areas and tackle it an area at a time.
The code is important, but if it were me, I'd try to analyze how the users and processes work before I'd dig into the nitty gritty of the code too much if at all possible. I'd build the smallest functional unit from what I deem to be the most important and critical module(s) trying to cut as much cruft from the application and database as possible. I'd get users and processes to start banging on the new app as soon as possible. I'd keep the old application up and running and available to analyze (not for the users but for the developers and analysts) as the team works down the most often used parts. I would not try to analyze the whole mess in one go beyond finding waypoints as described above. If possible I'd also try to get users to understand that the old way is not necessarily the right way. Much pain has been caused trying to make new systems work exactly like the old systems when the new systems don't face the same constraints. It is just too tempting to say 'make it work like it did'.