The iterative process is: download data (~10000 rows or more), clean it, come up with intermediate variables ("recoding"), tabulate these variables, graph them, write a draft that explains the results, see where I went wrong, rinse, repeat.
I generally start a directory per project. Within that, I create a subdirectory into which I download raw data and where I do various munging operations with unix tools on it. I also start a file into which I paste the commands and small scripts I use (either at the command line or in the R environment). Then I just go crazy, doing analysis, generating intermediate files with weird naming conventions, saving stuff, and everyone once in a while cleaning my work space of intermediate files by "mv -b ./JUNK".
At some point I decide I am done and put my tables into an Excel spreadsheet (don't hate me!) and my graphics (usually pdfs) into their own subdirectory. This is the last stop before I open Word or In Design and start writing the report importing tables/ graphs as I create a story around this data. I save multiple versions of this, like "myreport-01.doc", "myreport-02.doc" etc, until I am finished. The final product is usually a printed / pdf'ed document, meant to be read as such.
I would REALLY, REALLY like a more systematic approach.
(If I were writing queries against a regular business database and storing them and running regularly, my problem might be a little easier, but the reports are completely ad hoc.)
I have wondered about using version control aggressively, working more with central data stores (I already use the workspace feature of R), and focusing more on writing code to do everything from beginning to end. But I am a little bit at wits end.
So ... everyone please weigh in and tell me how to streamline my process!
No comments yet.