Which isn't to say that's a bad idea...
Google Refine is nice for cleaning up and pre-processing data files before exporting elsewhere for analysing.
Still unix toolset - awk, grep, sort - beats both for most tasks and huge data sets.
That's when the export/import processing steps feature comes in handy.
Installation was a breeze. I couldn't find any instructions, but it was as simple as downloading for Linux, extracting, the running the shell script.
http://code.google.com/p/google-refine/downloads/detail?name...
The application automatically opens in a new Chrome window.
From here, I grabbed a data dump from one of our external providers.
We work with a lot of providers who are really technologically challenged. I'd love to be able to say, here you are.. here is our API, start pushing your content to us. But in practice they don't even know what their XML feeds do. We need their data, but getting a consistent dataset from them when they seem to change their format regularly is a pain! And when importing only 10 or so items at a time it's excruciatingly painful.
Today I learnt how easy that can be with Google Refine!
It focuses on more mechanical transformations but has the ability to save the steps to a program which you can then use in a process pipeline.
(disclaimer, I haven't played with it in a few months so this is from memory)
Tip: GR can have a bit of a wobble from time-to-time, usually restarting the process will sort things out.
Have a quick look over the screen casts. If you're familiar with those tools you'll map the concepts pretty quickly.
Am I correct in that understanding, or did I miss the boat?
That's how good Refine is...it adds an extra, GUI-driven step to the workflow, but it's so well executed that it makes data exploration (and cleaning) effortless.
I wrote a tutorial awhile back about how I used it in an investigative reporting project: http://www.propublica.org/nerds/item/using-google-refine-for...