I just wish Yahoo would open source Everest (their multi-PB column store DB based on PostgreSQL) -- this would be ideal for building an open source Splunk competitor.
If it's sensitive data, I'd recommend just spinning up your own cluster and installing the tool.
All that said, good luck in your endeavors!
I disagree. Large software companies already exist in this space: Splunk, LogLogic, Arcsight, etc. The author Sounds like they reinvented Splunk in particular. (disclaimer: I work for Splunk and can see everything in your datacenter with a few finger presses. Call me Geordi LaForge.)
Um... what? Have the authors gotten stuck in a time vortex and been dropped off before, y'know, awk? Much less perl, or any of those new-fangled toys.
I mean: writing scripts to do log analysis is a pretty fundamental problem for server-side development, and lots of very smart people have spend the last two^H^H^Hthree decades working on tools to address the issue.
I don't even see how this (indexing the entries across a Hadoop cluster) is all that useful. In general, you don't do log analysis by asking "give me all the entries that match this pattern", you do it by walking them in order and extracting one or two fields from each line and building some kind of result data structure. This thing would be fine if you were asking for all the logs messages that mentioned "coffee", I guess. But what if you wanted a histogram of hit counts per page per day-of-week?
For analytics, you're right, search is only part of the equation. That's why we make MapReduce easy to use on a cluster. You can write Pig or Hive scr
We also have templates for common data formats (and ways to roll your own) so you can turn unstructured log text into structured data, so that a histogram of hit counts per day-of-week is just a few lines of a script (or maybe even a search).
From my perspective the main hurdle to log aggregation/correlation is not scalability. If splunk doesn't cut it for your performance needs, you have probably hit the price point to where you can afford a loglogic or similar appliance.
Instead the barrier to entry is in the number of applications supported by a particular log archival product, and the ability to correlate across the different applications.
As I'm sure you know at this point, adding support for log types is a painstaking task. Most vendors punt on this and tell customers to do it themselves.
If there is a niche available to you as a startup I would think that it would be in offering a very low turnaround time in supporting new log types. For example: give us some log sources and we'll support and categorize your logs with our service.
As for running in the cloud on large datasets, I think you'll find that most customers are not going to want to double or triple their outgoing bandwidth -- In addition to concerns from a security compliance standpoint.
That being said, good luck in your venture. Logging is a mess, and could certainly use some clean up. :)