Readings in Database Systems, 5th Edition (2015) (opens in new tab)

(redbook.io)

225 pointskediz5y ago30 comments

30 comments

Michael Stonebraker has an interesting set of conclusions in his assessment of the MapReduce vendor market in 2015 from the "Dataflow" chapter here:

"- Just because Google thinks something is a good idea does not mean you should adopt it.

- Disbelieve all marketing spin, and figure out what benefit any given product actually has. This should be especially applied to performance claims.

- The community of programmers has a love affair with “the next shiny object”. This is likely to create “churn” in your organization, as the “half-life” of shiny objects may be quite short."

sradman5y ago

I reread DeWitt and Stonebraker’s (D&S) MapReduce criticism [1] and I still find it misguided 12 years later.

Map() is not equivalent to a SQL GROUP BY clause, it is equivalent to a user-defined Table Function that is used in a FROM clause. This mimics the Extract and Transform stages in a SQL ETL pipeline. The Extract is implied by the input format.

The Reduce() is very much equivalent to a user-defined Aggregate Function. D&S accurately criticize the sub-optimal materialization of intermediate data sets but they under appreciate the implicit input split and distributed sorting mechanism which dominated the Terasort benchmark at the time (a Jim Gray creation).

On-Premise commodity Hadoop clusters lost out to public Infrastructure-as-a-Service clusters. None of the five takedown categories turned out to be important. The tools have evolved and cloud-native data warehouses and ETL systems are now the best of both worlds.

[1] https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_...

mrits5y ago

"Map() is not equivalent to a SQL GROUP BY clause, it is equivalent to a user-defined Table Function that is used in a FROM clause."

No, the projection doesn't remove redundancy under most cases. There also isn't any reason you couldn't have UDF's in the GROUP BY clause. I've written implementations of both and I think the GROUP BY is an excellent comparison for understanding Map in MapReduce Systems.

1 more reply

ramraj075y ago

When you say cloud-native data warehouses do you mean things like snowflake/redshift/big-query or something else? As part of an org making the transition from spark to these I can definitely agree that these tools are better suited for practical data engineering in the medium-big-data scale (anything not Google/Facebook)

1 more reply

jrumbut5y ago

I think it's a bit of a shame that the MapReduce concept got the shiny object treatment since I thought it was a nice pragmatic approach to a useful set of problems that are faced all the time and often addressed with ad-hoc programs that make a mess.

People always looked down on those that used Hadoop or somesuch for <1GB of data, but while it wasn't needed from a technology perspective it gave a structure to the project.

Now many places are back in the world of one-off scripts, and I think something of value was lost (even if it was a little ridiculous to fire up a cluster for something Excel or SQLite could handle).

throwaway_pdp095y ago

> People always looked down on those that used Hadoop or somesuch for <1GB of data, but while it wasn't needed from a technology perspective it gave a structure to the project.

What 'structure'? Why is it so important that it makes it worthwhile firing up a large, complex framework? I'm beyond baffled.

1 more reply

o1lab5y ago

>> (even if it was a little ridiculous to fire up a cluster for something Excel or SQLite could handle)

I know above comment will be lost - but this is such a genuine truth.

1 more reply

exdsq5y ago

> The community of programmers has a love affair with “the next shiny object”. This is likely to create “churn” in your organization, as the “half-life” of shiny objects may be quite short."

This is an interesting thought. A company uses shiny tech because programmers like using them for whatever reason. This attracts employees who want to use this tech too. The half-life for shiny tech is short and so these developers move on to shinier pastures. I wonder if this explains why people change jobs so often in tech? I’m sure I read the average tenure is much lower (~1.5years) compared to other industries.

dinosaurdynasty5y ago

I'm pretty sure it's the raises people tend to get jumping companies compared to what they get if they stay at a company.

1 more reply

atombender5y ago

Well, he's absolutely right. MapReduce apparently didn't last that long at Google — it's long been supplanted by other technologies internally.

century195y ago

But MapReduce has long been superseded by Spark outside of Google right?

vikiomega95y ago

Which ones?

1 more reply

i0exception5y ago

This needs a [2015] tag.

The thing that makes the redbook special in my opinion is that the editors have been able to apply their research to solve actual problems for paying customers! You don't get to see enough of that in academia.

chmaynard5y ago

> This needs a [2015] tag.

That's debatable. It's a book, not a blog post.

asah5y ago

Technically it's a book but the structure is actually more like a series of blog posts...

razornova5y ago

where would one start if we wanted more up to date material on this domain?

willvarfar5y ago

It is up to date. Things haven't changed substantially, and they probably won't change soon either. There's nothing in the book that you'll have to unlearn or avoid applying.

Its an interesting book in that 2015 was in the middle of the noSQL hype. Since then, people have started looking for results and being more critical.

There's a gazillion technologies that we could list that are newer, and claims that any of them are the next big thing and will fundamentally change everything are, obviously, exaggerated.

asavinov5y ago

You might look at the concept-oriented model [1] which is a major alternative to set-oriented approaches (including RM and MapReduce). Shortly, instead of viewing data processing as a graph of set operations, this approach treats it as a graph of operations on functions which make many data modeling/processing tasks simpler and more natural in comparision to the conventional purely set-oriented approach.

[1] Concept-oriented model: Modeling and processing data using functions: https://www.researchgate.net/publication/337336089_Concept-o...

(Disclaimer: I am the author)

omginternets5y ago

Relatedly: I’ve been trying to wrap my head around MVCC (I’d like to write my own implementation). Any recommendations for a thorough overview of the subject?

oftenwrong5y ago

https://vladmihalcea.com/how-does-mvcc-multi-version-concurr...

http://www.interdb.jp/pg/pgsql05.html

omginternets5y ago

Thanks! The second link is really excellent (as is the first, but I've already read it).

I also found this paper in the refs, which seems really good. [0]

[0] https://drkp.net/papers/ssi-vldb12.pdf

j / k navigate · click thread line to collapse

30 comments

pixelmonkey5y ago

Michael Stonebraker has an interesting set of conclusions in his assessment of the MapReduce vendor market in 2015 from the "Dataflow" chapter here:

"- Just because Google thinks something is a good idea does not mean you should adopt it.

- Disbelieve all marketing spin, and figure out what benefit any given product actually has. This should be especially applied to performance claims.

sradman5y ago

I reread DeWitt and Stonebraker’s (D&S) MapReduce criticism [1] and I still find it misguided 12 years later.

[1] https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_...

mrits5y ago

"Map() is not equivalent to a SQL GROUP BY clause, it is equivalent to a user-defined Table Function that is used in a FROM clause."

1 more reply

ramraj075y ago

1 more reply

jrumbut5y ago

People always looked down on those that used Hadoop or somesuch for <1GB of data, but while it wasn't needed from a technology perspective it gave a structure to the project.

Now many places are back in the world of one-off scripts, and I think something of value was lost (even if it was a little ridiculous to fire up a cluster for something Excel or SQLite could handle).

throwaway_pdp095y ago

> People always looked down on those that used Hadoop or somesuch for <1GB of data, but while it wasn't needed from a technology perspective it gave a structure to the project.

What 'structure'? Why is it so important that it makes it worthwhile firing up a large, complex framework? I'm beyond baffled.

1 more reply

o1lab5y ago

>> (even if it was a little ridiculous to fire up a cluster for something Excel or SQLite could handle)

I know above comment will be lost - but this is such a genuine truth.

1 more reply

exdsq5y ago

dinosaurdynasty5y ago

I'm pretty sure it's the raises people tend to get jumping companies compared to what they get if they stay at a company.

1 more reply

atombender5y ago

Well, he's absolutely right. MapReduce apparently didn't last that long at Google — it's long been supplanted by other technologies internally.

century195y ago

But MapReduce has long been superseded by Spark outside of Google right?

vikiomega95y ago

Which ones?

1 more reply

i0exception5y ago

This needs a [2015] tag.

chmaynard5y ago

> This needs a [2015] tag.

That's debatable. It's a book, not a blog post.

asah5y ago

Technically it's a book but the structure is actually more like a series of blog posts...

razornova5y ago

where would one start if we wanted more up to date material on this domain?

willvarfar5y ago

It is up to date. Things haven't changed substantially, and they probably won't change soon either. There's nothing in the book that you'll have to unlearn or avoid applying.

Its an interesting book in that 2015 was in the middle of the noSQL hype. Since then, people have started looking for results and being more critical.

There's a gazillion technologies that we could list that are newer, and claims that any of them are the next big thing and will fundamentally change everything are, obviously, exaggerated.

asavinov5y ago

[1] Concept-oriented model: Modeling and processing data using functions: https://www.researchgate.net/publication/337336089_Concept-o...

(Disclaimer: I am the author)

omginternets5y ago

Relatedly: I’ve been trying to wrap my head around MVCC (I’d like to write my own implementation). Any recommendations for a thorough overview of the subject?

oftenwrong5y ago

https://vladmihalcea.com/how-does-mvcc-multi-version-concurr...

http://www.interdb.jp/pg/pgsql05.html

omginternets5y ago

Thanks! The second link is really excellent (as is the first, but I've already read it).

I also found this paper in the refs, which seems really good. [0]

[0] https://drkp.net/papers/ssi-vldb12.pdf

j / k navigate · click thread line to collapse