undefined | Better HN

0 pointssllrpr14y ago0 comments

> Collaborative filtering.

What collaborative filtering algorithm are you using that requires terabytes of intermediate storage for gigabytes of input data?

I'm familiar with most approaches to CF (SVD, gradient descent, etc) and I can't think of any that require large amounts of intermediate storage.

> By and large scratch data ends up being much, much larger than the original inputs, if for no other reason than that needed during the shuffle/sort stage

I can't think of a single practical situation where you couldn't do your sorting online as you progress through the data. Again, the overhead of moving the data to-and-from S3 would be greater than processing the data locally (unless Amazon's LAN is faster than a SATA bus, which is unlikely).

> The author sounds like someone who may have read the academic papers and a few books but hasn't used these tools in practice.

You keep attacking the author in various ad hominem ways, yet you haven't yet provided a single uncontrived example of the small input data, large intermediate data scenario that your argument relies upon.

0 comments

3 comments · 1 top-level

gfodor14y ago· 2 in thread

My argument does not rely upon it, it was an example of one of several reasons running map reduce jobs on the AWS cloud have nothing to do with the amount input data you are moving around. I am not going to go off into even more detail about specific jobs I run daily that generate a large amount of itermediate data because unless I paste the source code in this thread and write a paper on it I assume you won't believe me that there is in fact in the space of "all map reduce jobs" jobs that can generate more data than they input.

If you write a trivial map reduce job using cascading that has 10 reducers and each reduce step shuffles the data on a different grouping key you will find that Hadoop alone is generating more data than you input. But again, this isn't the point. The point is the author is calling anyone using AWS for map reduce a "cargo cult" based upon an academic argument that the sole purpose of map reduce is to move computation to your data, hence if you copy your data you are missing the point. In practice, the cost of uploading your data to s3 is a footnote compared to the computational flexibility and use cases that become possible once you are able to run arbitrary tranformations on that data via EMR. You keep ignoring my main point and are focused on my simplistic examples, reading way more into them than was intended.

sllrprOP14y ago

> it was an example of one of several reasons running map reduce jobs on the AWS cloud have nothing to do with the amount input data you are moving around

It would be an example if you had backed up your assertion that collaborative filtering required large amounts of intermediate data, but apparently you are unwilling or, more likely, unable to do this.

> I am not going to go off into even more detail about specific jobs I run daily that generate a large amount of itermediate data because unless I paste the source code in this thread and write a paper on it I assume you won't believe me that there is in fact in the space of "all map reduce jobs" jobs that can generate more data than they input.

Even more detail? You haven't given me any detail! You've yet to give me a single example of a practical situation where a task involves much larger amounts of intermediate data than it's input data. I'm asking you to back up your argument, I'm not asking for access to your source code.

> If you write a trivial map reduce job using cascading that has 10 reducers and each reduce step shuffles the data on a different grouping key you will find that Hadoop alone is generating more data than you input

If it's so trivial, why can't you give me a single practical use-case?

> In practice, the cost of uploading your data to s3 is a footnote compared to the computational flexibility and use cases that become possible once you are able to run arbitrary tranformations on that data via EMR

Yes, apparently so many use-cases that you can't provide a single example of one!

gfodor14y ago

Apparently I am either horrible at explaining myself or you are being deliberately obtuse. The argument about intermediate data size is a sufficient but unnecessary argument to prove the author has no point to make.

First, let's show that I can write a job that can produce more data that it inputs. I have a map of user to score, and want to compute pair wise summed scores for every user pair. This is clearly O(n^2) outputs. Computing pair wise scores is a common algorithm for recommender systems (I realize in practice you generally do not compute the entire space because it will be too slow. However your output will be closer to n^2 than n, ie, it will be much larger than your input.)

Now, this is a complicated example. A simpler example is "I want to compute aggregations on all the fields my log file." If you have N fields, hadoop is going to sort the data N times. Ie, you will be producing lots of intermediate data, almost certainly more than the input size, just by using map reduce (the code doing this merge sort is not your code, it's hadoop.)

But again, the point you keep missing and conveniently ignore when you quote my posts is that the point of map reduce on Aws is not about data locality (obviously) but about downstream flexibility. I can run 100 jobs, in parallel, on 10k machines, and output much more data than I input, without running a cluster of my own and I get to pay by the hour. I am isolated from other devs and spin the machines down when finished. If you buy the argument that this is a useful feature (as any EMR customer would attest too) than this too is a separate more pragmatic reason why the author has no idea what he is taking about when he says all EMR customers are a cargo cult.

1 more reply

j / k navigate · click thread line to collapse