I think there’s a lot of interesting applications for using function-as-a-service platforms as executors in data processing frameworks such as this.
If you’re interested more in the development/internals of this project, I wrote a blog post with more details: https://benjamincongdon.me/blog/2018/05/02/Introducing-Corra...
[1]: https://github.com/Miserlou/Zappa [2]: https://serverless.com/
* Processing speed - that is, how long does it take to do that word count example on a nontrivial dataset? Something that takes hours on a local machine, vs minutes in map/reduce. Comparing local to this to e.g. Hadoop or Google BigQuery or whatever viable alternative there is. * Cost. I think that's probably the biggest factor here. I don't get the impression that Lambda was intended for big data or highly resource / i/o / processing intensive operations, but, I'd love to be proven wrong. * Actually, mostly just cost vs performance.
I mean it's a neat idea but if the serverless benefit is outweighed by difficulty in setting up, cost, performance, etc compared to dedicated big data solutions it's going to stay a proof of concept.
For a small map reduce load, say a terabyte (to replace a single MR node), how much would you estimate the aws cost would be?
As a very rough estimate, for a light-to-medium load of 1Tb, the cost would probably be in the ballpark of ~$0.50. AWS's own reference MR framework[2] (which is mostly a tech demo) quotes prices in a similar order of magnitude.
Corral isn't great for processing-heavy MR jobs, as Lambda pricing rises quickly if you need a lot of memory or take a lot of time with each record. But, for small-ish low-overhead jobs, it can pretty easily beat the pricing and hassle of using something like EMR.
[1]: https://aws.amazon.com/lambda/pricing/#Lambda_pricing_detail... [2]: https://github.com/awslabs/lambda-refarch-mapreduce/
How do you deal with the 5min (IIRC) execution time limit of Lambda ?
Corral deals with this by splitting input data into small enough chunks that each chunk can be processed within the timeout -- I exposed options for setting the amount of data that each Lambda function has to process. However, if each data item requires more than 5 min of processing, then corral won't work for you.
The "driver" that coordinates the Lambda functions runs locally (not in Lambda), so it doesn't have this constraint.
If you want reliable infrastructure, first you must become a master system, database and network administrator, then you must apprentice with a mentor to become a system engineer, and finally after several decades of practitioning as one, you will have enough experience and insight to become a system architect. There is no way around that, no punting will help.
The meanings of words morph over time. When developers mention serverless everyone knows what it means it that context. Just like when someone says there is a bug in their code no one thinks that there are roaches running around in their computer.
and the person or persons who went "serverless" simply chose to stick their head(s) in the sand and punt the OS engineering and hardware design and maintenance off to someone else,
When I write a program, I'm not writing assembly language. I'm also "sticking my head in the sand" about the how assembly works. AWS has a whole team of people that know how to do that stuff.
hoping that it will just work. But it does not, and eventually there will be an outage and lost money. One can punt this responsibility off to someone else, but there will be consequences.
AWS is probably more reliable than what you could do on prem or at a colo,
If you want reliable infrastructure, first you must become a master system, database and network administrator, then you must apprentice with a mentor to become a system engineer, and finally after several decades of practitioning as one, you will have enough experience and insight to become a system architect. There is no way around that, no punting will help.
Tell that to Netflix. They host everything on AWS. They purposefully moved from an on prem architecture to AWS because they realized where their core competence was.
It's a misnomer, but it's no worse than "the cloud" or how "artificial intelligence" has come to mean anything to do with machine learning.
Isn't hiring other people who know better than you to do this kind of stuff kind of the point? Like, a lot of people's jobs are based on that idea, including almost everyone in the IT industry. I'm confused by your point. It almost looks like sarcasm. Getting some serious "Poe's Law" here.
You're even, then, because I've got the allergy to this trivial monotonous whinging about the by-now-well-understood meaning of the term "serverless". I'm not the only one.
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, AWS_TEST_BUCKET
First off, when I see "map" and "reduce" I think of the functional programming/data processing equivalents of mapping, meaning to apply a function to every element in a set (like capitalizing strings or dividing everything by two or something) and reducing, meaning to iterate over a set, processing it and combining it with some accumulator (like taking a sum).
What a MapReduce framework seems to do is take these two function and run them in parallel, splitting the data to take advantage of the independent nature of these two functions. Data can be split however is convenient because the map function doesn't need to worry about another data than itself, and run in as many processes you can manage. Any mapped-data can be put into parallel reduce processes, which can be run in any order because the order of the data shouldn't matter.
All of that I get (although if I'm wrong that might explain why I'm confused). I guess my main confusion is why the reduce function doesn't really fit with the idea that I just put forward. I would think that the reduce function would need some sort of "accumulator" input, and that you'd only get one thing as an output, as opposed to more files of data. Perhaps the idea is that the reduce is actually just any function that can only work on post-mapped functions, or even the only one that's supposed to change state in some way?
Can anyone shed some light on my confusion? What is the reduce function actually supposed to do, if not what I just laid out.
It may be easier to think of the reduce step more like a SQL GROUP BY rather than a function of a list. The map phase emits a bunch of (key, value) pairs, and all values with the same key are processed by the same reducer function (but each key gets a new reducer, modulo implementation details).
So in your paradigm, there are many reduce functions, each starting with a null accumulated value, resulting in many outputs rather than a single one.
The use of S3 ListObjects is an immediate deal breaker though, its eventual consistency can cause silent data corruption. To avoid the List, you'd need to write a file manifest somewhere that contains a list of all S3 objects. If it were me, I'd use DynamoDB and append keys to a StringSet on a single item (if you use S3 for the manifest, it needs to be a single object, which means you need to aggregate the keys first, which sounds tricky with Lambda). You'll hit a scaling limit with DDB's item size limit, if you want to avoid that, perhaps writing an item per mapper with the same hash key and a different range key might be better, then you'd do a strongly consistent query to reconstruct the manifest.
Source code: https://github.com/joyent/manta
I believe the cloud version has since been renamed to "Converged Analytics", so this is probably the same thing: https://www.joyent.com/triton/analytics