Show HN: Corral – A Serverless MapReduce Framework (opens in new tab)

(github.com)

92 pointsbcongdon8y ago36 comments

36 comments

26 comments · 6 top-level

bcongdonOP8y ago· 7 in thread

Hi HN, author here. Corral is my attempt at a performant, easy-to-deploy MapReduce. Unlike traditional frameworks like Hadoop, it uses AWS Lambda for execution and is “serverless” as a result. It was initially kicked-off by AWS adding Lambda support for Go, but draws on experience I’ve had using Lambda tools like Zappa[1] and Serverless[2] in the past.

I think there’s a lot of interesting applications for using function-as-a-service platforms as executors in data processing frameworks such as this.

If you’re interested more in the development/internals of this project, I wrote a blog post with more details: https://benjamincongdon.me/blog/2018/05/02/Introducing-Corra...

[1]: https://github.com/Miserlou/Zappa [2]: https://serverless.com/

Cthulhu_8y ago

I'd like to see your readme expanded with some figures:

* Processing speed - that is, how long does it take to do that word count example on a nontrivial dataset? Something that takes hours on a local machine, vs minutes in map/reduce. Comparing local to this to e.g. Hadoop or Google BigQuery or whatever viable alternative there is. * Cost. I think that's probably the biggest factor here. I don't get the impression that Lambda was intended for big data or highly resource / i/o / processing intensive operations, but, I'd love to be proven wrong. * Actually, mostly just cost vs performance.

I mean it's a neat idea but if the serverless benefit is outweighed by difficulty in setting up, cost, performance, etc compared to dedicated big data solutions it's going to stay a proof of concept.

quickben8y ago

Hi,

For a small map reduce load, say a terabyte (to replace a single MR node), how much would you estimate the aws cost would be?

bcongdonOP8y ago

Pricing depends a lot on how much memory your job requires[1] and how much processing each record requires -- i.e. the pricing is more sensitive to usage.

As a very rough estimate, for a light-to-medium load of 1Tb, the cost would probably be in the ballpark of ~$0.50. AWS's own reference MR framework[2] (which is mostly a tech demo) quotes prices in a similar order of magnitude.

Corral isn't great for processing-heavy MR jobs, as Lambda pricing rises quickly if you need a lot of memory or take a lot of time with each record. But, for small-ish low-overhead jobs, it can pretty easily beat the pricing and hassle of using something like EMR.

[1]: https://aws.amazon.com/lambda/pricing/#Lambda_pricing_detail... [2]: https://github.com/awslabs/lambda-refarch-mapreduce/

1 more reply

eranation8y ago

Nice! Always thought this would be cool. Few thought questions though: how do you get things like consistent hashing? Spark for example can shuffle data somewhat efficiently by sending the data to the right node / getting data from the right node by the hash key, right? How in a serverless stateless world you call a specific Serverless function instance? Assuming it can’t be done, arent you losing performance gained by data locality? Eg data has to be saved in a massive and efficient key value store? Isn’t it much slower than spark’s in memory / data locality (bring the compute to the data and not vice versa). Would love to see benchmarks on this. This is the future IMHO... well done.

optimuspaul8y ago

Nice! I've been thinking about doing this kind of thing for a while. I have long experimented with doing map reduce style work outside of things like Hadoop and gotten much better results due to being able to tune for different things much more quickly.

navaati8y ago

Hi.

How do you deal with the 5min (IIRC) execution time limit of Lambda ?

bcongdonOP8y ago

Yeah, max execution time (and max memory usage) are the main constraints of using Lambda.

Corral deals with this by splitting input data into small enough chunks that each chunk can be processed within the timeout -- I exposed options for setting the amount of data that each Lambda function has to process. However, if each data item requires more than 5 min of processing, then corral won't work for you.

The "driver" that coordinates the Lambda functions runs locally (not in Lambda), so it doesn't have this constraint.

Annatar8y ago· 7 in thread

Any time I read "serverless", I get a violent allergic reaction. There is no such thing, as software requires hardware to run on, and the person or persons who went "serverless" simply chose to stick their head(s) in the sand and punt the OS engineering and hardware design and maintenance off to someone else, hoping that it will just work. But it does not, and eventually there will be an outage and lost money. One can punt this responsibility off to someone else, but there will be consequences.

If you want reliable infrastructure, first you must become a master system, database and network administrator, then you must apprentice with a mentor to become a system engineer, and finally after several decades of practitioning as one, you will have enough experience and insight to become a system architect. There is no way around that, no punting will help.

scarface748y ago

Any time I read "serverless", I get a violent allergic reaction. There is no such thing, as software requires hardware to run on,

The meanings of words morph over time. When developers mention serverless everyone knows what it means it that context. Just like when someone says there is a bug in their code no one thinks that there are roaches running around in their computer.

and the person or persons who went "serverless" simply chose to stick their head(s) in the sand and punt the OS engineering and hardware design and maintenance off to someone else,

When I write a program, I'm not writing assembly language. I'm also "sticking my head in the sand" about the how assembly works. AWS has a whole team of people that know how to do that stuff.

hoping that it will just work. But it does not, and eventually there will be an outage and lost money. One can punt this responsibility off to someone else, but there will be consequences.

AWS is probably more reliable than what you could do on prem or at a colo,

Tell that to Netflix. They host everything on AWS. They purposefully moved from an on prem architecture to AWS because they realized where their core competence was.

Annatar8y ago

Netflix has an army of professional system and kernel engineers who optimize and engineer the OSes they run on; they just use AWS for virtual machine and datacenter capacity.

1 more reply

rockostrich8y ago

The term "serverless" means you don't need to manage the server resources. There is an "infinite" on-demand amount of resources that are available at request and they are available to run any process you want with a single command/hook/etc.

It's a misnomer, but it's no worse than "the cloud" or how "artificial intelligence" has come to mean anything to do with machine learning.

Annatar8y ago

There is no such thing as infinite server resources.

jedimastert8y ago

>punt the OS engineering and hardware design and maintenance off to someone else

Isn't hiring other people who know better than you to do this kind of stuff kind of the point? Like, a lot of people's jobs are based on that idea, including almost everyone in the IT industry. I'm confused by your point. It almost looks like sarcasm. Getting some serious "Poe's Law" here.

Annatar8y ago

One cannot design and write high quality software without throughly understanding the substrate underneath and how to engineer for it; from doing this for over 30 years, I’m telling you it’s impossible.

jessaustin8y ago

Any time I read "serverless", I get a violent allergic reaction.

You're even, then, because I've got the allergy to this trivial monotonous whinging about the by-now-well-understood meaning of the term "serverless". I'm not the only one.

amelius8y ago· 4 in thread

Why call it "serverless" if I need to provide stuff like:

AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, AWS_TEST_BUCKET

Willson508y ago

Because you don't need to manage any servers.

jokh8y ago

Serverless, not configless.

neolefty8y ago

Serverless doesn't mean there isn't a server—it just means you don't have to manage server instances. You can treat all of AWS as your server. Which means there must be a way to separate your work from everybody else's work who is sharing AWS with you. Thus the configuration.

a_imho8y ago

buzzword compliance

jedimastert8y ago· 2 in thread

I'm a complete novice when it comes to distributed computing, so I'm not too sure I get the idea of a MapReduce framework. I'm gonna try to suss it out here as I understand it, and if someone could correct me where I'm wrong, that'd be awesome.

First off, when I see "map" and "reduce" I think of the functional programming/data processing equivalents of mapping, meaning to apply a function to every element in a set (like capitalizing strings or dividing everything by two or something) and reducing, meaning to iterate over a set, processing it and combining it with some accumulator (like taking a sum).

What a MapReduce framework seems to do is take these two function and run them in parallel, splitting the data to take advantage of the independent nature of these two functions. Data can be split however is convenient because the map function doesn't need to worry about another data than itself, and run in as many processes you can manage. Any mapped-data can be put into parallel reduce processes, which can be run in any order because the order of the data shouldn't matter.

All of that I get (although if I'm wrong that might explain why I'm confused). I guess my main confusion is why the reduce function doesn't really fit with the idea that I just put forward. I would think that the reduce function would need some sort of "accumulator" input, and that you'd only get one thing as an output, as opposed to more files of data. Perhaps the idea is that the reduce is actually just any function that can only work on post-mapped functions, or even the only one that's supposed to change state in some way?

Can anyone shed some light on my confusion? What is the reduce function actually supposed to do, if not what I just laid out.

mayank8y ago

> I would think that the reduce function would need some sort of "accumulator" input, and that you'd only get one thing as an output, as opposed to more files of data.

It may be easier to think of the reduce step more like a SQL GROUP BY rather than a function of a list. The map phase emits a bunch of (key, value) pairs, and all values with the same key are processed by the same reducer function (but each key gets a new reducer, modulo implementation details).

So in your paradigm, there are many reduce functions, each starting with a null accumulated value, resulting in many outputs rather than a single one.

jedimastert8y ago

Interesting. That actually makes a lot of sense. I've done a touch more reading, and I actually only just now realized that the "word count" example doesn't count all of the words in a document, but rather all occurrences of each word. That clears a lot of questions up for me.

cle8y ago

Nice!

The use of S3 ListObjects is an immediate deal breaker though, its eventual consistency can cause silent data corruption. To avoid the List, you'd need to write a file manifest somewhere that contains a list of all S3 objects. If it were me, I'd use DynamoDB and append keys to a StringSet on a single item (if you use S3 for the manifest, it needs to be a single object, which means you need to aggregate the keys first, which sounds tricky with Lambda). You'll hit a scaling limit with DDB's item size limit, if you want to avoid that, perhaps writing an item per mapper with the same hash key and a different range key might be better, then you'd do a strongly consistent query to reconstruct the manifest.

mjn8y ago

Very different technical construction, but reminds me a bit of Joyent's Manta, that also lets you run map-reduce jobs over an object store without explicit server management: https://news.ycombinator.com/item?id=5939340

Source code: https://github.com/joyent/manta

I believe the cloud version has since been renamed to "Converged Analytics", so this is probably the same thing: https://www.joyent.com/triton/analytics

j / k navigate · click thread line to collapse

36 comments

26 comments · 6 top-level

bcongdonOP8y ago· 7 in thread

I think there’s a lot of interesting applications for using function-as-a-service platforms as executors in data processing frameworks such as this.

If you’re interested more in the development/internals of this project, I wrote a blog post with more details: https://benjamincongdon.me/blog/2018/05/02/Introducing-Corra...

[1]: https://github.com/Miserlou/Zappa [2]: https://serverless.com/

Cthulhu_8y ago

I'd like to see your readme expanded with some figures:

I mean it's a neat idea but if the serverless benefit is outweighed by difficulty in setting up, cost, performance, etc compared to dedicated big data solutions it's going to stay a proof of concept.

quickben8y ago

Hi,

For a small map reduce load, say a terabyte (to replace a single MR node), how much would you estimate the aws cost would be?

bcongdonOP8y ago

Pricing depends a lot on how much memory your job requires[1] and how much processing each record requires -- i.e. the pricing is more sensitive to usage.

[1]: https://aws.amazon.com/lambda/pricing/#Lambda_pricing_detail... [2]: https://github.com/awslabs/lambda-refarch-mapreduce/

1 more reply

eranation8y ago

optimuspaul8y ago

navaati8y ago

Hi.

How do you deal with the 5min (IIRC) execution time limit of Lambda ?

bcongdonOP8y ago

Yeah, max execution time (and max memory usage) are the main constraints of using Lambda.

The "driver" that coordinates the Lambda functions runs locally (not in Lambda), so it doesn't have this constraint.

Annatar8y ago· 7 in thread

scarface748y ago

Any time I read "serverless", I get a violent allergic reaction. There is no such thing, as software requires hardware to run on,

and the person or persons who went "serverless" simply chose to stick their head(s) in the sand and punt the OS engineering and hardware design and maintenance off to someone else,

When I write a program, I'm not writing assembly language. I'm also "sticking my head in the sand" about the how assembly works. AWS has a whole team of people that know how to do that stuff.

hoping that it will just work. But it does not, and eventually there will be an outage and lost money. One can punt this responsibility off to someone else, but there will be consequences.

AWS is probably more reliable than what you could do on prem or at a colo,

Tell that to Netflix. They host everything on AWS. They purposefully moved from an on prem architecture to AWS because they realized where their core competence was.

Annatar8y ago

Netflix has an army of professional system and kernel engineers who optimize and engineer the OSes they run on; they just use AWS for virtual machine and datacenter capacity.

1 more reply

rockostrich8y ago

It's a misnomer, but it's no worse than "the cloud" or how "artificial intelligence" has come to mean anything to do with machine learning.

Annatar8y ago

There is no such thing as infinite server resources.

jedimastert8y ago

>punt the OS engineering and hardware design and maintenance off to someone else

Annatar8y ago

jessaustin8y ago

Any time I read "serverless", I get a violent allergic reaction.

You're even, then, because I've got the allergy to this trivial monotonous whinging about the by-now-well-understood meaning of the term "serverless". I'm not the only one.

amelius8y ago· 4 in thread

Why call it "serverless" if I need to provide stuff like:

AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, AWS_TEST_BUCKET

Willson508y ago

Because you don't need to manage any servers.

jokh8y ago

Serverless, not configless.

neolefty8y ago

a_imho8y ago

buzzword compliance

jedimastert8y ago· 2 in thread

Can anyone shed some light on my confusion? What is the reduce function actually supposed to do, if not what I just laid out.

mayank8y ago

> I would think that the reduce function would need some sort of "accumulator" input, and that you'd only get one thing as an output, as opposed to more files of data.

So in your paradigm, there are many reduce functions, each starting with a null accumulated value, resulting in many outputs rather than a single one.

jedimastert8y ago

cle8y ago

Nice!

mjn8y ago

Source code: https://github.com/joyent/manta

I believe the cloud version has since been renamed to "Converged Analytics", so this is probably the same thing: https://www.joyent.com/triton/analytics

j / k navigate · click thread line to collapse