We reduced the AWS costs of our streaming data pipeline (opens in new tab)

(taloflow.ai)

142 pointscloudfalcon6y ago83 comments

83 comments

38 comments · 9 top-level

QuinnyPig6y ago· 17 in thread

Hmm. This looks to me like a lot of the savings were realized by moving away from managed services into a scenario where there’s more operator overhead. The AWS bill gets lower, but what about the cost of the engineering work?

alharith6y ago

Does anyone else find the costs associated with running well-tested, well developed systems overblown? Like if you know how to adjust some basic parameters, you will solve for 99% use cases (adjust memory, adjust ram).

Examples I can think of is Rabbit MQ and Cassandra. But in general, we have some really battle-tested software these days that has become simpler to configure and run over time. People seem scared to run their own these days.

owenmarshall6y ago

I vouched for this comment because it’s a valid point and I’m not sure why it was killed.

I happen to disagree strongly, though: lots of engineers in my experience undervalue the work of systems administrators and underestimate the effort needed to operationalize any technology.

Running your own is absolutely fine if you are willing to keep your stack small and invest time learning the tools you pick. But there are still horror stories of people thinking snapshots are backups, turning the wrong knobs and turning off fsync on their databases, ...

chrischen6y ago

Yea exactly and unless you are FB scale you can just run a single docker container and never really have to worry (granted you know how to use Docker).

Most small startups are actually the ones who don’t really need SaaS services.

1 more reply

csharptwdec196y ago

Sometimes.

But developers are part of this problem too. There's plenty of times where I see devs immediately reach for tools instead of learning just a little bit more about what they already have. My favorite example is when folks want to add a NoSQL db into the mix on top of a traditional db. Not because there's a real performance need, but because for their use case it is 'easier'. Never mind that their problem possibly could have been solved by just writing their own SQL instead of trusting a garbage ORM...

Cthulhu_6y ago

This is probably a tradeoff in a lot of AWS related stuff; you pay a premium for convenience. But depending on your workload it can pay itself off fairly quickly; AWS bills can go up quite fast, whereas personnel costs are fairly predictable.

nojito6y ago

False equivalence. The engineer will be doing more than just cloud work.

This comparison is the #1 flawed sales tactic the cloud companies use to convince you youre saving money

vageli6y ago

> False equivalence. The engineer will be doing more than just cloud work.

> This comparison is the #1 flawed sales tactic the cloud companies use to convince you youre saving money

Time is of a limited quantity and time spent managing postgres backups (for example) is time not spent doing other (possibly more meaningful/impactful _to the business_) work.

3 more replies

fiddlerwoaroof6y ago

Yes. I’ve found that the amount you have to learn to use a managed service often equals or exceeds the amount you have to learn to run something on EC2 or on-prem. The automation/management costs of AWS or equivalent are a lot higher than people think and not significantly different from the costs to learn Linux and enough networking to do an “old-fashioned” deploy.

2 more replies

QuinnyPig6y ago

I don't have a dog in this race; I'm not partnered with any cloud provider. I will say that based upon what the article discusses, they save a bunch of money on AWS Glue by... running their own ETL pipeline inside of ECS instead. What's the maintenance burden of that decision? It's certainly not zero.

2 more replies

acdha6y ago

It’s a “false equivalence” or “flawed sales tactic” to suggest planning using total costs? That’s what both engineers and business people are supposed to do - and reflexively attacking it really does not cast your motives in a good light.

1 more reply

brodouevencode6y ago

While I don't entirely disagree I think it should be made clear that both can be true, even at the same time. To spin up a 9 node managed Elasticsearch cluster load-balanced across two regions takes a competent engineer with practice roughly a couple of hours, or twenty to thirty minutes if they were smart and terraformed it out previously. Now there's a whole host of potential problems that come along with using that managed Elasticsearch cluster too (no access to "cluster mode", no tunability, etc). But if those potential problems don't apply to you and a very vanilla ES cluster suits your use case then you're fine.

Alternatively that practiced engineer could have spun up a self-managed ES cluster in a couple of DCs in about the same time, but now has the obligation to maintain those servers (patching, etc.). Maybe that marginal cost is damn near zero - chef has been deployed to all instances and enforces patching and there's already good security monitoring in place, etc. The cost of that engineer managing that box, as with a managed ES in AWS, is practically nothing.

TL;DR: as in all cases, it depends.

brodouevencode6y ago

Agreed - there is a tradeoff that must factor in many things: engineer competency or the ability to get competent engineers, state of the product itself (maybe Elasticsearch as a service was an interim step in a longer term vision), complexity of the managed service itself, integratability (is that a word?) into other AWS services, maturity of the managed service, and probably a few other things I'm missing.

We've seen our teams go both from managed to non-managed and non-managed to managed with relative success - to give scale, across all of our accounts we spend way north of $3 million/month at AWS so this has happened within our realm a quite few times. The short, unsatisfying answer is that _it depends_. We have an internal policy from the suits that "if there's a managed version, use it" but most of our teams are thankfully smart enough to take that at face value and do their own analysis.

gberger6y ago

> integratability

interoperability?

1 more reply

dan_quixote6y ago

My personal favorite is "Move off of AWS MSK". No big deal - we just fire up some kafka brokers and and zookeeper nodes in ECS! All we gotta do is run several more supporting services to keep the cluster healthy and deal with the nightmare of Apache security ourselves.

As far as I'm concerned MSK is cheap - one broker is priced roughly same price as 2 equivalent EC2 instances. And you don't have to worry about zookeeper at all!

cloudfalconOP6y ago

Hi Corey! I'm the author of the blog post - I definitely agree with you. 9 times out of 10 engineering teams underestimate cost of engineering work as well as opportunity cost lost due to managing non-core functionality internally or moving away from managed services.

For us our pipeline was actually easier to work with Flink than Glue because of the restrictions that Amazon placed on it and so that factored into our decision.

runT1ME6y ago

If you're already paying the cost(both engineering time and compute wise) for EMR, I can't imagine it takes more effort to create a new Flink job than a new Glue job?

The advantage of Glue or the corresponding serverless GCP ETL option (dataflow) is that it's serverless elastic, but it sounds like their workload wasn't applicable.

brodouevencode6y ago

Unfortunately that's not nearly how AWS works. AWS breaks down everything and charges you for it separately. Flink and Glue are entirely different animals.

1 more reply

aritraghosh0076y ago· 3 in thread

Back when AWS started, there would be articles about the work to master scalability and performance for the modern web but as things matured, we somehow ended up in a much larger heap of literature around AWS cost optimization.

derex6y ago

In some sense this is a good problem to have. With on-prem you used to have very limited resources to start with, so cost efficiency is a baked-in requirement. With cloud providers you seem to have limitless resources and the new problem of cost optimization arises.

Admittedly there’s difference between optimizing fully-controlled resources and cloud provider managed services. For one, low visibility into cloud service internals makes such optimization harder.

acdha6y ago

Engineering is always balancing capabilities and costs. Before AWS existed there were plenty of stories about people over-buying to handle peak loads, optimizing workloads to fit a particular budget (especially during the dotcom era when VCs stopped underwriting huge sales for Sun, et al.), etc.

Cloud services gave new options for variable use and reallocating management costs but they also did something which most places were not used to: expose every detail as an itemized bill. That makes costs more visible than they’d been for most organizations which is good in the sense that people can make architectural decisions with pretty detailed numbers but bad in that many CIOs get sticker shock unless they’d done a well above average job calculating on-premise TCO.

ooobit26y ago

We ended up here because Amazon can't scale. It's just uncool to admit you have to notice the pink elephant. Why? I don't know. Maybe it has to do with cred in engineering teams or for engineering teams in the broader org structure.

But the problem with AWS, with a lot of the "cloud", is the pitch that remote centralization of a service scales ad infinitum. It's still subject to the same constraints as self-managed, even if those constraints appear at a higher limit.

The greatest constraint is the per-unit pricing. You buy self-managed, you have huge upfront and period costs, but with remote, you see the $.03/MB price and assume that variable cost is more manageable over the long run. And it is... until price changes, overhead changes, bandwidth changes, or worse, accessibility changes. And suddenly, what you had cost-effective scaling on 18 months ago now has a massive deficit affixed to it. Because that's how most people used the platform... or because removing A or B features reduced maintenance costs or freed up bandwidth.

AWS is an experiment. Does it work in many or even most use cases? Yes. For now.

I love engineers. A lot. In fact, being in sales, I would give up a deal with an engineering team unless I knew for sure my ROI basis was solid. That said, I do know sales and marketing rhetoric. And having spent hundreds of hours in meetings with product, marketing and dev professionals, I wish I could record the stress-induced breakdowns I've seen in engineers and executives who had everything running buttery, "and then [provider] pushed [update]..." and they then have executives breathing on the back of their neck 16 hours a day, entire teams offline or unable to do basic tasks, etc. I just want to play that shit to people and say, "This is why you don't overpromise."

agounaris6y ago· 3 in thread

I am curious about the actual cost in $! Managing your own kafka or observability infra is expensive, you need a team to do this.

A 67% reduction doesn't say the whole truth. They have more services to manage now, which means they need more people and more time to do this.

Saving 10k from your AWS bill by hiring 2 more engineers is not cost effective.

nojito6y ago

Or course it is.

Where did we get the idea that engineers are hired to do only one thing?

This has never ever been the case in my experience.

Also Kafka being hard this manage is not the case. A simple look into many small companies and startups running their own clusters shows otherwise.

agounaris6y ago

Engineers are hired to deliver and produce value. Tooling can be a part of it but if you can outsource something which is not your source of income, you have to do it. Engineering time is more valuable.

I also know many startups and small companies investing 5 people and 6 months to get an observability platform up and running while they could just get datadog or new relic for half the price... and I don't get into account outages and updates to the platform.

I remember a recent uber blog post on how they moved from build tool A to build tool B and a couple of weeks later, 3000 people where laid off. It's important to spend development time on revenue streams.

This is some nice piece of advice https://nav.al/build-a-team-that-ships

"Outsource everything that isn’t core. Resist the urge to pick up that last dollar. Founders do Customer Service."

cthalupa6y ago

>Where did we get the idea that engineers are hired to do only one thing?

At a certain size or number of self run services, they very well might be. I used to be the guy that did the set up for these sort of self managed solutions, and ran them day to day. In some shops the workload was high enough we needed multiple people like me doing it. Or a whole team. Doing DevOps style management of them just let us do it with fewer people - it certainly didn't make it feasible for developers to do the day to day management of these services and still write code.

throwaway888abc6y ago· 2 in thread

"Eliminate unused EC2 instances" -27% of cost

Haha, so cleaned the internal IT / DevOps mess and call it a day and than blog post it

tjbiddle6y ago

Eh, you're pulling a quote out of context. It's 27% reduction in EC2 use, which was only 18.5% total. So this only accounted for ~5% of total savings.

dannyw6y ago

I mean if a quarter of your EC2 instances were unused, that is absolutely an internal devops / IT mess.

The whole point of AWS is to use services on demand; it's like buying 133 conference tickets for your 100 person company.

3 more replies

tyingq6y ago· 2 in thread

The initial pie chart seems to indicate that either AWS glue is significantly overpriced, or that they were doing something wrong.

brodouevencode6y ago

As with all things AWS the more "magic" there is to it, the more expensive it is.

csharptwdec196y ago

Huge part of why I always try to build applications as platform agnostic as possible.

If I make a .NET service or site, I know (with the tools I use) I can deploy it on any linux or windows machine without issue. I can take it anywhere that I can run any software.

Sure, may need more glue for certain scenarios, but you know that you can move as soon as a provider shows it's fangs.

1 more reply

meritt6y ago· 1 in thread

I find it highly entertaining a two-year old company who was founded on the basis of helping slash cloud spending found so much waste in their own AWS spend. This is not an example of dogfooding, but an example of sheer incompetency and massive technical debt.

I'd really like to start seeing a series of blog posts from companies who are running extremely lean and efficient tech environments by utilizing cloud in an intelligent manner and avoiding the expensive and unnecessary bullshit that's so prevalent today. The ones that can brag "How we run a $4M/yr SaaS on $40k/yr of AWS spend!" are far more interesting than "How we stopped incinerating millions of VC money by simply turning off shit we didn't need"

tilolebo6y ago

two-year old companies have limited resources. It might have been a deliberate trade off to focus on work that produces value to the customers.

Maybe the blog post would have been "How we run a $1M/yr SaaS on $40k/yr of AWS spend!" instead of $4M?

anthonysarkis6y ago· 1 in thread

It seems reasonable to make some of these cost comparisons more visible.

ie If working on a new product or feature to understand upfront "this managed service is x% more then more bare bones" etc.

essentially turning an alchemy into a science

Cthulhu_6y ago

AWS offers a cost calculator for just that purpose; they offer 'easier' products if you can't be arsed to dive into AWS costs and technologies yourself.

I think a lot of people make the mistake of assuming AWS is just an easy off-the-shelf thing you can just grab, but if you use it seriously it's a full-time job and its own expertise.

Source: I've done some AWS certifications, never was able to put them into practice though. I've also worked in multiple organizations that migrated to AWS, they all had a full-time team of people managing it.

It's a full-time, specialist job and you can't just palm it off to your engineers as a background thing.

theatraine6y ago

Interesting idea. Does anyone do this for Azure?

dirtydroog6y ago

We went through a similar process with GCP, which was annoying since GCP was sold as being cheaper than AWS.

j / k navigate · click thread line to collapse

83 comments

38 comments · 9 top-level

QuinnyPig6y ago· 17 in thread

alharith6y ago

owenmarshall6y ago

I vouched for this comment because it’s a valid point and I’m not sure why it was killed.

I happen to disagree strongly, though: lots of engineers in my experience undervalue the work of systems administrators and underestimate the effort needed to operationalize any technology.

chrischen6y ago

Yea exactly and unless you are FB scale you can just run a single docker container and never really have to worry (granted you know how to use Docker).

Most small startups are actually the ones who don’t really need SaaS services.

1 more reply

csharptwdec196y ago

Sometimes.

Cthulhu_6y ago

nojito6y ago

False equivalence. The engineer will be doing more than just cloud work.

This comparison is the #1 flawed sales tactic the cloud companies use to convince you youre saving money

vageli6y ago

> False equivalence. The engineer will be doing more than just cloud work.

> This comparison is the #1 flawed sales tactic the cloud companies use to convince you youre saving money

Time is of a limited quantity and time spent managing postgres backups (for example) is time not spent doing other (possibly more meaningful/impactful _to the business_) work.

fiddlerwoaroof6y ago

QuinnyPig6y ago

acdha6y ago

brodouevencode6y ago

TL;DR: as in all cases, it depends.

brodouevencode6y ago

gberger6y ago

> integratability

interoperability?

1 more reply

dan_quixote6y ago

As far as I'm concerned MSK is cheap - one broker is priced roughly same price as 2 equivalent EC2 instances. And you don't have to worry about zookeeper at all!

cloudfalconOP6y ago

For us our pipeline was actually easier to work with Flink than Glue because of the restrictions that Amazon placed on it and so that factored into our decision.

runT1ME6y ago

If you're already paying the cost(both engineering time and compute wise) for EMR, I can't imagine it takes more effort to create a new Flink job than a new Glue job?

The advantage of Glue or the corresponding serverless GCP ETL option (dataflow) is that it's serverless elastic, but it sounds like their workload wasn't applicable.

brodouevencode6y ago

Unfortunately that's not nearly how AWS works. AWS breaks down everything and charges you for it separately. Flink and Glue are entirely different animals.

1 more reply

aritraghosh0076y ago· 3 in thread

derex6y ago

Admittedly there’s difference between optimizing fully-controlled resources and cloud provider managed services. For one, low visibility into cloud service internals makes such optimization harder.

acdha6y ago

ooobit26y ago

AWS is an experiment. Does it work in many or even most use cases? Yes. For now.

agounaris6y ago· 3 in thread

I am curious about the actual cost in $! Managing your own kafka or observability infra is expensive, you need a team to do this.

A 67% reduction doesn't say the whole truth. They have more services to manage now, which means they need more people and more time to do this.

Saving 10k from your AWS bill by hiring 2 more engineers is not cost effective.

nojito6y ago

Or course it is.

Where did we get the idea that engineers are hired to do only one thing?

This has never ever been the case in my experience.

Also Kafka being hard this manage is not the case. A simple look into many small companies and startups running their own clusters shows otherwise.

agounaris6y ago

This is some nice piece of advice https://nav.al/build-a-team-that-ships

"Outsource everything that isn’t core. Resist the urge to pick up that last dollar. Founders do Customer Service."

cthalupa6y ago

>Where did we get the idea that engineers are hired to do only one thing?

throwaway888abc6y ago· 2 in thread

"Eliminate unused EC2 instances" -27% of cost

Haha, so cleaned the internal IT / DevOps mess and call it a day and than blog post it

tjbiddle6y ago

Eh, you're pulling a quote out of context. It's 27% reduction in EC2 use, which was only 18.5% total. So this only accounted for ~5% of total savings.

dannyw6y ago

I mean if a quarter of your EC2 instances were unused, that is absolutely an internal devops / IT mess.

The whole point of AWS is to use services on demand; it's like buying 133 conference tickets for your 100 person company.

3 more replies

tyingq6y ago· 2 in thread

The initial pie chart seems to indicate that either AWS glue is significantly overpriced, or that they were doing something wrong.

brodouevencode6y ago

As with all things AWS the more "magic" there is to it, the more expensive it is.

csharptwdec196y ago

Huge part of why I always try to build applications as platform agnostic as possible.

If I make a .NET service or site, I know (with the tools I use) I can deploy it on any linux or windows machine without issue. I can take it anywhere that I can run any software.

Sure, may need more glue for certain scenarios, but you know that you can move as soon as a provider shows it's fangs.

1 more reply

meritt6y ago· 1 in thread

tilolebo6y ago

two-year old companies have limited resources. It might have been a deliberate trade off to focus on work that produces value to the customers.

Maybe the blog post would have been "How we run a $1M/yr SaaS on $40k/yr of AWS spend!" instead of $4M?

anthonysarkis6y ago· 1 in thread

It seems reasonable to make some of these cost comparisons more visible.

ie If working on a new product or feature to understand upfront "this managed service is x% more then more bare bones" etc.

essentially turning an alchemy into a science

Cthulhu_6y ago

AWS offers a cost calculator for just that purpose; they offer 'easier' products if you can't be arsed to dive into AWS costs and technologies yourself.

I think a lot of people make the mistake of assuming AWS is just an easy off-the-shelf thing you can just grab, but if you use it seriously it's a full-time job and its own expertise.

It's a full-time, specialist job and you can't just palm it off to your engineers as a background thing.

theatraine6y ago

Interesting idea. Does anyone do this for Azure?

dirtydroog6y ago

We went through a similar process with GCP, which was annoying since GCP was sold as being cheaper than AWS.

j / k navigate · click thread line to collapse