undefined | Better HN

0 pointsjghn3y ago0 comments

> Then everyone started copying them without knowing why

People tend to have a very bad sense of what constitutes large scale. It usually maps to "larger than the largest thing I've personally seen". So they hear "Use X instead of Y when operating at scale", and all of a sudden we have people implementing distributed datastore for a few MB of data.

Having gone downward in scale over the last few years of my career it has been eye opening how many people tell me X won't work due to "our scale", and I point out I have already used X in prior jobs for scale that's much larger than what we have.

0 comments

lazide3y ago

100% agree. I've also run across many cases where no-one bothered to even attempt any benchmarks or load tests on anything (either old or new solutions), compared latency, optimize anything, etc.

Sometimes making 10+ million dollar decisions off that gut feel with literally zero data on what is actually going on.

It rarely works out well, but hey, have to leave that opening for competition somehow I guess?

And I'm not talking about 'why didn't they spend 6 months optimizing that one call which would save them $50 type stuff'. I mean literally zero idea what is going on, what actual performance issues are, etc.

dmd3y ago

Yep. I've personally been in the situation where I had to show someone that I could do their analysis in a few seconds using the proverbial awk-on-a-laptop when they were planning on building a hadoop cluster in the cloud because "BIG DATA". (Their Big Data was 50 gigabytes.)

thraxil3y ago

I remember going to a PyData conference in... 2011 (maybe off by a year or two)... and one of the presenters making the point that if your data was less than about 10-100TB range, you were almost certainly better off running your code in a tight loop on one beefy server than trying to use Hadoop or a similar MapReduce cluster approach. He said that when he got a job, he'd often start by writing up the generic MapReduce code (one of the advantages is that it tends to be very simple to implement), starting the job running, and then writing a dedicated tight loop version while it ran. He almost always finished implementing the optimized version, got it loaded onto a server, and completed the analysis long before the MapReduce job had finished. The MapReduce implementation was just there as "insurance" if, eg, he hit 5pm on Friday without his optimized version quite done, he could go home and the MR job might just finish over the weekend.

hathawsh3y ago

I suggest a new rule of thumb: if the data can fit on a micro SD card [1], then it's smaller than a thumb, so it can't be big data. ;-)

[1] https://www.amazon.com/SanDisk-Extreme-microSDXC-Memory-Adap...

bombcar3y ago

My rule of thumb has been “if I could afford to put the data in RAM it’s not that big a deal”.

1 more reply

lazide3y ago

Uh oh, a lot of ML startups just freaked out! Name brand 1TB microSDXC cards are less than $200 on Amazon.

lostcolony3y ago

It's self reinforcing too. All the "system design" questions I've seen have started from the perspective of "we're going to run this at scale". Really? You're going to build for 50 million users -from the beginning-? Without first learning some lessons from -actual- use? That...seems non-ideal.

dzhiurgis3y ago

Place I’ve recently left had 10M record MongoDB table without indexes which would take tens of seconds to query. Celery was running in cron mode every 2 second or so meaning jobs would just pile up and redis eventually ran out of memory. No one understood why this was happening so just restart everything after pagerduty alert…

lazide3y ago

Yikes. Don’t get me wrong, it’s always been this way to some extent - not enough people who can look into a problem and understand what is happening to make many things actually work correctly, so iterate with new shiny thing.

It seems like the last 4-5 years though have really made it super common again. Bubble maybe?

Huge horde of newbs?

Maybe I’m getting crustier.

I remember it was SUPER bad before the dot-com crash, all the fake it ‘til you make it too. I even had someone claim 10 years of Java experience who couldn’t write out a basic class on a whiteboard at all, and tons of folks starting that literally couldn’t write a hello world in the language they claimed experience in, and this was before decent GUI IDEs.

Nextgrid3y ago

> It seems like the last 4-5 years though have really made it super common again. Bubble maybe?

Cloud providers have successfully redefined the baseline performance of a server in the minds of a lot of developers. Many people don't understand just how powerful (and at the same time cheap) a single physical machine can be when all they've used is shitty overpriced AWS instances, so no wonder they have no confidence in putting a standard RDBMS in there when anything above 4GB of RAM will cost you an arm and a leg, therefore they're looking for "magic" workarounds, which the business often accepts - it's easier to get them to pay lots of $$$$ for running a "web-scale" DB than paying the same amount for a Postgres instance, or God forbid, actually opting for a bare-metal server outside of the cloud.

In my career I've seen significant amount of time & effort being wasted on workarounds such as deferring very trivial tasks onto queues or building an insanely-distributed system where the proper solution would've been to throw more hardware at it (even expensive AWS instances would've been cost-effective if you count the amount of developer time spent working around the problem).

1 more reply

dzhiurgis3y ago

Technically we were a tech startup with 10+ “senior” engineers which scrape entire web ;D

jstrong3y ago

the one of those most likely to be humming along fine is redis in my experience. once ssh'd to the redis box (ec2), which was hugely critical to business: 1 core, instance had been up for 853 days, just chilling and handling things like a boss.

cortesoft3y ago

This is funny, because I suffer from the opposite issue... every time I try to bring up scaling issues on forums like HN, everyone says I don't actually need to worry because it can scale up to size X... but my current work is with systems at 100X size.

I feel like sometimes the pendulum has swung too far the other way, where people deny that there ARE people dealing with actual scale problems.

Nextgrid3y ago

In this case it might be helpful to mention the solutions you’ve already tried/evaluated and the reasons why they’re not suitable. Without those details you’re no different from the dreamers who think their 10GB database is FAANG-scale so it’s normal that you get the usual responses.

creakingstairs3y ago

I mean what percentage of companies are web scale or at your scale? I would guess around 1% being really generous. So it makes sense that the starting advice would be to not worry about scaling.

cortesoft3y ago

I get it, and I can't even say I blame the people for responding like that.

I think it is the same frustration I get when I call my ISP for tech support and they tell me to reboot my computer. I realize that they are giving advice for the average person, but it sucks having to sit through it.

1 more reply

lazide3y ago

Probably true - hopefully you can prefix your question with 'Yes, this is 10 Exabytes - no, I'm not typo'ng it' to save some of us from foot-in-mouth syndrome?

cortesoft3y ago

That is probably a good idea, get that out of the way up front.

I feel similar frustrations with commenters saying I am doing it wrong by not moving everything to the cloud… I work for a CDN, we would go out of business pretty quickly if we moved everything to the cloud. Oh well.

jghnOP3y ago

Yes, exactly. When people cite scaling concerns and/or big data, I start by asking them what they mean by scale and/or big. It's a great way to get down to brass tacks quickly.

Now when dealing with someone convinced that their single TB of data is google scale, the harder issue is changing that belief. But at least you know where they stand.

dsr_3y ago

That sounds like you're not giving enough detail. If you don't mention the approximate scale that you have right now, you can't expect people to glark it from context.

staticassertion3y ago

Same. I think there's this idea that 5 companies have more than 1PB of data and everyone else is just faking it. My field operates on many petabytes of data per customer.

jghnOP3y ago

Yes, the set of people truly operating "at scale" is more than FAANG and far, far less than the set of people believing they operate "at scale". This means there are still people in that middle ground.

One gotcha here is not all PBs are equal. My field also is a case where multi-PB datastores are common. However for the most part those data sit at rest in S3 or similar. They'll occasionally be pulled in chunks of a couple TB at most. But when you talk to people they'll flash their "do you know how big our storage budget is?" badge at the drop of a hat. It gets used to explain all sorts of compute patterns. Meanwhile, all they need is a large S3 footprint and a machine with a reasonable amount of RAM.

j / k navigate · click thread line to collapse

0 comments

lazide3y ago

100% agree. I've also run across many cases where no-one bothered to even attempt any benchmarks or load tests on anything (either old or new solutions), compared latency, optimize anything, etc.

Sometimes making 10+ million dollar decisions off that gut feel with literally zero data on what is actually going on.

It rarely works out well, but hey, have to leave that opening for competition somehow I guess?

dmd3y ago

thraxil3y ago

hathawsh3y ago

I suggest a new rule of thumb: if the data can fit on a micro SD card [1], then it's smaller than a thumb, so it can't be big data. ;-)

[1] https://www.amazon.com/SanDisk-Extreme-microSDXC-Memory-Adap...

bombcar3y ago

My rule of thumb has been “if I could afford to put the data in RAM it’s not that big a deal”.

1 more reply

lazide3y ago

Uh oh, a lot of ML startups just freaked out! Name brand 1TB microSDXC cards are less than $200 on Amazon.

lostcolony3y ago

dzhiurgis3y ago

lazide3y ago

It seems like the last 4-5 years though have really made it super common again. Bubble maybe?

Huge horde of newbs?

Maybe I’m getting crustier.

Nextgrid3y ago

> It seems like the last 4-5 years though have really made it super common again. Bubble maybe?

1 more reply

dzhiurgis3y ago

Technically we were a tech startup with 10+ “senior” engineers which scrape entire web ;D

jstrong3y ago

cortesoft3y ago

I feel like sometimes the pendulum has swung too far the other way, where people deny that there ARE people dealing with actual scale problems.

Nextgrid3y ago

creakingstairs3y ago

I mean what percentage of companies are web scale or at your scale? I would guess around 1% being really generous. So it makes sense that the starting advice would be to not worry about scaling.

cortesoft3y ago

I get it, and I can't even say I blame the people for responding like that.

1 more reply

lazide3y ago

Probably true - hopefully you can prefix your question with 'Yes, this is 10 Exabytes - no, I'm not typo'ng it' to save some of us from foot-in-mouth syndrome?

cortesoft3y ago

That is probably a good idea, get that out of the way up front.

jghnOP3y ago

Yes, exactly. When people cite scaling concerns and/or big data, I start by asking them what they mean by scale and/or big. It's a great way to get down to brass tacks quickly.

Now when dealing with someone convinced that their single TB of data is google scale, the harder issue is changing that belief. But at least you know where they stand.

dsr_3y ago

That sounds like you're not giving enough detail. If you don't mention the approximate scale that you have right now, you can't expect people to glark it from context.

staticassertion3y ago

Same. I think there's this idea that 5 companies have more than 1PB of data and everyone else is just faking it. My field operates on many petabytes of data per customer.

jghnOP3y ago

j / k navigate · click thread line to collapse