We survived 10k requests/second: Switching to signed asset URLs in an emergency (opens in new tab)

(hardcover.app)

147 pointsdyogenez1y ago168 comments

168 comments

131 comments · 34 top-level

flockonus1y ago· 12 in thread

Have you considered putting cloudflare or similar CDN with unlimited egress in front of your bucket?

Reading your blogpost I don't fully get how the current signing implementation can halt massive downloads, or the "attacker"(?) would just adapt their methods to get the signed URLs first and then proceed to download what they are after anyway?

frankjr1y ago

You cannot just put Cloudflare in front of your Google hosted bucket, that's against CF's terms of service. In order to do that you would have to also host the content itself on Cloudflare R2/Images etc. There used to be also html only restriction but that's no longer the case.

> Next, we got rid of the antiquated HTML vs. non-HTML construct, which was far too broad. Finally, we made it clear that customers can serve video and other large files using the CDN so long as that content is hosted by a Cloudflare service like Stream, Images, or R2.

https://blog.cloudflare.com/updated-tos/

voxic111y ago

Lots of people do this, so you definitely can do this even if its against CF's terms of service, which is something I can't find evidence of.

2 more replies

ghayes1y ago

Where is this against the GCP or CloudFlare's TOS?

KomoD1y ago

You totally can, just not a "disproportionate percentage"

1 more reply

JohnMakin1y ago

This is absolutely nuts to me and would immediately rule out ever hosting anything on google storage for me

2 more replies

paxys1y ago

Yup. The only mitigation here is that there is a limit to how many different asset URLs they will be able to generate, but if they want to be malicious they can download the same file over and over again and still make you rack up a huge bill.

dyogenezOP1y ago

This is true. I’d still need a CDN in front of the actual files to prevent that. That’s a takeaway for me from this feedback.

l5870uoo9y1y ago

10k req/s would potentially crash the ruby proxy server halting the image serving.

Cloudflare is the way to go. I generally serve heavy files, e.g. videos, from a Cloudflare bucket to avoid expensive bills from primary host.

ezekg1y ago

Honestly, I would just move to R2 and save on egress fees even without the CDN. Runaway egress bills are no fun.

I saved myself thousands $/mo moving to R2.

ksnsnsj1y ago

What is R2?

1 more reply

dyogenezOP1y ago

Putting a CDN in front would prevent this at the bucket level, but then someone could still hit the CDN at 10k requests/second. We could rate limit it there though, which would be nice.

The downside is that people already have the URLs for existing bucket directly. So we'd need to change those either way.

The reason why the attacker couldn't just hit the API to get the signed URLs is due to rate limiting that I go over using the rack-attack ruby gem. Since that's limited to 60/second, that's more like 43k images/day max.

flockonus1y ago

> someone could still hit the CDN at 10k requests/second

CDNs have mechanism to rate limit that you can easily configure, and they will be better at this than a ruby gem (no offence to that).

On Ruby you're taking on the rate limiting job down to your CPU and limited visibility per IP... idk man, cloudflare is 20/month.

qaq1y ago· 12 in thread

Beauty of cloud :) This could be easily served by a $100/month DO droplet with 0 worries about $.

atrus1y ago

Not on DO. ~$100 a month droplet gets you about 5TB of transfer out. They pulled 15TB in 7 hours. That's ~1,440,000 (16330) on overage or about $15k extra.

daemonologist1y ago

Doesn't DO charge $0.01/GB for egress overage? That's $150, not $15k. (Although Hetzner or something would've been even less.)

1 more reply

qaq1y ago

Didn't pay attention to transfer figure lets switch DO to CCX43 on Hetzner for $50 more

2 more replies

sroussey1y ago

I used to have my own half server rack and unlimited bandwidth for $500/mo.

My own machines, of course.

rsstack1y ago

DO _is_ cloud. Using their droplets compared to someone more sophisticated on GCP is an engineering choice, but both are cloud and both have upsides and downsides, and one needs to understand their needs to make the correct decision both among the different providers and within a provider on the right setup.

account421y ago

The billing model for VPSs and real big cloud(TM) providers is very different. This is espeically true for bandwith.

paxys1y ago

Does DO have free bandwidth? If not how exactly does that solve the problem?

Alifatisk1y ago

I don't think they have unmetered bandwidth?

1 more reply

ponytech1y ago

I rent a bare metal server for $50/month with unlimited bandwith...

kawera1y ago

Where?

3 more replies

ksnsnsj1y ago

There is no such thing as unlimited bandwidth.

What I'm aware of are services which do not charge extra for egress but severely limit your egress bandwidth (like 10 Gbit peak, 100 Mbit avg)

And limiting egress bandwidth is better is better done in the service per client than by the hoster for your system

1 more reply

lionkor1y ago

Hetzner rootservers have no in- or outgoing data limit either

andrewstuart1y ago· 11 in thread

I'm always surprised to read how much money companies are willing to spend on things that can be done for essentially nothing.

I had a look at the site - why does this need to run on a major cloud provider at all? Why use VERY expensive cloud storage at 9 cents per gigabyte? Why use very expensive image conversion at $50/month when you can run sharp on a Linux server?

I shouldn't be surprised - the world is all in on very expensive cloud computing.

There's another way though assuming you are running something fairly "normal" (whatever that means) - run your own Linux servers. Serve data from those Linux computers. I use CloudFlare R2 to serve your files - its free. You probably don't need most of your fancy architecture - run a fast server on Ionos or Hetzner or something and stop angsting about budget alerts from Google for things that should be free and runnong on your own computers - simple,. straightforward and without IAM spaghetti and all that garbage.

EDIT: I just had a look at the architecture diagram - this is overarchitected. This is a single server application that almost has no architecture - Caddy as a web server - a local queue - serve images from R2 - should be running on a single machine on a host that charges nothing or trivial amount for data.

Spivak1y ago

Don't use cloud, use these two other clouds. This right here is the issue, the skills and know how to buy hardware, install it in a data center, and get it on the internet are niche beyond niche.

Entering the world where you're dealing with Cogent, your Dell and Fortinet reps, suddenly having strong opinions about iDRAC vs iLO and hardware RAID is well beyond what anyone wants to care about just to run some web servers.

When people talk about major cloud providers being expensive the alternative is never /really/ to do it yourself but move to a discount hosting provider. And it's not as if there isn't savings to be found there but it's just another form of cloud optimization. We're talking about a story where $100 of spend triggers an alert. The difference is so minuscule.

ksnsnsj1y ago

I have read this argument before. Of cause you can do everything yourself _but it is not free_

You are missing both development cost and much more importantly opportunity cost

If I spent a person year on a cheap run architecture while my competitor spent a person year on a value add feature add, he will win

cuu5081y ago

Depends on what skills you have, but running everything on a single machine rather than messing with multiple cloud services can also be cheaper in development cost.

dyogenezOP1y ago

If you're able to do that, then you have a huge skill! I'm not much of a devops engineer myself, so I'm leveraging work done by others. My skills are in application design. For hosting I try to rely on what others have built and host there.

If I had your skills then our costs would be much smaller. As it stands now we pay about $700/month for everything - the bulk of it for a 16gb ram / 512gb space database.

BigParm1y ago

How much does it cost to have an ISP let you do that? What are the barriers generally?

hypeatei1y ago

If you're referring to hosting on a home network, you'll probably be behind CGNAT. Your ISP can give you a dedicated IP but it'll most likely cost something.

andrewstuart1y ago

Let you do what? What barriers do you see?

1 more reply

frankjr1y ago

> I use CloudFlare R2 to serve your files - its free.

I mean technically it's not free. It's just that they have a very generous "Forever Free" number of read operations (10M/month, $0.36 per million after).

1 more reply

rob1y ago

Looks like a site you could build in WordPress with some custom plugins like ACF and host on a single VPS for the most part.

blibble1y ago

yeah, as a crotchety old unix guy, 10k requests a second was a benchmark 30 years ago on an actual server

today a raspberry pi 5 can do 50k/s with TLS no sweat

BenjiWiebe1y ago

Can you give me an example of how to do 50k/s with TLS on an rpi? Also what do you use to measure that?

I've tried a little with httpd (apache) on an older desktop I use as my home server and got terrible results. I can't remember but it might have been single digit or low double digit rps.

2 more replies

dyogenezOP1y ago· 9 in thread

Earlier this week someone started hitting our Google Cloud Storage bucket with 10k requests a second... for 7 hours. I realized this while working from a coffee shop and spent the rest of the day putting in place a fix.

This post goes over what happened, how we put an a solution in place in hours and how we landed on the route we took.

I'm curious to hear how others have solved this same problem – generating authenticated URLs when you have a public API.

wrs1y ago

It sounds like you had public list access to your bucket, which is always bad. However, you can prevent list access, but keep read access to individual objects public. As long as your object names are unguessable (say, a 16-byte random number), you won’t have the problem you had.

I haven’t used Rails since they integrated storage, but gems like Paperclip used to do this for you by hashing the image parameters with a secret seed to generate the object name.

Using signed URLs is solving a different problem: making people hit your API at least once a day to get a working GCS URL for the image. It’s not clear if that’s an actual problem, as if people want to enumerate your API (as opposed to your bucket), they can do that with the new system too.

That aside, I’m confused about the 250ms thing. You don’t have to hit a Google API to construct a signed URL. It should just be a signature calculation done locally in your server. [0]

https://cloud.google.com/storage/docs/access-control/signing...

dyogenezOP1y ago

Thanks for the comment! Few things to reply to from here.

We didn’t have list access enabled, but someone did get a list of files from our API.

Rails with Paperclip and active storage is amazing. Our front end is in Next.js though, so we have people upload straight from Next to GCS, then we sent the GCS URL to Rails. We don’t do pre-processing of images, so just storing the original is good.

They can still download every image, but they’ll be throttled now and kicked out at the middleware level, or permission denied from GCP. 60/min vs 10k/min.

The signature calculation happening might not be hitting Google in that case. I noticed a long data dump in the console after requesting the signed URL combined with the additional latency and assumed. Maybe it’s just a cryptically difficult calculation like bcrypt and it takes a while. Will have to check, because it’s be great to not need a network reliant call for that.

hereonout21y ago

This was my understanding of signed URLs also. I was wondering why they needed to be cached, then afterwards wondering why the generation was so slow when I read the 250ms part.

deeebug1y ago

> That aside, I’m confused about the 250ms thing. You don’t have to hit a Google API to construct a signed URL. It should just be a signature calculation done locally in your server. [0]

I assume the additional latency is the initial cred fetch from the VM Metadata Service to perform that sign, no?

dantiberian1y ago

Could you explain more why you were you not able to sign the URLs at request time? Creating an HMAC is very fast.

dyogenezOP1y ago

I’m going to have to look into this today. I assuming generating the URLs hit an API, but if those can happen fast locally that changes things.

1 more reply

tayo421y ago

> I'm curious to hear how others have solved this same problem

I think this is interesting to ask, because I often have problems where I'm almost certain it's been solved before, just people don't bother to write about it. Where can people congregate to discuss questions like this?

dyogenezOP1y ago

Hopefully here. Sometimes the best way to get people to respond is to be wrong. I'm sure I've done a bunch of things wrong.

wordofx1y ago

> I'm curious to hear how others have solved this same problem

Not use Google to start with. And not make S3 buckets public. Must be accessed via CloudFront or CF Signed URLs. Making stuff public is dumb.

paxys1y ago· 6 in thread

Quick feedback – you've used the term "signed URL" over 50 times in the post without once explaining what it is or how it works.

telotortium1y ago

Until the author fixes the post, this is what they're talking about: https://cloud.google.com/storage/docs/access-control/signed-.... Essentially, it ensures that a URL is invalid unless the server signs it with a secret key controlled by the server, which means that clients can't access your assets just by guessing the URL. In addition to signing the URL, the signature can contain metadata such as permissions and expiration time.

riedel1y ago

Is there any advantage over JWT other than one can put the token into the URL itself (which is technically also possible with JWT I guess, with the downside that it will be probably exposed in logs, etc.)?

3 more replies

shortrounddev21y ago

Rather than allowing any object on a bucket to be downloaded by its raw URL (i.e: http://mycdn.io/abcdefg.jpeg), the backend service needs to generate a "signed" url, which is a short lived URL that grants the user a single request against that resources (GET, POST, PUT, etc.) (i.e: http://mycdn.io/abcdefg.jpeg?signed={securerandomstring}) So you can only use the URL to download it once, and you need to go through the backend API to generate the presigned URL. This could result in your backend getting hammered but you can also use DDOS protection to prevent 10k requests a second from going through your backend

Theyre also a good way to allow users to upload images to your CDN without having to actually upload that data to your web API backend; you just give the user a presigned PUT request URL and they get a one-time ticket to upload to your bucket

taeric1y ago

Worth calling out that the big benefit is you basically lean on the service provider for streaming the data, without having to form a trust relationship between them and the receiver of the data.

That is, the entire point is to not put more compute between the requester and the data. The absolute worst place to be would be to have compute that is streaming from the data provider, so that they can stream to the end user.

Right?

1 more reply

ddorian431y ago

It's not a single time, but its with TTL.

dyogenezOP1y ago

Ohh good catch. Just updated the post with a section mentioning what signed URLs are before jumping into the solution.

arcfour1y ago· 6 in thread

I immediately groaned when I read "public bucket."

On AWS you'd put CloudFront in front of the (now-private) bucket as a CDN, then use WAF for rate limiting, bot control, etc. In my experience GCP's services work similarly to AWS, so...is this not possible with GCP, or why wasn't this the setup from the get-go? That's the proper way to do things IMO.

Signed URLs I only think of when I think of like, paid content or other "semi-public" content.

0xbadcafebee1y ago

Google Cloud makes it insanely difficult/non-obvious what services you should use to solve these problems (or how to use them, because they're always difficult to use). They have a maze of unintuitive product names and sub-products and sub-sub-products, finding them in a UX is ridiculous, there's no useful tips/links/walkthroughs in the wizards, and their docs are terrible. It's like being trapped in the goddamn catacombs of Paris. On AWS, using buckets with CDN, ALB & WAF are obvious and easy, but on GCP it's a quagmire.

The other thing is, AWS WAF was released in 2015, and the Google Cloud Armor WAF feature (the what now?) was released in 2020.

written-beyond1y ago

Honestly this is exactly how I felt about GCP when I was building something that would be used by millions of people. At that scale it's very easy to shoot yourself in the foot and boy does Google make that easy.

There were so many things that were outright wrong in their documentation that caused me many sleepless nights. Like not recommending using a pool or closing cloudSQL connections in server less functions because they'll be closed automatically when the instance spins down.

Don't get me wrong I had used pools extensively before, and I knew you had to close connections but their docs and examples would explicitly show the connections not being closed, just left for them to close when the instance spins down.

Idk why they never thought that an instance might never spin down if it's getting hammered with requests and you end up with hundreds of open connections over multiple instances until GCP starts killing your requests telling you "out of connections" in a server less instance. The vaguest possible error which after a lot of debugging you understand that you can't have more than 100 open connections on a single function instance, but you were technically never supposed to have more than one open at any given time.

sigh

dyogenezOP1y ago

That's a good idea. I probably could've put a CDN in front of this and rate limited there while keeping things public. That might've been faster than using Ruby to be honest. The downside was that our API already shared the non-CDN URLs, so that would leave the problem open for anyone who already had that data.

arcfour1y ago

The bucket is private though, only accessible through the CDN. The old URLs would cease to function. On AWS this is implemented through OAI/OAC, granting the CloudFront distribution access via its own unique principal. AWS has had a baseline security recommendation for years now to disable S3 public access at the account/org level.

Maybe this breaks things, maybe you need to expire some caches, but (forgive me for being blunt, I can't think of a better way to say it) that's the cost of not doing things correctly to begin with.

My first thought as a security engineer when setting something up to be public has always been "how hard could someone hit this, and how much would it cost/affect availability?"

antihero1y ago

That said, if you use CF in front of S3 (which you should), anyone with a gigabit connection can easily cost you hundreds of dollars. I know this because I did this to myself accidentally.

arcfour1y ago

With WAF simple IP-based rate limiting is very simple & cheap. More complex types of limits aren't too difficult either, but even just deploying that is a few clicks.

hansvm1y ago· 4 in thread

What I just read is that for the cost of a single 16TB hard drive, they were able to rent a hard drive for 7 hours to stream 16TB, and they still had to devote meaningful engineering resources to avoid the cost overrun.

Does anybody here have a success story where AWS was either much cheaper to operate or to develop for (ideally both) than the normal alternatives?

lionkor1y ago

Yeah, I'm confused, too - a $60 server with any decent web server on it should be happy chugging along at 5-15k req/s, right?

hansvm1y ago

In general, yes. My rule of thumb for a basic web server is 100k QPS per physical core on cheap hardware, slowing down if it's doing anything intensive (depending on the nature of the images being requested and how the requests are distributed relative to the disks' layouts, they could have been pegged at the disks' throughput for example), speeding up if you have a particularly light workload or better hardware.

jeffhuys1y ago

They don't use AWS, by the way. This was GCP.

hansvm1y ago

Oops, missed that. The question still stands, but read "AWS" as "AWS or a similar service."

intelVISA1y ago· 4 in thread

10k/s... is that a lot? Computers are insanely fast nowadays..!

lionkor1y ago

No. It's not a lot. 20-30k req/s is easy for serving simple, small files. If you have beefy machine (say, a $50 hetzner rootserver), you get a few TB of storage and unlimited or cheap bandwidth.

8-16 cores can easily(!!!) push this kind of data without even heating up, not sure wtf OP is doing. Well, I know what OP is doing - they fell for the idea that the cloud is more scalable.

The issue with this is that the cloud™ starts "scaling" at the first user, whereas a baremetal server needs to scale when you have saturated a 8-16 core modern CPU, a 1-10 Gb/s NiC, 30-60 GB of RAM. In other words, baremetal needs to scale when you actually run out of hardware resources, which is tens or hundreds of thousands of users later.

Edit: for example, at BeamMP, we run on a few bare metal servers, and serve 22k unique users per day in a multiplayer videogame service. Funded by around 800 people donating.

jeroenhd1y ago

Based on the names of the endpoints, I get the idea that they're altering the image files on the fly (and probably caching processed files) based on the URL. I've seen this quite often on blogs and such. Serving files shouldn't take much CPU power, but resizing images can get quite expensive, especially if you want to achieve lower egress fees by using better compression methods.

Still, you need to deal with bad scrapers. Plus, this scraper downloaded at a consistent 650mbps, taking up half the unlimited Hetzner pipe by itself; if you'd go for a 10gbps Hetzner machine, you suddenly start paying egress fees once you hit 20TB of traffic. Even then, if you go the cheapo Hetzner server route, you probably still want at least some kind of CDN to keep latency down. Add to that costs of backups and synchronising failovers, and you may end up with more traffic than you'd expect.

I think going bare metal would save more than the signed URLs would, at least until the ten thousanth customer, but not everyone is proficient in maintaining servers. A lot of cloud projects I see are coming from programmers who don't want to/don't know how to maintain a Linux server and just want to run their code. If you're in that category, taking time off to learn server maintenance or hiring a sysadmin can easily be a lot more expensive than paying the extortionate rates cloud providers demand.

nirui1y ago

I'm not a fan of cloud too, but I have to admit that the networks of these big cloud providers built is just better than self-hosted ones. When they say they'll distribute your file globally, they mean it, as long as you pay of course.

But I would rather say, cloud is not for everyone. Especially in the case mentioned in the article. Think this: do you really REALLY need to distribute enlarged images globally at top speed? I bet most people just don't.

Same thing goes for "scaling", it's true the cloud can do that very well, but do you really need it that bad?

quectophoton1y ago

You also need to take into account the size of each response, how long your server needs to keep the data in memory (e.g. because of latency, the requester's bandwidth, etc), whether requests to the same file can share a buffer or not, how much data you can be sending at the same time while still being responsive (e.g. without slowing down other responses, causing them to take longer, requiring you to keep those resources in memory for longer, and snowballing from there), ..., stuff like that.

For short text messages, probably not an issue. With larger stuff like images or video, I would be more careful.

Still, even for text-only, if you're using PostgreSQL, by default you have a limit of (I think) 100 parallel connections (or 97, because I think 3 are reserved for superusers), but each connection can only be executing one transaction at a time, so that can quickly become a bottleneck depending on your application and how fast you need to make queries vs how long your queries take to return a response. So then you might need to tune some PostgreSQL settings, or add caching, or some other way to work around the issue.

If you add more services, then you also need to keep in mind the latency between those services.

And so on and so on. So RAM and network would probably become an issue way earlier than CPU in most cases.

TL;DR: "It depends".

languagehacker1y ago· 3 in thread

Did this guy just write a blog post about how he completely rewrote a functional feature to save $800?

In all seriousness, the devil is in the details around this kind of stuff, but I do worry that doing something not even clever, but just nonstandard, introduces a larger maintenance effort than necessary.

Interesting problem, and an interesting solution, but I'd probably rather just throw money at it until it gets to a scale that merits further bot prevention measures.

dyogenezOP1y ago

If this were a business and someone else's money I'd do the same. This is a bootstrapped side project coming out of my own wallet.

If money wasn't an issue, I'd probably just allow people to download images for free.

languagehacker1y ago

Good point! My POV assumed some amount of revenue generation.

underwater1y ago

It was $800 so far.

Your point is valid for normal usage patterns where there is a direct relationship between active users and cost. But an attack meant OP’s costs were sky rocketing even though usage was flat.

Waterluvian1y ago· 3 in thread

Do any cloud providers have a sensible default or easy-to-enable mode for “you literally cannot spend one penny until you set specific quotas/limits for each resource you’re allocating”?

paxys1y ago

No, because surprise runaway costs are their entire business model.

hinkley1y ago

Cloud is the new gym membership.

ksnsnsj1y ago

Not really, because those clients will be unhappy and cause trouble.

They like the clients which expand slowly.

So going from $100 to $100k in a month by accident they want to avoid while still being able to go from $1k to $100k in a year

EGreg1y ago· 3 in thread

We've designed our system for this very use case. Whether it's on commodity hardware or in the cloud, whether or not it's using a CDN and edge servers, there are ways to "nip things in the bud", as it were, by rejecting requests without a proper signed payload.

For example, the value of session ID cookies should actually be signed with an HMAC, and checked at the edge by the CDN. Session cookies that represent a authenticated session should also look different than unauthenticated ones. The checks should all happen at the edge, at your reverse proxy, without doing any I/O or calling your "fastcgi" process manager.

But let's get to the juicy part... hosting files. Ideally, you shouldn't have "secret URLs" for files, because then they can be shared and even (gasp) hotlinked from websites. Instead, you should use features like X-Accel-Redirect in NGINX to let your app server determine access to these gated resources. Apache has similar things.

Anyway, here is a write-up which goes into much more detail: https://community.qbix.com/t/files-and-storage/286

MassPikeMike1y ago

Thanks for making me aware of X-Accel-Redirect!

The write-up discusses X-Accel-Redirect's functionality in the context of qbix. For me, the two were rather hard to tease apart in that context. So for others who feel that way, I would recommend starting with Grant Horwood's introduction to X-Accel-Redirect at

https://gbh.fruitbat.io/2024/05/12/nginx-serving-private-fil...

dyogenezOP1y ago

Ohh, using the session ID in the URL would be a nice addition to this. We already generate session tokens for every user - guests and logged in users. We could pass that through to segment on rather than IP address.

EGreg1y ago

Glad I could help… consider using the session to determine access and then just send an HTTP 403 or whatever instead of the actual images

upon_drumhead1y ago· 2 in thread

Given that you want to be good stewards of book data, have you considered publishing bulk snapshots to archive.org on a set cadence? It would strongly reduce any needs to do any sort of bulk scraping and also ensure that should something happen to your service, the data isn't lost forever.

dyogenezOP1y ago

I hadn't thought of that, but I love the idea! How's that work?

upon_drumhead1y ago

Register for an account and create a new item. You can replace files in the item , update the description to indicate what date the snapshot was made and what it contains.

https://help.archive.org/help/managing-and-editing-your-item...

It's a very open platform. Think up what the best format for your data is and upload a compressed zip file or tar.gz of the data.

I'd likely do different archives for images and metadata, so people that want to just process metadata can download that specific data and work on it.

Luckily as you can edit over time, you can experiment and adjust based upon user's feedback.

1 more reply

1a527dd51y ago· 2 in thread

I don't understand, why wasn't there a CDN in front of the public GCS bucket resources?

ksnsnsj1y ago

While this is normally done due to the reasons mentioned, to me that is a significant downside.

Why can't GCS act as a CDN, too?

hinkley1y ago

Because then they can’t sell you two products.

feurio1y ago· 2 in thread

Maybe it's just me, but isn't ~10K r/s pretty much just, well, normal?

cassonmars1y ago

I came here to ask the same thing.

intelVISA1y ago

CDNs make $$ convincing you it is.

hypeatei1y ago· 2 in thread

So your fix was to move the responsibility to the web server and Redis instance? I guess that works but introduces a whole lot more complexity (you mentioned adding rate limiting) and potential for complete outage in the event a lot of requests for images come in again.

dyogenezOP1y ago

That's my worry too. Our server load for our Rails server hasn't gone up even though our throughput has maxed out at 76k requests/second (which I think is a bunch of people from Hacker News going to the Hardcover homepage and downloading 100 images).

I don't like that if Rails goes down our images go down. I'd much prefer to separate these out and show the signed URLs in Next.js and be able to generate them through the API. I think we'll get there, but that's a bigger change than I could reliably make in a day.

hinkley1y ago

I don’t have a ton of use cases for functions where they make great sense, not just fill in a bingo card, but generating access errors cheaply is a big one.

taeric1y ago· 2 in thread

I'm confused, isn't this literally the use case for a CDN?

Edit: I see this is discussed in other threads.

dyogenezOP1y ago

That would solve some of the problems. If the site was previously behind a CDN with a rate limit, I don't think we would have even had this problem.

Given that we have the problem now, and that people already have the non-CDN URLs, we needed a solution that allowed us to roll out something ASAP, while allowing people that use our API to continue using the image URLs they've downloaded.

taeric1y ago

Makes sense. And kudos on getting a solution that works for you! :D

the84721y ago· 2 in thread

The dreaded C10k problem, remaining unsolved to this day.

ksnsnsj1y ago

Unlike the original c10k problem, serving those number of connectioms has now morthed from a technical to an economic problem

the84721y ago

I don't think the economics of serving 1Gbit have ever added up to 300$ over two days.

1 more reply

busymom01y ago· 2 in thread

> The previous day I was experimenting with Google Cloud Run, trying to migrate our Next.js staging environment from Vercel to there to save some money. I assumed I misconfigured that service and turned it off and went about my day.

I am sorry but who sees a $100 sudden charge, assumes misconfiguration and just goes about their day without digging deeper right away?

hinkley1y ago

Are you one of those devs that mistakenly assumes that you salary constitutes 90% of your cost to the company, when in fact it’s closer to 40%?

You want me to spend an hour trying to save the company $100? We just spent $250. And that’s not the half of it. If the company is expecting me to result in $5 in revenue for every dollar they spend on me, we really just lost out on more than $1000.

I’ve worked many places where we didn’t think about opportunity costs. I’ve also been laid off many times.

busymom01y ago

The author in another comment posted this which very clearly indicated they are bootstrapping a low cash side project from their own wallet:

> If this were a business and someone else's money I'd do the same. This is a bootstrapped side project coming out of my own wallet. If money wasn't an issue, I'd probably just allow people to download images for free.

2 more replies

Sytten1y ago· 1 in thread

I really hope this is not the whole of your code otherwise you have a nice open redirect vulnerability on your hand and possibly a private bucket leak if you don't check which bucket you are signing the request for. Never for the love of security take an URL as input from a user without doing a whole lot of checks and sanitization. And don't expect your language parser to be perfect, Orange Tsai demonstrated they can get confused [1].

[1] https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-Ne...

dyogenezOP1y ago

I left off the method that generates the signed URL. It limits the bucket to a specific one per env and blocks some protected folders and file types. I left that out in case someone used it to find an opening to attack.

inopinatus1y ago· 1 in thread

I recall reports of cases like this nearly every day at AWS, and that was a decade ago.

It wasn't unusual, for first-time victims at least, that we'd a) waive the fees and b) schedule a solution architect to talk them through using signed URLs or some other mitigation. I have no visibility into current practice either at AWS or GCP but I'd encourage OP to seek billing relief nevertheless, it can't hurt to ask. Sustainable customer growth is the public cloud business model, of which billing surprises are the antithesis.

dyogenezOP1y ago

I recently had a call with Google and have a sales/solution person I’ve been talking to about moving more services there. I’ll share what happened and see what they say.

sakopov1y ago· 1 in thread

I must be missing something obvious, but what do signed URLs have to do with requests going directly to resources in a bucket instead of a CDN of some sort like Cloudflare? Signed URLs are typically used to provide secure access to a resource in a private bucket. But it seems like it's used as a cache of sorts?

dyogenezOP1y ago

I think you have it right. The signed URLs are a way to giving people an address to the files from our API, then they have call it again to key the keys. I suspect if once we put the files behind a CDN with signed keys, we’ll have even more security here.

elliot071y ago· 1 in thread

One suggestion to speed up perf. Use bucket#signed_url instead of file#signed_url, otherwise it's doing an HTTP request to Google every generation.

dyogenezOP1y ago

Thank you! I was wondering where the 250ms of latency was coming from. I’ll change this up today.

rcarmo1y ago· 1 in thread

I had to do a similar thing a decade ago when someone started scraping my site by brute force. At the time I was using CoralCDN already, but my server was getting hammered, so I just started serving up assets with hashed URLs and changing the key every 24h--their scraper was dumb enough to not start again from scratch.

I ended up using the exact same code for sharding, and later to move to a static site with Azure Storage (which lets me use SAS tokens for timed expiry if I want to).

BonoboIO1y ago

It would be funny to give the scraper some „funny“ pictures :D

0xbadcafebee1y ago· 1 in thread

Rate limiting (and its important cousin, back-off retries) is an important feature of any service being consumed by an "outside entity". There are many different reasons you'll want rate limiting at every layer of your stack, for every request you have: brute-force resistance, [accidental] DDoS protection, resiliency, performance testing, service quality, billing/quotas, and more.

Every important service always eventually gets rate limiting. The more of it you have, the more problems you can solve. Put in the rate limits you think you need (based on performance testing) and only raise them when you need to. It's one of those features nobody adds until it's too late. If you're designing a system from scratch, add rate limiting early on. (you'll want to control the limit per session/identity, as well as in bulk)

tetha1y ago

Very much what I recommend our teams as well. And you can totally start with something careful. Does a single IP really need 50 requests per second?

Like, sure, I have services at work where the answer is "yes". But I have 10 - 20 times more services for which I could cut that to 5 and still be fine.

solatic1y ago· 1 in thread

Did you try sticking your bucket behind Cloud CDN?

Google's documentation is inconsistent, but you do not need to make your bucket public, you can instead grant read access only to Cloud CDN: https://cloud.google.com/cdn/docs/using-signed-cookies#confi...

Dangerously incorrect documentation claiming the bucket must be public: https://cloud.google.com/cdn/docs/setting-up-cdn-with-bucket...

dyogenezOP1y ago

This sounds like a solid next step. I’d like to stop storing URLs we don’t control in our DB and share URLs to these images behind a CDN. We could slowly roll that out and update each image url in our database over time with both continuing to work.

I didn’t realize you could do this with a private bucket by granting it access either. That combined with IP throttling at the CDN level might be a good replacement for this and cut out the need for Rails.

twothamendment1y ago· 1 in thread

We recently had a bot from Taiwan downloading all of our images, over and over and over - similar to the author. By the time we noticed they had downloaded them many times over and showed no signs of stopping!

Bots these days are our of control and have lost their mind!

jeroenhd1y ago

I recently found out that Bytedance was scraping a website of mine over and over again. I don't care about their stupid AI crawler scanning my cheapo server, but they were hitting the same files from different IP addresses, all from the same /56 China Telecom subnet.

I added a firewall rule to block the subnet and that seems to have worked. Earlier attempts involving robots.txt failed and my logs still got spammed by all the HTTPS requests when I blocked the bots in Nginx.

I don't understand how you could write a scraper like that and not notice that you're downloading the same files over and over again.

Alifatisk1y ago· 1 in thread

I can't describe the surprise when I saw RoR being mentioned, that was unexpected but made the article way more exciting to read.

Wouldn't this be solved by using Cloudflare R2 though?

dyogenezOP1y ago

That's good to hear Any chance to bring in Ruby.

I'm not familiar with Cloudflare R2, so I'll have to check it out. I do like that we can rate limit based on either User ID requesting an image from the API, or by IP address. I'm not sure how we'd handle segmenting by user id with a CDN (but I'd have to read more to understand if that's a possibility).

mannyv1y ago· 1 in thread

We put assets in backblase and use fastly to cdn, because the cost is super low. It's a bit more work but super cheap.

mannyv1y ago

Oop, backblaze. Dang autocorr4ct

austin-cheney1y ago

10k requests per second has historically been a lower challenge to overcome than 10k concurrent sessions on a single box. 10k concurrent sessions was the historic design goal for standing up Node.js 15 years ago.

For everything high traffic and/or concurrency related my go to solution is dedicated sockets. Sockets are inherently session-oriented which makes everything related to security and routing more simple. If there is something about a request you don’t like then just destroy the socket. If you believe there is a DOS flood attack then keep the socket open and discard its messaging. If there are too many simultaneous sockets then jitter traffic processing via load balancer as resources become available.

paulddraper1y ago

Remember kids, CDNs are your friend.

You can roll/host your own anything. Except CDN, if you care about uptime.

nirui1y ago

In addition to "signing" the URL, you may also require users to login to view the original image, and serve visitors a compressed version. This could give you the benefit of gaining users (good for VC) while respecting the guests, as well as protecting your investments.

Back in the old days where everyone operates their own server, another thing you could do is to just setup a per-IP traffic throttling with iptables (`-m recent` or `-m hashlimit`). Just something to consider in case one day you might grow tired of Google Cloud Storage too ;)

aftbit1y ago

I solved this problem for free with storage on B2 and a Cloudflare worker which offers free egress from B2. I don't know if they'd still offer it for free at 10k rps though!

quectophoton1y ago

Thank you for saying it as 10k requests/second. It makes it way more clear than if you had instead said requests/minute, or worse, requests/day.

lfmunoz41y ago

Don't understand why hosting providers charge for egress. Why isn't it free? Doesn't that mean that we don't have an open internet, isn't that against net neutrality?

j / k navigate · click thread line to collapse

168 comments

131 comments · 34 top-level

flockonus1y ago· 12 in thread

Have you considered putting cloudflare or similar CDN with unlimited egress in front of your bucket?

frankjr1y ago

https://blog.cloudflare.com/updated-tos/

voxic111y ago

Lots of people do this, so you definitely can do this even if its against CF's terms of service, which is something I can't find evidence of.

2 more replies

ghayes1y ago

Where is this against the GCP or CloudFlare's TOS?

KomoD1y ago

You totally can, just not a "disproportionate percentage"

1 more reply

JohnMakin1y ago

This is absolutely nuts to me and would immediately rule out ever hosting anything on google storage for me

2 more replies

paxys1y ago

dyogenezOP1y ago

This is true. I’d still need a CDN in front of the actual files to prevent that. That’s a takeaway for me from this feedback.

l5870uoo9y1y ago

10k req/s would potentially crash the ruby proxy server halting the image serving.

Cloudflare is the way to go. I generally serve heavy files, e.g. videos, from a Cloudflare bucket to avoid expensive bills from primary host.

ezekg1y ago

Honestly, I would just move to R2 and save on egress fees even without the CDN. Runaway egress bills are no fun.

I saved myself thousands $/mo moving to R2.

ksnsnsj1y ago

What is R2?

1 more reply

dyogenezOP1y ago

Putting a CDN in front would prevent this at the bucket level, but then someone could still hit the CDN at 10k requests/second. We could rate limit it there though, which would be nice.

The downside is that people already have the URLs for existing bucket directly. So we'd need to change those either way.

flockonus1y ago

> someone could still hit the CDN at 10k requests/second

CDNs have mechanism to rate limit that you can easily configure, and they will be better at this than a ruby gem (no offence to that).

On Ruby you're taking on the rate limiting job down to your CPU and limited visibility per IP... idk man, cloudflare is 20/month.

qaq1y ago· 12 in thread

Beauty of cloud :) This could be easily served by a $100/month DO droplet with 0 worries about $.

atrus1y ago

Not on DO. ~$100 a month droplet gets you about 5TB of transfer out. They pulled 15TB in 7 hours. That's ~1,440,000 (16330) on overage or about $15k extra.

daemonologist1y ago

Doesn't DO charge $0.01/GB for egress overage? That's $150, not $15k. (Although Hetzner or something would've been even less.)

1 more reply

qaq1y ago

Didn't pay attention to transfer figure lets switch DO to CCX43 on Hetzner for $50 more

2 more replies

sroussey1y ago

I used to have my own half server rack and unlimited bandwidth for $500/mo.

My own machines, of course.

rsstack1y ago

account421y ago

The billing model for VPSs and real big cloud(TM) providers is very different. This is espeically true for bandwith.

paxys1y ago

Does DO have free bandwidth? If not how exactly does that solve the problem?

Alifatisk1y ago

I don't think they have unmetered bandwidth?

1 more reply

ponytech1y ago

I rent a bare metal server for $50/month with unlimited bandwith...

kawera1y ago

Where?

3 more replies

ksnsnsj1y ago

There is no such thing as unlimited bandwidth.

What I'm aware of are services which do not charge extra for egress but severely limit your egress bandwidth (like 10 Gbit peak, 100 Mbit avg)

And limiting egress bandwidth is better is better done in the service per client than by the hoster for your system

1 more reply

lionkor1y ago

Hetzner rootservers have no in- or outgoing data limit either

andrewstuart1y ago· 11 in thread

I'm always surprised to read how much money companies are willing to spend on things that can be done for essentially nothing.

I shouldn't be surprised - the world is all in on very expensive cloud computing.

Spivak1y ago

Don't use cloud, use these two other clouds. This right here is the issue, the skills and know how to buy hardware, install it in a data center, and get it on the internet are niche beyond niche.

ksnsnsj1y ago

I have read this argument before. Of cause you can do everything yourself _but it is not free_

You are missing both development cost and much more importantly opportunity cost

If I spent a person year on a cheap run architecture while my competitor spent a person year on a value add feature add, he will win

cuu5081y ago

Depends on what skills you have, but running everything on a single machine rather than messing with multiple cloud services can also be cheaper in development cost.

dyogenezOP1y ago

If I had your skills then our costs would be much smaller. As it stands now we pay about $700/month for everything - the bulk of it for a 16gb ram / 512gb space database.

BigParm1y ago

How much does it cost to have an ISP let you do that? What are the barriers generally?

hypeatei1y ago

If you're referring to hosting on a home network, you'll probably be behind CGNAT. Your ISP can give you a dedicated IP but it'll most likely cost something.

andrewstuart1y ago

Let you do what? What barriers do you see?

1 more reply

frankjr1y ago

> I use CloudFlare R2 to serve your files - its free.

I mean technically it's not free. It's just that they have a very generous "Forever Free" number of read operations (10M/month, $0.36 per million after).

1 more reply

rob1y ago

Looks like a site you could build in WordPress with some custom plugins like ACF and host on a single VPS for the most part.

blibble1y ago

yeah, as a crotchety old unix guy, 10k requests a second was a benchmark 30 years ago on an actual server

today a raspberry pi 5 can do 50k/s with TLS no sweat

BenjiWiebe1y ago

Can you give me an example of how to do 50k/s with TLS on an rpi? Also what do you use to measure that?

I've tried a little with httpd (apache) on an older desktop I use as my home server and got terrible results. I can't remember but it might have been single digit or low double digit rps.

2 more replies

dyogenezOP1y ago· 9 in thread

This post goes over what happened, how we put an a solution in place in hours and how we landed on the route we took.

I'm curious to hear how others have solved this same problem – generating authenticated URLs when you have a public API.

wrs1y ago

I haven’t used Rails since they integrated storage, but gems like Paperclip used to do this for you by hashing the image parameters with a secret seed to generate the object name.

That aside, I’m confused about the 250ms thing. You don’t have to hit a Google API to construct a signed URL. It should just be a signature calculation done locally in your server. [0]

https://cloud.google.com/storage/docs/access-control/signing...

dyogenezOP1y ago

Thanks for the comment! Few things to reply to from here.

We didn’t have list access enabled, but someone did get a list of files from our API.

They can still download every image, but they’ll be throttled now and kicked out at the middleware level, or permission denied from GCP. 60/min vs 10k/min.

hereonout21y ago

This was my understanding of signed URLs also. I was wondering why they needed to be cached, then afterwards wondering why the generation was so slow when I read the 250ms part.

deeebug1y ago

> That aside, I’m confused about the 250ms thing. You don’t have to hit a Google API to construct a signed URL. It should just be a signature calculation done locally in your server. [0]

I assume the additional latency is the initial cred fetch from the VM Metadata Service to perform that sign, no?

dantiberian1y ago

Could you explain more why you were you not able to sign the URLs at request time? Creating an HMAC is very fast.

dyogenezOP1y ago

I’m going to have to look into this today. I assuming generating the URLs hit an API, but if those can happen fast locally that changes things.

1 more reply

tayo421y ago

> I'm curious to hear how others have solved this same problem

dyogenezOP1y ago

Hopefully here. Sometimes the best way to get people to respond is to be wrong. I'm sure I've done a bunch of things wrong.

wordofx1y ago

> I'm curious to hear how others have solved this same problem

Not use Google to start with. And not make S3 buckets public. Must be accessed via CloudFront or CF Signed URLs. Making stuff public is dumb.

paxys1y ago· 6 in thread

Quick feedback – you've used the term "signed URL" over 50 times in the post without once explaining what it is or how it works.

telotortium1y ago

riedel1y ago

3 more replies

shortrounddev21y ago

taeric1y ago

Worth calling out that the big benefit is you basically lean on the service provider for streaming the data, without having to form a trust relationship between them and the receiver of the data.

Right?

1 more reply

ddorian431y ago

It's not a single time, but its with TTL.

dyogenezOP1y ago

Ohh good catch. Just updated the post with a section mentioning what signed URLs are before jumping into the solution.

arcfour1y ago· 6 in thread

I immediately groaned when I read "public bucket."

Signed URLs I only think of when I think of like, paid content or other "semi-public" content.

0xbadcafebee1y ago

The other thing is, AWS WAF was released in 2015, and the Google Cloud Armor WAF feature (the what now?) was released in 2020.

written-beyond1y ago

sigh

dyogenezOP1y ago

arcfour1y ago

Maybe this breaks things, maybe you need to expire some caches, but (forgive me for being blunt, I can't think of a better way to say it) that's the cost of not doing things correctly to begin with.

My first thought as a security engineer when setting something up to be public has always been "how hard could someone hit this, and how much would it cost/affect availability?"

antihero1y ago

That said, if you use CF in front of S3 (which you should), anyone with a gigabit connection can easily cost you hundreds of dollars. I know this because I did this to myself accidentally.

arcfour1y ago

With WAF simple IP-based rate limiting is very simple & cheap. More complex types of limits aren't too difficult either, but even just deploying that is a few clicks.

hansvm1y ago· 4 in thread

Does anybody here have a success story where AWS was either much cheaper to operate or to develop for (ideally both) than the normal alternatives?

lionkor1y ago

Yeah, I'm confused, too - a $60 server with any decent web server on it should be happy chugging along at 5-15k req/s, right?

hansvm1y ago

jeffhuys1y ago

They don't use AWS, by the way. This was GCP.

hansvm1y ago

Oops, missed that. The question still stands, but read "AWS" as "AWS or a similar service."

intelVISA1y ago· 4 in thread

10k/s... is that a lot? Computers are insanely fast nowadays..!

lionkor1y ago

No. It's not a lot. 20-30k req/s is easy for serving simple, small files. If you have beefy machine (say, a $50 hetzner rootserver), you get a few TB of storage and unlimited or cheap bandwidth.

8-16 cores can easily(!!!) push this kind of data without even heating up, not sure wtf OP is doing. Well, I know what OP is doing - they fell for the idea that the cloud is more scalable.

Edit: for example, at BeamMP, we run on a few bare metal servers, and serve 22k unique users per day in a multiplayer videogame service. Funded by around 800 people donating.

jeroenhd1y ago

nirui1y ago

Same thing goes for "scaling", it's true the cloud can do that very well, but do you really need it that bad?

quectophoton1y ago

For short text messages, probably not an issue. With larger stuff like images or video, I would be more careful.

If you add more services, then you also need to keep in mind the latency between those services.

And so on and so on. So RAM and network would probably become an issue way earlier than CPU in most cases.

TL;DR: "It depends".

languagehacker1y ago· 3 in thread

Did this guy just write a blog post about how he completely rewrote a functional feature to save $800?

Interesting problem, and an interesting solution, but I'd probably rather just throw money at it until it gets to a scale that merits further bot prevention measures.

dyogenezOP1y ago

If this were a business and someone else's money I'd do the same. This is a bootstrapped side project coming out of my own wallet.

If money wasn't an issue, I'd probably just allow people to download images for free.

languagehacker1y ago

Good point! My POV assumed some amount of revenue generation.

underwater1y ago

It was $800 so far.

Your point is valid for normal usage patterns where there is a direct relationship between active users and cost. But an attack meant OP’s costs were sky rocketing even though usage was flat.

Waterluvian1y ago· 3 in thread

Do any cloud providers have a sensible default or easy-to-enable mode for “you literally cannot spend one penny until you set specific quotas/limits for each resource you’re allocating”?

paxys1y ago

No, because surprise runaway costs are their entire business model.

hinkley1y ago

Cloud is the new gym membership.

ksnsnsj1y ago

Not really, because those clients will be unhappy and cause trouble.

They like the clients which expand slowly.

So going from $100 to $100k in a month by accident they want to avoid while still being able to go from $1k to $100k in a year

EGreg1y ago· 3 in thread

Anyway, here is a write-up which goes into much more detail: https://community.qbix.com/t/files-and-storage/286

MassPikeMike1y ago

Thanks for making me aware of X-Accel-Redirect!

https://gbh.fruitbat.io/2024/05/12/nginx-serving-private-fil...

dyogenezOP1y ago

EGreg1y ago

Glad I could help… consider using the session to determine access and then just send an HTTP 403 or whatever instead of the actual images

upon_drumhead1y ago· 2 in thread

dyogenezOP1y ago

I hadn't thought of that, but I love the idea! How's that work?

upon_drumhead1y ago

Register for an account and create a new item. You can replace files in the item , update the description to indicate what date the snapshot was made and what it contains.

https://help.archive.org/help/managing-and-editing-your-item...

It's a very open platform. Think up what the best format for your data is and upload a compressed zip file or tar.gz of the data.

I'd likely do different archives for images and metadata, so people that want to just process metadata can download that specific data and work on it.

Luckily as you can edit over time, you can experiment and adjust based upon user's feedback.

1 more reply

1a527dd51y ago· 2 in thread

I don't understand, why wasn't there a CDN in front of the public GCS bucket resources?

ksnsnsj1y ago

While this is normally done due to the reasons mentioned, to me that is a significant downside.

Why can't GCS act as a CDN, too?

hinkley1y ago

Because then they can’t sell you two products.

feurio1y ago· 2 in thread

Maybe it's just me, but isn't ~10K r/s pretty much just, well, normal?

cassonmars1y ago

I came here to ask the same thing.

intelVISA1y ago

CDNs make $$ convincing you it is.

hypeatei1y ago· 2 in thread

dyogenezOP1y ago

hinkley1y ago

I don’t have a ton of use cases for functions where they make great sense, not just fill in a bingo card, but generating access errors cheaply is a big one.

taeric1y ago· 2 in thread

I'm confused, isn't this literally the use case for a CDN?

Edit: I see this is discussed in other threads.

dyogenezOP1y ago

That would solve some of the problems. If the site was previously behind a CDN with a rate limit, I don't think we would have even had this problem.

taeric1y ago

Makes sense. And kudos on getting a solution that works for you! :D

the84721y ago· 2 in thread

The dreaded C10k problem, remaining unsolved to this day.

ksnsnsj1y ago

Unlike the original c10k problem, serving those number of connectioms has now morthed from a technical to an economic problem

the84721y ago

I don't think the economics of serving 1Gbit have ever added up to 300$ over two days.

1 more reply

busymom01y ago· 2 in thread

I am sorry but who sees a $100 sudden charge, assumes misconfiguration and just goes about their day without digging deeper right away?

hinkley1y ago

Are you one of those devs that mistakenly assumes that you salary constitutes 90% of your cost to the company, when in fact it’s closer to 40%?

I’ve worked many places where we didn’t think about opportunity costs. I’ve also been laid off many times.

busymom01y ago

The author in another comment posted this which very clearly indicated they are bootstrapping a low cash side project from their own wallet:

2 more replies

Sytten1y ago· 1 in thread

[1] https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-Ne...

dyogenezOP1y ago

inopinatus1y ago· 1 in thread

I recall reports of cases like this nearly every day at AWS, and that was a decade ago.

dyogenezOP1y ago

I recently had a call with Google and have a sales/solution person I’ve been talking to about moving more services there. I’ll share what happened and see what they say.

sakopov1y ago· 1 in thread

dyogenezOP1y ago

elliot071y ago· 1 in thread

One suggestion to speed up perf. Use bucket#signed_url instead of file#signed_url, otherwise it's doing an HTTP request to Google every generation.

dyogenezOP1y ago

Thank you! I was wondering where the 250ms of latency was coming from. I’ll change this up today.

rcarmo1y ago· 1 in thread

I ended up using the exact same code for sharding, and later to move to a static site with Azure Storage (which lets me use SAS tokens for timed expiry if I want to).

BonoboIO1y ago

It would be funny to give the scraper some „funny“ pictures :D

0xbadcafebee1y ago· 1 in thread

tetha1y ago

Very much what I recommend our teams as well. And you can totally start with something careful. Does a single IP really need 50 requests per second?

Like, sure, I have services at work where the answer is "yes". But I have 10 - 20 times more services for which I could cut that to 5 and still be fine.

solatic1y ago· 1 in thread

Did you try sticking your bucket behind Cloud CDN?

Dangerously incorrect documentation claiming the bucket must be public: https://cloud.google.com/cdn/docs/setting-up-cdn-with-bucket...

dyogenezOP1y ago

twothamendment1y ago· 1 in thread

Bots these days are our of control and have lost their mind!

jeroenhd1y ago

I don't understand how you could write a scraper like that and not notice that you're downloading the same files over and over again.

Alifatisk1y ago· 1 in thread

I can't describe the surprise when I saw RoR being mentioned, that was unexpected but made the article way more exciting to read.

Wouldn't this be solved by using Cloudflare R2 though?

dyogenezOP1y ago

That's good to hear Any chance to bring in Ruby.

mannyv1y ago· 1 in thread

We put assets in backblase and use fastly to cdn, because the cost is super low. It's a bit more work but super cheap.

mannyv1y ago

Oop, backblaze. Dang autocorr4ct

austin-cheney1y ago

paulddraper1y ago

Remember kids, CDNs are your friend.

You can roll/host your own anything. Except CDN, if you care about uptime.

nirui1y ago

aftbit1y ago

I solved this problem for free with storage on B2 and a Cloudflare worker which offers free egress from B2. I don't know if they'd still offer it for free at 10k rps though!

quectophoton1y ago

Thank you for saying it as 10k requests/second. It makes it way more clear than if you had instead said requests/minute, or worse, requests/day.

lfmunoz41y ago

Don't understand why hosting providers charge for egress. Why isn't it free? Doesn't that mean that we don't have an open internet, isn't that against net neutrality?

j / k navigate · click thread line to collapse