Reading your blogpost I don't fully get how the current signing implementation can halt massive downloads, or the "attacker"(?) would just adapt their methods to get the signed URLs first and then proceed to download what they are after anyway?
> Next, we got rid of the antiquated HTML vs. non-HTML construct, which was far too broad. Finally, we made it clear that customers can serve video and other large files using the CDN so long as that content is hosted by a Cloudflare service like Stream, Images, or R2.
Cloudflare is the way to go. I generally serve heavy files, e.g. videos, from a Cloudflare bucket to avoid expensive bills from primary host.
I saved myself thousands $/mo moving to R2.
The downside is that people already have the URLs for existing bucket directly. So we'd need to change those either way.
The reason why the attacker couldn't just hit the API to get the signed URLs is due to rate limiting that I go over using the rack-attack ruby gem. Since that's limited to 60/second, that's more like 43k images/day max.
CDNs have mechanism to rate limit that you can easily configure, and they will be better at this than a ruby gem (no offence to that).
On Ruby you're taking on the rate limiting job down to your CPU and limited visibility per IP... idk man, cloudflare is 20/month.
My own machines, of course.
What I'm aware of are services which do not charge extra for egress but severely limit your egress bandwidth (like 10 Gbit peak, 100 Mbit avg)
And limiting egress bandwidth is better is better done in the service per client than by the hoster for your system
I had a look at the site - why does this need to run on a major cloud provider at all? Why use VERY expensive cloud storage at 9 cents per gigabyte? Why use very expensive image conversion at $50/month when you can run sharp on a Linux server?
I shouldn't be surprised - the world is all in on very expensive cloud computing.
There's another way though assuming you are running something fairly "normal" (whatever that means) - run your own Linux servers. Serve data from those Linux computers. I use CloudFlare R2 to serve your files - its free. You probably don't need most of your fancy architecture - run a fast server on Ionos or Hetzner or something and stop angsting about budget alerts from Google for things that should be free and runnong on your own computers - simple,. straightforward and without IAM spaghetti and all that garbage.
EDIT: I just had a look at the architecture diagram - this is overarchitected. This is a single server application that almost has no architecture - Caddy as a web server - a local queue - serve images from R2 - should be running on a single machine on a host that charges nothing or trivial amount for data.
Entering the world where you're dealing with Cogent, your Dell and Fortinet reps, suddenly having strong opinions about iDRAC vs iLO and hardware RAID is well beyond what anyone wants to care about just to run some web servers.
When people talk about major cloud providers being expensive the alternative is never /really/ to do it yourself but move to a discount hosting provider. And it's not as if there isn't savings to be found there but it's just another form of cloud optimization. We're talking about a story where $100 of spend triggers an alert. The difference is so minuscule.
You are missing both development cost and much more importantly opportunity cost
If I spent a person year on a cheap run architecture while my competitor spent a person year on a value add feature add, he will win
If I had your skills then our costs would be much smaller. As it stands now we pay about $700/month for everything - the bulk of it for a 16gb ram / 512gb space database.
I mean technically it's not free. It's just that they have a very generous "Forever Free" number of read operations (10M/month, $0.36 per million after).
today a raspberry pi 5 can do 50k/s with TLS no sweat
I've tried a little with httpd (apache) on an older desktop I use as my home server and got terrible results. I can't remember but it might have been single digit or low double digit rps.
This post goes over what happened, how we put an a solution in place in hours and how we landed on the route we took.
I'm curious to hear how others have solved this same problem – generating authenticated URLs when you have a public API.
I haven’t used Rails since they integrated storage, but gems like Paperclip used to do this for you by hashing the image parameters with a secret seed to generate the object name.
Using signed URLs is solving a different problem: making people hit your API at least once a day to get a working GCS URL for the image. It’s not clear if that’s an actual problem, as if people want to enumerate your API (as opposed to your bucket), they can do that with the new system too.
That aside, I’m confused about the 250ms thing. You don’t have to hit a Google API to construct a signed URL. It should just be a signature calculation done locally in your server. [0]
https://cloud.google.com/storage/docs/access-control/signing...
We didn’t have list access enabled, but someone did get a list of files from our API.
Rails with Paperclip and active storage is amazing. Our front end is in Next.js though, so we have people upload straight from Next to GCS, then we sent the GCS URL to Rails. We don’t do pre-processing of images, so just storing the original is good.
They can still download every image, but they’ll be throttled now and kicked out at the middleware level, or permission denied from GCP. 60/min vs 10k/min.
The signature calculation happening might not be hitting Google in that case. I noticed a long data dump in the console after requesting the signed URL combined with the additional latency and assumed. Maybe it’s just a cryptically difficult calculation like bcrypt and it takes a while. Will have to check, because it’s be great to not need a network reliant call for that.
I assume the additional latency is the initial cred fetch from the VM Metadata Service to perform that sign, no?
I think this is interesting to ask, because I often have problems where I'm almost certain it's been solved before, just people don't bother to write about it. Where can people congregate to discuss questions like this?
Not use Google to start with. And not make S3 buckets public. Must be accessed via CloudFront or CF Signed URLs. Making stuff public is dumb.
Theyre also a good way to allow users to upload images to your CDN without having to actually upload that data to your web API backend; you just give the user a presigned PUT request URL and they get a one-time ticket to upload to your bucket
That is, the entire point is to not put more compute between the requester and the data. The absolute worst place to be would be to have compute that is streaming from the data provider, so that they can stream to the end user.
Right?
On AWS you'd put CloudFront in front of the (now-private) bucket as a CDN, then use WAF for rate limiting, bot control, etc. In my experience GCP's services work similarly to AWS, so...is this not possible with GCP, or why wasn't this the setup from the get-go? That's the proper way to do things IMO.
Signed URLs I only think of when I think of like, paid content or other "semi-public" content.
The other thing is, AWS WAF was released in 2015, and the Google Cloud Armor WAF feature (the what now?) was released in 2020.
There were so many things that were outright wrong in their documentation that caused me many sleepless nights. Like not recommending using a pool or closing cloudSQL connections in server less functions because they'll be closed automatically when the instance spins down.
Don't get me wrong I had used pools extensively before, and I knew you had to close connections but their docs and examples would explicitly show the connections not being closed, just left for them to close when the instance spins down.
Idk why they never thought that an instance might never spin down if it's getting hammered with requests and you end up with hundreds of open connections over multiple instances until GCP starts killing your requests telling you "out of connections" in a server less instance. The vaguest possible error which after a lot of debugging you understand that you can't have more than 100 open connections on a single function instance, but you were technically never supposed to have more than one open at any given time.
sigh
Maybe this breaks things, maybe you need to expire some caches, but (forgive me for being blunt, I can't think of a better way to say it) that's the cost of not doing things correctly to begin with.
My first thought as a security engineer when setting something up to be public has always been "how hard could someone hit this, and how much would it cost/affect availability?"
Does anybody here have a success story where AWS was either much cheaper to operate or to develop for (ideally both) than the normal alternatives?
8-16 cores can easily(!!!) push this kind of data without even heating up, not sure wtf OP is doing. Well, I know what OP is doing - they fell for the idea that the cloud is more scalable.
The issue with this is that the cloud™ starts "scaling" at the first user, whereas a baremetal server needs to scale when you have saturated a 8-16 core modern CPU, a 1-10 Gb/s NiC, 30-60 GB of RAM. In other words, baremetal needs to scale when you actually run out of hardware resources, which is tens or hundreds of thousands of users later.
Edit: for example, at BeamMP, we run on a few bare metal servers, and serve 22k unique users per day in a multiplayer videogame service. Funded by around 800 people donating.
Still, you need to deal with bad scrapers. Plus, this scraper downloaded at a consistent 650mbps, taking up half the unlimited Hetzner pipe by itself; if you'd go for a 10gbps Hetzner machine, you suddenly start paying egress fees once you hit 20TB of traffic. Even then, if you go the cheapo Hetzner server route, you probably still want at least some kind of CDN to keep latency down. Add to that costs of backups and synchronising failovers, and you may end up with more traffic than you'd expect.
I think going bare metal would save more than the signed URLs would, at least until the ten thousanth customer, but not everyone is proficient in maintaining servers. A lot of cloud projects I see are coming from programmers who don't want to/don't know how to maintain a Linux server and just want to run their code. If you're in that category, taking time off to learn server maintenance or hiring a sysadmin can easily be a lot more expensive than paying the extortionate rates cloud providers demand.
But I would rather say, cloud is not for everyone. Especially in the case mentioned in the article. Think this: do you really REALLY need to distribute enlarged images globally at top speed? I bet most people just don't.
Same thing goes for "scaling", it's true the cloud can do that very well, but do you really need it that bad?
For short text messages, probably not an issue. With larger stuff like images or video, I would be more careful.
Still, even for text-only, if you're using PostgreSQL, by default you have a limit of (I think) 100 parallel connections (or 97, because I think 3 are reserved for superusers), but each connection can only be executing one transaction at a time, so that can quickly become a bottleneck depending on your application and how fast you need to make queries vs how long your queries take to return a response. So then you might need to tune some PostgreSQL settings, or add caching, or some other way to work around the issue.
If you add more services, then you also need to keep in mind the latency between those services.
And so on and so on. So RAM and network would probably become an issue way earlier than CPU in most cases.
TL;DR: "It depends".
In all seriousness, the devil is in the details around this kind of stuff, but I do worry that doing something not even clever, but just nonstandard, introduces a larger maintenance effort than necessary.
Interesting problem, and an interesting solution, but I'd probably rather just throw money at it until it gets to a scale that merits further bot prevention measures.
If money wasn't an issue, I'd probably just allow people to download images for free.
Your point is valid for normal usage patterns where there is a direct relationship between active users and cost. But an attack meant OP’s costs were sky rocketing even though usage was flat.
For example, the value of session ID cookies should actually be signed with an HMAC, and checked at the edge by the CDN. Session cookies that represent a authenticated session should also look different than unauthenticated ones. The checks should all happen at the edge, at your reverse proxy, without doing any I/O or calling your "fastcgi" process manager.
But let's get to the juicy part... hosting files. Ideally, you shouldn't have "secret URLs" for files, because then they can be shared and even (gasp) hotlinked from websites. Instead, you should use features like X-Accel-Redirect in NGINX to let your app server determine access to these gated resources. Apache has similar things.
Anyway, here is a write-up which goes into much more detail: https://community.qbix.com/t/files-and-storage/286
The write-up discusses X-Accel-Redirect's functionality in the context of qbix. For me, the two were rather hard to tease apart in that context. So for others who feel that way, I would recommend starting with Grant Horwood's introduction to X-Accel-Redirect at
https://gbh.fruitbat.io/2024/05/12/nginx-serving-private-fil...
https://help.archive.org/help/managing-and-editing-your-item...
It's a very open platform. Think up what the best format for your data is and upload a compressed zip file or tar.gz of the data.
I'd likely do different archives for images and metadata, so people that want to just process metadata can download that specific data and work on it.
Luckily as you can edit over time, you can experiment and adjust based upon user's feedback.
I don't like that if Rails goes down our images go down. I'd much prefer to separate these out and show the signed URLs in Next.js and be able to generate them through the API. I think we'll get there, but that's a bigger change than I could reliably make in a day.
Edit: I see this is discussed in other threads.
Given that we have the problem now, and that people already have the non-CDN URLs, we needed a solution that allowed us to roll out something ASAP, while allowing people that use our API to continue using the image URLs they've downloaded.
I am sorry but who sees a $100 sudden charge, assumes misconfiguration and just goes about their day without digging deeper right away?
You want me to spend an hour trying to save the company $100? We just spent $250. And that’s not the half of it. If the company is expecting me to result in $5 in revenue for every dollar they spend on me, we really just lost out on more than $1000.
I’ve worked many places where we didn’t think about opportunity costs. I’ve also been laid off many times.
> If this were a business and someone else's money I'd do the same. This is a bootstrapped side project coming out of my own wallet. If money wasn't an issue, I'd probably just allow people to download images for free.
[1] https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-Ne...
It wasn't unusual, for first-time victims at least, that we'd a) waive the fees and b) schedule a solution architect to talk them through using signed URLs or some other mitigation. I have no visibility into current practice either at AWS or GCP but I'd encourage OP to seek billing relief nevertheless, it can't hurt to ask. Sustainable customer growth is the public cloud business model, of which billing surprises are the antithesis.
I ended up using the exact same code for sharding, and later to move to a static site with Azure Storage (which lets me use SAS tokens for timed expiry if I want to).
Every important service always eventually gets rate limiting. The more of it you have, the more problems you can solve. Put in the rate limits you think you need (based on performance testing) and only raise them when you need to. It's one of those features nobody adds until it's too late. If you're designing a system from scratch, add rate limiting early on. (you'll want to control the limit per session/identity, as well as in bulk)
Like, sure, I have services at work where the answer is "yes". But I have 10 - 20 times more services for which I could cut that to 5 and still be fine.
Google's documentation is inconsistent, but you do not need to make your bucket public, you can instead grant read access only to Cloud CDN: https://cloud.google.com/cdn/docs/using-signed-cookies#confi...
Dangerously incorrect documentation claiming the bucket must be public: https://cloud.google.com/cdn/docs/setting-up-cdn-with-bucket...
I didn’t realize you could do this with a private bucket by granting it access either. That combined with IP throttling at the CDN level might be a good replacement for this and cut out the need for Rails.
Bots these days are our of control and have lost their mind!
I added a firewall rule to block the subnet and that seems to have worked. Earlier attempts involving robots.txt failed and my logs still got spammed by all the HTTPS requests when I blocked the bots in Nginx.
I don't understand how you could write a scraper like that and not notice that you're downloading the same files over and over again.
Wouldn't this be solved by using Cloudflare R2 though?
I'm not familiar with Cloudflare R2, so I'll have to check it out. I do like that we can rate limit based on either User ID requesting an image from the API, or by IP address. I'm not sure how we'd handle segmenting by user id with a CDN (but I'd have to read more to understand if that's a possibility).
For everything high traffic and/or concurrency related my go to solution is dedicated sockets. Sockets are inherently session-oriented which makes everything related to security and routing more simple. If there is something about a request you don’t like then just destroy the socket. If you believe there is a DOS flood attack then keep the socket open and discard its messaging. If there are too many simultaneous sockets then jitter traffic processing via load balancer as resources become available.
You can roll/host your own anything. Except CDN, if you care about uptime.
Back in the old days where everyone operates their own server, another thing you could do is to just setup a per-IP traffic throttling with iptables (`-m recent` or `-m hashlimit`). Just something to consider in case one day you might grow tired of Google Cloud Storage too ;)