Amazon S3 now supports the ability to append data to an object (opens in new tab)

(aws.amazon.com)

199 pointsnotgiorgi1y ago67 comments

67 comments

Wrote some notes on this here: https://simonwillison.net/2024/Nov/22/amazon-s3-append-data/

Key points:

- It's just for the "S3 Express One Zone" bucket class, which is more expensive (16c/GB/month compared to 2.3c for S3 standard tier) and less highly available, since it lives in just one availability zone

- "With each successful append operation, you create a part of the object and each object can have up to 10,000 parts. This means you can append data to an object up to 10,000 times."

That 10,000 parts limit means this isn't quite the solution for writing log files directly to S3.

jiggawatts1y ago

Wow, I'm surprised it took AWS this long to (mostly) catch up to Azure, which had this feature back in 2015: https://learn.microsoft.com/en-us/rest/api/storageservices/u...

Azure supports 50,000 parts, zone-redundancy, and append blobs are supported in the normal "Hot" tier, which is their low-budget mechanical drive storage.

Note that both 10K and 50K parts means that you can use a single blob to store a day's worth of logs and flush every minute (1,440 parts). Conversely, hourly blobs can support flushing every second (3,600 parts). Neither support daily blobs with per-second flushing for a whole day (86,400 parts).

Typical designs involve a per-server log, per hour. So the blob path looks like:

    "{account}/{path}/{year}/{month}/{day}/{hour}_{servername}.txt"

This seems insane, but it's not a file system! You don't need to create directories, and you're not supposed to read these using VIM, Notepad, or whatever.

The typical workflow is to run a daily consolidation into an indexed columnstore format like Parquet, or send it off to Splunk, Log Analytics, or whatever...

sofixa1y ago

> Wow, I'm surprised it took AWS this long to (mostly) catch up to Azure, which had this feature back in 2015:

Microsoft had the benefit of starting later and learning from Amazon's failures and successes. S3 dates from 2006.

That being said, both Microsoft and Google learned a lot, but also failed at learning different things.

GCP has a lovely global network, which makes multi-region easy. But they spent way too much time on GCE and lost the early advantage they had with Google App Engine.

Azure severely lacks in security (check out how many critical cross-tenant security vulnerabilities they've had in the past few years) and reliability (how many times have there been various outages due to a single DC in Texas failing; availability zones still aren't the default there).

ak2171y ago

Microsoft did this by sacrificing other features of object storage that S3 and GS had since the beginning, primarily performance, automatic scaling, unlimited provisioning and cross-sectional (region wide) bandwidth. Azure blob storage did not have parity on those features back in 2015 and data platform applications could not be implemented on top of it as a result. Since then they fixed some of these, but there are still areas where Azure lacks scaling features that are taken for granted on AWS and GCP.

jiggawatts1y ago

Today I learned that there is a 5 PB soft capacity limit for Azure blob storage: https://learn.microsoft.com/en-us/azure/storage/common/scala...

Also, a 200 Gbps egress limit.

How does that compare to S3?

Mind you, at this scale the storage cost is about $15K/mo, so it would be cost effective to throw some developer time at the problem of scaling out between multiple storage accounts. Or just call support to have the soft limit cap raised…

cedilla1y ago

If I need to consolidate anyway, is this really a win for this use case? I could just upload with {hour}_{minute}.txt instead of appending every minute, right?

jiggawatts1y ago

Consolidation is for archival cost efficiency and long-term analytics. If you don't append regularly, you can lose up to 59 minutes of data.

zaphirplane1y ago

In all fairness. Shipping unreliable features for unreliable services is a lot easier

kochie1y ago

AWS ranks features based on potential income from customers. Normally there’s a fairly big customer PFR needed to get a service team to implement a new feature.

jiggawatts1y ago

I always found it strange that AWS seems to have 2-3x as many products or services as Azure, but it has these bizarre feature gaps where as an Azure user I think: "Really? Now? In this year you're finally getting this?"

(Conversely, Azure's low-level performance is woeful in comparison to AWS and they're still slow-walking the rollout of their vaguely equivalent networking and storage called Azure Boost.)

2 more replies

omeid21y ago

Not directly, but enough to write once every hour for more than a year!

simonw1y ago

Yeah, or I guess log rotation will work well - you can write 10,000 lines to one key and then switch to a new key name.

santiagobasulto1y ago

This will require some serious buffering.

electroly1y ago

The original title is "Amazon S3 Express One Zone now supports the ability to append data to an object" and the difference is extremely important! I was excited for a moment.

teractiveodular1y ago

For comparison, while GCS doesn't support appends directly, there's hacky but effective workaround in that you can compose existing objects together into new objects, without having to read & write the data. If you have existing object A, upload new object B, and compose A and B together so that the resulting object is also called A, this effectively functions the same as appending B into A.

https://cloud.google.com/storage/docs/composite-objects#appe...

pclmulqdq1y ago

Colossus allows appends, so it would make sense that there's a backdoor way to take advantage of that in GCS. It seemed silly of me that Google didn't just directly allow appends given the architecture.

CharlieDigital1y ago

There are some limitations[0] to work around (can only compose 32 at a time and it doesn't auto delete composed parts), but I find this approach super useful for data ingest and ETL processing flows while being quite easy to use.

[0] https://chrlschn.dev/blog/2024/07/merging-objects-in-google-...

vdm1y ago

S3 can also do this https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPa...

thinkharderdev1y ago

It's similar but no really the same thing. It has to be done up front by initiating a multi-part upload to start. The parts are still technically accessible as S3 objects but through a different API. But the biggest limitation is that each part has to be >5MB (except for the final part)

new_user_final1y ago

It's totally different thing and requires special way to initiate multi-part uploading.

vdm1y ago

totally different how?

1 more reply

sureIy1y ago

It's crazy to me that anyone would still consider S3 after R2 was made available, given the egress fees. I regularly see people switching to R2 and saving thousands or hundreds of thousands by switching.

JonoBB1y ago

For the most part I agree, but we have found that R2 does not handle large files (hundreds of GB or larger) very well. It will often silently fail with nothing being returned, so it’s not possible to handle it gracefully.

remus1y ago

Depends a bit on your use case. If you've got lots of other infra on AWS and you don't need to store that much data then the downside of using another platform can outweigh the cost savings.

compootr1y ago

doesn't s3 have 'free' (subsidized!) transfer to other products like ec2 though? it might look better to businesspeople that "well, we're doing this processing on ec2, why not keep our objects in s3 since it's a bit cheaper!"

mjlee1y ago

S3 has free data transfer within the same region.

dragonwriter1y ago

> It's crazy to me that anyone would still consider S3 after R2 was made available, given the egress fees.

If your compute is on AWS, using R2 (or anything outside of AWS) for object storage means you pay AWS egress for “in-system” operations rather than at the system boundary, which is often much more expensive (plus, you also probably add a bunch of latency compared to staying on AWS infra.) And unless you are exposing your object store directly externally as your interface to the world, you still pay AWS egress at the boundary.

Now, if all you use AWS for is S3, R2 may be, from a cost perspective, a no brainer, but who does that?

YetAnotherNick1y ago

In most cases S3 data is not directly exposed to the client. If the middleware is EC2, then you need to pay the same egress fee, but you you will have lower latency with S3, as EC2 shares the same datacenter as S3.

lijok1y ago

People still use S3 because doing business with Cloudflare is a liability.

bobnamob1y ago

Elaborate?

mxuribe1y ago

Genuinely curious what you meant by this?

anentropic1y ago

There are reasons like this: https://github.com/cloudflare/terraform-provider-cloudflare/...

crest1y ago

Just wait until Cloudflare decides you can afford to be on a different plan.

ChrisArchitect1y ago

Please fix title: Amazon S3 Express One Zone now supports the ability to append data to an object

supermatt1y ago

This doesnt seem very useful for many cases, given that you NEED to specify a write offset in order for it to work. So you need to either track the size (which becomes more complex if you have multiple writers), or need to first request the size every time you want to do a write and then race for it using the checksum of the current object... Urghhh.

thecleaner1y ago

I don't understand the bashing in the comments. I image this is a tough distributed systems challenge (as with anything S3). Of course AWS is charging more since they've cracked it.

Does anybody know if appending still has that 5TB file limit ?

taeric1y ago

I'm curious on the different use cases for this? Firehose/kinesis whatever the name seems to have the append case covered in ways that I would think has fewer foot guns?

styx311y ago

I am surprised it was not supported until now? How does it compare to azure blob append (which exists for years)?

I have been using azure storage append blob to store logs of long running tasks with periodic flush (see https://learn.microsoft.com/en-us/rest/api/storageservices/u...)

xkqd1y ago

I know the whole point of cloud services is to pick and choose, but in general I wouldn’t express “outrage” or scoff when comparing Azure to AWS. I recommend Azure to the smallest and leanest of shops, but when you compare functionality matrices and maturity Azure is a children’s toy.

To compare the other way, Azure write blocks target replication blob containers. I consider that a primitive and yet they just outright say you can’t do it. When I engaged our TPM on this we were just told our expectations were wrong and we were thinking about the problem wrong.

styx311y ago

I did not want to express any outrage (even sarcastically), just surprise and the fact that I don't know very well the AWS offer.

> Azure write blocks target replication blob containers

I am sorry but what does it mean?

The goal of my question was about what are the differences between the two solutions: I know HN is a place where I can read technical arguments based on actual experience.

merek1y ago

This is specifically for S3 "Express One Zone"

exac1y ago

I wonder what the implications for all the s3-like APIs is going to be.

insomniacity1y ago

Good point - plus the conditional writes recently announced. Did anyone else implement that?

100pctremote1y ago

MinIO actually supported conditional writes a year before S3

from-nibly1y ago

Logs are a terrible usecase for this. Loki already existed and it uses the cheaper more highly available s3

kylegalbraith1y ago

I got excited until I saw the one zone part. That is a critical difference in terms of cost.

juancampa1y ago

Isn't it cheaper than normal S3 though?

> S3 Express One Zone delivers data access speed up to 10x faster and request costs up to 50% lower than S3 Standard [0]

The critical difference seems to be in availability (1 AZ)

[0] https://aws.amazon.com/s3/storage-classes/express-one-zone/

rrampage1y ago

Storage cost is much higher. Express One Zone is $0.16 per GB while standard S3 is $0.023 per GB ( https://aws.amazon.com/s3/pricing/?nc=sn&loc=4 )

datatrashfire1y ago

Ingress pricing is indeed cheaper. POST is $0.005 per thousand requests on standard and $0.0025 on express one.

Egress and storage however are more expensive on express one than any other tier. For comparison, glacier (instant), standard and express are $0.004, $0.023 and $0.16 per GB. Although slight, standard tier also receives additional discounts above 50 TB.

Maxious1y ago

> Compared to S3 Standard, even though the new class offers 50% cheaper request pricing—storage is almost seven times more expensive.

The house always wins https://www.vantage.sh/blog/amazon-s3-express-one-zone

wood_spirit1y ago

Will be exciting to see what adaptations are needed and how performance and cost changes for delta lake and iceberg and other cloud mutable data storage formats. It could be really dramatic!

S3 is often used as a lowest common denominator, and a lot of the features of azure and gcs aren’t leveraged by libraries and formats that try to be cross platform so only want to expose features that are available everywhere.

If these days all object stores do append then perhaps all the data storage formats and libs can start leveraging it?

crest1y ago

I wonder at which point they'll admit they've added back all the complexity of a hierarchical filesystem.

klysm1y ago

This sounds like a thin wrapper over an underlying object store

msoad1y ago

Does it really work for livestreams? Can I stream read and write on the same video file? That is huge if true!

Edit: oh it’s only in one AZ

jrpelkonen1y ago

Livestream is not usually done by writing to/reeding from a single media file. Instead, the media is broken into few second long segments. The current set of segments are indicated by a HLS and/or DASH manifest, which is updated as new segments appear.

chx1y ago

If you want to know the differences between Express One Zone and normal, check https://www.vantage.sh/blog/amazon-s3-express-one-zone this blog post. I had no idea this even existed. tl;dr: it's x7 expensive.

andrewstuart1y ago

There are many, many S3 compatible storage services out there provided by other companies.

Most of them cheaper, some MUCH cheaper.

orthoxerox1y ago

But most of them are compliant with the standard S3 API. If I use AWS SDK to write data to my on-prem Ceph/Minio/SeaweedFS/Hitachi storage, I want this SDK to support the concept of appending data to an object.

water91y ago

Incredible breakthrough. What will they come up with next the ability to remove data from an object? It’s clear that not working from home is really working out for them

maryndisouza1y ago

This is a fantastic addition to Amazon S3 Express One Zone! The ability to directly append data to existing objects opens up new possibilities for real-time data processing applications. Whether it's continuously adding log entries or appending video segments in a media workflow, this feature will streamline workflows and improve efficiency for many use cases. It's great to see AWS continuing to innovate and make data management even more flexible and user-friendly. Excited to see how this feature enhances the scalability of applications across various industries!

andrewstuart1y ago

Amazon has no right to do this - it no longer owns the S3 standard and should respect the ecosystem and community.

S3 has stagnated for a long time, allowing it to become a standard.

Third parties have cloned the storage service and a vast array of software is compatible. There’s drivers, there’s file transfer programs and utilities.

What does it mean that Amazon is now changing it.

Does Amazon even really own the standard any more, does it have the right to break the long standing standards?

I’m reminded of IBM when they broke compatibility of the PS/2 computers just so it could maintain dominance.

j / k navigate · click thread line to collapse

67 comments

simonw1y ago

Wrote some notes on this here: https://simonwillison.net/2024/Nov/22/amazon-s3-append-data/

Key points:

- "With each successful append operation, you create a part of the object and each object can have up to 10,000 parts. This means you can append data to an object up to 10,000 times."

That 10,000 parts limit means this isn't quite the solution for writing log files directly to S3.

jiggawatts1y ago

Wow, I'm surprised it took AWS this long to (mostly) catch up to Azure, which had this feature back in 2015: https://learn.microsoft.com/en-us/rest/api/storageservices/u...

Azure supports 50,000 parts, zone-redundancy, and append blobs are supported in the normal "Hot" tier, which is their low-budget mechanical drive storage.

Typical designs involve a per-server log, per hour. So the blob path looks like:

    "{account}/{path}/{year}/{month}/{day}/{hour}_{servername}.txt"

This seems insane, but it's not a file system! You don't need to create directories, and you're not supposed to read these using VIM, Notepad, or whatever.

The typical workflow is to run a daily consolidation into an indexed columnstore format like Parquet, or send it off to Splunk, Log Analytics, or whatever...

sofixa1y ago

> Wow, I'm surprised it took AWS this long to (mostly) catch up to Azure, which had this feature back in 2015:

Microsoft had the benefit of starting later and learning from Amazon's failures and successes. S3 dates from 2006.

That being said, both Microsoft and Google learned a lot, but also failed at learning different things.

GCP has a lovely global network, which makes multi-region easy. But they spent way too much time on GCE and lost the early advantage they had with Google App Engine.

ak2171y ago

jiggawatts1y ago

Today I learned that there is a 5 PB soft capacity limit for Azure blob storage: https://learn.microsoft.com/en-us/azure/storage/common/scala...

Also, a 200 Gbps egress limit.

How does that compare to S3?

cedilla1y ago

If I need to consolidate anyway, is this really a win for this use case? I could just upload with {hour}_{minute}.txt instead of appending every minute, right?

jiggawatts1y ago

Consolidation is for archival cost efficiency and long-term analytics. If you don't append regularly, you can lose up to 59 minutes of data.

zaphirplane1y ago

In all fairness. Shipping unreliable features for unreliable services is a lot easier

kochie1y ago

AWS ranks features based on potential income from customers. Normally there’s a fairly big customer PFR needed to get a service team to implement a new feature.

jiggawatts1y ago

(Conversely, Azure's low-level performance is woeful in comparison to AWS and they're still slow-walking the rollout of their vaguely equivalent networking and storage called Azure Boost.)

2 more replies

omeid21y ago

Not directly, but enough to write once every hour for more than a year!

simonw1y ago

Yeah, or I guess log rotation will work well - you can write 10,000 lines to one key and then switch to a new key name.

santiagobasulto1y ago

This will require some serious buffering.

electroly1y ago

The original title is "Amazon S3 Express One Zone now supports the ability to append data to an object" and the difference is extremely important! I was excited for a moment.

teractiveodular1y ago

https://cloud.google.com/storage/docs/composite-objects#appe...

pclmulqdq1y ago

CharlieDigital1y ago

[0] https://chrlschn.dev/blog/2024/07/merging-objects-in-google-...

vdm1y ago

S3 can also do this https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPa...

thinkharderdev1y ago

new_user_final1y ago

It's totally different thing and requires special way to initiate multi-part uploading.

vdm1y ago

totally different how?

1 more reply

sureIy1y ago

JonoBB1y ago

remus1y ago

Depends a bit on your use case. If you've got lots of other infra on AWS and you don't need to store that much data then the downside of using another platform can outweigh the cost savings.

compootr1y ago

mjlee1y ago

S3 has free data transfer within the same region.

dragonwriter1y ago

> It's crazy to me that anyone would still consider S3 after R2 was made available, given the egress fees.

Now, if all you use AWS for is S3, R2 may be, from a cost perspective, a no brainer, but who does that?

YetAnotherNick1y ago

lijok1y ago

People still use S3 because doing business with Cloudflare is a liability.

bobnamob1y ago

Elaborate?

mxuribe1y ago

Genuinely curious what you meant by this?

anentropic1y ago

There are reasons like this: https://github.com/cloudflare/terraform-provider-cloudflare/...

crest1y ago

Just wait until Cloudflare decides you can afford to be on a different plan.

ChrisArchitect1y ago

Please fix title: Amazon S3 Express One Zone now supports the ability to append data to an object

supermatt1y ago

thecleaner1y ago

I don't understand the bashing in the comments. I image this is a tough distributed systems challenge (as with anything S3). Of course AWS is charging more since they've cracked it.

Does anybody know if appending still has that 5TB file limit ?

taeric1y ago

I'm curious on the different use cases for this? Firehose/kinesis whatever the name seems to have the append case covered in ways that I would think has fewer foot guns?

styx311y ago

I am surprised it was not supported until now? How does it compare to azure blob append (which exists for years)?

I have been using azure storage append blob to store logs of long running tasks with periodic flush (see https://learn.microsoft.com/en-us/rest/api/storageservices/u...)

xkqd1y ago

styx311y ago

I did not want to express any outrage (even sarcastically), just surprise and the fact that I don't know very well the AWS offer.

> Azure write blocks target replication blob containers

I am sorry but what does it mean?

The goal of my question was about what are the differences between the two solutions: I know HN is a place where I can read technical arguments based on actual experience.

merek1y ago

This is specifically for S3 "Express One Zone"

exac1y ago

I wonder what the implications for all the s3-like APIs is going to be.

insomniacity1y ago

Good point - plus the conditional writes recently announced. Did anyone else implement that?

100pctremote1y ago

MinIO actually supported conditional writes a year before S3

from-nibly1y ago

Logs are a terrible usecase for this. Loki already existed and it uses the cheaper more highly available s3

kylegalbraith1y ago

I got excited until I saw the one zone part. That is a critical difference in terms of cost.

juancampa1y ago

Isn't it cheaper than normal S3 though?

> S3 Express One Zone delivers data access speed up to 10x faster and request costs up to 50% lower than S3 Standard [0]

The critical difference seems to be in availability (1 AZ)

[0] https://aws.amazon.com/s3/storage-classes/express-one-zone/

rrampage1y ago

Storage cost is much higher. Express One Zone is $0.16 per GB while standard S3 is $0.023 per GB ( https://aws.amazon.com/s3/pricing/?nc=sn&loc=4 )

datatrashfire1y ago

Ingress pricing is indeed cheaper. POST is $0.005 per thousand requests on standard and $0.0025 on express one.

Maxious1y ago

> Compared to S3 Standard, even though the new class offers 50% cheaper request pricing—storage is almost seven times more expensive.

The house always wins https://www.vantage.sh/blog/amazon-s3-express-one-zone

wood_spirit1y ago

Will be exciting to see what adaptations are needed and how performance and cost changes for delta lake and iceberg and other cloud mutable data storage formats. It could be really dramatic!

If these days all object stores do append then perhaps all the data storage formats and libs can start leveraging it?

crest1y ago

I wonder at which point they'll admit they've added back all the complexity of a hierarchical filesystem.

klysm1y ago

This sounds like a thin wrapper over an underlying object store

msoad1y ago

Does it really work for livestreams? Can I stream read and write on the same video file? That is huge if true!

Edit: oh it’s only in one AZ

jrpelkonen1y ago

chx1y ago

andrewstuart1y ago

There are many, many S3 compatible storage services out there provided by other companies.

Most of them cheaper, some MUCH cheaper.

orthoxerox1y ago

water91y ago

Incredible breakthrough. What will they come up with next the ability to remove data from an object? It’s clear that not working from home is really working out for them

maryndisouza1y ago

andrewstuart1y ago

Amazon has no right to do this - it no longer owns the S3 standard and should respect the ecosystem and community.

S3 has stagnated for a long time, allowing it to become a standard.

Third parties have cloned the storage service and a vast array of software is compatible. There’s drivers, there’s file transfer programs and utilities.

What does it mean that Amazon is now changing it.

Does Amazon even really own the standard any more, does it have the right to break the long standing standards?

I’m reminded of IBM when they broke compatibility of the PS/2 computers just so it could maintain dominance.

j / k navigate · click thread line to collapse