I feel like a lot of complexity and performance overhead could be reduced if you only store immutable blobs under their hash (e.g Blake3). Combined with a soft delete this would make all operations idempotent, blobs trivially cacheable, and all state a CRDT/monotonically mergeable/coordination free.
There is stuff like IPFS in the large, but I want this for local deployments as a S3 replacement, when the metadata is stored elsewhere like git or a database.
IOW I would settle for content verification even without content addressing.
S3 has an extremely half-hearted implementation of this for “integrity”.
They probably don't expose it publicly though.
Tbh I'm not sure if content aware chunking isn't a sirens call:
- It sounds great on paper, but once you start storing encrypted (which you have to do if you want e2e encryption) or compressed blobs (e.g. images) it won't work anymore.
- Ideally you would store things with enough fine grained blobs that blob-level deduplication would suffice.
- Storing a blob across your cluster has additional compute, lookup, bookkeeping, and communication overhead, resulting in worse latency. Storing an object as a contiguous unit makes the cache/storage hierarchies happy and allows for optimisations like using `sendfile`.
- Storing the blobs as a unit makes computational storage easier to implement, where instead of reading the blob and processing it, you would send a small WASM program to the storage server (or drive? https://semiconductor.samsung.com/us/ssd/smart-ssd/) and only receive the computation result back.Open source project written in Rust that uses BLAKE3 (and QUIC, which you mentioned in another comment)
E.g. it tries to solve the "mutability problem" (having human readable identifiers point to changing blobs); there are blobs and collections and documents; there is a whole resolver system with their ticket stuff
All of these things are interesting problems, that I'd definitely like to see solved some day, but I'd be more than happy with an "S3 for blobs" :D.
So in the hope of triggering someone to give me the missing link (maybe even a hyperlink) for me to understand it, here is a the situation:
I'm a SW dev that also have done a lot of sysadmin work. Yes, I have managed to install it. And that is about it. There seems to be so many features there but I really really don't understand how I am supposed to use the product or the documentation for that matter.
I could start an import of Twitter or something else an it kind of shows up. Same with anything else: photos etc.
It clearly does something but it was impossible to understand what I am supposed to do next, both from the ui and also from the docs.
Recently I benchmarked the latency to some popular RPC, cache, and DB platforms and was shocked at how high the latency was. Every still talks about 1 ms as the latency floor, when it should be the ceiling.
It's a distributed hash table where the value mapped to a hash is immutable after it is STOREd (at least in the implementations that I know)
> Trac detected an internal error:
> IOError: [Errno 28] No space left on device
So it looks like it is pretty dead like most projects in this space?
Files are stored by hash on S3. Metadata is stored in a database. I run it locally and access it just like an S3 store. Metadata is in a Postgres DB.
- It works pretty well, at least up to the 15B objects I am using it for. Running on 2 machines with about 300TB, (500 raw) storage on each.
- The documentation, specifically with regards to operations like how to backup things, or different failure modes of the components can be sparse.
- One example of the above is I spun up a second filer instance (which is supposed to sync automatically) which caused the master server to emit an error while it was syncing. The only way to know if it was working was watching the new filers storage slowly grow.
- Seaweed has a pretty high bus factor, though the dev is pretty responsive and seems to accept PRs at a steady rate.
(Disclaimer: ex-Ceph employee.)
S3 is a horrible interface with a terrible lack of features. It’s just file storage without any of the benefits of a file syste - no metadata, no directory structures, no ability to search, sort, or filter.
Combine that with high latency network file access and an overly verbose API. You literally have a bucket for storing files, when you used to have a toolbox with drawers, folders, and labels.
Replicating a real file system is not that hard, and when you lose the original reason for using a bucket —- because your were stuck in the swamp with nothing else to carry your files in — why keep using it when you’re out of the mud?
Does your remote file server magically avoid network latency? Mine doesn’t.
In case you didn’t know, inside the bucket you can use a full path for S3 files. So you can have directories or folders or whatever.
Some benefits of this system (KV style access) is to support concurrent usage better. Not every system needs it, but if you’re using an object store you might.
What personal experience do you have in this area? In particular, how have you handled greater than single-server scale, storage-level corruption, network partitions, and atomicity under concurrent access?
Blob storage is easier than POSIX file systems:
You have server-client state. The concept of opened files, directories, and their states. Locks. The ability for multiple writers to write to the same file while still providing POSIX guarantees.
All of those need to correctly handle failure of both the client and the server.
CephFS implements that with a Metadata server that has lots of logica and needs plenty of RAM.
A distributed file system like CephFS is more convenient than S3 in multiple ways, and I agree it's preferable for most use cases. But it's undoubtedly more complex to build.
Filesystems impose a lot of constraints on data-consistency that make things go slow. In particular, when it comes to mutating directory structure. There's also another set of consistency constraints when it comes to dealing with file's contents. Object stores relax or remove these constraints, which allows them to "go faster". You should, however, carefully consider if the constraints are really unnecessary for your case. The typical use-case for object stores is something like storing volume snapshots, VM images, layers of layered filesystems etc. They would perform poorly if you wanted to use them to store the files of your programming project, for example.
Because turn out that most applications do not require that many features when it comes to persistent storage.
It's mostly just S3, really. You don't see anywhere near as many "clones" of other AWS services like EC2, for instance.
And there's a ton of value on being able to develop against a S3 clone like Garage or Minio and deploy against S3 - or being able to retarget an existing application which expected S3 to one of those clones.
Implementing a filesystem versus an object store involves severe tradeoffs in scalability and complexity that are rarely worth it for users that just want a giant bucket to dump things in.
The API doesn't matter that much, but everything already supports S3, so why not save time on client libraries and implement it? It's not like some alternative PUT/GET/DELETE API will be much simpler-- though naturally LIST could be implemented myriad ways.
By reducing the API surface (to essentially just GET, PUT, DELETE), it increases the flexibility of the backend. It's almost trivial to do a union mount with object storage, where half the files go to one server and half go to another (based on a hash of the name). This can and is done with POSIX filesystems too, but it requires more work to fully satisfy the semantics. One of the biggest complications is having to support file modification and mmap. With S3 you can instead only modify a file by fully replacing it with PUT. Which again might be unacceptable for a desktop OS filesystem, but many server applications already satisfy this constraint by default
Ummmm what? Replicating a file system is insanely hard
Which, to your point, makes no sense because as you rightly point out, people use S3 because of the Amazon services and ecosystem it is integrated with - not at all because it is "good tech"
https://github.com/Peergos/Peergos/blob/master/src/peergos/s...
If this 'Garage' doesn't support the plain HTTP use case then it isn't S3 compatible.
There is a few "very minimal" sigv4 implementations ...
Edit: memory = RAM
https://github.com/cycneuramus/seaweedfs-docker-swarm/blob/m...
We got ceph, minio, seaweedfs ... and a dozen of others. I am genuinly curious what is the goal here?
deuxfleurs thought long and hard about the kind of infra this would translate to. The base came fast enough: some kind of storage, based on a standard (even de-facto only is good because it means it is proven), that would tolerate some nodes go down. The decision of doing a Dynamo-like thing to be accessed through S3 with eventual consistency made sense
So Garage is not "simply" a S3 storage system: it is a system to store blobs in an unreliable but still trusted coonsumer-grade network of passable machines.
it can not be understated how slow Ceph/Minio/etc can be compared to local NVME. there is plenty of room for improvement.
Object store = store blobs of bytes. Usually by bucket + key accessible over HTTP. No POSIX expectation.
Distributed = works spread across multiple servers in different locations.
Files
by bucket
Directories
key accessible
File names
over HTTP
Web server
From the perspective of consistency guarantees, object storage gives fewer of such guarantees (this is seen as allowing implementations to be faster than typical file-systems). For example, since there isn't a concept of directories in object store, the implementation doesn't need to deal with the problems that arise while copying or moving directories with files open in those directories.
There are some non-storage functions that are performed only by filesystems, but not object storage. For example, suid bits.
It's also much more common to use object stores for larger chunks of data s.a. whole disk snapshots, VM images etc. While filesystems aim for the middle-size (small being RDBMs) s.a. text files you'd open in a text editor. Subsequently, they are optimized for these objectives. Filesystems care a lot about what happens when random small incremental and possibly overlapping updates happen to the same file, while object stores care about performance of sequential reads and writes the most.
This excludes the notion of "distributed" as both can be distributed (and in different ways). I suppose you meant to ask about the difference between "distributed object storage" and "distributed filesystem".
We saw about 20-30x performance gain overall after moving to garage for our specific use case.
so we wanted lots of compliance features - like access logs, access approvals, short lived (time bound) accesses, etc etc.
how would you compare garage vs minio on that front ?
Are there other details you are willing/allowed to share, like the number of objects in the store and the number of servers you are balancing them on?
We use it for CI in ClickHouse, for example: https://github.com/ClickHouse/ClickHouse/blob/master/docker/...
Docker is young and fashionable, every windows script kiddy uses it nowadays!
And then comes to the Docker forum complaining about strange issues, not realizing Docker Desktop is a different product, it uses a Linux VM to run the Docker engine, which was build for Linux ;-)
I explicitly wrote "old-school Docker Swarm", as that is missing love for years and everyone with 2 IT FTEs seems to be moving to k8s.
I find it interesting that they chose CRDTs over Raft for distributed consensus.
CRDTs do not have the same failure scenarios and favor uptime over consistency.
1. https://www.youtube.com/watch?v=H1DunJM1zoc 2. https://platform.swiftstack.com/docs/
The only thing I am missing is the ability to automatically replicate some buckets on AWS S3 for backup.
It is an object storage system and more..
I found this really difficult to achieve with MinIO, since this appears to require an AssumeRole request, which is almost not documented in any way and I did not find a Typescript example. Additionally, there's a weird set of restrictions in place for MinIO (and also AWS) that makes this really difficult to do, e.g. the size of policies is limited, which effectively limits the number of prefixes a user can share. I found this really difficult to work around.
Can anyone suggest a way to do this? Can garage do this? Am I just approaching this from the wrong side?
Thanks