House-keeping UX is key for self-hosting by laypersons.
- Garage is easier to deploy and to operate: you don't have to manage independent components like the filer, the volume manager, the master, etc. It also seems that a bucket must be pinned to a volume server on SeaweedFS. In Garage, all buckets are spread on the whole cluster. So you do not have to worry that your bucket fills one of your volume server.
- Garage works better in presence of crashes: I would be very interested by a deep analysis of Seaweed "automatic master failover". They use Raft, I suppose either by running an healthcheck every second which lead to data loss on a crash, or sending a request for each transaction, which creates a huge bottleneck in their design.
- Better scalability: because there is no special node, there is no bottlenecks. I suppose that with SeaweedFS, all the requests have to pass through the master. We do not have such limitations.
As a conclusion, we choose a radically different design with Garage. We plan to do a more in-depth comparison in the future, but even today, I can say that if we implement the same API, our radically different designs lead to radically different properties and trade-off.
To me, the two key differentiators of Garage over its competitors are as follows:
- Garage contains an evolved metadata system that is based on CRDTs and consistent hashing inspired by Dynamo, solidly grounded in distributed system's theory. This allows us to be very efficient as we don't use Raft or other consensus algorithms between nodes, and we also do not rely on an external service for metadata storage (Postgres, Cassandra, whatever) meaning we don't pay an additionnal communication penalty.
- Garage was designed from the start to be multi-datacenter aware, again helped by insights from distributed system's theory. In practice we explicitly chose against implementing erasure coding, instead we spread three full copies of data over different zones so that overall availability is maintained with no degradation in performance when one full zone goes down, and data locality is preserved at all locations for faster access (in the case of a system with three zones, our ideal deployment scenario).
That's fine when there is absolute trust between the server operators, but am i correct to assume it's not the same threat model as encrypted/signed backups pushed to some friends lending storage space for you (who could corrupt your backup but can't corrupt your working data)?
If my understanding is correct, maybe your homepage should outline the threat model more clearly (i.e. a single trusted operator for the whole cluster) and point to other solutions like TAHOE-LAFS for other use-cases.
Congratulations on doing selfhosting with friends, that's pretty cool! Do you have any idea if some hosting cooperatives (CHATONS) or ISPs (FFDN) have practical use-cases in mind? They already have many physical locations and rather good bandwidth, but i personally can't think of an interesting idea.
Threat model? At a large enough scale, even CPUs can be bad actors: https://muratbuffalo.blogspot.com/2021/06/silent-data-corrup...
From an ideological perspective, we are strongly attached to the building of tight-knit communities in which strong trust bonds can emerge, as it gives us more meaning than living in an individualized society where all exchanges between individuals are mediated by a market, or worse, by blockchain technology. This means that trusting several system's administrator makes sense to us.
Note that in the case of cooperatives such as CHATONS, most users are non-technical and have to trust their sysadmin anyways; here, they just have to trust several sysadmins instead of just one. We know that several hosting cooperatives of the CHATONS network are thinking like us and are interested in setting up systems such as Garage that work under this assumption.
In the meantime, we also do perfectly recognize the possibility of a variety of attack scenarios against which we want to consider practical defenses, such as the following:
1/ An honest-but-curious system administrator or an intruder in the network that wants to read user's private data, or equivalently, a police raid where server software is embarked for inspection by state services;
2/ A malicious administrator or an intruder that wants to manipulate the users by introducing fake data;
3/ A malicious administrator or an intruder that simply wants to wreak havoc by deleting everything.
Point 1 is the biggest risk in my eyes. Several solutions can be built to add an encryption layer over S3 for different usage scenarios. For instance for storing personnal files, Rclone can be used to add simple file encryption to an S3 bucket and can also be mounted directly via FUSE, allowing us to access Garage as an end-to-end encrypted network drive. For backups, programs like Restic and Borg allow us to upload our backups encrypted into Garage.
Point 2 can be at least partially solved by adding signatures and verification in the encryption layer which is handled on the client (I don't know for sure if Rclone, Restic and Borg are doing this).
Point 3 is harder and probably requires adaptation on the side of Garage to be solved, for instance by adding a restriction on which nodes are allowed to propagate updates in the network and thus establishing a hierarchy between two categories of nodes: those that implement the S3 gateway and thus have full power in the network, and those that are only responsible for storing data given by the gateway but cannot originate modifications themselves (under the assumption that modifications must be accompanied by a digital signature and that only gateway nodes have the private keys to generate such signatures). Then we could separate gateway nodes according to the different buckets in which they can write for better separation of concernts. However, as long as we are trying to implement the S3 protocol I think we are stuck with some imperfect solution like this, because S3 itself does not implement using public-key cryptography to attest operations sent by the users.
It is perfectly true that solutions that are explicitely designed to handle these threats (such as TAHOE-LAFS) would provide better guarantees in a system that is maybe more consistent as a whole. However we are also trying to juggle these security constraints with deployment contraints such as keeping compatibility with standard protocols such as S3 (and soon also IMAP for mailbox storage), which restricts us in the design choices we make.
Just to clarify, the fact that any node can tamper with any of the data in the cluster is not strictly linked to the fact that we use CRDTs internally. It is true that CRDTs make it more difficult, but we do believe that there are solutions to this and we intend to implement at least some of them in our next project that involves mailbox storage.
I understand the appeal of CRDTs for some use-cases (in fact, for many more we could develop), but isn't implementation of permissions on top quite a big enterprise? Isn't that what the Matrix project is about? How would your mitigations compare to Matrix, beyond optimizing for arbitrary large messages (files)?
> It is perfectly true that solutions that are explicitely designed to handle these threats
Please make it more explicit on the homepage. Strong personal opinion: i like it when projects make their design tradeoffs explicit and link to alternatives involving other tradeoffs ; it helps me to avoid reading detailed specifications and source code to understand whether the tool is fitted to my usecase.
> our next project that involves mailbox storage
That's pretty cool! Are you aware of existing solutions in this space and failed attempts at producing new ones? leap.se/bitmask.net had some interesting takes but had to lower their goals due to limited human resources, but there's still people very interested in that question over there. Mailbox encryption as implemented by riseup/posteo (source code available, private key unlocked with passphrase at login time) is also very interesting and i wish that approach was used with other protocols as well (eg. XMPP) though it raises some interesting questions in regards to allow/denylisting.
I wish you the best of luck, and certainly hope to read more from you. Don't hesitate to post around in seemingly-unrelated venues to gather critical feedback on your design before implementation.
From a quick look StorJ and fuelcoin both appear to be classic crypto-scams where the entire tech stack is controlled by a single company... talk about "zero trust" :-)
If you're really interested about low trust tech for file storage, i recommend you check out TAHOE-LAFS. It's pretty cool tech and there's no money to be made out of it, nor is there a company dictating who can join what pool and how much you have to pay for what kind of storage. It's an actually decentralized system, unlike most blockchains.
"Garage is a distributed storage solution, that automatically replicates your data on several servers. Garage takes into account the geographical location of servers, and ensures that copies of your data are located at different locations when possible for maximal redundancy, a unique feature in the landscape of distributed storage systems.
Garage implements the Amazon S3 protocol, a de-facto standard that makes it compatible with a large variety of existing software. For instance it can be used as a storage back-end for many self-hosted web applications such as NextCloud, Matrix, Mastodon, Peertube, and many others, replacing the local file system of a server by a distributed storage layer. Garage can also be used to synchronize your files or store your backups with utilities such as Rclone or Restic. Last but not least, Garage can be used to host static websites, such as the one you are currently reading, which is served directly by the Garage cluster we host at Deuxfleurs."
If it's written to attract potential users, however, then it's woefully light on useful content. I don't care in the slightest about your views on tech monopolies - why on earth waste such a huge amount of space talking about this? What I care about are three things: 1) what workloads does it do well and, and which ones does it do poorly at (if you don't tell me about the latter I won't believe you - no storage system is great at everything). 2) What consistency/availability model does it use, and 3) how you've tested it.
As written, it's full of fluff and vague handwavy promises (we have PhDs!) - the only technical information in the entire post is that it's written in rust. For users of your application, what programming language it's written in is about the least interesting thing possible to say about it. Even going through your docs I can't find a single proper discussion of your consistency and availability model, which is a huge red flag to me.
In my setup, I don't care about redundancy. I much prefer to maximize storage capacity. Can Garage be configured without replication?
If not, maybe someone else can point me in the right direction. Something like a glusterfs distributed volume. At first glance, SeaweedFS has a "no replication" option, but their docs make it seem like they're geared towards handling billions of tiny files. I have about 24TB of 1Gb to 20GB files.
If by "fails" you mean the network connection drops out. Then yes, that would be a huge problem. I was hoping some project had a built-in solution to this. Currently, I'm using MergerFS to effectively create 1 disk out of 3 external USB drives and it handles accidental drive disconnects with no problems (I can't gush enough over how great mergerfs is).
But, if by "fails" you mean actual hardware failure. Then, I don't really care. I keep 1 to 1 backups. A few days of downtime to restore the data isn't a big deal; this is just my home network.
> Or if you have the possibility of putting all your drives in a single box...
Unfortunately, I've maxed out the drive bays on my TS140 server. Buying new, larger drives to replace existing drives seems wasteful. Also, I've just been gifted another TS140, which is a good platform to start building another file server.
You've given me something to think about, thanks. I appreciate you taking the time to respond!
$ curl -v https://git.deuxfleurs.fr/Deuxfleurs/garage/issues
* Trying 2001:41d0:8:ba0b::1:443...
* Trying 5.135.179.11:443...
* Immediate connect fail for 5.135.179.11: Network is unreachable
$ dig +noquestion +nocmd aaaa git.deuxfleurs.fr
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46366
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 2
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; ANSWER SECTION:
git.deuxfleurs.fr. 6822 IN CNAME hammerhead.machine.deuxfleurs.fr.
hammerhead.machine.deuxfleurs.fr. 6822 IN AAAA 2001:41d0:8:ba0b::1
This from an IPv6-only network. Since you publish an IPv6 record our DNS64/NAT64 gateway doesn't get involved hence the IPv4 immediate connect fail.I just deleted the AAAA entry for this machine. In the meantime, if the result is cached for you, you can pass the `-4` argument to force IPv4:
git clone -4 git@git.deuxfleurs.fr:Deuxfleurs/garage.git
And, in a second time, we will work on a better/working IPv6 configuration for our Git repository and all of our services (we use Nomad+Docker and did not find a way to expose IPv6 in a satisfying way yet).Other comments from my colleagues also shed light on Garage's specific features (do check the discussion on Raft).
In a nutshell, Garage is designed for higher inter-node latency, thanks to less round-trips for reads & writes. Garage does not intend to compete with MinIO, though - rather, to expand the application domain of object stores.
Due to our design choice, you can add and remove nodes without any constraint on number of nodes and size of the storage. So you do not have to overprovision your cluster as recommended by MinIO[0].
Additionally, and we planned a full blog post on this subject, adding or removing a node in the cluster does not lead to a full rebalance of the cluster. To understand why, I must explain how it works traditionally and how we improved on existing work.
When you initialize the cluster, we split the cluster in partitions, then assign partitions to nodes (see Maglev[1]). Later, based on their hash, we will store data in its corresponding partition. When a node is added or removed, traditional approaches rerun the whole algorithm and comes with a totally different partition assignation. Instead, we try to compute a new partition distribution that minimize partitions assignment change, which in the end minimize the number of partitions moved.
On the drawback side, Garage does not implement erasure coding (as it also the reason of many MinIO's limitations) and duplicate data 3 times which is less efficient. Garage also implements less S3 endpoints than Minio (for example we do not support versioning), the full list is available in our documentation[2].
[0]: https://docs.min.io/minio/baremetal/installation/deploy-mini...
[1]: https://www.usenix.org/conference/nsdi16/technical-sessions/...
[2]: https://garagehq.deuxfleurs.fr/documentation/reference-manua...
but is this safe to use inside a commercial project as a backend s3 solution? i.e. just using it instead of lcoal storage and not changing it?
On the plus side, it survived Hacker News Hug of Death. Indeed, the website we linked is hosted on our own Garage cluster made of old Lenovo ThinkCentre M83 (with Intel Pentium G3420 and 8GB of RAM) and the cluster seems fine. We also host more than 100k objects in our Matrix (a chat service) bucket.
On the minus side, this is the first time we have so much coverage, so our software has not yet been tested by thousands of people. It is possible that in the near future, some edge cases we never triggered are reported. This is the reason why most people wait that an application reaches a certain level of adoption before using it, in other words they don't want to pay "the early adopter cost".
In the end, it's up to you :-)
> The Deuxfleurs association has received a grant from NGI POINTER[0], to fund 3 people working on Garage full-time for a year : from October 2021 to September 2022.
I strongly recommend NGI grants to any European citizen. Just look at the variety of profiles that NGI POINTER (one of the grant type) funded last year: https://pointer.ngi.eu/wp-content/uploads/2021/10/NGI-POINTE...
They even finance individuals wanting to contribute to FOSS up to 50k€ for a year.
> Why not using Riak and adding an S3 API around it.
Riak was developed by a company named Basho that went bankrupt some years ago, the software is not developed anymore. In fact, we do no need to add an S3 API around Riak KV, Basho even released "Riak Cloud Storage"[0] that exactly does this: provide an S3 API on top of Riak KV architecture. We plan to release a comparison between Garage and Riak CS, Garage has some interesting features that Riak CS does not have! In practice, implementing an object store on top of a DynamoDB-like KV store is not that straightforward. For example, Exoscale, a cloud provider went this way for their first implementation of their KV store, Pithos[1], but rewrote it later as you need special logic to handle your chunks (they did not publish Pithos v2).
> Most apps don't have S3 support
We are maintaining in our documentation an "integration" section listing all the compatible applications. Garage already works with Matrix, Mastodon, Peertube, Nextcloud, Restic (an alternative to Borg), Hugo and Publii (a static site generator with a GUI). These applications are only a fraction of all existing applications, but our software is targeted at its users/hosters.
> A distributed system is not necessarily highly available
I will not fight on the wording: we come from an academic background where the term "distributed computing" has a specific meaning that may differ outside. In our field, we define models where we study systems made of processes that can crash. Depending on your algorithms and the properties you want, you can prove that your system will work despite some crashes. We want to build software on these academic foundations. This is also the reason we put "Standing on the shoulders of giants" on our front page and linking to research papers. To put it in a nutshell, one critic we address to other software is that sometimes they lack theoretical/academic foundations that lead to unexpected failures/more work to sysadmins. But on the theoretical point, Basho and Riak were exemplary and a model for us!
[0]: https://docs.riak.com/riak/cs/2.1.1/index.html [1]: https://github.com/exoscale/pithos
> I have been building distributed systems for 20 years, and they are not more reliable. They are probabilistically more likely to fail.
It depends on what kind of faults you want to protect against. In our case, we are hosting servers at home, meaning that any one of them could be disconnected at any time due to a power outage, a fiber being cut off, or any number of reasons. We are also running old hardware where individual machines are more likely to fail. We also do not run clusters composed of very large numbers of machines, meaning that the number of simultaneous failures that can be expected actually remains quite low. This means that the choices made by Garage's architecture make sense for us.
But maybe your point was about the distinction between distributed systems and high availability, in which case I agree. Several of us have studied distributed systems in the academic setting, and in our vocabulary, distributed systems almost by definition includes crash-tolerance and thus making systems HA. I understand that in the engineering community the vocabulary might be different and we might orient our communication more towards presenting Garage as HA thanks to your insight, as it is one of our core, defining features.
> However, this isn't it. This is distributed S3 with CRDTs. Still too application-specific, because every app that wants to use it has to be integrated with S3. They could have just downloaded something like Riak and added an S3 API around it.
Garage is almost that, except that we didn't download Riak but made our own CRDT-based distributed storage system. It's actually not the most complex part at all, and most of the developpement time was spent on S3 compatibility. Rewriting the storage layer means that we have better integration between components, as everything is built in Rust and heavily depends on the type system to ensure things work well together. In the future, we plan to reuse the storage layer we built for Garage for other projects, in particular to build an e-mail storage server.
One question on the network requirements. The web page says for networking: "200 ms or less, 50 Mbps or more".
How hard are these requirements? For folks like me that can't afford a guaranteed 50Mbps internet connection, is this still usable?
There are plenty of places in the world where 50Mbps internet connectivity would be a dream. Even here in Canada there are plenty of places with a max of 10Mbps. The African continent for example will have many more.
If I may add a few thoughts : You may be very well placed for being a more suitable replacement to Minio in Data science / AI projects. Why ? Because a hard requirement for any serious MLOps construction needs 2 things : An append-only storage and a way to stream large content to a single place. First one is very hard to get right, second is relatively easy.
Being CRDT-based it should be _very easy_ for you to provide an append-only storage that can store partitioned, ordered logs of immutable objects (think dataframes, and think kafka). Once you have that, it's really easy to build the remaining missing pieces (UI, API, ...) for creating a _much better_ (and distributed) version of MLFlow.
Finally, S3 protocol is an "okay" version for file storage, but as you are probably aware, it's clearly a huge limiter. So, trash it. Provide a read-only S3 gateway for compatibility, but writes should use a different API.
PS: Galette-Saucisse <3
It's an interesting take you make about ML workloads. We haven't investigated that yet (next in line is e-mail: you see we target ubiquitous, low-tech needs that are horrendous to host). But we will definitely consider it for future works: small breton food trucks do need their ML solutions for better galettes.
When failures occur, repair is done through workers that says when they launch, when they repair chunks, and when they exit in the logs. We also have `garage status` and `garage stats`. The first command displays healthy and non healthy nodes, the second one displays the queue length of our tables and chunks, if their values are greater than zero, we are repairing the cluster. We are documenting failure recovery in our documentation: https://garagehq.deuxfleurs.fr/documentation/cookbook/recove...
For the near future, we plan to integrate opentelemetry. But we are still discussing the design and information we want to track and report. We are currently discussing these questions in our issue tracker: https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/111 https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/207
If you have some knowledge/experience on this subject, feel free to share it in these issues.
For non french speakers, Deuxfleurs is the french traduction of the Twoflower character appearing in The Colour of Magic, The Light Fantastic and Interesting Times from Terry Pratchett's Discworld books series.
Azure already does this, so claiming it's unique seems untrue.
You might have personal reasons by your current choice, but maybe the project can have shot at success if it's in a more social and high traffic ecosystem.
Really happy to see this being worked on by a dedicated team. Combined with a tool like https://github.com/slackhq/nebula this could help form the foundation of a fully autonomous, durable, online community.
We also made a presentation at FOSDEM'22, the video should be out soon.
- replication is way too expensive for what it's worth. As soon as you can afford it you really should go for some kind of erasure coding. The exact moment when erasure coding is more efficient (storage wise, transport wise) and more robust (in terms of devices you can lose) depends on the size of the objects (fixed cost of metadata).
- you don't need distributed consistency for the data storage actions just for the meta data (the map that tells you where the data is). You can start pushing data, and if the metadata for the object metadata never is created, you just pushed garbage which can be collected.1) the license of Garage is AGPL, MinIO is also AGPL so I'm not sure what's the problem with it as a valid choice for self-hosting? You seem to not have a problem using Apache 2 licensed software (HashiCorp stack).
2) if you're a non-profit, how come MinIO is your "closest competitor"?