- https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver
- https://github.com/ofek/csi-gcs
Here is the initial commit: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/c...
Notice for example not just the code but also the associated files. In the Dockerfile it blatantly copied the one from my repo, even the dual license I chose because I was very into Rust at the time. Or take a look at the deployment examples which use Kustomize which I like but is very uncommon and most Kubernetes projects provide Helm charts instead.
They were most certainly aware of the project because Google reached out to discuss potential collaboration but never responded back: https://imgur.com/a/KDuf9mj
Edit: I see you said it’s dual licensed. From the look of it both allow Google or any other company to copy and reuse code, so what are you upset about?
A lot of people treat licensing emotionally (e.g. WTFPL, or picking licenses that feel good, or that we saw in another project), however business people are very logical and will unfortunately exploit this.
The irony is that Google probably would not have done this if the codebase just omitted a license entirely. When I worked there, they wouldn't allow OSS with no license.
edit: as I express in a sibling comment this act is legally allowed of course, but is bad practice
I am a contributor who works on the Google Cloud Storage FUSE CSI Driver project. The project is partially inspired by your CSI implementation. Thank you so much for the contribution to the Kubernetes community. However, I would like to clarify a few things regarding your post.
The Cloud Storage FUSE CSI Driver project does not have “in large part copied code” from your implementation. The initial commit you referred to in the post was based on a fork of another open source project: https://github.com/kubernetes-sigs/gcp-filestore-csi-driver. If you compare the Google Cloud Storage FUSE CSI Driver repo with the Google Cloud Filestore CSI Driver repo, you will notice the obvious similarities, in terms of the code structure, the Dockerfile, the usage of Kustomize, and the way the CSI is implemented. Moreover, the design of the Google Cloud Storage FUSE CSI Driver included a proxy server, and then evolved to a sidecar container mode, which are all significantly different from your implementation.
As for the Dockerfile annotations you pointed out in the initial commit, I did follow the pattern in your repo because I thought it was the standard way to declare the copyright. However, it didn't take me too long to realize that the Dockerfile annotations are not required, so I removed them.
Thank you again for your contribution to the open source community. I have included your project link on the readme page. I take the copyright very seriously, so please feel free to directly create issues or PRs on the Cloud Storage FUSE CSI Driver GitHub project page if I missed any other copyright information.
Are you saying you have an issue with them copying your MIT licensed code?
It makes me sad that no one here cares about whether your blame is true. And I'd expect you can provide more convincing evidence. But looks like the accusation is not even true. It's not fair for those contributors man, I hope you can apologize.
Per the licenses they can copy but they must maintain attribution which has not been done.
For certain applications that consistently read limited subsets of the filesystem, this can be mitigated somewhat by the disk cache, but for applications that would thrash the cache, cloud buckets are simply not a good storage backend if you desire disk-like access.
What I would really like to see is a two-tier cache system: most recently accessed files are cached to RAM, with less recently accessed files spilling over to a disk-backed cache. That would open up a world of additional applications whose useful cache size exceeds practical RAM amounts.
Sure you're not going to use this as a consumer in place of a local disk, nor are you going to use this as part of your web app.
But there are lots of situations in reporting, batch/cron jobs, data processing, and general file administration where it's incredibly easier to use the file system interface than to use an HTTP API via a cloud storage library. Which FUSE is a godsend for. The latency doesn't matter in these cases for one-off things or scripts that already take seconds/minutes/hours anyways.
So no this isn't niche or a toy. It's a fantastic production tool for a lot of different common uses. It's not for everything but nothing is. Use the right tool for the job.
I agree with you, I would prefer a local disk to one with 100+ msec of latency and local storage prices are at the point where the right answer is probably "just add local storage."
But I watch with some sympathy the small army of sys-admins (something like 15-20 people) responsible for managing the 3000+ Macs our company uses and remember the 2 person staff which supported the 1500+ diskless workstations from my years at a sadly defunct mini-super-computer manufacturer. It was quite nice... you could go to any machine and log in and your desktop would follow you. I'm told doing the same thing with MSFT requires 10-20 people just to manage the AD hardware (though as a unix-fan, I hang out with other unix-fans who are notoriously rude to MSFT, so maybe it's only 5-10 people needed to manage the AD instance.)
Just copying the file to a mounted bucket would make this a lot easier.
Then again, how does one get the metadata of the uploaded file?
My company uses GCSFuse for ad-hoc analysis/visualization of large but poorly structured output from our lifesciences jobs and it works just fine for that.
Is there any sort of Linux HSM (Hieracrhical Storage Manager)? I haven't see any and have been a bit surprised nothing has really developed there. They can manage putting hot data in RAM, SSDs, colder or larger data on spinning rust, deep freezing onto a tape silo or a cloud storage...
Some of the NAS devices and RAID cards can support a two-tier caching or data migration using SSDs, where hot or highly-random data (usually identified by smaller write sizes) go to the SSDs, and then can migrate to the spinning discs.
I've done some "poor mans" version of this using LVM, where I can "pvmove" blocks of a logical volume between spinning discs and SSDs, which is pretty slick, but a very crude tool.
Take a look a the CERN paper https://iopscience.iop.org/article/10.1088/1742-6596/331/5/0... as they have a large use case.
AWS Premium Support wisely advised me against it, not just because of latency but also because the abstraction makes /far/ more API calls then a native solution would.
After a bit of testing to confirm, I switched to using native API calls. That code was easy to write and the performance was great. I've been wary of cloud FUSE adapters ever since.
This is really hard to get right if the origin cloud storage is anything other than immutable. Otherwise you're in for a world of cache invalidation and consistency pain.
I've gradually come round to the other opinion: there should be devices that sit on the PCIe/NVMe bus and provide a blob storage API rather than a block one, and there should be an operating system blob API that is similar to but not identical to the filesystem one.
I'd be curious to see how it works running on EC2, especially with an S3 endpoint in the VPC. Although I still think you'd be better suited by using S3 as an object store, given the option to built it right.
1. Goofys for S3 FUSE
2. Catfs for local disk caching
3. Linux caches in memory
4. Mmap file means processes share it
5. One device then exports this over the network to other machines, each of which have an application layer disk cache.
6. Machines are linked via 10 GigE (we use SFP+).
Overall the goofys and catfs guy (kahing) wrote very performant software. Big fan.
Isn't this how most servers run normally? (parts of) files which are accessed are in page cache, the rest is on "disk"
gcsfuse latency is ok as it embodies "infinite sync & persistence" ;)
If it performs well there, I could imagine that being pretty useful.
I don’t even want to know how bad the latency would be outside of a cloud VM.
VMs and disk space I understand completely, having machines on-prem is too much of an hassle and the price isn't that bad. But for stuff like this, managed services, databases especially, you're just getting scammed.
From reading the docs, it looks very similar to `rclone mount` with `--vfs-cache-mode off` (the default). The limitations are almost identical.
* Metadata: Cloud Storage FUSE does not transfer object metadata when uploading files to Cloud Storage, with the exception of mtime and symlink targets. This means that you cannot set object metadata when you upload files using Cloud Storage FUSE. If you need to preserve object metadata, consider uploading files using gsutil, the JSON API, or the Google Cloud console.
* Concurrency: Cloud Storage FUSE does not provide concurrency control for multiple writes to the same file. When multiple writes try to replace a file, the last write wins and all previous writes are lost. There is no merging, version control, or user notification of the subsequent overwrite.
* Linking: Cloud Storage FUSE does not support hard links.
* File locking and file patching: Cloud Storage FUSE does not support file locking or file patching. As such, you should not store version control system repositories in Cloud Storage FUSE mount points, as version control systems rely on file locking and patching. Additionally, you should not use Cloud Storage FUSE as a filer replacement.
* Semantics: Semantics in Cloud Storage FUSE are different from semantics in a traditional file system. For example, metadata like last access time are not supported, and some metadata operations like directory renaming are not atomic. For a list of differences between Cloud Storage FUSE semantics and traditional file system semantics, see Semantics in the Cloud Storage FUSE GitHub documentation.
* Overwriting in the middle of a file: Cloud Storage FUSE does not support overwriting in the middle of a file. Only sequential writes are supported. Access: Authorization for files is governed by Cloud Storage permissions. POSIX-style access control does not work.
However rclone has `--vfs-cache-mode writes` which caches file writes to disk first to allow overwriting in the middle of a file and `--vfs-cache-mode full` to cache all objects on a LRU basis. They both make the file system a whole lot more POSIX compatible and most applications will run using `--vfs-cache-mode writes` unlike `--vfs-cache-mode off`.
And of course rclone supports s3/azureblob/b2/r2/sftp/webdav/etc/etc also...
I don't think it is possible to adapt something with cloud storage semantics to a file system without caching to disk, unless you are willing to leave behind the 1:1 mapping of files seen in the mount to object in the cloud storage.
> export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
We don't use drive to store other files. Actually, we don't really "store files" since almost everything we need is remote.
See for instance this discussion: https://news.ycombinator.com/item?id=13561096
But if you're using the google cloud like you might use Box.Net or DropBox, it seems fine for light usage.
A more complex layer like https://objectivefs.com/ (based on the S3 API) would be more useful, although I would've expected the cloud providers to scale their own block-store/SANs backed with object-stores by now.
Adds a DBMS or key-value store for metadata, making the filesystem much faster (POSIX, small overwrites don't have to replace a full object in the GCS/S3 backend).
Almost certainly a better solution if you want to turn your object storage into a mountable filesystem, with the (big) caveat that you can't access the files directly in the bucket (they are not stored transparently).
Choosing an appropriate solution in this space still depends on what you need to do with the storage, and a few other options are MooseFS (https://github.com/moosefs/moosefs) SeaweedFS (https://github.com/seaweedfs/seaweedfs) Curve (https://github.com/opencurve/curve) GeeseFS (https://github.com/yandex-cloud/geesefs)
This seems like a big limitation?
You could spilt the file into smaller chunks and reassemble at the application layer. That way you limit the cost of changing any byte to the chunk size.
That could also support inserting or removing a byte. You'd have a new chunk of DEFUALT_CHUNK_SIZE+1 (or -1). Split and merge chunks when they get too large or small.
Of course at some point if you are using a file metaphor you want a real file system.
Or is there a large group of programs that only ever write sequentially?
"Cloud Storage FUSE is available free of charge, but the storage, metadata, and network I/O it generates to and from Cloud Storage are charged like any other Cloud Storage interface. In other words, all data transfer and operations performed by Cloud Storage FUSE map to Cloud Storage transfers and operations, and are charged accordingly."
For example, Cloud Storage never moves or renames your objects; copying and deleting the original one instead. This can end up costing quite a lot if you're using data other that in "standard store" because of minimum storage duration.