undefined | Better HN

0 pointskorijn3y ago0 comments

We wrote our own LFS API server (which is actually not very hard, about 100 lines of python was enough and it performs at scale) so we can directly leverage azure blob storage. If you don't walk this path and enable LFS in github or something like that the costs are obscene, yes. For us it's dirt cheap.

If I check out head of my repo and don't filter anything, it's a couple GBs.

Inside the azure blob storage container that backs our LFS API server, there's probably terabytes of data. It's really very very much.

We don't have any performance problems. One API instance can handle it. Of course we did make sure to implement it well... It's Uvicorn/Starlette, all IO is async and all CPU "intensive" work like JSON (de)serialization runs in a background threadpool.

0 comments

3 comments · 1 top-level

eliomattia3y ago· 2 in thread

That is really interesting and begs the question of how frequently you have changes in your data that lead to new commits. I am assuming here that you don't dedupe anything, that is, you throw the entire files into Azure with each version, since it's cheap enough for your purposes. Also, how frequently do you move head, even without committing anything new, perhaps to use another branch?

korijnOP3y ago

LFS stores files by content hash, so deduplication happens that way. But you're right that if you frequently make small changes to a single large file, it's wasteful.

In our case though we don't frequently change files, we just get lots and lots of new big files coming in all the time.

Moving head, as in, to check out another branch locally? Somewhat regularly I guess. I suppose you're wondering about performance in that scenario? It's usually quite good since git-lfs does some local caching as well. I've never needed to wait longer than a couple of seconds. I'm usually on a wired 1000/1000 Mbit optic fibre connection, and transfers are directly to and from an azure blob storage container (the LFS API server only generates download and upload URLs, it intentionally doesn't transfer any data), with parallel connections and chunking etc, so it doesn't really get any better than that. And all of that is out of the box functionality too. :)

eliomattia3y ago

Sorry I should have been more specific, I meant block deduplication, or any form of deduplication at a level lower than the entire file. File deduplication can only get you so far, depending on the use case. XetHub does block deduplication, whereas I am implementing data-level deduplication, which is slower in recreating dataset snapshots (can be parallelized and delegated), but allows savings on disk space with small but frequent changes and can be tied to collaborative features to show diffs, comment on them, and revert or edit changes where needed, all while pointing clearly to specific commits. And also potentially fork data or cumulative changes.

Yes I meant either checking out other branches locally, or in the general case pointing to another branch to indicate to any services to make data from that branch available to wherever it's consumed. I am assuming that each incoming new file is then added to data pipelines, possibly just a few. Sounds like you are in the sweet spot where you have the speed you want and, given unfrequent changes, you are fine with the versions taking up terabytes on Azure, since they are mostly new data.

j / k navigate · click thread line to collapse