Writing a file system from scratch in Rust (opens in new tab)

(blog.carlosgaldino.com)

331 pointscarlosgaldino5y ago59 comments

59 comments

39 comments · 12 top-level

dm3195y ago· 7 in thread

It would be nice if the intro had a brief explanation of why a disk needs to be divided into blocks. Otherwise, I really enjoyed this read from the perspective of a lay person.

masklinn5y ago

> It would be nice if the intro had a brief explanation of why a disk needs to be divided into blocks.

One reason is that HDDs simply don't have a byte-wise resolution, so there's little point talking to HDDs in sub-sector units. Sectors are usually 512 bytes to 4k.

A second reason is being able to simply address the drive. Using 32b indices, if you index bytewise you're limited to 4GB which was available in the early 90s. With 512 bytes blocks, you get an addressing capacity of 2TB, and with 4k blocks (the AF format), you get 16TB. In fact I remember 'round the late 90s / early aught we'd reformat our drives using higher-size blocks because the base couldn't see the entire thing.

jagged-chisel5y ago

> HDDs simply don't have a byte-wise resolution

Sufficient explanation for code. Now why is it that disks lack byte-wise resolution?

7 more replies

monadic25y ago

Well, one benefit would be a relaxation of constraints on filesystems. This might be worth performance loss.

baruch5y ago

A drive (be it HDD or SSD) is a block device, all drives are divided into blocks/sectors since they are a rather lossy medium. Data written to a bit may not necessarily be read accurately and so the solution is to have an Error-Correcting-Code that will allow to recover from some number of badly read bits (above that and you get a medium error from the drive). Since an ECC for a byte is a bit excessive the drives use blocks of 512 bytes or 4096 bytes, one reason for the move from 512 to 4096 is that the ECC becomes more efficient.

HDDs also have small "gaps" that have headers to locate the sectors (in the distant past you could do a low-level format to correct these gaps as well).

SSDs do not have these gaps but they do need the ECC.

im3w1l5y ago

In the bad old days we stored data on spinning-rust harddrives. In order to find the data you wanted you had to position the reader head. This took several milliseconds. You could read a lot of data in that time. So basically if you took the trouble to seek somewhere you better read a bunch of data to make it worthwhile.

People would even run defragmenters, that would rearrange file blocks so they were stored continously and the whole file could be read without seeeking.

unethical_ban5y ago

The disk / inodes need to know where to start looking for a file's contents, like the address in memory for RAM. Or like the mail: We subdivide by city, then ZIP, then street, then address.

So the inode says "The data for my file starts at block 72 and is 3 blocks long" (or something like that). The disk then goes there, and reads blocks 72,73,74.

Each block is 4KiB large often, so if you have a 10KiB file, you still take up ceiling(file size/block size) blocks.

That's why there is a difference between "File size" and "Size on disk" when you look at disk usage summaries.

saurik5y ago

This doesn't explain why blocks are valuable as you could use byte addressing. The reason why blocks are valuable is for similar reasons to why memory is divided into pages. (Which is all I am going to write, as I don't have the time to answer this well today. But "you need to know where things are on disk" isn't an answer.)

1 more reply

Immortal3335y ago· 6 in thread

Shameless plug. I did similar in my OS course. But, in C. Github: https://github.com/immortal3/EbFS

Warning: Terribly written. many hacks.

RealityVoid5y ago

Soooo... how does it work?

I'm not asking about the structure or how it's organized. I mean... is the filesystem in a file or... how?

Background: I mostly do embedded stuff so at a glance I would have expected low level primitives (like, HW interactions, registers and stuff) but I see none. So maybe, my expectation, when tacking a problem, of interacting with the HW directly, does not stand in modern environments.

Even better, but unrelated question... how the heck does a x86 OS request data from the HDD?

mcpherrinm5y ago

You'd presumably have some "block device" abstraction between your filesystem and your device driver. Don't want to re-implement a FS for each type of hardware. On a Linux system, you can read, eg, /dev/sda1 from userspace, which is what it looks like this filesystem probably does.

As for how you actually request data from the hard drive: There's older ATA interfaces, and BIOS routines from them, which I suspect is what most hobbyist OSes would use.

A more modern interface is AHCI. The OSDev wiki has an overview, where you can see how the registers work: https://wiki.osdev.org/AHCI

keithnz5y ago

as an aside, for our embedded system we use https://github.com/ARMmbed/littlefs for our flash file system, it has a bit of a description on its design and its copy on write system so that it can handle random power loss. Be nice to see some of these kinds of libraries done in Nim or Rust.

brandmeyer5y ago

> how the heck does a x86 OS request data from the HDD?

Entirely too short summary: Use PCI to discover the various devices attached to the CPU. One or more of them are AHCI or NVMe devices. The AHCI and NVMe standards each describe sets of memory-mapped configuration registers and DMA engines. Eventually, you get to a point where you can describe linked lists of transactions to be executed that are semantically similar to preadv, pwritev, and so on.

There's tons of info on osdev.org, such as https://wiki.osdev.org/AHCI

rrdharan5y ago

https://en.m.wikipedia.org/wiki/INT_13H

pkaye5y ago

Looks like a filesystem in a file.

ravenstine5y ago· 5 in thread

Is there any advantage in writing a custom file system for a niche purpose? It seems like most file systems are just different variations of managing where/when files are written simultaneously. Could a file system written specifically for something like PostgreSQL cut out the middle-man and increase performance?

topspin5y ago

Yes. Oracle has done this (ASM) to eliminate overhead, implement fault tolerance and provide a storage management interface based on SQL, for example.

I once made a 'file system' to mount cpio archives (read-only) in an embedded system. Cpio is an extremely simple format to generate and edit (in code) and mounting it directly was very effective.

formerly_proven5y ago

I suspect operating on block storage directly may both be easier and more reliable for databases, since about 75 % of the complication in writing transactional I/O software is working around the kernel's behavior.

1 more reply

anitil5y ago

Wow this comment just made me fall down a rabbit hole. I've only just surfaced. The Kaitai project actually comes with some pre-defined bindings for cpio which meant I was up and running very quickly.

https://formats.kaitai.io/cpio_old_le/index.html

1 more reply

tene5y ago

You may be interested in a paper written by the Ceph team: "File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution"

https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf

There are definitely some significant benefits you can get from managing your own storage, rather than using a filesystem.

jandrewrogers5y ago

Yes, this is common in database engines. Doing so allows you to optimize the file system along a very different set of performance tradeoffs and assumptions than a typical generic file system. Beyond that, it also gives you direct control of file system behavior, the lack of which is a source of code complexity and edge cases. This is not transparent to the database, something like PostgreSQL would need to have its storage layer redesigned to explicitly take advantage of the guarantees.

It isn't just about performance gains, which are substantial, it also greatly simplifies the design and code by eliminating edge cases, undesirable behaviors, and variability in behavior across different deployment environments.

unethical_ban5y ago· 3 in thread

I have read down to the implementation section, but for my money, this is the best way to describe the high level function and behavior of a filesystem that I have ever seen.

ridiculous_fish5y ago

A very accessible (though dated) intro to filesystems is Practical File System Design, by Dominic Giampaolo.

PDF link: http://www.nobius.org/practical-file-system-design.pdf

Q6T46nT668w6i3m5y ago

Frankly, not too much has changed since Giampaolo. In fact, it is still standard reading in many graduate seminars on the subject!

vondur5y ago

Is he the guy who did the BeOS filesystem?

1 more reply

bluejekyll5y ago· 2 in thread

Always fun to see this type of work. I notice the usage of OsString, and it made me wonder: does the way an OS encodes it’s strings potentially make this FS non-portable between OSes? If I want to mount a drive formatted with this FS, would the OsString be potentially non-portable?

There was a lot of discussion in the past around TFS https://github.com/redox-os/tfs, my understanding is that effort has kinda lost steam.

fiddlerwoaroof5y ago

This is really cool, I wish someone would fund it.

still_grokking5y ago

That dead[1] project?

Actually everything around "Redox" looks like:

https://gitlab.redox-os.org/redox-os/tfs/issues/66

[1] https://gitlab.redox-os.org/redox-os/tfs/issues/80

1 more reply

blackrock5y ago· 2 in thread

Once you have the file system, and a scheduler, don’t you have a basic rudimentary operating system?

How soon until someone builds an Operating System developed in Rust? Maybe make it microkernel-based this time.

smt885y ago

> How soon until someone builds an Operating System developed in Rust?

Redox[1] has been around for almost as long as Rust has. I first heard about it 4-5 years ago.

They had an interesting competition a while back challenging people to figure out how to crash it.

1. https://www.redox-os.org/

blackrock5y ago

Yeah, I heard about this project. But there’s a graveyard of dead OS projects out there.

What’s the progress and potential of Redox?

1 more reply

azhenley5y ago· 1 in thread

There’s also this file system chapter from a series on writing an OS in Rust: http://osblog.stephenmarz.com/ch10.html

est315y ago

And for code there is TFS https://github.com/redox-os/tfs

phjesusthatguy35y ago· 1 in thread

We've attempted this as well and it's not as simple as it seems. The issues we've run into have made us reconsider porting our FS handlers to Rust, although we are cautiously optimistic about later results.

ianlevesque5y ago

Any more specifics?

vinc5y ago

I recently wrote a very simple and naive filesystem in rust for a toy OS I'm building and it was quite an interesting thing to do: https://github.com/vinc/moros/blob/master/doc/filesystem.md

Then I implemented a little FUSE driver in Python to read the disk image from the host system and it was wonderful to mount it the first time and see the files! https://github.com/vinc/moros-fuse

Ericson23145y ago

My dream is to add enough type parameters so in-memory collections can also work as (not horribly tuned!) on-disk datastructures.

It's a nice ambitious goal which can really drive language and library design.

sjwright5y ago

I'd be curious to experiment with a file system where all of the file and path metadata is centrally stored in a sqlite blob. Is sqlite fast enough for dealing with file system metadata requests?

shmerl5y ago

Something like bcachefs could have been written in Rust.

j / k navigate · click thread line to collapse

59 comments

39 comments · 12 top-level

dm3195y ago· 7 in thread

It would be nice if the intro had a brief explanation of why a disk needs to be divided into blocks. Otherwise, I really enjoyed this read from the perspective of a lay person.

masklinn5y ago

> It would be nice if the intro had a brief explanation of why a disk needs to be divided into blocks.

One reason is that HDDs simply don't have a byte-wise resolution, so there's little point talking to HDDs in sub-sector units. Sectors are usually 512 bytes to 4k.

jagged-chisel5y ago

> HDDs simply don't have a byte-wise resolution

Sufficient explanation for code. Now why is it that disks lack byte-wise resolution?

7 more replies

monadic25y ago

Well, one benefit would be a relaxation of constraints on filesystems. This might be worth performance loss.

baruch5y ago

HDDs also have small "gaps" that have headers to locate the sectors (in the distant past you could do a low-level format to correct these gaps as well).

SSDs do not have these gaps but they do need the ECC.

im3w1l5y ago

People would even run defragmenters, that would rearrange file blocks so they were stored continously and the whole file could be read without seeeking.

unethical_ban5y ago

The disk / inodes need to know where to start looking for a file's contents, like the address in memory for RAM. Or like the mail: We subdivide by city, then ZIP, then street, then address.

So the inode says "The data for my file starts at block 72 and is 3 blocks long" (or something like that). The disk then goes there, and reads blocks 72,73,74.

Each block is 4KiB large often, so if you have a 10KiB file, you still take up ceiling(file size/block size) blocks.

That's why there is a difference between "File size" and "Size on disk" when you look at disk usage summaries.

saurik5y ago

1 more reply

Immortal3335y ago· 6 in thread

Shameless plug. I did similar in my OS course. But, in C. Github: https://github.com/immortal3/EbFS

Warning: Terribly written. many hacks.

RealityVoid5y ago

Soooo... how does it work?

I'm not asking about the structure or how it's organized. I mean... is the filesystem in a file or... how?

Even better, but unrelated question... how the heck does a x86 OS request data from the HDD?

mcpherrinm5y ago

As for how you actually request data from the hard drive: There's older ATA interfaces, and BIOS routines from them, which I suspect is what most hobbyist OSes would use.

A more modern interface is AHCI. The OSDev wiki has an overview, where you can see how the registers work: https://wiki.osdev.org/AHCI

keithnz5y ago

brandmeyer5y ago

> how the heck does a x86 OS request data from the HDD?

There's tons of info on osdev.org, such as https://wiki.osdev.org/AHCI

rrdharan5y ago

https://en.m.wikipedia.org/wiki/INT_13H

pkaye5y ago

Looks like a filesystem in a file.

ravenstine5y ago· 5 in thread

topspin5y ago

Yes. Oracle has done this (ASM) to eliminate overhead, implement fault tolerance and provide a storage management interface based on SQL, for example.

I once made a 'file system' to mount cpio archives (read-only) in an embedded system. Cpio is an extremely simple format to generate and edit (in code) and mounting it directly was very effective.

formerly_proven5y ago

1 more reply

anitil5y ago

https://formats.kaitai.io/cpio_old_le/index.html

1 more reply

tene5y ago

You may be interested in a paper written by the Ceph team: "File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution"

https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf

There are definitely some significant benefits you can get from managing your own storage, rather than using a filesystem.

jandrewrogers5y ago

unethical_ban5y ago· 3 in thread

I have read down to the implementation section, but for my money, this is the best way to describe the high level function and behavior of a filesystem that I have ever seen.

ridiculous_fish5y ago

A very accessible (though dated) intro to filesystems is Practical File System Design, by Dominic Giampaolo.

PDF link: http://www.nobius.org/practical-file-system-design.pdf

Q6T46nT668w6i3m5y ago

Frankly, not too much has changed since Giampaolo. In fact, it is still standard reading in many graduate seminars on the subject!

vondur5y ago

Is he the guy who did the BeOS filesystem?

1 more reply

bluejekyll5y ago· 2 in thread

There was a lot of discussion in the past around TFS https://github.com/redox-os/tfs, my understanding is that effort has kinda lost steam.

fiddlerwoaroof5y ago

This is really cool, I wish someone would fund it.

still_grokking5y ago

That dead[1] project?

Actually everything around "Redox" looks like:

https://gitlab.redox-os.org/redox-os/tfs/issues/66

[1] https://gitlab.redox-os.org/redox-os/tfs/issues/80

1 more reply

blackrock5y ago· 2 in thread

Once you have the file system, and a scheduler, don’t you have a basic rudimentary operating system?

How soon until someone builds an Operating System developed in Rust? Maybe make it microkernel-based this time.

smt885y ago

> How soon until someone builds an Operating System developed in Rust?

Redox[1] has been around for almost as long as Rust has. I first heard about it 4-5 years ago.

They had an interesting competition a while back challenging people to figure out how to crash it.

1. https://www.redox-os.org/

blackrock5y ago

Yeah, I heard about this project. But there’s a graveyard of dead OS projects out there.

What’s the progress and potential of Redox?

1 more reply

azhenley5y ago· 1 in thread

There’s also this file system chapter from a series on writing an OS in Rust: http://osblog.stephenmarz.com/ch10.html

est315y ago

And for code there is TFS https://github.com/redox-os/tfs

phjesusthatguy35y ago· 1 in thread

ianlevesque5y ago

Any more specifics?

vinc5y ago

I recently wrote a very simple and naive filesystem in rust for a toy OS I'm building and it was quite an interesting thing to do: https://github.com/vinc/moros/blob/master/doc/filesystem.md

Then I implemented a little FUSE driver in Python to read the disk image from the host system and it was wonderful to mount it the first time and see the files! https://github.com/vinc/moros-fuse

Ericson23145y ago

My dream is to add enough type parameters so in-memory collections can also work as (not horribly tuned!) on-disk datastructures.

It's a nice ambitious goal which can really drive language and library design.

sjwright5y ago

I'd be curious to experiment with a file system where all of the file and path metadata is centrally stored in a sqlite blob. Is sqlite fast enough for dealing with file system metadata requests?

shmerl5y ago

Something like bcachefs could have been written in Rust.

j / k navigate · click thread line to collapse