> We do not need replication right now because our user base is low
What happens when the hard drive or hard drives in this machine fails?
> I don't see how an hierarchical file structure would prevent us from doing it.
Because you are tied to one machine. Replication across machines creates a consistency problem. How do you ensure consistency? Either you write to N machines, which means when one fails, it doesn't get written to and you have to reconcile later, or you write to one and replicate to another.
(I am not a database guy and I'm on a programming hiatus, so maybe I dont understand)
If you cronned rsync or a backup script, how much data are you willing to lose? If a change is made right after backup happens, is it a big deal to lose it if a machine fails before the next backup?
The hash question was more to assess why your files need to be in a hierarchical structure rather than in a key value store, where the key is the file hash, and the value is the contents of that file. I was imagining an example being that nginx serves files directly out of these folders.
If you think you are going to grow significantly, tying information directly to a machine will create incredible operational pain in the future. Abstracting it away from a machine lets you do things like stick your data into s3, which will let you avoid a large number of operational pitfalls.
> Using your solution would abstract the filesystem
That's where I am going. If you do grow and your data out grows a machine or machines, I imagine you will have to do this. If your data is in s3, I imagine you are going to define config files which get consumed by wherever your "write" call is made and consumed by a script that validates those assumptions and emails a report of assumption violations.
I would assume a client and a cronned validation script that look something like:
Class ValidatedS3Writer:
def __init__(self, s3, validator):
self.s3 = s3
self.validator = validator
def write(self, path, payload):
self.validator.validate(path, payload)
self.s3.write(path, payload)
s3 = ValidatedS3Writer(S3(), Validator(VALIDATION_CONFIG_YAML))
----
if __name__ == "__main__":
validator = Validator(VALIDATION_CONFIG_YAML)
for path in db.all_paths():
validator.validate(path, s3.read(path)) # or s3.metadata(path)
Anyway I don't know your use case in detail, if this is a premature optimization, well that is the root of all evil. However if it is not an optimization, but an architectural decision, I would encourage you to think about:
1. What happens when a machine fails
2. What happens when you have hit some machine based limit.
If these files in a hierarchy rarely change and are pointed to by say an nginx config, then making sure your backups are sane (and testing them, actually testing them), and writing a validation script is probably a better investment that allows you to spend more time on product.
Operational pain comes from how state is managed, SPOFs, and consistency models. So anytime you have a SPOF, it merits thinking what it would take to mitigate it.