I'm also unsure if Diskcache supports ignoring certain fields in the function call.
And it does support ignoring certain args yes.
Decorator to wrap callable with memoizing function using cache. Repeated calls with the same arguments will lookup result in cache and avoid function evaluation.
mature and works well
It's a very mature library, too, nice and polished, I've never once experienced a bug with it.
My local file cache Python decorator also allows the decorator to define the hash manually, either by the decorator’s parameter function call that plucks a value from the cached function params, or by calling a global function from anywhere with any arbitrary value.
What’s cool about caching results locally to files during development is the ease of invalidating caches — just delete the file named after the function and key you want.
- from your code it seems you're not sorting kwargs, I would strongly recommend sorting them so that whether you call f(a=1, b=2) or f(b=2, a=1) the cache key is the same
- I use inspect.signature to convert all args to kwargs, this way it doesn't matter how a function gets called, the cache logic is always consistent. I know this is relatively slow but it only gets called once per function (I call it outside the wrapper) and the DX benefits are nice (in this same note, you could probably move the inspect.getsource call outside your wrapper fn for a speed boost)
I also took the opposite approach to ignore_params, and made the __dict__ params that get hashed opt-in, which works well when caching instance methods
These tips make sense, we often use named args in our function calls (not using them has caused so many bugs), but we don't really enforce the order. Copilot doesn't always get it right either.
By moving inspect.getsource out of the wrapper, do you mean initializing it when the module is imported? I'm curious how that improves performance.
Re inspect.getsource, I'm not sure if it'd be a huge performance impact, but if it's in the wrapper fn it will get called every time the function gets called, while if it's outside it will be called only when the decorator runs (eg when the module containing the function being decorated is imported).
eg: https://gist.github.com/mpeg/ff1d99fde06f39916b5aaadd76b534f...
EDIT: on a quick test, over 100k function calls, with inspect.getsource inside the wrapper it runs in 2.7s on my Apple M2, and that's not even including the md5 hash, so I suspect this should dramatically improve performance for you
def hash_code(code):
return hashlib.md5(code.encode()).hexdigest()
Be warned. The above function is used as part of the hash. The ostensible purpose is to prevent using cached values of functions who's code has changed, but it does not handle dependencies of that function.Your example isn't really a problem with OPs utility, but a specific example of a broader dependency management problem that affects just about everything. The answers usually boil down to 1) invest heavily in a kick ass test suite, 2) never upgrade or 3) upgrade and pray nothing breaks.
Pretty much. Recursively collect dependencies by analyzing the AST of the code.
> And then if one minor update And let's say the dependency did change, but it's generally inert (more error handling around edge cases, for example), how do you factor that in? Blow up the whole cache?
You're saying that like it's some kind of ridiculous ask, but yes. The current implementation is already "Blow[ing] up the whole cache" whenever the code for the decorated function is changed anyways. I'd guess that additionally handling dependencies recursively would only modestly increase the rate of "Blow[ing] up the whole cache".
> Your example isn't really a problem with OPs utility...
Whether or not this is a problem in practice obviously depends on your use case. Maybe you don't generally care if functions return the correct result, but many do.
> [This is] a specific example of a broader dependency management problem that affects just about everything.
Dependency resolution is not trivial per se, but it's a pretty common problem. Every single package manager, build system (make), etc. have all solved this.
Sha1 is a better choice even for non-cryptographic use cases, it's quite a bit faster than md5. Even better would be something like xxhash!
According to a quick bash script I wrote to benchmark the popular hash functions, md5 comes out last compared to sha1, sha256, sha512, and blake2, and by a decent margin!
A good rule of thumb is to never use md5 at all. Not even for non-cryptographic use cases. It's not only broken, but also very slow!
https://github.com/stanfordnlp/dspy/blob/main/dsp/modules/ca...
memory = joblib.memory.Memory(...)
@memory.cache
def slow_func(...):
...""" Caching Libraries
joblib.Memory provides caching functions and works by explicitly saving the inputs and outputs to files. It is designed to work with non-hashable and potentially large input and output data types such as numpy arrays.
"""
From https://pypi.org/project/diskcache/I think it could help if you forget to save the output of a function within a single cell like this:
1. print(f(x)) # -> check what happened 2. out = f(x) # -> turns out we want to save this, so we have to wait again
Does this also take care of the thundering heard problem? That was one of the cases where lru_cache really blows
Sometimes caching can actually be slower for certain functions, because just performing that operation is faster than pickle.load/pickle.dump.
The main difference is that it stores the state of an object, not a function.
If your data is JSON serializable then it could be a cool way to save and resume application state.