Show HN: File-based cache for slow Python functions | Better HN

57 comments

43 comments · 10 top-level

rmholt2y ago· 10 in thread

I have extensively used https://pypi.org/project/diskcache/. Is there a reason you decided to make an in house solution?

williamzeng0OP2y ago

Diskcache works well, we just wanted a dependency free version that we had more control over (easier cache key deletion). I think you'd have to write a custom hashing function for diskcache to use the function source code as a key.

I'm also unsure if Diskcache supports ignoring certain fields in the function call.

rmholt2y ago

Good questions! You made me check the docs because those seem like very legitimate issues. So firstly, DiskCache by default just checks the function name, not the source code, but you could hack it to include the source code. I personally usually just deleted the cache if I knew the function meaningfully changed.

And it does support ignoring certain args yes.

CGamesPlay2y ago

I was curious to see an alternative to this, but how is this an alternative? You're saying I can implement my own caching of function calls that invalidates when the arguments or source code change...? These feel like entirely separate layers. Did I miss where diskcache does this stuff?

https://grantjenks.com/docs/diskcache/api.html#diskcache.Cac...

Decorator to wrap callable with memoizing function using cache. Repeated calls with the same arguments will lookup result in cache and avoid function evaluation.

anentropic2y ago

I was thinking of this as soon as I read the OP

mature and works well

quickslowdown2y ago

I found DiskCache sometime last year, it's amazing. Very simple to set up and works great as a cache for so many different things.

dotancohen2y ago

What are you using it for? A disk based cache seems almost contradictory for my use cases, I would love to hear yours. Anything that I would store on disk, even as a cache, I can generally put in SQLite.

varispeed2y ago

How it fares with several millions of cached objects?

BiteCode_dev2y ago

It has sqlite performances, which is the fastest you can get with Pareto effort.

Being sqlite backed, it's really fast and threadsafe, the cache is shared safely between all threads or processes.

It's a very mature library, too, nice and polished, I've never once experienced a bug with it.

wildermuthn2y ago· 7 in thread

If you aren’t caching LLM functions during development, then you’re an even greater glutton for punishment than the normal engineer.

My local file cache Python decorator also allows the decorator to define the hash manually, either by the decorator’s parameter function call that plucks a value from the cached function params, or by calling a global function from anywhere with any arbitrary value.

What’s cool about caching results locally to files during development is the ease of invalidating caches — just delete the file named after the function and key you want.

I feel like adding an argument to the decorator that labels the "version" of the function would make deliberate cache invalidation more straightforward for cache users.

williamzeng0OP2y ago

The version input makes sense, I could also see some developers disliking that ux because of it's verbosity. But to deliberately invalidate you have to make a manual effort in either case.

williamzeng0OP2y ago

100%, invalidation needs to be fast or you're not really saving time. I'm curious about calling a global function, what's the use case for that?

mpeg2y ago

This is also why in my custom cache I back it with sqlite – much easier to delete one db file than thousands of pickle files.

AlecSchueler2y ago

Globs are a thing?

canadiantim2y ago

I'm sure this is a stupid question, but why is it much better to be caching LLM functions during development?

Because they are generally incredibly computationally expensive operations that can take hours/days to complete (?more)

mpeg2y ago· 5 in thread

I recently wrote a version of this that I use in my projects, some things I do differently that you may or may not care about:

- from your code it seems you're not sorting kwargs, I would strongly recommend sorting them so that whether you call f(a=1, b=2) or f(b=2, a=1) the cache key is the same

- I use inspect.signature to convert all args to kwargs, this way it doesn't matter how a function gets called, the cache logic is always consistent. I know this is relatively slow but it only gets called once per function (I call it outside the wrapper) and the DX benefits are nice (in this same note, you could probably move the inspect.getsource call outside your wrapper fn for a speed boost)

I also took the opposite approach to ignore_params, and made the __dict__ params that get hashed opt-in, which works well when caching instance methods

williamzeng0OP2y ago

Making the __dict__ opt-in makes it a lot more user-friendly at the expense of a little verbosity. That makes sense.

These tips make sense, we often use named args in our function calls (not using them has caused so many bugs), but we don't really enforce the order. Copilot doesn't always get it right either.

By moving inspect.getsource out of the wrapper, do you mean initializing it when the module is imported? I'm curious how that improves performance.

mpeg2y ago

Yeah I too try to avoid positional args as much as possible, huge source of bugs and time wasting especially when refactoring code

Re inspect.getsource, I'm not sure if it'd be a huge performance impact, but if it's in the wrapper fn it will get called every time the function gets called, while if it's outside it will be called only when the decorator runs (eg when the module containing the function being decorated is imported).

eg: https://gist.github.com/mpeg/ff1d99fde06f39916b5aaadd76b534f...

EDIT: on a quick test, over 100k function calls, with inspect.getsource inside the wrapper it runs in 2.7s on my Apple M2, and that's not even including the md5 hash, so I suspect this should dramatically improve performance for you

AlecSchueler2y ago

Very insightful comment, but can I ask what DX stands for? Maybe I'm missing something obvious.

oulipo2y ago

"Developper Experience", eg good developper tools / libs

by_the_bay2y ago

Developer experience

epr2y ago· 5 in thread

  def hash_code(code):
      return hashlib.md5(code.encode()).hexdigest()

Be warned. The above function is used as part of the hash. The ostensible purpose is to prevent using cached values of functions who's code has changed, but it does not handle dependencies of that function.

martinky242y ago

How do you suggest one might fix that issue? Also pin the cache to a hash of all dependency versions? And then if one minor update And let's say the dependency did change, but it's generally inert (more error handling around edge cases, for example), how do you factor that in? Blow up the whole cache?

Your example isn't really a problem with OPs utility, but a specific example of a broader dependency management problem that affects just about everything. The answers usually boil down to 1) invest heavily in a kick ass test suite, 2) never upgrade or 3) upgrade and pray nothing breaks.

williamzeng0OP2y ago

+1, we considered traversing the function's dependencies to key the cache on (not just the initial function source code), but decided to leave this in a as a constraint. Otherwise we also blowing up the cache when we didn't want it to happen.

epr2y ago

> How do you suggest one might fix that issue? Also pin the cache to a hash of all dependency versions?

Pretty much. Recursively collect dependencies by analyzing the AST of the code.

> And then if one minor update And let's say the dependency did change, but it's generally inert (more error handling around edge cases, for example), how do you factor that in? Blow up the whole cache?

You're saying that like it's some kind of ridiculous ask, but yes. The current implementation is already "Blow[ing] up the whole cache" whenever the code for the decorated function is changed anyways. I'd guess that additionally handling dependencies recursively would only modestly increase the rate of "Blow[ing] up the whole cache".

> Your example isn't really a problem with OPs utility...

Whether or not this is a problem in practice obviously depends on your use case. Maybe you don't generally care if functions return the correct result, but many do.

> [This is] a specific example of a broader dependency management problem that affects just about everything.

Dependency resolution is not trivial per se, but it's a pretty common problem. Every single package manager, build system (make), etc. have all solved this.

Using md5 for this seems like an odd choice.

Sha1 is a better choice even for non-cryptographic use cases, it's quite a bit faster than md5. Even better would be something like xxhash!

According to a quick bash script I wrote to benchmark the popular hash functions, md5 comes out last compared to sha1, sha256, sha512, and blake2, and by a decent margin!

A good rule of thumb is to never use md5 at all. Not even for non-cryptographic use cases. It's not only broken, but also very slow!

williamzeng0OP2y ago

That sounds great, I'm going to see how Sweep does on this issue: https://github.com/sweepai/sweep/issues/3333

rassibassi2y ago· 2 in thread

What's the difference to using joblibs Memory class similar to this implementation:

https://github.com/stanfordnlp/dspy/blob/main/dsp/modules/ca...

I was going to mention this as well. It's fairly similar:

  memory = joblib.memory.Memory(...)
  
  @memory.cache
  def slow_func(...):
      ...

rassibassi2y ago

The diskcache docs state:

""" Caching Libraries

    joblib.Memory provides caching functions and works by explicitly saving the inputs and outputs to files. It is designed to work with non-hashable and potentially large input and output data types such as numpy arrays.

""" From https://pypi.org/project/diskcache/

kapilsinha2y ago· 2 in thread

I like the simplicity. I definitely get the payoff for standalone Python scripts, where once the script errors out the memory is cleared. But do you see a similar payoff for Jupyter notebooks (or similar)?

williamzeng0OP2y ago

I think the marginal gain would be a lot less for Jupyter notebooks, but I've definitely rerun individual cells and wasted time there before.

I think it could help if you forget to save the output of a function within a single cell like this:

1. print(f(x)) # -> check what happened 2. out = f(x) # -> turns out we want to save this, so we have to wait again

FWIW There is a built-in cache system for r markdown documents. I'm not up to speed on their exact implementation but I have found it useful.

https://bookdown.org/yihui/rmarkdown-cookbook/cache.html

rthnbgrredf2y ago· 1 in thread

Recently, I experimented with various techniques to cache some JSON responses from FastAPI, using Python decorators for both in-memory and disk caching on a single machine. After benchmarking the performance, I found the results somewhat disappointing (500 req/s vs 5k req/s). While caching did lead to a tenfold improvement in speed compared to no caching, I believe the primary bottleneck was Python's inherent performance limitations, which made it X times slower than a comparable program written in C. Consequently, I decided to remove the cache decorator and instead put a simple nginx caching reverse proxy in front of FastAPI. This resulted in performance gains that were an order of magnitude better (60k req/s) than those achieved with Python based caching.

kevinlu12482y ago

We also found a lot of cases where caching ends up being actually slower than doing the operation. The 100% solution would probably be to use a SQL db the way diskcache does it, but this is easier to use for us.

skp19952y ago· 1 in thread

This is a pretty good implementation. I like the simplicity of it, reminds me of SQLite backed storage decorators we used to have, where the data was persisted to a DB instead of the file system (altho thats just a different storage engine)

Does this also take care of the thundering heard problem? That was one of the cases where lru_cache really blows

williamzeng0OP2y ago

Unfortunately it doesn't, we typically don't expect to handle high load with this cache and actually disable it in production with another envvar.

Sometimes caching can actually be slower for certain functions, because just performing that operation is faster than pickle.load/pickle.dump.

andrewgazelka2y ago

This looks cool :). A while ago, I wrote something similar that analyzes bytecode and invalidates the cache if the bytecode changes.

https://github.com/andrewgazelka/smart-cache

emilehere2y ago

Reminds me of a little prototype I wrote a while ago that tried to do something similar with Javascript's Proxy class. https://github.com/emileindik/cashola

The main difference is that it stores the state of an object, not a function.

If your data is JSON serializable then it could be a cool way to save and resume application state.

j / k navigate · click thread line to collapse