Skip to content

Caching the results of functions of your R package

Metadata

Highlights

  • One principle of programming that’s often encountered is “DRY”, “Don’t Repeat Yourself”, that encourages e.g. the use of functions over duplicated (read: copy-pasted and slightly amended) code. You could also interpret it as don’t let the machine repeat its calculations if useless. How about for a function with the same inputs (or with no argument!), we only run it once e.g. per R session, and save the results for later? In this post, we shall go over ways to cache results of R functions, so that you don’t need to burden machines and humans.
  • Caching: what is it and why use it? Caching means that if you call a function several times with the exact same input, the function is only actually run the first time. The result is stored in a cache of some sort (more practical details later!). Every other time the function is called with the same input, the result is retrieved from the cache unless invalidated. You will often think of caching as something valid in only one R session, but we’ll see it can be persistent across sessions via storage on disk.
  • Now, why use caching? * It might help save time. * It might help save other resources of users such as money: e.g. if the function calls a web API whose pricing depends on the number of hits. 😅 * It might be more polite. That’s similar to the second item but from the perspective of e.g. a web API you keep hitting when you could have saved the result. The polite package for polite webscraping caches results.
  • Caching can be about results of functions but also some user inputs that won’t change for the session and you don’t want to ask every time you need it (being polite!). It could be per session caching but also persistent caching. As an example, reticulate will ask you once if you want to install miniconda by storing your answer locally if you say no and not ask again. (See internal miniconda_install_prompt()).
  • The memoise package The memoise package by Jim Hester is easy to use. Say we want to cache a function that only sleeps. The second call to sleep() is much quicker because, well, it does not call the .sleep() function so there’s no sleep.
  • Function factory Now what if you want simple memoization and no dependency on the memoise package? In that case you might be interested in creating some sort of function factory. See the example below, with who_am_i_impl() the function factory. We use it to create the who_am_i() function whose results are then stored for the session.
  • This function is a closure (a function creating a function). It will take any function and makes it cache its result in a list based on first argument value (here arg) when this is a string. If the memoized function is called again with the same first argument arg, then the result is retrieved from the list instead of the function being executed.
  • In the Advanced R book by Hadley Wickham such function factories are called stateful functions that “allow you to maintain state across function invocations”.
  • Saving results in an environment This is well suited for package development where: * You create an environment internal to your package where you store a value; * You can then store in it any computed value for the current R session where the package is loaded.
  • This example would be simpler if using rlang::env_cache() by Lionel Henry (in the development version of rlang at the time of writing).
  • Storing on disk? For persistent caching across R sessions you will need to store function results on disk. On that topic see also the R-hub blog post on persistent data and config for R packages. Where to store results on disk? Best practice is to use user data dir via the rappdirs package or tools::R_user_dir() from R version 4.0. You might see some local caching e.g. what httr::oauth2.0_token() does, in that case with editing of the .gitignore file as the cached result is a secret!
  • Packages for caching For further tooling around caching beside the memoise package and base R functions, refer to these packages (and their reverse dependencies!): * storr by Rich FitzJohn. Creates and manages simple key-value stores. These can use a variety of approaches for storing the data. This package implements the base methods and support for file system, in-memory and DBI-based database stores. * R.cache by Henrik Bengtsson. Fast and Light-Weight Caching (Memoization) of Objects and Results to Speed Up Computations. * cachem by Winston Chang. Key-value stores with automatic pruning. Caches can limit either their total size or the age of the oldest object (or both), automatically pruning objects to maintain the constraints.
  • Documenting caching If your package use caching, * document that; * and also provide ways to clear the cache (see e.g. opencage docs); this is especially crucial for persistent caching as it would be fine to simply say the user has to restart the R session.
  • When not to cache We can’t end this post with a few words of caution. Here are three cases when it’d be bad to cache: * The gains in time and other resources are not worth the increased complexity. You decide what’s worth it. Think of future collaborators, some of whom might encounter caching for the first time. * The results of a function with the same input might change. E.g. the function you call gives you the current time. Or you call a web API whose data is updated very regularly (although in that case rather than not caching you might want to look into the validity time of your cache). * The function should not be called several times to begin with. I.e., do not use caching as a band-aid for bad code design.
  • We have not covered other types of caching relevant for R users: caching for R Markdown, caching for Shiny, caching in projects via the use of the targets package (or its superseded predecessor drake). Lots to explore based on your use case! 😉 Have you used caching in one of your packages or scripts? What tool did you use?