The Store - Caching and (re-)using simulator results with SWYFT

The caching and (re-)use of simulator results is central to the working of SWYFT, with reuse possible both within the context of a single inference problem, as well as between different experiments – provided the simulator used (including all its settings) is the same. It is the responsibility of the user to ensure the employed simulator is consistent between experiments using the same store.

To this end SWYFT incorporates a Store class with two main implementations: a memory store, which holds data in the main memory, and a directory store, which saves data in files written to disk. Here we demonstrate the use of these stores.

[1]:
%load_ext autoreload
%autoreload 2
[2]:
# DON'T FORGET TO ACTIVATE THE GPU when on google colab (Edit > Notebook settings)
from os import environ
GOOGLE_COLAB = True if "COLAB_GPU" in environ else False
if GOOGLE_COLAB:
    !pip install git+https://github.com/undark-lab/swyft.git
[3]:
import numpy as np
import torch
import pylab as plt
import os

import swyft

We again begin by defining some parameters, a toy simulator, and a prior.

[4]:
# Set randomness
np.random.seed(25)
torch.manual_seed(25)

# cwd
cwd = os.getcwd()

# swyft
device = 'cpu'
n_training_samples = 3000
n_parameters = 2
observation_key = "x"
[5]:
def model(v, sigma = 0.01):
    x = v + np.random.randn(n_parameters)*sigma
    return {observation_key: x}

v_o = np.zeros(n_parameters)
observation_o = model(v_o, sigma = 0.)

n_observation_features = observation_o[observation_key].shape[0]
observation_shapes = {key: value.shape for key, value in observation_o.items()}
[6]:
simulator = swyft.Simulator(
    model,
    n_parameters,
    sim_shapes=observation_shapes,
)

low = -1 * np.ones(n_parameters)
high = 1 * np.ones(n_parameters)
prior = swyft.get_uniform_prior(low, high)

store = swyft.Store.memory_store(simulator)
# drawing samples from the store is Poisson distributed. Simulating slightly more than we need avoids attempting to draw more than we have.
store.add(n_training_samples + 0.01 * n_training_samples, prior)
store.simulate()
Creating new store.
Store: Adding 2993 new samples to simulator store.

The memory store

The memory store, which, intuitively, stores all results in active memory using zarr, provides SWYFT’s simplest store option.

An empty store can be instantiated as follows, requiring only the specification of an associated simulator.

[7]:
store = swyft.Store.memory_store(simulator)
Creating new store.

Subsequently, parameters, drawn according to the specified prior, can be added to the store as

NOTE: the store ADDS a Poisson-distributed number of samples with parameter n_training_samples. When samples are drawn FROM the store, that amount is also Poisson-distributed. When more samples are drawn than exist within the store, an error is thrown. To avoid this issue, add more samples to the store than you intend to draw from it.

[8]:
# Drawing samples from the store is Poisson distributed.
# Simulating slightly more than we need avoids attempting to draw more than we have.
store.add(n_training_samples + 0.01 * n_training_samples, prior=prior)
Store: Adding 2969 new samples to simulator store.

and it is possible to check whether entries in the store require simulator runs using

[9]:
needs_sim = store.requires_sim()
needs_sim
[9]:
True

Similarly, an overview of the exact simulation staus of all entries can be obtained using

[10]:
store.get_simulation_status()
[10]:
array([0, 0, 0, ..., 0, 0, 0])

Where a value of 0 corresponds to not yet simulated .

The reqired simulations can then be run using the store’s simulate method.

[11]:
store.simulate()

Afterwards, all simulations have been run, and their status in the store has been updated (2 corresponds to successfully simulated).

[12]:
store.requires_sim()
[12]:
False
[13]:
store.get_simulation_status()
[13]:
array([2, 2, 2, ..., 2, 2, 2])

Sample re-use and coverage

SWYFT’s store enables reuse of simulations. In order to check which fraction of a required number of samples can be reused, the coverge of the store for the desired prior, i.e. which fraction of the desired nuumber of samples to be drawn from the specified prior is already available in the store, can he inspected as follows.

[14]:
store.coverage(2*n_training_samples, prior=prior)
[14]:
0.5050005

Adding a specified number of samples to the store then becomes a question of adding the missing number.

[15]:
store.add(2*n_training_samples, prior=prior)
Store: Adding 2934 new samples to simulator store.

These, however, do not yet have associated simulation results.

[16]:
store.requires_sim()
[16]:
True
[17]:
store._get_indices_to_simulate()
[17]:
array([2969, 2970, 2971, ..., 5900, 5901, 5902])

Saving and loading

A memory store can also be saved, i.e. serialized to disk as a directory store, using the save method which takes the desired path as an argument,

[18]:
store.save(cwd+'/SavedStore')

and be loaded into memory by specifying the path to a directory store and a simulator

[19]:
store2 = swyft.Store.load(cwd+'/SavedStore', simulator=simulator).to_memory()
Loading existing store.
Loading existing store.
[20]:
store2._get_indices_to_simulate()
[20]:
array([2969, 2970, 2971, ..., 5900, 5901, 5902])

The directory store

In many cases, running an instance of a simulator may be quite computationally expensive. For such simulators SWYFT’s ability to support reuse of simulations across different experiments is of paramount importance.

SWYFT provides this capability in the form of the directory store, which serializes the store to disk using zarrand keeps it up-to-date with regard to requested samples and parameters.

A directory store can be instantiated via the Store.directory_store() convenience method by providing a path and a simulator as arguments. In order to open an existing store, Store.load() can be employed.

[21]:
dirStore = swyft.Store.load(cwd+'/SavedStore')
Loading existing store.

While it is necessary to specify the simulator to be associated with a directory store upon instantiation via the simulator keyword, it is possible to load an existing store without specifying a simulator and set the simulator later/afterwards.

[22]:
dirStore.set_simulator(simulator)

Updating on disk

We now briefly demonstrate the difference between a directory store and a memory store which has been loaded from an existing directory store.

In the example above, both the dirStor and store2 are currenlty equivalent in content. In the dirStore we will now add simulations for half of the currently present samples lacking simulations,

[23]:
all_to_sim = dirStore._get_indices_to_simulate()
sim_now = all_to_sim[0:int(len(all_to_sim)/2)]
dirStore.simulate(sim_now)

Where we have made use of the ability to explicitly specify the indices of samples to be simulated.

The remaining samples lacking simulation results in the dirStore are now

[24]:
dirStore._get_indices_to_simulate()
[24]:
array([4436, 4437, 4438, ..., 5900, 5901, 5902])

i.e. the store has been updated on disk, while in comparison the samples lacking simulation results in store2 are still

[25]:
store2._get_indices_to_simulate()
[25]:
array([2969, 2970, 2971, ..., 5900, 5901, 5902])

asynchronous usage

In contrast to the memory store, the directory store also supports asynchronous usage, i.e. when simulations are requested control immediately returns, with the simulations and updating of the store happening in the background.

This is particularly relevant for long-running simulators and parallelization using Dask, as is showcased in a separate notebook.

Here, as a small example, we simply add further samples to the store and then execute the associated simulations without waiting for the results.

[26]:
dirStore.add(5*n_training_samples,prior=prior)
Store: Adding 9064 new samples to simulator store.
[27]:
dirStore.simulate(wait_for_results=False)
[28]:
print('control returned')
control returned