EULA - laser-base/laser-core GitHub Wiki
EULA
Summary
EULA stands for Epidemiologically Uninteresting Light Agents. This can be 90% of the total population. The idea is to stop actively modeling agents who are permanently immune without completely losing track of them. Even using technology which makes modeling faster, we think we might be able to get a 10x speedup by moving these agents out of the actively modeled population. And depending on the EULA approach, we may also be able to save memory. We've explored several options:
-
Default/Nothing. Model them and work around them. This is the default.
-
Delete them. (And remember the true population for reporting.) This obviously isn't a long-term solution if you are using transmission math which factors is the total human population. But as a quick test it confirms the speedup benefits across a variety of modeling technology choices (from SQL to Numpy to C).
-
Downsample to N (+mcw). We deleted them but added a row to each node representing the EULA population, and we added a property "mcw" -- monte carlo weight -- to each agent. So all the modeled agents have a weight of 1 and the EULA pop has a weight of the total number of the number of EULAs. This has the benefit of keeping the whole population in one place and just one way of doing things. But the EULAs need values for all the other attributes. Some are easy like immunity and immunity_timer, but others like "age" need to have synthetic values. We also need to make sure all our other code uses logic that doesn't actually operate on the EULAs. It can be tricky to always be writing model code and remember that there are "dummy agents" that need to be ignored. Also, we were initially doing fertility by counting the number of women of childbearing age. That become impossible with this level of downsampling, so we switched to fertility by crude birth rate. Which is perfectly reasonable but it might be nice not to rule out other, more detailed ways of doing birth dates. Also, we had started doing mortality by giving everyone a pre-calculated lifespan, but with downsampled EULAs we had to switch to a statistical way of doing natural mortality. Which is surprisingly expensive.
-
Downsample by yearly age bin (+mcw). An improvement on the above was to downsample less aggressively -- have an "mcw agent" for each yearly age bin (for each node). Using a EULA threshold age of 5, that's about 85 rows. With 1,000 nodes -- not yet tried personally -- that's 85,000 rows. Not nothing. This approach lets us do age-specific statistical mortality and also count the number of women of child-bearing age. But we're starting to get the worst of both worlds: more complexity for the modeler and some cost in the total number of rows.
-
Separate. At this point we switched to the idea of putting the EULA pop into its own table/dataframe/set of vectors. We actually stopped downsampling at all. So we're using the same amount of RAM as we started, but we get almost all the perf benefits of deletion and the coding simplicity of not having to remember that there are "dummy agents in our table". The downside is we now have "two ways of doing things". The modeler has to remember that when counting how many people are in a node, they have to go to two places. Is 2 just one more than 1, or twice as many as 1? :) If there are 2 places, is there a third place? I will also add a note here that so far all natural mortality is from R->D. But by retaining all the agents with no downsampling, we can go back to precalculated lifespans and kill people off the same way from both tables. But we can not really go back to population recycling because the vast majority of deaths are coming from the EULA table/arrays and all the births are going into the regular table/arrays.
-
Pre-process and Separate. Since Separation seemed to be a good idea, net net, it made sense to pre-process the population and do the population split before running the model. This adds some complexity in the sense that running the model becomes a multi-step activity, but longstanding tools like "make" can handle this easily. But maybe that's adding a new dependency and tool for some users? Should these intermediate files be retained? Does that feel like clutter to the end user?
-
Database. We added a pre-processing step of converting the EULA csv to a sqlite db. The code then does the population counting and mortality updating queries in a more RAM-parsimonious way, potentially. This turns out not to be performant for the numpy and C-accelerated numpy models, compared to simply storing the counts in a dictionary. One could use a dict to cache the initial query and maybe get the best of both worlds.
Componentization. Finally we will note that we moved the two surviving EULA implementations into standalone submodules. These are very small dedicated "scripts" that have the same name and same function signature, but live in separate directories. This lets the user do "from sql_model import eula" or "from numpy_model import eula" with no other changes to code or settings. Not sure if that's preferable to changing config parameters.
UPDATE: Mortality
We have been handling "Initial EULA Population" by identifying the > eula_age population agents during preprocessing and representing them as a dataframe/table of initial populations by node and age bin in units of year. So a 60 node simulation, with EULA at age 5, assuming 100 as the max age, we have 95 age bins and 60 nodes = 5700 "cells".
Note that in a pop=100m measles sim, we could have 95m EULA agents. The only thing we really need to do with this pop is age & kill them so the populations are right over time for transmission math (dilution).
With our 5700 bucketized pop, we have been killing them by using a plausible age-based mortality curve -- with probability of dying for each age pre-calculated and returned LUT-style. And we return the number of expected deaths per node (we can sum over all the age bins and timesteps). Since this is a little expensive we only do it once a month.
This is tolerable from a perf pov but it's the rate-limiting step right now (outside of transmission itself) and should be sped up.
It occurred to me that recalculating these node-and-time populations all the time -- for each sim -- is silly. Except for some stochasticity -- which is arguably inconsequential -- the EULA population for a given node over time given a certain initial pop and mortality curve is unchanged. It should probably be calculated once -- maybe even ahead of time -- and subsequently those values reused and returned each time.
There are 3 ways to do this that come to mind:
- Calculate a 3D table of data: 57001220=1,368,000 data points. UPDATE: Actually the data we need store doesn't need to be age-bucketized. And we don't have to store every timestep. 1b) Fit a curve to the data and interp.
- Do the math in real time.
- Train a NN on the dataset and the trained as an emulator.
We here propose to investigate quickly all 3 of these solutions and compare and contrast them.
Update: What I've done is 1b. I ran the model standalone with the population & mortality math (space=60 nodes, time=20 years) and wrote the populations to a csv file. I post-processed that so the data was aggregated across the age bins, since we don't need the age bins themselves for the disease modeling. Yes, those first 2 steps can be aggregated into 1. I fit 60 curves and saved the fit data as an .npy file. Then in the disease model itself, I load the fit data once at the beginning and I get the population of each node at each timestep essentially by calling "predict" on the fit model for that node. Seems to work great.
I have simplified this workflow for a new model. Details to be added here.