Performance: Friendly Numba - laser-base/laser-core GitHub Wiki
A useful pattern for small scale (≤ 100,000 agents) modeling with NumPy is "query-process." That is, we create a boolean mask of agents matching our selection criteria - query - and apply changes to those agents - process. There are two basically equivalent versions of the query:
# example
import numpy as np
velocity = np.random.rand(10_000_000) * 100 # 10^7 agents with velocity in [0, 100)
speeders = (velocity >= 80) # This reads all values of velocity and creates a 10_000_000 element boolean array - 10MB
velocity[speeders] -= 20 # Iterates over velocity a second time and is the end of using the speeders arrayIn this case the contents of speeders is ephemeral - calculated, used, and discarded (it's out of date thanks to the update to velocity).
We want to avoid the query-process approach.
The preferred approach, possibly counter-intuitively, is to return to explicit for loops.
for i in range(len(velocity)):
if velocity[i] >= 80:
velocity[i] -= 20But this approach is painfully slow in pure Python (20x-40x slower). Fortunately, extracting this snippet to its own function and applying the Numba njit decorator gets us back to the NumPy performance plus some more (up to 4.5x better than NumPy).
from numba import njit
@njit
def reign_in_speeders(velocity):
for i in range(len(velocity)):
if velocity[i] >= 80:
velocity[i] -= 20
returnThe real magic though is to use all the available CPU cores in parallel for our Numba-fied query-process function which we do by adding the parallel=True argument to the njit decorator:
from numba import njit, prange
@njit(parallel=True)
def reign_in_speeders(velocity):
for i in prange(len(velocity)):
if velocity[i] >= 80:
velocity[i] -= 20
returnProcessing in parallel can be very advantageous: generally ~ 0.75 * #cores speedup depending on hardware specs.
A common operation or question in LASER models is "How many agents with this property are in each location?" Let's call this select-and-bin.
The straightforward NumPy approach would be like this:
deciles = (velocity / 10).astype(int)
bin_count = np.bincount(deciles)But, again, we would like to use all our CPU cores in parallel. However, a naïve approach of using prange won't work because the basic method of updating the count in each bin - bin[index] += 1 doesn't return the correct value when done in parallel. Note that bin[index] += 1 is really a shortcut for the following bin[index] = bin[index] + 1. This involves reading the current value of bin[index], adding 1, and then storing the new value back into bin[index]. If two or more CPU cores find agents with the same index, they can all read the same bin[index] value, increment it by 1, and write it back to memory - missing one or more counts. We deal with this with the following pattern:
- allocate thread local storage, i.e., accumulators for each CPU core (thread)
- iterate in parallel over the population
- merge the results from the various CPU cores (threads) into a unified result
The Python looks like this:
from numba import njit, prange
@njit(parallel=True)
def parallel_bincount(num_bins, property):
# For each thread/core Allocate num_bins counters for each thread/core
tls = np.zeros((nb.get_num_threads(), num_bins), np.int32)
for i in prange(len(property)): # iterate in parallel over agents
tid = nb.get_thread_id() # get _this_ thread's id
index = property[i] # get the bin index from the property array
tls[tid, index] += 1 # increment the count in _this_ thread's memory
return tls.sum(axis=0) # return the aggregate counts by summing across the thread id axisA parallel implement of np.bincount() with weights would look like this:
from numba import njit, prange
@njit(parallel=True)
def parallel_bincount(num_bins, property, weights):
# For each thread/core Allocate num_bins counters for each thread/core
tls = np.zeros((nb.get_num_threads(), num_bins), np.float32)
for i in prange(len(property)): # iterate in parallel over agents
tid = nb.get_thread_id() # get _this_ thread's id
index = property[i] # get the bin index from the property array
tls[tid, index] += weights[i] # increment the weight in _this_ thread's memory
return tls.sum(axis=0) # return the aggregate weights by summing across the thread id axisThe implementations above consider all agents in the population. Implementing the "select" portion of "select-and-bin" would include adding the properties to test as arguments and including an appropriate if clause to include selected agents in the bin count or weight sum.
An interesting variant on np.bincount() with and without weights is getting the average weighted value per bin which combines the count functionality with the weighted functionality in a single pass. Example here - thread local storage is allocated and summed outside the Numba jitted function.
Note this functionality will soon be in laser-core as pbincount().