Random sampling optimisation resources - mratsim/laser GitHub Wiki

Random sampling is a key part of probabilistic programming, bayesian deep learning and text generation.

For example for text generation with a 4-character model "abcd", the network might predict the next character probability with [0.1, 0.3, 0.4, 0.2] and we need to sample a character from this (multinomial) distribution.

Now imagine that we instead have a vocabulary of a hundred thousands of words to sample from and here is the need for scalable performance.

Multinomial distribution - Weighted Random sampling

Use case

This is a critical distribution to optimise and the one used to sample next words or characters in NLP text generation. It is also used to give more weight to certain labels during training by sampling the training set with a skewed distribution (instead of scaling the gradient by a certain weight). And it's probably also the best way to implement stratified K-Fold

What is it

It is the generalisation of the binomial distribution where for example probability of a biaised coin toss could be [0.25, 0.75].

Implementation

In research

Papers

Let's start with the main ones:

  • Linear search
  • Binary search on top of the cumulative distribution (see http://www.keithschwarz.com/darts-dice-coins/)
  • Alias method
  • Using Fenwick trees (A Scalable Asynchronous Distributed Algorithm for Topic Modeling, state-of-the-art, scalable to billions of words and parallel)
    • Data structure largely inspired by An Efficient Method for Weighted Sampling without Replacement, 1980, Wong et al
    • Parallel Prefix Sum
    • Heaps for incremental computation and code
      import numpy as np
      from numpy.random import uniform
      
      
      def update(S, k, v):
          "Update value position `k` in time O(log n)."
          d = S.shape[0]
          i = d//2 + k
          S[i] = v
          while i > 0:
              i //= 2
              S[i] = S[2*i] + S[2*i + 1]
      
      
      def sumheap(w):
          "Create sumheap from weights `w`."
          n = w.shape[0]
      
          # About the datastructure: Bottom-most level of the heap has size n' =
          # next-power-of-two(n). The number of internal nodes in the tree is n'-1. We
          # have a dummy node at position zero to make indexing math simpler. So, we
          # allocate twice the size of the bottom level to fit internal nodes. Thus,
          # the overal data structure is <4*n in the worst case because the next power
          # of two <2n and then we have another factor of two for internal nodes.
          d = int(2**np.ceil(np.log2(n)))
          S = np.zeros(2*d)
      
          # O(n) version (faster than calling update n times => O(n log n))
          S[d:d+n] = w
          for i in reversed(range(1, d)):
              S[i] = S[2*i] + S[2*i + 1]
      
          return S
      
      
      def check(S, i):
          "Check heap invariant."
          d = S.shape[0]
          if i >= d//2:   # only checks internal nodes.
              return
          assert S[i] == S[2*i] + S[2*i + 1]
          check(S, 2*i)
          check(S, 2*i + 1)
      
      
      def dump(S):
          "Print heap for debugging."
          for i in range(int(np.ceil(np.log2(len(S))))):
              print 'depth', i, S[2**i:2**(i+1)]
      
      
      def sample(w, u):
          "Ordinary sampling method, O(n) to build heap, O(log n) per sample after that."
          c = w.cumsum()
          return c.searchsorted(u * c[-1])
      
      
      def hsample(S, u):
          "Sample from sumheap, O(log n) per sample."
          offset = S.shape[0]//2  # number of internal nodes.
          # random probe
          p = S[1] * u
          # Use binary search to find the index of the largest CDF (represented as a
          # heap) value that is less than a random probe.
          i = 1
          while i < offset:
              # Determine if the value is in the left or right subtree.
              i *= 2
              left = S[i]
              if p > left:
                  # Value is in right subtree. Subtract mass under left subtree.
                  p -= left
                  i += 1
          return i - offset
      
      
      def main():
          for n in np.random.choice(range(1, 100), size=10):
              print n
              w = np.round(uniform(0, 10, size=n), 1)
              S = sumheap(w)
              check(S, 1)
              for _ in range(100):
                  # test uses same random number because the methods should be identical up to ties.
                  u = uniform()
                  p1 = sample(w, u)
                  p2 = hsample(S, u)
                  assert p1 == p2
                  # change a random value in the weight array
                  c = np.random.randint(n)
                  v = uniform(10)
                  w[c] = v
                  update(S, c, v)
                  check(S, 1)
      
      
      if __name__ == '__main__':
          main()
      

And a few curious ones:

Misc literature

Numpy

For choice Numpy uses binary search on top of the cumulative distribution

    def choice(self, a, size=None, replace=True, p=None):
        """
        Parameters
        -----------
        a : 1-D array-like or int. If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a were np.arange(a)
        size : int or tuple of ints, optional. Output shape.  
        replace : boolean, optional. Whether the sample is with or without replacement
        p : 1-D array-like, optional. The probabilities associated with each entry in a. If not given the sample assumes a uniform distribution over all entries in a.
        Returns
        --------
        samples : single item or ndarray
            The generated random samples
        Examples
        ---------
        Generate a uniform random sample from np.arange(5) of size 3:
        >>> np.random.choice(5, 3)
        array([0, 3, 4])
        >>> #This is equivalent to np.random.randint(0,5,3)
        Generate a non-uniform random sample from np.arange(5) of size 3:
        >>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
        array([3, 3, 0])
        """

        ...

        # Actual sampling
        if replace:
            if p is not None:
                cdf = p.cumsum()
                cdf /= cdf[-1]
                uniform_samples = self.random_sample(shape)
                idx = cdf.searchsorted(uniform_samples, side='right')
                idx = np.array(idx, copy=False) # searchsorted returns a scalar
            else:
                idx = self.randint(0, pop_size, size=shape)
        else:
            if size > pop_size:
                raise ValueError("Cannot take a larger sample than "
                                 "population when 'replace=False'")

            if p is not None:
                if np.count_nonzero(p > 0) < size:
                    raise ValueError("Fewer non-zero entries in p than size")
                n_uniq = 0
                p = p.copy()
                found = np.zeros(shape, dtype=np.int)
                flat_found = found.ravel()
                while n_uniq < size:
                    x = self.rand(size - n_uniq)
                    if n_uniq > 0:
                        p[flat_found[0:n_uniq]] = 0
                    cdf = np.cumsum(p)
                    cdf /= cdf[-1]
                    new = cdf.searchsorted(x, side='right')
                    _, unique_indices = np.unique(new, return_index=True)
                    unique_indices.sort()
                    new = new.take(unique_indices)
                    flat_found[n_uniq:n_uniq + new.size] = new
                    n_uniq += new.size
                idx = found
            else:
                idx = self.permutation(pop_size)[:size]
                if shape is not None:
                    idx.shape = shape

Note there is also Numpy multinomial that uses repeated binomial sampling but that doesn't match our need: Numpy multinomial

Implementation of searchsorted and Numpy binsearch

Equivalent CUDA code

PyTorch

PyTorch uses either the alias method or CDF + binary search

Tensorflow

https://github.com/tensorflow/tensorflow/blob/125bf1dbb76c05bf5f88f14e77387ce35f986621/tensorflow/core/kernels/multinomial_op.cc

Reservoir sampling - one-pass sampling over the data

It is also possible to do weighted sampling in one pass over a stream of unknown length with a technique called reservoir sampling.

Papers:

Articles:

Normal distribution

TODO

Importance sampling for deep learning