Memory Allocator of TensorFlow - lijiansong/tf-2.1.3 GitHub Wiki

BFC Memory allocator.

This is a summary about memory management strategies for tensorflow. The reference are attached at the end.

Doug Lea Malloc

Tensorflow adopts Doug Lea's memory allocation / deallocation. The algorithm is best-first with coalescing. Freed chunks are coalesced with neighboring ones, and held in bins that are searched in size order. The two core elements of the this algorithm are Boundary Tags and Binning:

Boundary Tags (coalescing via boundary tags)

Chunks of memory carry around with them size information fields both before and after the chunk. This allows for two important capabilities:

Two bordering unused chunks can be coalesced into one larger chunk. This minimizes the number of unusable small chunks.
All chunks can be traversed starting from any known chunk in either a forward or backward direction. The original versions implemented boundary tags exactly in this fashion. More recent versions omit trailer fields on chunks that are in use by the program. This is itself a minor trade-off: The fields are not ever used while chunks are active so need not be present. Eliminating them decreases overhead and wastage. However, lack of these fields weakens error detection a bit by making it impossible to check if users mistakenly overwrite fields that should have known values.

Binning (best-fit via binning)

Available chunks are maintained in bins, grouped by size. There are a surprisingly large number (128) of fixed-width bins, approximately logarithmically spaced in size. Bins for sizes less than 512 bytes each hold only exactly one size (spaced 8 bytes apart, simplifying enforcement of 8-byte alignment). Searches for available chunks are processed in smallest-first, best-fitorder. As shown by Wilson et al, best-fit schemes (of various kinds and approximations) tend to produce the least fragmentation on real loads compared to other general approaches such as first-fit.

Until the versions released in 1995, chunks were left unsorted within bins, so that the best-fit strategy was only approximate. More recent versions instead sort chunks by size within bins, with ties broken by an oldest-first rule. (This was done after finding that the minor time investment was worth it to avoid observed bad cases.)

Advantages of BFC:

This approach leads to fixed bookkeeping overhead per chunk. Because both size information and bin links must be held in each available chunk, the smallest allocatable chunk is 16 bytes in systems with 32-bit pointers and 24 bytes in systems with 64-bit pointers. These minimum sizes are larger than most people would like to see -- they can lead to significant wastage for example in applications allocating many tiny linked-list nodes. However, the 16 bytes minimum at least is characteristic of any system requiring 8-byte alignment in which there is any malloc bookkeeping overhead.
This basic algorithm can be made to be very fast. Even though it rests upon a search mechanism to find best fits, the use of indexing techniques, exploitation of special cases, and careful coding lead to average cases requiring only a few dozen instructions, depending of course on the machine and the allocation pattern.

Tensorflow memory management

Memory Placement

In tensorflow, the runtime is responsible for the memory allocation and garbage collection. Since tensorflow supports heterogeneous hardware platform (e.g., GPU and CPU), it is usually the first thing to decide where the memory is placed (device memory or a certain accelerator memory). Since the memory is associated with the its operation, the operation will also executed on that device. The operation / memory placement has a default setting, and it can also be manually managed by programmer.

In default (automatic) setting, if a TensorFlow operation has both CPU and GPU implementations, the GPU devices will be given priority when the operation is assigned to a device. For example, matmul has both CPU and GPU kernels. On a system with devices cpu:0 and gpu:0, gpu:0 will be selected to run matmul.

In manual management, if the programmer would like a particular operation to run on a device of your choice instead of what's automatically selected, the programmer can use with tf.device to create a device context such that all the operations within that context will have the same device assignment.

# Creates a graph.
with tf.device('/cpu:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

Programmer visible memory management interface

Default setting

TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.

Dynamic memory allocation

The process can allocate a subset of the available memory, or grow the memory usage as is needed by the process. TensorFlow provides two Config options on the Session to control this.

The first is the allow_growth option, which attempts to allocate only as much GPU memory based on runtime allocations: it starts out allocating very little memory, and as Sessions get run and more GPU memory is needed, we extend the GPU memory region needed by the TensorFlow process. Note that we do not release memory, since that can lead to even worse memory fragmentation. To turn this option on, set the option in the ConfigProto by:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)

The second method is the per_process_gpu_memory_fraction option, which determines the fraction of the overall amount of memory that each visible GPU should be allocated. This is useful if you want to truly bound the amount of GPU memory available to the TensorFlow process. For example, you can tell TensorFlow to only allocate 40% of the total memory of each GPU by:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)

Using Python garbage collector

As long as we have reference cycles we're at the mercy of Python's garbage collection strategy. We could add some explicit gc.collect() calls in places (when popping the graph stack?), but for graph mode in TF 1.x I think that's the best we can do.

Programmer invisible memory management

All tensors in TensorFlow are refcounted, dynamically allocated and freed. On GPU, this goes through its own non-blocking BFC memory allocator. All the tensors are freed when their refcount goes to zero. For temporary tensors, that is often when "Compute" finishes. For output tensors, that is when they finish being used as input tensors.

Code details of tensorflow memory management device memory.

Allocator is an abstract interface for allocating and deallocating

A tensorflow Op may need access to different kinds of memory that are not simply a function of the device to which the Op has been assigned. For example, an Op executing on a GPU may still need to allocate CPU RAM for some purpose. Internal to the tensorflow runtime we may choose to allocate CPU ram from special regions that have been prepared for higher performance in some use contexts, e.g. doing DMA with particular devices. For these reasons, the Device interface does not expose just one memory Allocator, but instead provides an accessor that takes a specification of the desired memory attributes in order to select an Allocator. Example use:

// Allocator for ordinary device memory:
Allocator* a = allocator(AllocatorAttributes());
// ...
// Allocator for CPU RAM, regardless of where Op is executing:
AllocatorAttributes attr;
attr.set_on_host(true);
Allocator* a = allocator(attr);

TrackingAllocator is a wrapper for an Allocator. It keeps a running count of the number of bytes allocated through the wrapper. It is used by the Executor to "charge" allocations to particular Op executions. Each Op gets a separate TrackingAllocator wrapper around the underlying allocator.

The implementation assumes the invariant that all calls to AllocateRaw by an Op (or work items spawned by the Op) will occur before the Op's Compute method returns. Thus the high watermark is established once Compute returns.
DeallocateRaw can be called long after the Op has finished, e.g. when an output tensor is deallocated, and the wrapper cannot be deleted until the last of these calls has occurred. The TrackingAllocator keeps track of outstanding calls using a reference count, and deletes itself once the last call has been received and the high watermark has been retrieved.
It relies on RefCounted class.

  // Initial reference count is one.
  RefCounted();
  // Increments reference count by one.
  void Ref() const;
  // Decrements reference count by one.  If the count remains
  // positive, returns false.  When the count reaches zero, returns
  // true and deletes this, in which case the caller must not access
  // the object afterward.
  bool Unref() const;
  // Return whether the reference count is one.
  // If the reference count is used in the conventional way, a
  // reference count of 1 implies that the current thread owns the
  // reference and no other thread shares it.
  // This call performs the test for a reference count of one, and
  // performs the memory barrier needed for the owning thread
  // to act on the object, knowing that it has exclusive access to the
  // object.
  bool RefCountIsOne() const;
 protected:
  // Make destructor protected so that RefCounted objects cannot
  // be instantiated directly. Only subclasses can be instantiated.
  virtual ~RefCounted();

Subclass VisitableAllocator instead of Allocator when a memory allocator needs to enable some kind of registration/deregistration of memory areas.

Visitor gets called with a pointer to a memory area and its size in bytes.
Register a visitor guaranteed to be called exactly once on each chunk of memory newly allocated from the underlying device. Typically, chunks will be reused and possibly sub-divided by a pool manager, so the calls will happen only once per process execution, not once per tensor (re)allocation.
Register a visitor guaranteed to be called on each chunk of memory returned to the underlying device.

BFCAllocator inherits from VisitableAllocator, which is a memory allocator that implements a "best-fit with coalescing" algorithm. This is essentially a very simple version of Doug Lea's malloc (dlmalloc). The goal of this allocator is to support defragmentation via coalescing. One assumption we make is that the process using this allocator owns pretty much all of the memory, and that nearly all requests to allocate memory go through this interface.

This allocator has a helper from class AllocatorRetry. Call "alloc_func" to obtain memory. On first call, "verbose_failure" will be false. If return value is nullptr, then wait up to "max_millis_to_wait" milliseconds, retrying each time a call to DeallocateRaw() is detected, until either a good pointer is returned or the deadline is exhausted. If the deadline is exhausted, try one more time with "verbose_failure" set to true. The value returned is either the first good pointer obtained from "alloc_func" or nullptr.

void* AllocateRaw(std::function<void*(size_t alignment, size_t num_bytes, bool verbose_failure)> alloc_func,
                    int max_millis_to_wait, size_t alignment, size_t bytes);

BFCAllocator has a ChunkHandle which is an index into the chunks_ vector in BFCAllocator. kInvalidChunkHandle means an invalid chunk. It has a Chunk struct which points to memory. Their prev/next pointers form a doubly-linked list of addresses sorted by base address that must be contiguous. Chunks contain information about whether they are in use or whether they are free, and contain a pointer to the bin they are in. It has a Bin struct which is a collection of similar-sized free chunks.

GPU memory allocator

A GPU memory allocator is created when a GPU device reference is created by "BaseGPUDeviceFactory::CreateGPUDevice".

Different GPU device objects that refer to the same physical device and stream group id use the same stream group object (and therefore the same CUDA streams). This is necessary since there is a single memory allocator per device (ProcessState::GetGPUAllocator) and allocators must not be shared across streams.

Allocator* gpu_allocator = process_state->GetGPUAllocator(
      options.config.gpu_options(), tf_gpu_id, memory_limit);

The GPU allocator is defined in "process_state.h" as:

virtual Allocator* GetGPUAllocator(const GPUOptions& options,
                                     TfGpuId tf_gpu_id, size_t total_bytes);

This returns the one GPU allocator used for the indexed GPU. 'total_bytes' is the total number of bytes that should be made available to the allocator. The first call to this function for a given tf_gpu_id creates the allocator, so only the total_bytes used on that first call is used.

Currently, options.config.gpu_options.allocator_type can only be BFC (best-fit with coalescing).

The core functionality of BFC is implemented in "tensorflow/core/common_runtime/bfc_allocator.cc". Its class name is BFCAllocator, which is inherited from VisitableAllocator.

REFs

"Using GPU" - https://www.tensorflow.org/programmers_guide/using_gpu
"dynamic memory allocation" - https://github.com/tensorflow/tensorflow/issues/3299
"using python GC in tensorflow" - https://github.com/tensorflow/tensorflow/issues/14181
"A Memory Allocator - Doug Lea" - http://g.oswego.edu/dl/html/malloc.html
Wiki page: https://github.com/miglopst/cs263_spring2018