DirectML Backend Overview - microsoft/tensorflow-directml GitHub Wiki

This page provides a quick summary of the essential components of TensorFlow's architecture relevant to supporting a new device backend, like DirectML. This doc assumes a basic understanding of machine learning concepts such as models, inference, and training. A general overview of key TensorFlow concepts can be found here, and a more in-depth overview of TensorFlow internals can be found here.

Important TensorFlow Concepts

At its heart, TensorFlow is all about building up a computational graph, or model, and executing it efficiently. The graph nodes represent operations, like add, and the graph edges represent data (tensors). The graph is executed within a session, and a session may include one or more devices that carry out the actual computation. Users build a graph with APIs at varying levels of abstraction: the lower-level APIs allow users to explicitly link together operations (e.g. tf.raw_ops.Add), and higher-level APIs can represent subgraphs of operations (e.g. tf.keras.layers.Dense). Regardless of which APIs are invoked, the computation boils down into ops running on TF devices like the CPU, GPU (CUDA), or DirectML (internally named "DML").

Below is a summary of important terms found throughout the TensorFlow project:

Operation : an abstract function that transforms data (e.g. add, convolve, multiply, reshape). In this doc, we may also refer to these as simply ops or operators.
Tensor : a multi-dimensional array of data
Device : abstraction for hardware that implements execution logic, such as a CPU or GPU
Kernel : a device-specific implementation of an operation
Graph : a computational model that comprises operations (nodes) and tensors (edges)
Session : the execution environment for a graph (contains runtime state)

TensorFlow has higher-level abstractions that build on the fundamental pieces above. These higher-level abstractions are not particularly important with respect to the backend, since internally the execution model still relies on graphs and sessions. However, it's worth noting these in case you encounter them in code:

Estimator : encapsulates a graph and session for high-level training/inference scenarios (e.g. linear regression)
Layer : a grouping of operators and tensors (e.g. dense)
Optimizer : an implementation of the gradient descent algorithm, such as Adam, which is used for training

DirectML Backend

For the purposes of DirectML acceleration, supporting a new device type in TensorFlow 1.x involves code that, for the most part, falls into these buckets:

Device Runtime: all the "glue" code necessary for TensorFlow's core runtime to interface with a device-specific backend. A device runtime is responsible for enumerating physical hardware (e.g. GPUs, or adapters in DirectX terminology), allocating resources, transferring memory between devices (e.g. between CPU and DirectML), managing various kinds of state, and delegating the computation of certain operations to device-specific kernels. For DirectML, most of this code is found under tensorflow/core/common_runtime/dml.
Device Kernels: ops like add and conv2d are abstract functions with no concrete implementation. The implementation of an op is called a kernel, and every TF device statically registers kernels for the ops that it supports. A device can register many kernels for a single op so long as the signature of the kernel (input data types, output data types, and other constraints) is unique. Many DirectML kernels are found in files prefixed with dml_ under the tensorflow/core/kernels directory; however, in some cases we modify existing source files when appropriate (e.g. the implementation is purely a registration that invokes CPU code).
External Code: DirectML itself, along with some support headers to compile for WSL, are introduced in the repo under the third_party/dml directory.

When TensorFlow executes a model, it partitions the underlying graph and assigns each partition to a device. The CPU might run part of the model, and DirectML might run another part. Most of this logic is handled automatically by the core TensorFlow runtime and is outside the scope of a device backend; however, there are ways to influence this partitioning both implicitly and explicitly. For example, the grappler system can have device-specific logic to simplify parts of the graph before it's executed. Additionally, there are heuristics that determine the device placement of ops based on which kernels a device registers: if a graph has ops A, B, and C and a device only implements op B then TF may execute all three ops on the CPU regardless. In short, the more ops a device implements the better the locality (and performance) is likely to be.