2.Glow PLatform - MO436-MC934/notebooks GitHub Wiki

2. Glow Platform

This syllabus are divided into three parts. Part One, comprising sections 2.1 through 2.4, presents a step-by-step reference guide for using the Glow Platform. It contains instructions so that one can reproduce everything from the installation of the environment, its configuration, and the complete execution of the inference of a reference model. Part 2, comprising sections 2.5 through 2.8, covers the CPU backend and how Glow provides support for architecture-specific nodes and instructions. Next, we discuss the quantization feature and some architecture-independent optimizations that Glow provides. Part 3, comprising section 2.9, describes the P2 project and a practical guide for its implementation.

Part One

The first 3 sections, Building the Environment, Compiling the Model and Running the Inference refer, respectively, to the preparation of the environment, the compilation of the model, and its execution. All these steps assume you are using the host machine directly with your main operating system. The fourth option (preferable 👍) Using Containers, is an alternative environment preparation through the use of docker containers, but sharing a worker folder with the host machine.

  • Note: The reference operating system used was Ubuntu 20.04 and the default clang/llvm compiler version used was 10. Ubuntu 20.04 comes with clang/llvm 10.0. You can choose to install a different version than version 10, but you should not install two releases at the same time. If this is a requirement for your environment, you will need to replace the build line in item 7:
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release ..

     for this:

cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_DIR=/usr/lib/llvm-XX/lib/cmake/llvm ..

     where XX is the version number you would like to use.

2.1 Building the Environment

  1. First, it is important to update your system. Also, the command below install some dependences to set up our environment:
sudo apt-get update && sudo apt-get install clang clang-tools \
  cmake graphviz libpng-dev libprotobuf-dev llvm llvm-dev \
  ninja-build protobuf-compiler wget libgoogle-glog-dev \
  libboost-all-dev libdouble-conversion-dev libevent-dev libssl-dev \
  libgflags-dev libjemalloc-dev libpthread-stubs0-dev liblz4-dev \
  libzstd-dev libbz2-dev libsodium-dev libfmt-dev pkg-config \
  apt-utils libfmt-dev libc6-dbg gdb valgrind git git-lfs doxygen \
  libopenblas-dev
  1. Based on a path directory named, for example, mo436, clone the repositories below:
git clone https://github.com/pytorch/glow.git
git clone https://github.com/MO436-MC934/work.git
  1. Glow depends on a few submodules: googletest, onnx, and a library for FP16 conversions. To get them, from the mo436 directory, run:
cd glow
git submodule update --init --recursive
cd ..
  1. Glow depends on fmt, which must be built from source:
git clone https://github.com/fmtlib/fmt
mkdir fmt/build
cd fmt/build
cmake ..
make
sudo make install
  1. Before configure and building Glow, it may be desirable to use update-alternatives to manage the version of clang/clang++ & python:
sudo update-alternatives --install /usr/bin/clang clang \
 /usr/lib/llvm-10/bin/clang 100
sudo update-alternatives --install /usr/bin/clang++ clang++ \
 /usr/lib/llvm-10/bin/clang++ 100
sudo update-alternatives --install /usr/bin/python python \
 /usr/bin/python3 30
  1. Glow uses the system default C/C++ compiler (/usr/bin/c++), and so you may also want to switch your default C/C++ compiler to clang:
sudo update-alternatives --config cc
 # Select the option corresponding to /usr/bin/clang
sudo update-alternatives --config c++
 # Select the option corresponding to /usr/bin/clang++
  1. To build the Glow compiler create a build directory and run cmake (assuming you are in the mo436 folder). This is a very time-consuming process, especially when compiling glow from scratch:
cd glow
mkdir build
cd build
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release ..
ninja all

Building documentation can be enabled by passing an additional cmake parameter:

-DBUILD_DOCS=ON

The output will be placed in the docs/html subdirectory of the build output directory.

  1. To use our scripts, you must set the environment variable that defines the path to the Glow binaries so that you can compile the models:
export GLOWBIN=/path/to/glow/build/bin

2.2 Compiling the Model

  1. Compiling the CNN model mnist.

After completing the previous steps, you are ready to compile and run a CNN model. The work directory (created inside mo436 directory) contains a set of ONNX models for image classification. The example in this guide was based on MNIST-Handwritten Digit Recognition --- mnist. For the example, we only use the default configuration parameters.

cd /work/models/mnist

Compile the model to produce the bundle objects as below. Please notice that the compilation below uses a set of default flags. A brief description of each one of the compilation flags and how they work can be found in the help.

make -f compiler.mk

Build the application that guides the inference execution on the CPU board:

make -f builder.mk

After this step, the workflow produces three files in the bin folder of the mnist directory, namely: main.x, mnist.weihghts.bin and mnist.weights.txt.

2.3 Running the Inference

  1. Executing the Inference.

The execution process obeys the directory structure below:

── work
   ├── datasets
   │   ├── imagenet
   │   ├── mnist
   ├── models
   │   ├── mnist
   │   │    ├── bin
   │   │    │   ├── main.x
   │   │    │   ├── mnist.weights.bin
   │   │    │   ├── mnist.weights.txt
   │   ├── ...
   ├── scripts
   │   ├── exec_accuracy.sh
       ├── ground_truth_imagenet.txt
       ├── ground_truth_mnist.txt
       ├── measure_acc.cpp

To manually perform the inference for the images contained in the /work/dataset/mnist folder, do:

cd bin
./main.x ../../../datasets/mnist/*.png

The mnist application will show the top 5 and the confidence for the top1 for each image in the dataset. We also have execution scripts that can be used to measure accuracy automatically. First, it computes the inference for a set of images that computes the top1 and top5 and then computes the Top1 accuracy, Top5 accuracy, Precision, Recall, and F1-score. To run the scripts, go to the /work/scripts folder and do:

./exec_accuracy.sh -m mnist

2.4 Using Containers

The use of containers is often preferable because it isolates the tools used and their dependencies from the main user environment, thus avoiding conflicting versions of the same tool, but mainly of system libraries that they make use of.

The steps that are common to using the tools, whether through containers or not, have already been described in the previous sections and, therefore, will only be mentioned here. This section seeks to focus only on what differs from setting up the environment using containers. If you are new to using containers see the documentation here.

  1. Creating the Dockerfile

The first step is to create a file named Dockerfile with the contents of the script below. You can create this file in the same directory (in our example, mo436) where you will download the repositories (step 13, below).

FROM ubuntu:20.04

WORKDIR /mo436

ENV TZ=America/Sao_Paulo
ENV DEBIAN_FRONTEND=noninteractive
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ >/etc/timezone

RUN apt-get -y update && apt-get install -y clang clang-tools \
    cmake graphviz libpng-dev libprotobuf-dev llvm \
    llvm-dev ninja-build protobuf-compiler wget libgoogle-glog-dev \
    libboost-all-dev libdouble-conversion-dev libevent-dev libssl-dev \
    libgflags-dev libjemalloc-dev libpthread-stubs0-dev liblz4-dev libzstd-dev \
    libbz2-dev libsodium-dev libfmt-dev \
    pkg-config apt-utils libfmt-dev libc6-dbg gdb valgrind \
    git git-lfs doxygen libopenblas-dev zsh

RUN update-alternatives --install /usr/bin/clang clang /usr/lib/llvm-10/bin/clang 100 \
    && update-alternatives --install /usr/bin/clang++ clang++ /usr/lib/llvm-10/bin/clang++ 100 \
    && update-alternatives --install /usr/bin/cc  cc  /usr/bin/clang 100 \
    && update-alternatives --install /usr/bin/c++ c++ /usr/bin/clang++ 100 \
    && update-alternatives --install /usr/bin/python python /usr/bin/python3 30

ENV GLOWBIN="/mo436/glow/build/bin"

RUN git clone https://github.com/fmtlib/fmt.git \
    && cmake -S fmt -B fmt/build \
    && cmake --build fmt/build \
    && cmake --install fmt/build \
    && rm -rf fmt

  1. Building the Container

To build the image, run the following command (note the presence of a dot at the end of the command line). Depending on the computing power of your machine, this build process may take a while.

docker build -t ubuntu:mo436 .
  1. Running the Container

First, clone the repositories (steps 2 and 3 on first section). Then run the container mapping the current directory (mo436 folder) of your host machine to the container's /mo436 working directory:

docker run --rm -it -v $PWD:/mo436 ubuntu:mo436 zsh
  1. Compiling the Glow/CPU

With the container running you can compile the Glow with the command below (don't forget to check if you are on the desired branch):

cmake -S glow -B glow/build -G Ninja -DCMAKE_BUILD_TYPE=Release && cmake --build glow/build

This is similar to step 8 in the section Building the Environment. After that, you can follow steps 9 & 10 to compile and run the model. Also remember that this is a very time-consuming process, especially when compiling glow from scratch.

Part Two

2.5. Design of the Glow Intermediate Representation

This section describes the motivation behind the Glow intermediate representation (IR) and some implementation details.

High-level IR

The high-level IR is a dataflow node-based graph representation that is similar to a graph that you may find in ONNX format. When Glow loads a neural network model from some file it constructs this graph with a direct translation of one operator to one or more nodes. The high-level IR is a simple graph that allows basic transformations such as replacing all uses of some node with another node and modifying the content of constant nodes. The graph is strongly typed, which means that inputs and output have a known tensor type (consisting of the tensor's shape and element type) and that the types of nodes are verified by the compiler. For example, the element-wise add instruction must operate on operands of the same type.

HIR

Constants

Constants are special nodes that represent tensors that are a part of the graph. These nodes can be used to represent things like the weights of neural networks. Constants are immutable during the execution of the program, but graph optimizations can access the constants and modify them. This feature is useful for transformations that prepare the weights by transposing them or quantizing them before the execution of the program.

Placeholders

Placeholders are symbolic nodes that are not backed by a concrete tensor during the compilation of the program. Inputs and outputs of Glow programs should be modeled using Placeholder nodes. Concrete tensors are attached to placeholder nodes during the compilation of the program, and not before. This means that, unlike constants, the optimizer can't inspect or mutate the content of Placeholder nodes. The same program could be compiled using different bound tensors without changing the semantics of the program.

Node Lowering

Instead of compiling high-level operators directly, Glow performs "node lowering". In this phase, the compiler breaks the high-level operator nodes into low-level linear algebra operator nodes. For example, the FullyConnected layer is represented as a matrix multiplication followed by broadcasted add. Different compiler backends do not have to implement the FullyConnected layer and a dozen other high-level opcodes, just the low-level matrix multiplication.

This lowering phase drives many of the design decisions of the compiler. In Glow, lowering is performed as part of the high-level graph as described above, prior to moving to low-level IR. This is due to a number of reasons. First, the new lowered graph may allow for additional graph-level optimizations. Second, the new graph structure may affect the decisions of the instruction scheduler. And third, after lowering we allow the backends to perform additional target-specific optimizations on the lowered graph.

The lowering phase comes after the graph is differentiated. Because the lowering transformation does not preserve the semantics of the graph, it is not possible to differentiate the graph for certain operators. For example, the Regression node (which produces gradient when optimizing total squared error) becomes a no-op for the inference case, but is translated into an element-wise subtract for the training case. Performing the lowering before differentiation would prevent us from performing the correct lowering of the Regression node.

HIR

When compiling the model it is useful to view the final form of the graph after all the transformations and optimizations performed by Glow (which might differ from the initial model). You can generate the graph visual representations in .dot format by using the -dump-graph-DAG and -dump-graph-DAG-before-compile options like in this:

model-compiler -model=lenet.onnx \
    -dump-graph-DAG-before-compile=lenet-before.dot \
    -dump-graph-DAG=lenet-after.dot \
    ...

Additionally, you can convert the .dot files to .pdf format using the dot utility available on Linux like this:

dot -Tpdf lenet-before.dot -o lenet-before.pdf
dot -Tpdf lenet-after.dot -o lenet-after.pdf

Low-Level IR

After optimizing the graph with target-independent optimizations, and lowering from high-level operator nodes to linear algebra operator nodes, the code is further lowered into the low-level IR in a phase that is called "IRGen" (which stands for IR generation). This is a one-to-many translation where each high-level node is translated into one or more instructions.

The low-level IR enables a different kind of target-independent optimizations that are not possible with the high-level graph format. This is an instruction-based representation that operates on tensors that are referenced by address. This gives the compiler the ability to perform low-level memory optimizations that are not possible at the high-level, because memory is not represented directly. An example of such a transformation is the optimization that allows certain operations to transform some buffers in-place, such as element-wise arithmetic.

The IR is strongly typed and each instruction operand kind has known parameter types. The IR is strongly typed and each instruction operand kind has known parameter types. It is designed to be used as an in-memory form, though can be dumped into a human-readable assembly-like format.

A function in IR form contains two sections: declare and program. In the first section of the IR, we declare a number of memory regions that live throughout the lifetime of the program. This is similar to global variables in C. The second part of the IR is a list of instructions. Each variable is annotated with the kind of initialization that the program should do.

There are two kinds of memory regions that correspond to these two sections: global memory regions (found in declare) and locally allocated regions (found in program). The locally allocated memory regions are similar to alloca in LLVM IR. Memory regions are strongly typed, which means that the kind of type of tensor that the region represents is known.

Instructions operate on either global variables or locally allocated buffers. Each operand is annotated with one of the qualifiers '@in'/'@out'/'@inout'. '@in' means that the buffer is read from. '@out' means that the buffer is written into. And '@inout' means that the instruction may be read and written into the buffer. These operand qualifiers help the optimizer decide when it is legal to perform certain optimizations, such as copy elimination or buffer sharing. Instructions may have other attributes that specify the legality of some optimizations. For example, some instructions require that the data from the forward pass would be kept around for the backward pass, so if the program is not optimized for inference-only mode then certain memory optimizations cannot happen.

Below is an example of unoptimized Glow IR. Note that the alloc instruction does not allocate memory; it just marks the lifetime of the activation. The low-level memory allocator is responsible for allocating all of the buffers into a single coalesced region.

  declare {
    %input = weight float<8 x 28 x 28 x 1>, broadcast, 0.0
    %filter = weight float<16 x 5 x 5 x 1>, xavier, 25.0
    %filter0 = weight float<16>, broadcast, 0.100
    %weights = weight float<10 x 144>, xavier, 144.0
    %bias = weight float<10>, broadcast, 0.100
    %selected = weight index<8 x 1>
    ...
    %result = weight float<8 x 10>
  }

  program {
    %allo = alloc float<8 x 28 x 28 x 16>
    %conv = convolution [5 1 2 16] @out %allo, @in %input, @in %filter3, @in %bias0
    %allo0 = alloc float<8 x 28 x 28 x 16>
    %relu = relu @out %allo0, @in %allo
    %allo1 = alloc index<8 x 9 x 9 x 16 x 2>
    %allo2 = alloc float<8 x 9 x 9 x 16>
    %pool = pool max [3 3 0] @out %allo2, @in %allo0, @inout %allo1
    ...
    %deal6 = dealloc @out %allo6
    %deal7 = dealloc @out %allo7
    %deal8 = dealloc @out %allo8
    %deal9 = dealloc @out %allo9
  }

We have the option to print the IR after optimizations to stdout (or redirect it to a file). For example:

model-compiler -model=lenet.onnx \
    ... \
    -dump-ir > lenet.lir

The Lifetime of a Glow Instruction

This is a high-level overview of the compilation process:

  1. The graph is either loaded via the graph loader (from ONNX or Caffe2 format), or constructed via the C++ interface.

  2. The graph is differentiated if needed.

  3. The graph is optimized.

  4. Linear algebra node lowering takes place.

  5. Additional rounds of optimizations occur, both target independent and target specific.

  6. The graph is scheduled into a linear sequence of nodes that minimizes memory usage.

  7. IRGen converts the low-level graph into instructions.

  8. Low-level IR optimizations are performed.

  9. Backend-specific optimizations and code generation are performed.

2.6. CPU Backend

This section gives an overview of the Glow infrastructure required by the CPU processor. Details about each one of the changes and additions are described as comments in the code.

The implementation of the CPU backend is contained in the subdirectory lib/Backends/CPU. The CPU backend is registered through its own registration factory (see file CPUFactory.cpp) in order to be discovered by Glow. CPU backend is derived from the abstract base class Backend and implements the virtual function compile that takes a Function and the provided CPU backend options and compiles them.

Additionally, there are several virtual functions that CPU backend override:

  • isOpSupported returns whether the provided node instruction is supported by the CPU backend;
  • supportsFusedActivation returns whether the node instruction supports fused activations and which kind of node instruction can be fused;
  • shouldLower prevents lowering for some nodes, like relu node instruction, since CPU runtime library has a layer that implements it;
  • Save where the provided Function is compiled and then an object and a header file is saved into the bundle folder.

Backend-Specific Nodes and Instructions

Backends in Glow have the opportunity to perform their own analysis and transformations after lowering. This is exposed via the transformPostLowering() hook, during which a backend can transform the graph however it desires. In our example, the CPU Backend uses transformPostLowering(), defined in file lib/Backends/CPU/Transforms.cpp, to search the graph looking for nodes, like Convolution to replace a CPU specific node, called CPUConvDKKC8. The CPUConvDKKC8 node operates on filter weight data in a non-standard format. The default format is DKKC, where D is the output depth of the filter and C is the input channel, and K is the kernel size. This optimization changes the data layout to [D/8, K, K, C, 8]. Glow pre-swizzles the data in the weights to make the access pattern more efficient.

To do this, the CPU Backend creates its own custom Node and Instruction. This is done via ClassGen1 and implicitly included in tools/ClassGen/NodeGen.cpp and tools/ClassGen/InstrGen.cpp. These new node and instruction are defined inside the backend sub-directory, in files lib/Backends/CPU/ClassGen/CPUSpecificNodes.h and lib/Backends/CPU/ClassGen/CPUSpecificInstrs.h.

1: Glow uses automatic code generation techniques (ClassGen) for defining instructions and nodes and to help a human developer and maintain records of domain-specific information. The current system is capable of generating two kinds of classes: Nodes for the high-level IR and Instructions for the low-level IR.

Creating bundles for CPU Backend

In Glow, a bundle is a self-contained compiled network model that can be used to execute the model in a standalone mode. The CPU backend generates bundles as object files containing all the necessary code to run the inference. Also, CPU Backend produces in the bundle a header file that contains pieces of information to be used by the application that performs inference for the model.

The command used to build a bundle for the CPU backend is the following:

model-compiler -backend=CPU -model=<model-path> -emit-bundle=<bundle-dir>

After running the model-compiler tool, the following bundle artifacts will be generated in the output directory:

  • <network_name>.o - the bundle object file (code).
  • <network_name>.h - the bundle header file (API).
  • <network_name>.weights.bin - the model weights in binary format.
  • <network_name>.weights.txt - the model weights in text format as C text array.

Bundle memory layout

The memory of a bundle is organized in three separate memory regions which must be allocated by the user application code and provided through the bundle interface:

  • constantWeight - contains the model constant weights. The user application must:

    • allocate this memory region (statically or dynamically)
    • initialize this memory region with the content of the generated weights file in one of two possible formats:
      • binary format (<network_name>.weights.bin) used to initialize this memory region (allocated statically or dynamically) by loading the binary file dynamically at run-time using standard C function like fopen.
      • text format (<network_name>.weights.txt) used to initialize this memory region (only if statically allocated) by including the text file statically at compile-time as a C array using the #include pre-processor directive. This format is suitable for target architectures that do not have file systems (for example microcontrollers).
    • provide the base address of this memory region to the inference function
  • mutableWeight - contains all the model inputs and outputs (graph placeholders). The tensors corresponding to different inputs and outputs are identified using offsets relative to the base address of this memory region. The user application must:

    • allocate this memory region (statically or dynamically)
    • initialize the model input tensors from this memory region with the desired input data before running the inference
    • provide the base address of this memory region to the inference function
    • read the model output tensors from this memory region after running the inference
  • activations - this memory region is a scratch memory required for the bundle code to store the intermediate results of the graph computation (activations). The user application must:

    • allocate this memory region (statically or dynamically)
    • provide the base address of this memory region to the inference function
    • this memory region is NOT required to be initialized

The required sizes for all the memory regions described above are provided in the bundle interface. Also, all the memory regions must be allocated with a minimum alignment which is also provided in the interface (typically 64 bytes).

2.7 Working with Quantization

Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating-point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating-point values. This allows for a more compact model representation and the use of high-performance vectorized operations on many hardware platforms.

Glow is able to convert floating-point based networks into signed 8-bit or 16-bit integer networks. The CPU backend requires that the network must be quantized into a signed 8-bit integer, while the NMP backend, developed at UNICAMP for a LGE NeuroMorphic Processor2 requires that the network must be quantized into a signed 16-bit integer. Glow's infrastructure supports, among other schemes, symmetric with power of 2 scale that produces quantized ranges centered on 0 (symmetric) but also restricts the scale parameter to be a power of 2.

Quantization in Glow

Restricting the scale parameter to be a power of 2 might result in poor exploitation of the quantized range (poor accuracy) but has the potential to provide better performance. Therefore, it was okay to use it to quantize the models for NMP.

Qn scheme

The fastest inference performance is achieved by quantizing the model, and the best way to quantize the model without losing much accuracy is to create a quantization profile. Because different parts of the neural network contain floating-point values in different ranges, Glow uses profile-guided information to estimate the possible numeric range for each stage of the neural network. The quantization conversion works using a two-phase process. First, Glow statically instruments the network with special profiling nodes that record the ranges of activations that flow in the network, optimize the network including these profiling nodes, and then run inference. Then, it recompiles the network using the profile information to convert the network into a quantized form, allowing for static optimization of the quantized graph.

In order to compute the quantization profile, one option is to use the model-profiler tool. This application is generic and can be used with any model and requires a set of files (in either text or binary format) corresponding to the model input tensors in order to feed the model with a dataset and get the profile. The command has the following format:

model-profiler -model=<model-path> \
  -dump-profile=profile.yaml \
  -input-dataset=<name1,format1,source1,opts1> \
  ...

input-dataset specifies the dataset used to feed each of the model inputs. dump-profile=profile.yaml option is used to dump per node's output profile data into the profile.yaml file. This information can be used in the process of quantized conversion.

In order for the profiling phase to be correct, make sure the data used to feed the network is pre-processed in the same way as it would be in the case of inference. For example, for an image classification model make sure the input raw data:

  • has the correct data layout (NHWC or NCHW)
  • has the correct channel order (RGB or BGR)
  • is scaled properly: the values are in the range [0,1], [-1,1], [-127,128], [0,255.0] etc.

It is important to note that the profiling phase is independent on the quantization parameters so there is no need to specify the quantization schema, precision or other parameters.

Another tool used to compute the quantization profile is the image-classifier tool which is specialized for image classification models only and requires a set of images to do the inference and compute the profile. This application has the benefit that it provides a mechanism to load directly PNG images and also pre-process them according to the model needs (layout conversion, channel ordering, scaling). For example, you can run the following command to capture the profile for resnet18 model:

.PHONY : clean build all

MODEL ?= resnet18
MODELINPUT ?= "data"
PROFILE ?= $(MODEL).yaml
IMAGEMODE ?= 0to1
ORDER ?= RGB
CALIBRATION ?= ../calibration/images224

all: clean build

build: $(PROFILE)

$(PROFILE):
	@echo 'Build the $(MODEL) profiler $@:'
	${GLOWBIN}/image-classifier ${CALIBRATION}/*.png \
		-image-layout=NCHW \
		-image-mode=$(IMAGEMODE) \
		-image-channel-order=$(ORDER) \
		-use-imagenet-normalization \
		-model=$(MODEL).onnx \
		-model-input-name=$(MODELINPUT) \
		-dump-profile=$@

clean:
	rm -f $(PROFILE)

The sub-directory calibration provided to the image-classifier tool contains a small set of PNG images that were obtained by preprocessing a set of JPEG images from Imagenet converting from BRG to RGB format and rescaling them to size 3x224x224. Glow's default layout is NHWC, but for the NMP processor, we use the NCHW layout hence the need to indicate the layout change in the above script. Also, for onnx models, we use the -use-imagenet-normalization flag to indicate that the images will be pre-processed and normalized with the parameters used during training for the Imagenet dataset.

After the quantization profile profile.yaml has been generated, we can use the model-compiler tool to compile the model into a bundle by loading the previously generated profile:

model-compiler ... -load-profile=profile.yaml -quantization-schema=<schema>

When compiling a quantized bundle with the model-compiler some quantization parameters can be specified:

  • quantization-schema specifies the quantization schema:
    • asymmetric for Asymmetric quantization schema (Default).
    • symmetric for Symmetric quantization schema.
    • symmetric_with_uint8 for SymmetricWithUint8 quantization schema.
    • symmetric_with_power2_scale for SymmetricWithPower2Scale quantization schema.
  • quantization-precision specifies the precision used to quantized the nodes:
    • Int8 for int8 quantization (Default).
    • Int16 for int16 quantization.
  • quantization-precision-bias specifies the precision used to quantize the bias operand of some of the nodes (e.g. FullyConnected, Convolution):
    • Int8 for int8 quantization.
    • Int16 for int16 quantization.
    • Int32 for int32 quantization (Default).

For example, in order to profile, quantize and compile the ResNet18 model for NMP processor, you can use the script below:

.PHONY : clean build all

MODEL ?= resnet18
MODELINPUT ?= "data",float,[1,3,224,224]
PROFILE ?= $(MODEL).yaml
PRECISION ?= Int16
BUNDLE ?= bundle

all: clean build

build: ${BUNDLE}/$(MODEL).o

${BUNDLE}/$(MODEL).o : $(PROFILE)
	@echo 'Build the bundle object $@:'
	${GLOWBIN}/model-compiler \
		-load-profile=$< \
		-model=$(MODEL).onnx \
		-model-input=$(MODELINPUT) \
		-emit-bundle=$(BUNDLE) \
		-quantization-schema=symmetric_with_power2_scale \
		-quantization-precision=$(PRECISION) \
		-quantization-precision-bias=$(PRECISION) \
		-backend=NMP \
		-dump-graph-DAG=$(MODEL)-quant.dot

clean:
	rm -f ${BUNDLE}/$(MODEL).o

When the model is compiled, the quantization parameters are chosen in such a way that, for the given profile, no saturation occurs. Although this makes sense at first glance, there is actually a trade-off when choosing the quantization parameters for a given tensor: it might be beneficial overall if the quantization parameters are chosen such that to provide a smaller quantization step (e.g. smaller scale parameter) which means a better representation of most of the tensor values (the bulk of the histogram) at the expense of actually saturating the extreme values (outliers). You can see the result of the quantization by looking at the .dot file generated by the optional flag -dump-graph-DAG.

Measuring the accuracy of resnet18, onnx version 1.2.1, opset version 7, running on CPU and running on DQ1-A0 board of NMP processor:

Target  Top-1 accuracy (%)  Top-5 accuracy (%)
CPU          69.93               89.29
NMP          67.40               88.01

2: The NeuroMorphic Processor (NMP) was developed by LG Electronics (LGE) as a accelerator device. The key idea behind the NMP architecture is to use RISC-V ISA Extensions to design relevant CNN instructions like Conv-layers, FC-layers, Pooling layers, Element Wise operations, etc. The NMP architecture is a multicore NPU that contains an ARM57 processor that works as a host for a set of multiple Tile (TLE) processors, containing each a set of Tilelet (TLT) cores. Each TLT has one RISC-V core, three on-chip (scratchpad) memories which respectively store the Input, Weights and Output tiles from the operation data maps. Besides that, each TLT is also equipped with a MAC acceleration unit to execute CNN operations. The MAC unit execution is triggered by the RISC-V core and is capable of executing 8- and 16-bit fixed-point operations with the memory layout organized in NCHW format.

2.8 Glow Optimization passes

This section describes the target-independent optimizations performed by Glow.

Glow has two different optimizers: the graph optimizer and the IR optimizer. The graph optimizer performs optimizations on the graph representation of a neural network model. The nodes of the graph usually represent more coarse-grained operations than those represented by the IR instructions. These operations also do not explicitly represent memory allocations and buffers. The IR optimizer performs a number of optimizations on the IR representation of a neural network model. We won't describe them here as they are more related to the compiler area.

The optimizations have two major objectives. One is to improve the performance of training and inference steps. The other one is to reduce the memory consumption during the execution of neural network models. It is worth mentioning that performing optimizations to reduce memory consumption is easier at the IR level, because memory allocations and deallocations are explicitly represented in the IR, whereas they are not explicit in the graph representation.

Set of supported graph optimizations

Below you can see a non-exhaustive list of graphics optimizations that are supported by Glow:

  • Dead code elimination (DCE)

    This optimization removes computations whose results or side effects are not used.

  • Optimization of transpose nodes

    This optimization combines multiple consecutive transpose nodes into a single node, eliminates identity transpose nodes, and optimizes transpose nodes into reshape nodes when they actually move no data.

  • Sinking of transpose operations below other operations

    This optimization sink transposes below such operations as batch normalization, RELU, sigmoid, ChannelShuffle, etc. By doing this, many transpose operations are brought closer to each other and it creates more opportunities for the elimination of transpose operations.

  • Pool operations optimization

    This optimization swaps the order of Relu->MaxPool, to perform the RELU operation on a smaller tensor. This optimization is not a major performance win. The RELU operation takes a small fraction of the time, and reordering the nodes does not provide a lot of performance wins. However, reordering the buffers allows us to reuse the memory buffer of the pool operation and potentially save memory.

  • Optimization of concat nodes

    This optimization merges multiple consequent concat nodes into a single concat node.

  • Common sub-expression elimination (CSE)

    This optimization performs a classic CSE with the goal of avoiding of any results that were computed already.

  • Optimization of ReduceMean nodes

    This optimization performs substitutions of ReduceMean with AvgPool node if the reduce parameters are suitable: input is 4D with the last two dimensions to be reduced.

Quantization specific optimizations

The majority of the common optimizations above can be used on a quantized graph. But in addition to those there are quantization specific optimizations:

  • Quantize(Dequantize(X)) -> RescaleQuantized(X)

    If the Quantize-Dequantize sequence does not change the type then this sequence is simply dropped without adding nop RescaleQuantized node. If Dequantize node has an input type that is different from the Quantize node output type then a RescaleQuantized node replaces Quantize-Dequantize.

  • Dequantize(Quantize(X))

    A sequence of Dequantize(Quantize(X)) is a nop transformation and can be completely removed.

  • Constants optimization

    Constants that have single-use could be quantized at the optimization phase. This optimization replaces Quantize(Constant) with just a Constant with updated quantized weights based on the quantization parameters from the Quantize node.

  • RescaleQuantized(Max(X,Y)) -> Max(RescaleQuantized(X), RescaleQuantized(Y))

    It's OK to rescale the operands because even if the output range is smaller then truncation would have happened during the rescaling. On values that are outside of the range, we just move the truncation to a different location.

  • Combine RescaleQuantized operator up into the operation

    There are a number of operations that can operate on varying quantized parameters for the output type. It's safe to just merge RescaleQuantized node into the operator itself if the operator supports this, e.g., add, mul, etc.

    This optimization can be applied to:

    • Add
    • Sub
    • Mul
    • Div
    • Min
    • Max
    • Convolution
    • Splat
  • Combine Arithmetic operations into Batch Normalization.

    When a chain of Arithmetic nodes (each operating on a constant on one side) is right below the BatchNorm node, the chain is folded into the BatchNorm node. When a chain of Arithmetic nodes (each operating on a constant on one side) is right below the Convolution node, the chain is folded into a BatchNorm.

  • Combine RescaleQuantized operator down into the operation

    This optimization allows for eliminating redundant rescale operations when the next operation supports quantized inputs of different scales and offsets, e.g., normal arithmetic operations: Add, Sub, Mul, Div, Min, Max.

  • Sinking RescaleQuantized operator below other operators

    This optimization sinks RescaleQuantized node below such operations as slice, reshape, transpose, etc. By doing this, many RescaleQuantized operators are brought closer to each other, and it creates more opportunities for the elimination of RescaleQuantized operations.

  • RescaleQuantized(Quantize(X)) -> Quantize(X)

    A sequence of Quantize operation followed by RescaleQuantized operation is replaced by a single Quantize operation with the proper quantization parameters based on the RescaleQuantized operation.

  • Eliminate Max operation in Max(Splat(X), someOperand) or Max(someOperand, Splat(X))

    Splat and Max operations can be completely eliminated if the Splat value cannot impact the result of the Max operation. For example, Max and Splat have removed if Splat value is smaller than the smallest possible value from the other operand. The smallest possible value from the operand can be calculated based on the quantization parameters which represent the quantization range [min, max] in fp32.

Configuring a graph optimization pipeline

The graph optimizations listed above are each formulated as a FunctionPass, which is run on a Function (graph). A series of FunctionPasses along with how to configure each (via a FunctionPassConfig) are what constitutes a pipeline, which is passed into a PassManager, which executes them. A pipeline is simply a vector of FunctionPassConfigs. FunctionPassConfig is a class made up of:

  • FunctionPassID: An ID corresponding to a specific FunctionPass. For example, FuncionPassID::OptimizeArithmeticNodes. Default: EmptyPass (no-op pass).

  • ConvergenceMode: An enum corresponding to whether this FunctionPass should be run a single time (OnePass) or repeatedly until a fixed point is reached (UntilFixedPoint). Default: OnePass.

  • DCERequiredMode: An enum representing whether DCE is required before a pass is run (BeforePass), or not at all (None). Running DCE is often required in order to make sure the number of users for each Node is up to date. Default: BeforePass.

Set of supported IR optimizations

Below you can see the list of currently supported optimizations:

  • Peephole optimizations

    These are small, local optimizations that look for specific sequences of instructions and replace them with more efficient sequences of instructions.

  • Dead store elimination (DSE)

    This optimization removes stores into weights or allocations if it can prove that the results of these stores are never going to be used.

  • Deallocations hoisting

    This optimization tries to place the buffer deallocation instructions right after the last use of a buffer. Doing so reduces the lifetime of the buffer and makes the freed memory available for the allocation of other buffers. It improves the memory consumption.

  • Allocations sinking

    This optimization tries to place the buffer allocation instructions right before the first use of a buffer. Doing so reduces the lifetime of the buffer and makes the unused memory available for the allocation of other buffers. It improves the memory consumption.

  • Dead allocations removal

    This optimization finds and removes allocations that are just allocated and deallocated, but are never used. Such situations may happen e.g. after performing other allocations. Performing this optimization improves the memory consumption.

  • Making weights constant

    This optimization marks weights that are never mutated as constant. This may allow for placing such weights in a read only memory segments and share it between simultaneous executions of the same neural network model.

  • Sharing of buffers

    The purpose of this optimization is to reduce the memory usage by reusing the memory buffers as much as possible. The overall idea is that it is fine to combine storage for two live intervals if they do not overlap. Typically, two live intervals are considered as candidates for sharing if they occur in the same instruction.

  • Stacking of data-parallel operations

    Stacking tries to combine multiple data parallel (i.e. element-wise) operations that work with the same shape of tensors into a single kernel.

    Executing such a kernel should be in theory more efficient than executing those operations sequentially one after the other, because such a combined kernel exposes a better cache locality.

    The stacked kernels should provide even more advantages on GPUs, because they reduce the number of kernel threads launches, which are rather expensive operations.

When compiling the model, you can visualize the transformations and optimizations performed by Glow using the -print-graph-passes and -print-ir-passes options like this:

model-compiler -model=lenet.onnx \
    -print-graph-passes \
    -print-ir-passes \
    ...

We also have the option to view the IR after listed passes (comma separated pass names). For example:

model-compiler -model=lenet.onnx \
    ... \
    -dump-ir-after-passes=DeleteDeadAllocs > lenet.lir

Part Three

2.9 Project P2

Project P2 requires you to implement a naive convolution on the CPU backend. Then, using the ONNX models resnet18, mobilenet, or squeezenet from the work repository, compare the correctness of your implementation as well as the performance of your convolution to the standard convolution used by the CPU backend. The steps to do this experiment are:

  • It is interesting to define a compilation flag that allows turning on/off the code generation for your convolution. Let's call this flag, for example, MO436-features, belonging to the CPU Backend category. Here are the files you should modify:

    - include/glow/LLVMIRCodeGen/CommandLine.h
    - lib/LLVMIRCodeGen/CommandLine.cpp
    

    Don't forget to include the CommandLine.h in the CPUBackend header file so that this flag is visible to the backend CPU modules:

    - lib/Backends/CPU/CPUBackend.h
    
  • Create a new Backend-Specific Node and a new Backend-Specific Instruction to the convolution that you will implement (see the ClassGen documentation for more details.):

    - lib/Backends/CPU/ClassGen/CPUSpecificNodes.h
    - lib/Backends/CPU/ClassGen/CPUSpecificNodesVerification.h
    - lib/Backends/CPU/ClassGen/CPUSpecificInstrs.h
    - lib/Backends/CPU/ClassGen/CPUSpecificInstrsVerification.h
    
  • Disable support fusing activations for MO436 Convolution:

    - lib/Backends/CPU/CPUBackend.cpp
    
  • Generate code to replace the generic convolution with a new one that you will create in the CPUBackend::transformPostLowering function:

    - lib/Backends/CPU/Transforms.cpp
    

    For the sake of completeness and correctness, we will need to bypass deep wise convolutions. Depthwise Convolution is a type of convolution where we apply a single convolutional filter for each input channel. In the regular 2D convolution performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. In contrast, depthwise convolutions keep each channel separate. In Glow, regular 2D convolutions have group=1. So you should discard deep wise convolutions like this:

    if (CN->getGroup() !=1)  {
      return nullptr;
    }
    
  • Using as a reference the code generation for the convolution instruction in the CPULLVMIRGen.cpp module (CPUConvDKKC8InstKind) and its implementation (libjit_convDKKC8_f) in the libjit_cpu_conv.cpp module, design and implement its convolution and the code generation to invoke it (Pay attention to the prefix and suffix conventions used in the function names):

    -  lib/Backends/CPU/CPULLVMIRGen.cpp
    -  lib/Backends/CPU/libjit_cpu/libjit_cpu_conv.cpp
    
  • Finally, tell glow how it should represent this new convolution in the onnx model:

    -  lib/Backends/CPU/ONNX/CPUONNXModelWriter.cpp
    

The delivery must include one onnx model after optimized by Glow, in pdf format, and a report showing the accuracy (top1 & top 5) and the speedup (or slowdown) of its implementation compared with the default implementation in Glow for 1000 PNG images contained in the dataset/imagenet for one of the three models available in the work repository, namely resnet18, mobilenet, & squeezenet.

⚠️ **GitHub.com Fallback** ⚠️