2.Glow PLatform - MO436-MC934/notebooks GitHub Wiki
This syllabus are divided into three parts. Part One, comprising sections 2.1 through 2.4, presents a step-by-step reference guide for using the Glow Platform. It contains instructions so that one can reproduce everything from the installation of the environment, its configuration, and the complete execution of the inference of a reference model. Part 2, comprising sections 2.5 through 2.8, covers the CPU backend and how Glow provides support for architecture-specific nodes and instructions. Next, we discuss the quantization feature and some architecture-independent optimizations that Glow provides. Part 3, comprising section 2.9, describes the P2 project and a practical guide for its implementation.
The first 3 sections, Building the Environment, Compiling the Model and Running the Inference refer, respectively, to the preparation of the environment, the compilation of the model, and its execution. All these steps assume you are using the host machine directly with your main operating system. The fourth option (preferable 👍) Using Containers, is an alternative environment preparation through the use of docker containers, but sharing a worker folder with the host machine.
- Note: The reference operating system used was Ubuntu 20.04 and the default clang/llvm compiler version used was 10. Ubuntu 20.04 comes with clang/llvm 10.0. You can choose to install a different version than version 10, but you should not install two releases at the same time. If this is a requirement for your environment, you will need to replace the build line in item 7:
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release ..
for this:
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_DIR=/usr/lib/llvm-XX/lib/cmake/llvm ..
where XX
is the version number you would like to use.
- First, it is important to update your system. Also, the command below install some dependences to set up our environment:
sudo apt-get update && sudo apt-get install clang clang-tools \
cmake graphviz libpng-dev libprotobuf-dev llvm llvm-dev \
ninja-build protobuf-compiler wget libgoogle-glog-dev \
libboost-all-dev libdouble-conversion-dev libevent-dev libssl-dev \
libgflags-dev libjemalloc-dev libpthread-stubs0-dev liblz4-dev \
libzstd-dev libbz2-dev libsodium-dev libfmt-dev pkg-config \
apt-utils libfmt-dev libc6-dbg gdb valgrind git git-lfs doxygen \
libopenblas-dev
- Based on a path directory named, for example,
mo436
, clone the repositories below:
git clone https://github.com/pytorch/glow.git
git clone https://github.com/MO436-MC934/work.git
- Glow depends on a few submodules: googletest, onnx, and a library for FP16
conversions. To get them, from the
mo436
directory, run:
cd glow
git submodule update --init --recursive
cd ..
- Glow depends on
fmt
, which must be built from source:
git clone https://github.com/fmtlib/fmt
mkdir fmt/build
cd fmt/build
cmake ..
make
sudo make install
- Before configure and building Glow, it may be desirable to use
update-alternatives
to manage the version of clang/clang++ & python:
sudo update-alternatives --install /usr/bin/clang clang \
/usr/lib/llvm-10/bin/clang 100
sudo update-alternatives --install /usr/bin/clang++ clang++ \
/usr/lib/llvm-10/bin/clang++ 100
sudo update-alternatives --install /usr/bin/python python \
/usr/bin/python3 30
- Glow uses the system default C/C++ compiler (
/usr/bin/c++
), and so you may also want to switch your default C/C++ compiler to clang:
sudo update-alternatives --config cc
# Select the option corresponding to /usr/bin/clang
sudo update-alternatives --config c++
# Select the option corresponding to /usr/bin/clang++
- To build the Glow compiler create a build directory and
run
cmake
(assuming you are in themo436
folder). This is a very time-consuming process, especially when compiling glow from scratch:
cd glow
mkdir build
cd build
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release ..
ninja all
Building documentation can be enabled by passing an additional cmake
parameter:
-DBUILD_DOCS=ON
The output will be placed in the docs/html
subdirectory of the build output
directory.
- To use our scripts, you must set the environment variable that defines the path to the Glow binaries so that you can compile the models:
export GLOWBIN=/path/to/glow/build/bin
- Compiling the CNN model
mnist
.
After completing the previous steps, you are ready to compile and run a CNN
model. The work
directory (created inside mo436
directory) contains a set
of ONNX models for image classification. The example in this guide
was based on MNIST-Handwritten Digit Recognition --- mnist. For the example,
we only use the default configuration parameters.
cd /work/models/mnist
Compile the model to produce the bundle objects as below. Please notice that the compilation below uses a set of default flags. A brief description of each one of the compilation flags and how they work can be found in the help.
make -f compiler.mk
Build the application that guides the inference execution on the CPU board:
make -f builder.mk
After this step, the workflow produces three files in the bin
folder of the
mnist
directory, namely: main.x
, mnist.weihghts.bin
and
mnist.weights.txt
.
- Executing the Inference.
The execution process obeys the directory structure below:
── work
├── datasets
│ ├── imagenet
│ ├── mnist
├── models
│ ├── mnist
│ │ ├── bin
│ │ │ ├── main.x
│ │ │ ├── mnist.weights.bin
│ │ │ ├── mnist.weights.txt
│ ├── ...
├── scripts
│ ├── exec_accuracy.sh
├── ground_truth_imagenet.txt
├── ground_truth_mnist.txt
├── measure_acc.cpp
To manually perform the inference for the images contained in the /work/dataset/mnist folder, do:
cd bin
./main.x ../../../datasets/mnist/*.png
The mnist application will show the top 5 and the confidence for the top1 for
each image in the dataset. We also have execution scripts that can be used to
measure accuracy automatically. First, it computes the inference for a set of
images that computes the top1 and top5 and then computes the Top1 accuracy,
Top5 accuracy, Precision, Recall, and F1-score. To run the scripts, go to the
/work/scripts
folder and do:
./exec_accuracy.sh -m mnist
The use of containers is often preferable because it isolates the tools used and their dependencies from the main user environment, thus avoiding conflicting versions of the same tool, but mainly of system libraries that they make use of.
The steps that are common to using the tools, whether through containers or not, have already been described in the previous sections and, therefore, will only be mentioned here. This section seeks to focus only on what differs from setting up the environment using containers. If you are new to using containers see the documentation here.
- Creating the Dockerfile
The first step is to create a file named Dockerfile with the contents of the
script below. You can create this file in the same directory (in our example,
mo436
) where you will download the repositories (step 13, below).
FROM ubuntu:20.04
WORKDIR /mo436
ENV TZ=America/Sao_Paulo
ENV DEBIAN_FRONTEND=noninteractive
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ >/etc/timezone
RUN apt-get -y update && apt-get install -y clang clang-tools \
cmake graphviz libpng-dev libprotobuf-dev llvm \
llvm-dev ninja-build protobuf-compiler wget libgoogle-glog-dev \
libboost-all-dev libdouble-conversion-dev libevent-dev libssl-dev \
libgflags-dev libjemalloc-dev libpthread-stubs0-dev liblz4-dev libzstd-dev \
libbz2-dev libsodium-dev libfmt-dev \
pkg-config apt-utils libfmt-dev libc6-dbg gdb valgrind \
git git-lfs doxygen libopenblas-dev zsh
RUN update-alternatives --install /usr/bin/clang clang /usr/lib/llvm-10/bin/clang 100 \
&& update-alternatives --install /usr/bin/clang++ clang++ /usr/lib/llvm-10/bin/clang++ 100 \
&& update-alternatives --install /usr/bin/cc cc /usr/bin/clang 100 \
&& update-alternatives --install /usr/bin/c++ c++ /usr/bin/clang++ 100 \
&& update-alternatives --install /usr/bin/python python /usr/bin/python3 30
ENV GLOWBIN="/mo436/glow/build/bin"
RUN git clone https://github.com/fmtlib/fmt.git \
&& cmake -S fmt -B fmt/build \
&& cmake --build fmt/build \
&& cmake --install fmt/build \
&& rm -rf fmt
- Building the Container
To build the image, run the following command (note the presence of a dot
at the end of the command line). Depending on the computing power of your
machine, this build process may take a while.
docker build -t ubuntu:mo436 .
- Running the Container
First, clone the repositories (steps 2 and 3 on first
section). Then run the container mapping the
current directory (mo436
folder) of your host machine to the container's
/mo436
working directory:
docker run --rm -it -v $PWD:/mo436 ubuntu:mo436 zsh
- Compiling the Glow/CPU
With the container running you can compile the Glow with the command below (don't forget to check if you are on the desired branch):
cmake -S glow -B glow/build -G Ninja -DCMAKE_BUILD_TYPE=Release && cmake --build glow/build
This is similar to step 8 in the section Building the Environment. After that, you can follow steps 9 & 10 to compile and run the model. Also remember that this is a very time-consuming process, especially when compiling glow from scratch.
This section describes the motivation behind the Glow intermediate representation (IR) and some implementation details.
The high-level IR is a dataflow node-based graph representation that is similar to a graph that you may find in ONNX format. When Glow loads a neural network model from some file it constructs this graph with a direct translation of one operator to one or more nodes. The high-level IR is a simple graph that allows basic transformations such as replacing all uses of some node with another node and modifying the content of constant nodes. The graph is strongly typed, which means that inputs and output have a known tensor type (consisting of the tensor's shape and element type) and that the types of nodes are verified by the compiler. For example, the element-wise add instruction must operate on operands of the same type.
Constants are special nodes that represent tensors that are a part of the graph. These nodes can be used to represent things like the weights of neural networks. Constants are immutable during the execution of the program, but graph optimizations can access the constants and modify them. This feature is useful for transformations that prepare the weights by transposing them or quantizing them before the execution of the program.
Placeholders are symbolic nodes that are not backed by a concrete tensor during the compilation of the program. Inputs and outputs of Glow programs should be modeled using Placeholder nodes. Concrete tensors are attached to placeholder nodes during the compilation of the program, and not before. This means that, unlike constants, the optimizer can't inspect or mutate the content of Placeholder nodes. The same program could be compiled using different bound tensors without changing the semantics of the program.
Instead of compiling high-level operators directly, Glow performs "node lowering". In this phase, the compiler breaks the high-level operator nodes into low-level linear algebra operator nodes. For example, the FullyConnected layer is represented as a matrix multiplication followed by broadcasted add. Different compiler backends do not have to implement the FullyConnected layer and a dozen other high-level opcodes, just the low-level matrix multiplication.
This lowering phase drives many of the design decisions of the compiler. In Glow, lowering is performed as part of the high-level graph as described above, prior to moving to low-level IR. This is due to a number of reasons. First, the new lowered graph may allow for additional graph-level optimizations. Second, the new graph structure may affect the decisions of the instruction scheduler. And third, after lowering we allow the backends to perform additional target-specific optimizations on the lowered graph.
The lowering phase comes after the graph is differentiated. Because the lowering transformation does not preserve the semantics of the graph, it is not possible to differentiate the graph for certain operators. For example, the Regression node (which produces gradient when optimizing total squared error) becomes a no-op for the inference case, but is translated into an element-wise subtract for the training case. Performing the lowering before differentiation would prevent us from performing the correct lowering of the Regression node.
When compiling the model it is useful to view the final form of the graph after
all the transformations and optimizations performed by Glow (which might differ
from the initial model). You can generate the graph visual representations in
.dot format by using the -dump-graph-DAG
and
-dump-graph-DAG-before-compile
options like in this:
model-compiler -model=lenet.onnx \
-dump-graph-DAG-before-compile=lenet-before.dot \
-dump-graph-DAG=lenet-after.dot \
...
Additionally, you can convert the .dot files to .pdf format using the dot utility available on Linux like this:
dot -Tpdf lenet-before.dot -o lenet-before.pdf
dot -Tpdf lenet-after.dot -o lenet-after.pdf
After optimizing the graph with target-independent optimizations, and lowering from high-level operator nodes to linear algebra operator nodes, the code is further lowered into the low-level IR in a phase that is called "IRGen" (which stands for IR generation). This is a one-to-many translation where each high-level node is translated into one or more instructions.
The low-level IR enables a different kind of target-independent optimizations that are not possible with the high-level graph format. This is an instruction-based representation that operates on tensors that are referenced by address. This gives the compiler the ability to perform low-level memory optimizations that are not possible at the high-level, because memory is not represented directly. An example of such a transformation is the optimization that allows certain operations to transform some buffers in-place, such as element-wise arithmetic.
The IR is strongly typed and each instruction operand kind has known parameter types. The IR is strongly typed and each instruction operand kind has known parameter types. It is designed to be used as an in-memory form, though can be dumped into a human-readable assembly-like format.
A function in IR form contains two sections: declare
and program
. In the
first section of the IR, we declare a number of memory regions that live
throughout the lifetime of the program. This is similar to global variables in
C. The second part of the IR is a list of instructions. Each variable is
annotated with the kind of initialization that the program should do.
There are two kinds of memory regions that correspond to these two sections:
global memory regions (found in declare
) and locally allocated regions (found
in program
). The locally allocated memory regions are similar to alloca
in
LLVM IR. Memory regions are strongly typed, which means that the kind of type
of tensor that the region represents is known.
Instructions operate on either global variables or locally allocated buffers. Each operand is annotated with one of the qualifiers '@in'/'@out'/'@inout'. '@in' means that the buffer is read from. '@out' means that the buffer is written into. And '@inout' means that the instruction may be read and written into the buffer. These operand qualifiers help the optimizer decide when it is legal to perform certain optimizations, such as copy elimination or buffer sharing. Instructions may have other attributes that specify the legality of some optimizations. For example, some instructions require that the data from the forward pass would be kept around for the backward pass, so if the program is not optimized for inference-only mode then certain memory optimizations cannot happen.
Below is an example of unoptimized Glow IR. Note that the alloc
instruction
does not allocate memory; it just marks the lifetime of the activation. The
low-level memory allocator is responsible for allocating all of the buffers
into a single coalesced region.
declare {
%input = weight float<8 x 28 x 28 x 1>, broadcast, 0.0
%filter = weight float<16 x 5 x 5 x 1>, xavier, 25.0
%filter0 = weight float<16>, broadcast, 0.100
%weights = weight float<10 x 144>, xavier, 144.0
%bias = weight float<10>, broadcast, 0.100
%selected = weight index<8 x 1>
...
%result = weight float<8 x 10>
}
program {
%allo = alloc float<8 x 28 x 28 x 16>
%conv = convolution [5 1 2 16] @out %allo, @in %input, @in %filter3, @in %bias0
%allo0 = alloc float<8 x 28 x 28 x 16>
%relu = relu @out %allo0, @in %allo
%allo1 = alloc index<8 x 9 x 9 x 16 x 2>
%allo2 = alloc float<8 x 9 x 9 x 16>
%pool = pool max [3 3 0] @out %allo2, @in %allo0, @inout %allo1
...
%deal6 = dealloc @out %allo6
%deal7 = dealloc @out %allo7
%deal8 = dealloc @out %allo8
%deal9 = dealloc @out %allo9
}
We have the option to print the IR after optimizations to stdout (or redirect it to a file). For example:
model-compiler -model=lenet.onnx \
... \
-dump-ir > lenet.lir
This is a high-level overview of the compilation process:
-
The graph is either loaded via the graph loader (from ONNX or Caffe2 format), or constructed via the C++ interface.
-
The graph is differentiated if needed.
-
The graph is optimized.
-
Linear algebra node lowering takes place.
-
Additional rounds of optimizations occur, both target independent and target specific.
-
The graph is scheduled into a linear sequence of nodes that minimizes memory usage.
-
IRGen converts the low-level graph into instructions.
-
Low-level IR optimizations are performed.
-
Backend-specific optimizations and code generation are performed.
This section gives an overview of the Glow infrastructure required by the CPU processor. Details about each one of the changes and additions are described as comments in the code.
The implementation of the CPU backend is contained in the subdirectory
lib/Backends/CPU
. The CPU backend is registered through its own registration
factory (see file CPUFactory.cpp
) in order to be discovered by Glow. CPU
backend is derived from the abstract base class Backend
and implements the
virtual function compile
that takes a Function
and the provided CPU backend
options and compiles them.
Additionally, there are several virtual functions that CPU backend override:
-
isOpSupported
returns whether the provided node instruction is supported by the CPU backend; -
supportsFusedActivation
returns whether the node instruction supports fused activations and which kind of node instruction can be fused; -
shouldLower
prevents lowering for some nodes, likerelu
node instruction, since CPU runtime library has a layer that implements it; -
Save
where the providedFunction
is compiled and then an object and a header file is saved into the bundle folder.
Backends in Glow have the opportunity to perform their own analysis and
transformations after lowering. This is exposed via the
transformPostLowering()
hook, during which a backend can transform the graph
however it desires. In our example, the CPU Backend uses
transformPostLowering()
, defined in file lib/Backends/CPU/Transforms.cpp
,
to search the graph looking for nodes, like Convolution
to replace a CPU
specific node, called CPUConvDKKC8
. The CPUConvDKKC8
node operates on
filter weight data in a non-standard format. The default format is DKKC, where
D is the output depth of the filter and C is the input channel, and K is the
kernel size. This optimization changes the data layout to [D/8, K, K, C, 8].
Glow pre-swizzles the data in the weights to make the access pattern more
efficient.
To do this, the CPU Backend creates its own custom Node and Instruction. This
is done via ClassGen1
and implicitly included in tools/ClassGen/NodeGen.cpp
and
tools/ClassGen/InstrGen.cpp
. These new node and instruction are defined
inside the backend sub-directory, in files
lib/Backends/CPU/ClassGen/CPUSpecificNodes.h
and
lib/Backends/CPU/ClassGen/CPUSpecificInstrs.h
.
1: Glow uses automatic code generation techniques (ClassGen) for defining instructions and nodes and to help a human developer and maintain records of domain-specific information. The current system is capable of generating two kinds of classes: Nodes for the high-level IR and Instructions for the low-level IR.↩
In Glow, a bundle is a self-contained compiled network model that can be used to execute the model in a standalone mode. The CPU backend generates bundles as object files containing all the necessary code to run the inference. Also, CPU Backend produces in the bundle a header file that contains pieces of information to be used by the application that performs inference for the model.
The command used to build a bundle for the CPU backend is the following:
model-compiler -backend=CPU -model=<model-path> -emit-bundle=<bundle-dir>
After running the model-compiler tool, the following bundle artifacts will be generated in the output directory:
-
<network_name>.o
- the bundle object file (code). -
<network_name>.h
- the bundle header file (API). -
<network_name>.weights.bin
- the model weights in binary format. -
<network_name>.weights.txt
- the model weights in text format as C text array.
The memory of a bundle is organized in three separate memory regions which must be allocated by the user application code and provided through the bundle interface:
-
constantWeight
- contains the model constant weights. The user application must:- allocate this memory region (statically or dynamically)
- initialize this memory region with the content of the generated weights file in
one of two possible formats:
- binary format (
<network_name>.weights.bin
) used to initialize this memory region (allocated statically or dynamically) by loading the binary file dynamically at run-time using standard C function like fopen. - text format (
<network_name>.weights.txt
) used to initialize this memory region (only if statically allocated) by including the text file statically at compile-time as a C array using the #include pre-processor directive. This format is suitable for target architectures that do not have file systems (for example microcontrollers).
- binary format (
- provide the base address of this memory region to the inference function
-
mutableWeight
- contains all the model inputs and outputs (graph placeholders). The tensors corresponding to different inputs and outputs are identified using offsets relative to the base address of this memory region. The user application must:- allocate this memory region (statically or dynamically)
- initialize the model input tensors from this memory region with the desired input data before running the inference
- provide the base address of this memory region to the inference function
- read the model output tensors from this memory region after running the inference
-
activations
- this memory region is a scratch memory required for the bundle code to store the intermediate results of the graph computation (activations). The user application must:- allocate this memory region (statically or dynamically)
- provide the base address of this memory region to the inference function
- this memory region is NOT required to be initialized
The required sizes for all the memory regions described above are provided in the bundle interface. Also, all the memory regions must be allocated with a minimum alignment which is also provided in the interface (typically 64 bytes).
Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating-point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating-point values. This allows for a more compact model representation and the use of high-performance vectorized operations on many hardware platforms.
Glow is able to convert floating-point based networks into signed 8-bit or
16-bit integer networks. The CPU backend requires that the network must be
quantized into a signed 8-bit integer, while the NMP backend, developed at
UNICAMP for a LGE NeuroMorphic Processor2
requires that the network must be quantized into a signed 16-bit integer.
Glow's infrastructure supports, among other schemes, symmetric with power of 2 scale
that produces quantized ranges centered on 0 (symmetric) but also
restricts the scale parameter to be a power of 2.
Restricting the scale parameter to be a power of 2 might result in poor exploitation of the quantized range (poor accuracy) but has the potential to provide better performance. Therefore, it was okay to use it to quantize the models for NMP.
The fastest inference performance is achieved by quantizing the model, and the best way to quantize the model without losing much accuracy is to create a quantization profile. Because different parts of the neural network contain floating-point values in different ranges, Glow uses profile-guided information to estimate the possible numeric range for each stage of the neural network. The quantization conversion works using a two-phase process. First, Glow statically instruments the network with special profiling nodes that record the ranges of activations that flow in the network, optimize the network including these profiling nodes, and then run inference. Then, it recompiles the network using the profile information to convert the network into a quantized form, allowing for static optimization of the quantized graph.
In order to compute the quantization profile, one option is to use the model-profiler tool. This application is generic and can be used with any model and requires a set of files (in either text or binary format) corresponding to the model input tensors in order to feed the model with a dataset and get the profile. The command has the following format:
model-profiler -model=<model-path> \
-dump-profile=profile.yaml \
-input-dataset=<name1,format1,source1,opts1> \
...
input-dataset
specifies the dataset used to feed each of the model inputs.
dump-profile=profile.yaml
option is used to dump per node's output profile
data into the profile.yaml
file. This information can be used in the process
of quantized conversion.
In order for the profiling phase to be correct, make sure the data used to feed the network is pre-processed in the same way as it would be in the case of inference. For example, for an image classification model make sure the input raw data:
- has the correct data layout (NHWC or NCHW)
- has the correct channel order (RGB or BGR)
- is scaled properly: the values are in the range [0,1], [-1,1], [-127,128], [0,255.0] etc.
It is important to note that the profiling phase is independent on the quantization parameters so there is no need to specify the quantization schema, precision or other parameters.
Another tool used to compute the quantization profile is the
image-classifier tool which is specialized for image classification models
only and requires a set of images to do the inference and compute the profile.
This application has the benefit that it provides a mechanism to load directly
PNG images and also pre-process them according to the model needs (layout
conversion, channel ordering, scaling). For example, you can run the following
command to capture the profile for resnet18
model:
.PHONY : clean build all
MODEL ?= resnet18
MODELINPUT ?= "data"
PROFILE ?= $(MODEL).yaml
IMAGEMODE ?= 0to1
ORDER ?= RGB
CALIBRATION ?= ../calibration/images224
all: clean build
build: $(PROFILE)
$(PROFILE):
@echo 'Build the $(MODEL) profiler $@:'
${GLOWBIN}/image-classifier ${CALIBRATION}/*.png \
-image-layout=NCHW \
-image-mode=$(IMAGEMODE) \
-image-channel-order=$(ORDER) \
-use-imagenet-normalization \
-model=$(MODEL).onnx \
-model-input-name=$(MODELINPUT) \
-dump-profile=$@
clean:
rm -f $(PROFILE)
The sub-directory calibration
provided to the image-classifier
tool
contains a small set of PNG images that were obtained by preprocessing a set of
JPEG images from Imagenet converting from BRG to RGB format and rescaling them
to size 3x224x224. Glow's default layout is NHWC, but for the NMP processor, we
use the NCHW layout hence the need to indicate the layout change in the above
script. Also, for onnx models, we use the -use-imagenet-normalization
flag to
indicate that the images will be pre-processed and normalized with the
parameters used during training for the Imagenet dataset.
After the quantization profile profile.yaml
has been generated, we can use
the model-compiler tool to compile the model into a bundle by loading the
previously generated profile:
model-compiler ... -load-profile=profile.yaml -quantization-schema=<schema>
When compiling a quantized bundle with the model-compiler some quantization parameters can be specified:
-
quantization-schema
specifies the quantization schema:-
asymmetric
for Asymmetric quantization schema (Default). -
symmetric
for Symmetric quantization schema. -
symmetric_with_uint8
for SymmetricWithUint8 quantization schema. -
symmetric_with_power2_scale
for SymmetricWithPower2Scale quantization schema.
-
-
quantization-precision
specifies the precision used to quantized the nodes:-
Int8
for int8 quantization (Default). -
Int16
for int16 quantization.
-
-
quantization-precision-bias
specifies the precision used to quantize the bias operand of some of the nodes (e.g. FullyConnected, Convolution):-
Int8
for int8 quantization. -
Int16
for int16 quantization. -
Int32
for int32 quantization (Default).
-
For example, in order to profile, quantize and compile the ResNet18 model for NMP processor, you can use the script below:
.PHONY : clean build all
MODEL ?= resnet18
MODELINPUT ?= "data",float,[1,3,224,224]
PROFILE ?= $(MODEL).yaml
PRECISION ?= Int16
BUNDLE ?= bundle
all: clean build
build: ${BUNDLE}/$(MODEL).o
${BUNDLE}/$(MODEL).o : $(PROFILE)
@echo 'Build the bundle object $@:'
${GLOWBIN}/model-compiler \
-load-profile=$< \
-model=$(MODEL).onnx \
-model-input=$(MODELINPUT) \
-emit-bundle=$(BUNDLE) \
-quantization-schema=symmetric_with_power2_scale \
-quantization-precision=$(PRECISION) \
-quantization-precision-bias=$(PRECISION) \
-backend=NMP \
-dump-graph-DAG=$(MODEL)-quant.dot
clean:
rm -f ${BUNDLE}/$(MODEL).o
When the model is compiled, the quantization parameters are chosen in such
a way that, for the given profile, no saturation occurs. Although this makes
sense at first glance, there is actually a trade-off when choosing the
quantization parameters for a given tensor: it might be beneficial overall if
the quantization parameters are chosen such that to provide a smaller
quantization step (e.g. smaller scale parameter) which means a better
representation of most of the tensor values (the bulk of the histogram) at the
expense of actually saturating the extreme values (outliers). You can see the
result of the quantization by looking at the .dot
file generated by the
optional flag -dump-graph-DAG
.
Measuring the accuracy of resnet18, onnx version 1.2.1, opset version 7, running on CPU and running on DQ1-A0 board of NMP processor:
Target Top-1 accuracy (%) Top-5 accuracy (%)
CPU 69.93 89.29
NMP 67.40 88.01
2: The NeuroMorphic Processor (NMP) was developed by LG Electronics (LGE) as a accelerator device. The key idea behind the NMP architecture is to use RISC-V ISA Extensions to design relevant CNN instructions like Conv-layers, FC-layers, Pooling layers, Element Wise operations, etc. The NMP architecture is a multicore NPU that contains an ARM57 processor that works as a host for a set of multiple Tile (TLE) processors, containing each a set of Tilelet (TLT) cores. Each TLT has one RISC-V core, three on-chip (scratchpad) memories which respectively store the Input, Weights and Output tiles from the operation data maps. Besides that, each TLT is also equipped with a MAC acceleration unit to execute CNN operations. The MAC unit execution is triggered by the RISC-V core and is capable of executing 8- and 16-bit fixed-point operations with the memory layout organized in NCHW format.↩
This section describes the target-independent optimizations performed by Glow.
Glow has two different optimizers: the graph optimizer and the IR optimizer. The graph optimizer performs optimizations on the graph representation of a neural network model. The nodes of the graph usually represent more coarse-grained operations than those represented by the IR instructions. These operations also do not explicitly represent memory allocations and buffers. The IR optimizer performs a number of optimizations on the IR representation of a neural network model. We won't describe them here as they are more related to the compiler area.
The optimizations have two major objectives. One is to improve the performance of training and inference steps. The other one is to reduce the memory consumption during the execution of neural network models. It is worth mentioning that performing optimizations to reduce memory consumption is easier at the IR level, because memory allocations and deallocations are explicitly represented in the IR, whereas they are not explicit in the graph representation.
Below you can see a non-exhaustive list of graphics optimizations that are supported by Glow:
-
Dead code elimination (DCE)
This optimization removes computations whose results or side effects are not used.
-
Optimization of transpose nodes
This optimization combines multiple consecutive transpose nodes into a single node, eliminates identity transpose nodes, and optimizes transpose nodes into reshape nodes when they actually move no data.
-
Sinking of transpose operations below other operations
This optimization sink transposes below such operations as batch normalization, RELU, sigmoid, ChannelShuffle, etc. By doing this, many transpose operations are brought closer to each other and it creates more opportunities for the elimination of transpose operations.
-
Pool operations optimization
This optimization swaps the order of Relu->MaxPool, to perform the RELU operation on a smaller tensor. This optimization is not a major performance win. The RELU operation takes a small fraction of the time, and reordering the nodes does not provide a lot of performance wins. However, reordering the buffers allows us to reuse the memory buffer of the pool operation and potentially save memory.
-
Optimization of concat nodes
This optimization merges multiple consequent concat nodes into a single concat node.
-
Common sub-expression elimination (CSE)
This optimization performs a classic CSE with the goal of avoiding of any results that were computed already.
-
Optimization of ReduceMean nodes
This optimization performs substitutions of ReduceMean with AvgPool node if the reduce parameters are suitable: input is 4D with the last two dimensions to be reduced.
The majority of the common optimizations above can be used on a quantized graph. But in addition to those there are quantization specific optimizations:
-
Quantize(Dequantize(X)) -> RescaleQuantized(X)
If the Quantize-Dequantize sequence does not change the type then this sequence is simply dropped without adding nop RescaleQuantized node. If Dequantize node has an input type that is different from the Quantize node output type then a RescaleQuantized node replaces Quantize-Dequantize.
-
Dequantize(Quantize(X))
A sequence of Dequantize(Quantize(X)) is a nop transformation and can be completely removed.
-
Constants optimization
Constants that have single-use could be quantized at the optimization phase. This optimization replaces Quantize(Constant) with just a Constant with updated quantized weights based on the quantization parameters from the Quantize node.
-
RescaleQuantized(Max(X,Y)) -> Max(RescaleQuantized(X), RescaleQuantized(Y))
It's OK to rescale the operands because even if the output range is smaller then truncation would have happened during the rescaling. On values that are outside of the range, we just move the truncation to a different location.
-
Combine RescaleQuantized operator up into the operation
There are a number of operations that can operate on varying quantized parameters for the output type. It's safe to just merge RescaleQuantized node into the operator itself if the operator supports this, e.g., add, mul, etc.
This optimization can be applied to:
- Add
- Sub
- Mul
- Div
- Min
- Max
- Convolution
- Splat
-
Combine Arithmetic operations into Batch Normalization.
When a chain of Arithmetic nodes (each operating on a constant on one side) is right below the BatchNorm node, the chain is folded into the BatchNorm node. When a chain of Arithmetic nodes (each operating on a constant on one side) is right below the Convolution node, the chain is folded into a BatchNorm.
-
Combine RescaleQuantized operator down into the operation
This optimization allows for eliminating redundant rescale operations when the next operation supports quantized inputs of different scales and offsets, e.g., normal arithmetic operations: Add, Sub, Mul, Div, Min, Max.
-
Sinking RescaleQuantized operator below other operators
This optimization sinks RescaleQuantized node below such operations as slice, reshape, transpose, etc. By doing this, many RescaleQuantized operators are brought closer to each other, and it creates more opportunities for the elimination of RescaleQuantized operations.
-
RescaleQuantized(Quantize(X)) -> Quantize(X)
A sequence of Quantize operation followed by RescaleQuantized operation is replaced by a single Quantize operation with the proper quantization parameters based on the RescaleQuantized operation.
-
Eliminate Max operation in Max(Splat(X), someOperand) or Max(someOperand, Splat(X))
Splat and Max operations can be completely eliminated if the Splat value cannot impact the result of the Max operation. For example, Max and Splat have removed if Splat value is smaller than the smallest possible value from the other operand. The smallest possible value from the operand can be calculated based on the quantization parameters which represent the quantization range [min, max] in fp32.
The graph optimizations listed above are each formulated as a FunctionPass,
which is run on a Function (graph). A series of FunctionPasses along with how
to configure each (via a FunctionPassConfig
) are what constitutes a pipeline,
which is passed into a PassManager, which executes them. A pipeline is simply
a vector of FunctionPassConfig
s. FunctionPassConfig
is a class made up of:
-
FunctionPassID
: An ID corresponding to a specific FunctionPass. For example,FuncionPassID::OptimizeArithmeticNodes
. Default:EmptyPass
(no-op pass). -
ConvergenceMode
: An enum corresponding to whether this FunctionPass should be run a single time (OnePass) or repeatedly until a fixed point is reached (UntilFixedPoint
). Default:OnePass
. -
DCERequiredMode
: An enum representing whether DCE is required before a pass is run (BeforePass
), or not at all (None
). Running DCE is often required in order to make sure the number of users for each Node is up to date. Default:BeforePass
.
Below you can see the list of currently supported optimizations:
-
Peephole optimizations
These are small, local optimizations that look for specific sequences of instructions and replace them with more efficient sequences of instructions.
-
Dead store elimination (DSE)
This optimization removes stores into weights or allocations if it can prove that the results of these stores are never going to be used.
-
Deallocations hoisting
This optimization tries to place the buffer deallocation instructions right after the last use of a buffer. Doing so reduces the lifetime of the buffer and makes the freed memory available for the allocation of other buffers. It improves the memory consumption.
-
Allocations sinking
This optimization tries to place the buffer allocation instructions right before the first use of a buffer. Doing so reduces the lifetime of the buffer and makes the unused memory available for the allocation of other buffers. It improves the memory consumption.
-
Dead allocations removal
This optimization finds and removes allocations that are just allocated and deallocated, but are never used. Such situations may happen e.g. after performing other allocations. Performing this optimization improves the memory consumption.
-
Making weights constant
This optimization marks weights that are never mutated as constant. This may allow for placing such weights in a read only memory segments and share it between simultaneous executions of the same neural network model.
-
Sharing of buffers
The purpose of this optimization is to reduce the memory usage by reusing the memory buffers as much as possible. The overall idea is that it is fine to combine storage for two live intervals if they do not overlap. Typically, two live intervals are considered as candidates for sharing if they occur in the same instruction.
-
Stacking of data-parallel operations
Stacking tries to combine multiple data parallel (i.e. element-wise) operations that work with the same shape of tensors into a single kernel.
Executing such a kernel should be in theory more efficient than executing those operations sequentially one after the other, because such a combined kernel exposes a better cache locality.
The stacked kernels should provide even more advantages on GPUs, because they reduce the number of kernel threads launches, which are rather expensive operations.
When compiling the model, you can visualize the transformations and
optimizations performed by Glow using the -print-graph-passes
and
-print-ir-passes
options like this:
model-compiler -model=lenet.onnx \
-print-graph-passes \
-print-ir-passes \
...
We also have the option to view the IR after listed passes (comma separated pass names). For example:
model-compiler -model=lenet.onnx \
... \
-dump-ir-after-passes=DeleteDeadAllocs > lenet.lir
Project P2 requires you to implement a naive convolution on the CPU backend.
Then, using the ONNX models resnet18
, mobilenet
, or squeezenet
from the
work repository, compare the correctness of your implementation as well as the
performance of your convolution to the standard convolution used by the CPU
backend. The steps to do this experiment are:
-
It is interesting to define a
compilation flag
that allows turning on/off the code generation for your convolution. Let's call this flag, for example,MO436-features
, belonging to the CPU Backend category. Here are the files you should modify:- include/glow/LLVMIRCodeGen/CommandLine.h - lib/LLVMIRCodeGen/CommandLine.cpp
Don't forget to include the
CommandLine.h
in the CPUBackend header file so that this flag is visible to the backend CPU modules:- lib/Backends/CPU/CPUBackend.h
-
Create a new Backend-Specific Node and a new Backend-Specific Instruction to the convolution that you will implement (see the ClassGen documentation for more details.):
- lib/Backends/CPU/ClassGen/CPUSpecificNodes.h - lib/Backends/CPU/ClassGen/CPUSpecificNodesVerification.h - lib/Backends/CPU/ClassGen/CPUSpecificInstrs.h - lib/Backends/CPU/ClassGen/CPUSpecificInstrsVerification.h
-
Disable support fusing activations for MO436 Convolution:
- lib/Backends/CPU/CPUBackend.cpp
-
Generate code to replace the generic convolution with a new one that you will create in the
CPUBackend::transformPostLowering
function:- lib/Backends/CPU/Transforms.cpp
For the sake of completeness and correctness, we will need to bypass deep wise convolutions. Depthwise Convolution is a type of convolution where we apply a single convolutional filter for each input channel. In the regular 2D convolution performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. In contrast, depthwise convolutions keep each channel separate. In Glow, regular 2D convolutions have
group=1
. So you should discard deep wise convolutions like this:if (CN->getGroup() !=1) { return nullptr; }
-
Using as a reference the code generation for the convolution instruction in the
CPULLVMIRGen.cpp
module (CPUConvDKKC8InstKind
) and its implementation (libjit_convDKKC8_f
) in thelibjit_cpu_conv.cpp
module, design and implement its convolution and the code generation to invoke it (Pay attention to the prefix and suffix conventions used in the function names):- lib/Backends/CPU/CPULLVMIRGen.cpp - lib/Backends/CPU/libjit_cpu/libjit_cpu_conv.cpp
-
Finally, tell glow how it should represent this new convolution in the onnx model:
- lib/Backends/CPU/ONNX/CPUONNXModelWriter.cpp
The delivery must include one onnx model after optimized by Glow, in pdf
format, and a report showing the accuracy (top1 & top 5) and the speedup (or
slowdown) of its implementation compared with the default implementation in
Glow for 1000 PNG images contained in the dataset/imagenet
for one of the
three models available in the work
repository, namely resnet18
,
mobilenet
, & squeezenet
.