Discussion on Hardware and Software - ofithcheallaigh/masters_project GitHub Wiki

Introduction

This section of the Wiki aims to provide information on the hardware which can be used when working on edge AI. We will also discuss some of the factors that may influence the decision on the hardware a developer may choose to use. On a technical level, the idea behind edge hardware is that it is placed physically close to the data source, so any desktop computer or server could be used as an edge device, however, the environmental conditions tend not to allow these devices to be used - power consumption will be the major issue. This leads us to what we call edge devices. These edge devices, can come in a number of flours, one of which will be edge devices designed for AI applications.

Hardware

This section will also provide a discussion on the software used for ML at the edge. Or, to be more precise, they software frameworks used (as opposed to the languages).

The typical edge device architecture is shown below:

We can see from this image that the processor is at the centre of the who system. This is the element that runs the various applications and also runs the algorithms. Depending on the device, the processor may have what are called co-processors. A co-processor is typically an extra bit of hardware that is there to carry out specific tasks. These can be known as accelerators. One such accelerator would be a floating-point unit (FPU), which is a hardware element designed to carry out floating point arithmetic. Other common accelerator blocks which could be useful in an Edge-AI environment would be digital signal processing blocks and linear algebra blocks [1].

We have discussed a number of times how an MCU is a contained device. This presents a number of challenges when developers look to implement ML at the edge. One of the biggest hurdles facing developers is the reduced memory capacity of a constrained device, relative to, for example, a cloud solution.

General Hardware Considerations

Typically, a model could have to store information on various weights and biases, as well as data points and so on. Storing this information can take up large amounts of memory. This puts a particular onus on developers to reduce the memory resources required by the constrained device, while still ensuring the system can preform as expected. An MCU can have various types of memory, including Flash, EEPROM, SRAM, and DRAM. Each type of memory will have various pros and cons associated with it. The most obvious is how fast the system can carry out read and write requests. Some memory may also be volatile in nature, which means that it will lose the data stored in it if the system loses power. Another thing which will help with constrained memory will be model compression.

Hardware Options

There are many edge AI devices that have been designed with a specific use case, or area of use, in mind. For example, there is the Carol [2], which is a platform developed by Google for AI application in a production environment. The System on Module (SoM) device is a fully integrated system for accelerated ML applications (including CPU, GPU, Edge TPU, Wi-Fi, Bluetooth, and Secure Element. Then there is the Jetson Nano from NVIDIA [3], which is an AI computer for makers, learners, and developers [4]. As discussed in [4], the Jetson Nano allows for use of libraries and APIs for TensorRT and cuDNN which can be used for higher performance deep learning applications, CUDA for GPU accelerated applications, NVIDIA Container Runtime for containerised GPU accelerated applications, as well as APIs for sensor development or working with OpenVC. This, along with NVIDEA's other AI focused development kits are aimed at prototyping, before being put into production. A lot of the systems mentioned so far would fall into the mid- to high-end systems. There are also lower end systems which are still capable of carrying out AI applications. One example is the Beaglebone AI [5]. This system is built on Linux, and based around the Texas Instruments (TI) AM5729, with a TI C66x floating point DSP, with TI embedded vision engines (EVE) [6]. Then there is the Arduino Nano 33 BLE Sense which is a small development board which is AI enabled. The Arduino Nano 33 contains a number of onboard sensors, such as a 9-axis inertial sensor, humidity sensor, microphone and a proximity sensor. The Arduino board also has GPIO pins which can allow an extra sensors to be added, if needed or required. The Arduino development kit allows for EML applications to be run where models created using TensorFlow Lite can be uploaded to the board via the Arduino IDE [7]. Another type of hardware would be the field programmable gate array (FPGA). An FPGA is an IC which allows the user to configure the hardware to meet the requirements of the project. Using FPGAs allows the developer to build an processor that is specific and tailored to running ML models and so on. The flexibility that FPGAs allow mean they are an excellent platform for developing AI accelerators [6]. Intel have a range of FPGAs which allow real-time, low-latency, and low-power deep learning inference. A number of devices such as the Intel Cyclone, the 10 GA [8]. Other manufacturers are producing AI focused FPGAs, such as Qualcomm, which has, among other things, the Snapdragon, which is a System on Chip (SoC), which contains a CPU and GPU. As with similar systems, the Qualcomm kits offer power efficient systems with excellent performance [9].

Software

This section will detail some of the major software libraries and projects that are aimed at the EML field.

Frameworks

The main tools or frameworks used in deep learning would be TensorFlow and its associated Keras API, and PyTorch. Others, such as Theano, Caffe, and Microsoft CNTK are also available. TensorFlow, developed by Google Brian is a powerful open source library aimed at deep neural networks. These tools are typically optimised to run on GPUs and other specialised hardware. The reason for this is to accelerate the training processes, because once a model has been trained, using that model to make inferences is less computationally expensive, and this is important for dealing with edge devices. It is important to note that why inference is less computationally expensive, the MCU does not have a free ride - a reasonable amount of memory and processing power is still required to carry out the inference. To try and reduce the requirements on the front end, frameworks such as TensorFlow and PyTorch have implemented model compression techniques which have little to no effect on the computational accuracy [6]. There will be more on model compression later.

While TensorFlow (with TensorFlow Lite) and PyTorch have been mentioned here, TensorFlow has been the main focus, because while PyTorch has PyTorch Mobile, it is aimed at mobile devices which run on either Android or iOS, whereas TensorFlow Lite can be used on embedded system [6][10].

TensorFlow Lite

TensorFlow Lite (TLF) is a set of tools that have been designed and built to allow on-board machine learning. It has been designed to help developers run models on embedded systems, as well as on mobile devices [10].

Some of the key features of TensorFlow Lite include:

Optimisation
Multi-platform
Multi-language options
High performance

As discussed elsewhere, optimisation is very important when developing EML systems. TFL advances optimisation by dealing with a number of key constraints for embedded systems: latency, privacy, connectivity, size, and power consumption [10]. The reduced latency comes from the fact that there is no information going to the server to be processed, and then back to the device. TFL removes a lot of privacy issues which may traditionally have been present cloud-based ML tasks because no data will actually leave the device. Connectivity issues are side-stepped because the device does not need to connect to a network. Some systems may send information back to a cloud-based server, but this would become a non-critical task, in terms of timing. TFL can also reduce the size of any associated files, meaning they are more suitable for embedded applications. Power consumption is reduced through efficiencies in the interface, as well as the lack of network connection, since, typically, sending information tends to me a large drain on power.

On the topic of multi-platform, TFL can be used to develop models that will work on Android and iOS devices, as well as on microcontrollers and embedded Linux systems. In terms of languages, TFL has support for Python, Objective-C, C++, Java and Swift.

The ability to use multiple languages is a great advantage, but the ability to optimize the models is essential to the success of TFL. This will be discussed in the following section.

Optimisation

Various optimisations can be applied to models to allow efficient operation with the limited computational power and limited memory which is available on a contained device. In addition, some optimisation techniques allow for the use of specific hardware for accelerated inference [11]. TFL comes with a number of tools, within the TensorFlow Model Optimization Toolkit. Models should be optimised for a number of reasons '[12]'. One would be size reduction. There is no one single method to reduce a model - there are several ways this process can be achieved, but no matter the process, reducing the size of the model has a number of advantages:

Smaller storage size: A smaller size model will take up less memory on the end device. If we take the Arduino Nano as our target device, this MCU does not come with a lot of on-board memory, so reducing the model size is critical for deployment.
Reduced memory usage: Using a smaller model will result in less RAM being used when the model is run. This will allow other more memory to be available for other elements of the application to use. This will typically result in better performance over all.
Reduced download size: A smaller model will need less time to download onto the end device. This is a benefit when a system might have a small window of time to receive a newly trained model.

Another reason is latency reduction. In the context of EML, latency can be defined as the amount of time taken for the system to process a piece of data and make a decision based on that data. Optimisation can reduce the amount of computation needed to make this decision (or, to put it another way, reduce latency). Lower latency will also have the benefit of reducing power consumption.

Then we have accelerators. One consequence of model optimisation is that it can impact the model accuracy, and this should be kept in mind during development. TFL offers tools such as Edge TPU, which can run inference extremely fast when using models which have been correctly optimised. Edge TPU is offered by Google, and is a purpose-built ASIC which has been designed specifically to run AI on edge devices, and allows for high-quality machine learning inferencing at the edge, and is based around the Carol development environment discussed above [13].

As discussed, optimisation has the potential to modify the accuracy of the model, and any developer must keep this in mind during the development process. The potential inaccuracies cannot be predicated, so a method of dealing with them cannot be developed and included as part of the TFL suite of tools.

Depending on what works best for a given application, TFL offers a number of optimisation paths: quantisation, pruning, and clustering. The quantisation approach works my reducing the precision of the numbers used to represent the model parameters [12]. By default these are set to 32-bit floating point numbers. Reducing the precision of the numbers or parameters of the model means that less memory is required, this in turn means an over all smaller model, which will result in faster computation.

From [12], we can see the following quantisation are techniques are:

Technique	Data Requirements	Size Reduction	Accuracy	Support Hardware
Post training float 16 quantisation	None	Up to 50%	Insignificant accuracy loss	CPU, GPU
Post training dynamic range quantisation	None	Up to 75%	Smallest accuracy loss	CPU, GPU (Android)
Post training integer quantisation	Unlabelled representative sample	Up to 75%	Small accuracy loss	CPU, GPU (Android), EdgeTPU, Hexagon DSP
Quantisation-aware training	Labelled training data	Up to 75%	Smallest accuracy loss	CPU, GPU (Android), EdgeTPU, Hexagon DSP

Next is pruning. Pruning is an approach which removes parameters within the model which will have only a minor impact on the predication. A pruned model will take up the same size of memory, and will have the same latency, but it can be compressed more efficiently.

Then finally there is clustering. Clustering works by grouping the weights of each layer into a predefined number of clusters, which allows for the centriod values for the weights belonging to each cluster. This means the number of unique weights is reduced, which also reduces the complexity.

Sources

[1] D. Situnayake and J. Plunkett, AI at the Edge, O'Reilly Media, Inc., 2023.
[2] Google, "Carol," Google, January 2021. [Online]. Available: https://coral.ai/docs/som/datasheet/. [Accessed 13 4 2023].
[3] NVIDIA, "Jetson Nano Development Kit," NVIDIA, [Online]. Available: https://developer.nvidia.com/embedded/jetson-nano-developer-kit. [Accessed 13 4 23].
[4] NVIDIA, "Jetson Nano Development Kit User Guid," NVIDIA, 15 1 2020. [Online]. Available: https://developer.nvidia.com/embedded/dlc/Jetson_Nano_Developer_Kit_User_Guide. [Accessed 13 4 2023].
[5] BeagleBone, "BeagleBone AI 64," [Online]. Available: https://beagleboard.org/AI. [Accessed 13 4 23].
[6] T. Sipola, J. Alatalo, T. Kokkonen and M. Rantonen, "Artificial Intelligence in the IoT Era: A Review of Edge AI Hardware and Software," 2022 31st Conference of Open Innovations Association (FRUCT), Helsinki, Finland, 2022, pp. 320-331, doi: 10.23919/FRUCT54823.2022.9770931.
[7] Arduino, "Arduino Nano BLE 33 Sense," Arduino, 2021. [Online]. Available: https://store-usa.arduino.cc/products/arduino-nano-33-ble-sense. [Accessed 13 4 2023].
[8] Intel Corporation, "Intel FPGA AI Suite," [Online]. Available: https://www.intel.co.uk/content/www/uk/en/software/programmable/fpga-ai suite/overview.html. [Accessed 14 4 2023].
[9] Qualcomm, "AI is transforming everything. We are making AI ubiquitous.," [Online]. Available: https://www.qualcomm.com/research/artificial-intelligence. [Accessed 14 4 2023].
[10] TensorFlow Org, "TensorFlow Lite," 26 5 2022. [Online]. Available: https://www.tensorflow.org/lite/guide. [Accessed 14 4 2023].
[11] TensorFlow Org, "Model Optimisation," 20 10 2021. [Online]. Available: https://www.tensorflow.org/lite/performance/model_optimization. [Accessed 15 3 23].
[12] TensorFlow Org, "Model Optimization," [Online]. Available: https://www.tensorflow.org/model_optimization. [Accessed 9 4 23].
[13] Google, "Internet of Things - Edge TPU," [Online]. Available: https://cloud.google.com/edge-tpu/. [Accessed 16 4 2023].