Literature Review - ofithcheallaigh/masters_project GitHub Wiki

Introduction

This section will comprise the literature review for this project, which has a focus on embedded machine learning. It is not difficult to see a future where embedded machine learning or deep learning plays a central role in a number of systems and industries. This literature review aims to provide an overview of the state of the art, as well as context on where embedded ML/DL could be headed.

As we have seen in [1], the growth of connected devices is increasing, and as the computing power of these devices increases, the potential of using these devices for ML or DL applications grows.

This review will cover a number of areas, starting with a review of edge computing, and then moving on to the topic of how machine learning can be carried out at the edge.

Embedded System and the Edge

Embedded System

Embedded is quite a common term in the field of electronic engineering. An embedded device, or an embedded system is a computer that controls the electronics of many of today's modern devices. Embedded systems can be found in everything from mobile phones to modern cars to satellites which orbit the Earth. These embedded systems are able to run the software, which will control the functions and ability of the system.

Embedded systems are in more places than one may imagine, or it may be more accurate to say that there are more embedded chips in a single device than one may imagine. In 2020, more than 28 billion microcontrollers were shipped globally, and the trend is predicted to grow, with a focus on automation and AI devices [9]. Given how common these devices are, it is quite obvious that researching how best to transfer ML models to these systems is an important step.

A microcontroller is a small computer on an integrated circuit (IC). The IC is sometimes referred to as a VLSI IC. A microcontroller will contain:

One or more central processing units (CPU)
Typically multiple types of onboard memory (Flash, EEPROM, RAM)
Input/Output peripherals

Typically, microcontrollers are embedded in systems, hence the term embedded systems. As an example, an entry-level motorcar would typically have 15-20 microcontrollers, with high-end, luxury motorcars potentially having over 100 microcontrollers [10].

As technology progressed, microcontrollers came to be classified using bit sizes as a way to distinguish the system's performance. Typically MCUs are described as 8-, 16, or 32-bit systems, where the number of bits described the size of the registers. It also identifies the available memory address. So, for an 8-bit system, there would be $2^{8} = 256$ available addresses, in a 16-bit system, there would be $2^{16} = 65536$ addresses, and in a 32-bit system, there would be $2^{32} = 4,294,967,296$ addresses, which is over 4GB of memory. Another common classification for MCUs is their operating speeds. While values will vary for systems manufactured by different suppliers, typically 8-bit systems will have speeds in the low tens of megahertz, going all the way up to hundreds of megahertz for 32-bit systems. The deciding factor on which MCU is used for a particular job will be that application. For example, a simple application where the cost needs to be kept down could be covered by an 8-bit MCU, whereas a large, high-performance system, which requires intensive mathematical operations would be better covered by a 32-bit MCU.

Edge Computing

The term 'edge' may still seem slightly unusual. How does Edge relate to ML? The edge tends to be something that is located far away.

When discussing the internet, computers, or IT systems, most people will have an image of the PC they have at home or the computer they use at work. But there are more devices connected to the internet than your PC. In fact, it is estimated that as of 2021, there were 12.2 billion active IoT connections [2]. These IoT devices cover almost any aspect of our lives that one cares to think about, everything from smart watches, smart kitchen appliances, baby monitors connected to the internet, allowing parents to check in from anywhere in the world, shipping containers, industrial sensors used to monitor the health of machinery. The list goes on.

On the face of it, all these devices are connected to the "internet". But most people tend to think of the internet as something "that comes out of the box in the corner" (i.e. the router) (direct quote from my mother, Irene). How are all these billions of devices connecting to networks and communication? They have connected to servers, and these are the servers which are often referred to as the "cloud".

All of these devices are connected to a network, they are taking readings from their sensors, and send that information across this network, to a location where it can be stored and processed. From this perspective, these devices sit at the 'edge' of the network, and this is where the term edge devices come from.

For a lot of time, IoT devices were seen as ways to collect data via the sensors they had onboard. They would collect the data, and then transmit it back to a hub for processing. This approach is very expensive in a number of ways. First, it costs a lot of money to transmit large amounts of data, due mainly to the connectivity and storage costs, there is also the issue that for any battery-powered IoT devices, transmitting data is an extremely power-hungry task. Second, it is also expensive in human time too, because people will need to evaluate the data and process it, potentially making some decisions based on that data analysis.

So, sending information back to a central location to be processed, and then returning the result of that processing can take time. Not a lot of time, in terms of how we humans work, but in terms of the timescales a computer works it, the lag in sending your information and getting the response can be long, for a computer. This lag characteristic, more correctly called latency can be detrimental when the system requires a fast response. For example, it is not difficult to imagine a situation where a self-driving car needs to make a decision on whether or not the 'object' it has detected is a person crossing the road in front of it, and it needs to apply the brakes in an emergency stop. It would not do to have a system that needed to send the data collected off to a server, which had to process it so it could determine if the object was a human, and then send that results back to the computer on the vehicle, which then needs to react to that. By the time all that took place, it may be too late. And let's not even get into what could happen in a situation where there was a momentary loss of connection to the network!

Issues like this have pushed the need to remove that latency. This could be achieved by having faster connections and by increasing the computing speed on the server side. Another way is to remove the need for the server side to do the processing. In this situation, the system that collects the data also processes the data to get a result. This is where we enter the world of Edge AI.

Embedded machine learning is a complex field that requires a variety of tools, both hardware and software. This section will provide a brief overview of the different types of tools available to TinyML developers.

Hardware

TimyML hardware is produced by a number of manufacturers. These include the Sparkfun Edge Development Board Appollo3 Blue, a board designed in collaboration with Google and Ambiq [12], ST Microelectronics, who have a set of microelectronic tools which allow a user to map a pre-trained neural network onto one of their purposely designed STM32 MCUs [13], and run inferences. The Raspberry Pi Pico is a microcontroller system that is low-cost with a range of sensors and comms protocols [14]. And then there is the Arduino 33 BLE Nano [15], which, like the Raspberry Pi has a range of sensors, including an IMU, Bluetooth capabilities as well as a proximity sensor. Listed above is a very small subset of hardware available to anyone who wants to get a TinyML system going.

There are many edge AI devices that have been designed with a specific use case, or area of use, in mind. For example, there is the Carol, which is a platform developed by Google for AI applications in a production environment. The System on Module (SoM) device is a fully integrated system for accelerated machine learning applications (including CPU, GPU, Edge TPU, Wi-Fi, and Bluetooth). The Jetson Nano from NVIDIA, which is an AI computer for makers, learners, and developers. The Jetson Nano allows for the use of libraries and APIs for TensorRT and cuDNN, which can be used for deep learning applications that require higher performance, CUDA for GPU accelerated applications, NVIDIA Container Runtime for containerised GPU accelerated applications, as well as APIs for sensor development or working with OpenCV. This, along with NVIDEA's other AI-focused development kits, is aimed at prototyping, before being put into production.

A lot of these systems contain sensors onboard the MCU, but these systems are not limited to these onboard sensors. External sensors can be added via the pins provided by the MCU. These pins can be configured as general-purpose input and output (GPIO) pins.

Edge AI is built on constrained devices, such as microcontrollers (MCU). Constrained devices, such as microcontrollers, are the primary platform for edge AI applications. The limited processing power, available memory and power consumption are the main factors constraining these devices.

The processor is the centre of the whole system, with the processor being the element that runs applications and runs the algorithm for an AI application. Depending on the system, there may also be co-processors or accelerators. An accelerator is a bit of hardware designed to carry out a specific task. For example, one typical accelerator would be a floating-point unit (FPU), which will be added to a system to carry out floating-point arithmetic as quickly as possible. Other accelerators which could be common in edge AI applications would be digital signal processing (DSP) blocks and linear algebra blocks.

Another type of hardware configuration would be the field programmable gate array (FPGA). An FPGA is an IC which allows the user to configure the hardware to meet the requirements of the project. Using FPGAs allows the developer to build a processor that is specific and tailored to running machine learning models and so on. The flexibility that FPGAs allow means they are an excellent platform for developing AI accelerators. Intel has a range of FPGAs which allow real-time, low-latency, and low-power deep learning inference. A number of devices such as the Intel Cyclone. Other manufacturers are producing AI-focused FPGAs, such as Qualcomm, which has, among other systems, the Snapdragon, which is a System on Chip (SoC) which contains a CPU and GPU and is very power efficient.

Software

There are two main languages typically involved in TinyML, these are Python and C. Python is typically used for data analysis and model training, and C, the standard language of embedded systems, is used to program the MCU.

While Python and C may be the main languages, there are many other languages available, such as Julia, Weka (Waikato Environment for Knowledge Analysis) which works with Java, as well as R.

When we look at Python, there are a number of frameworks we can use to implement out model on a constrained device. These are:

TensorFlow Lite (TLF)
PyTorch
Edge Impluse
uTensor
STM32Cube.AI

Libraries, like TFL and PyTorch are systems where the designer has to set the system parameters in code, while systems such as Edge Impulse provide the designer with more of a "low-code" or "no-code" interface for them to set the parameters. These systems provide a level of abstraction which may not give the designer the level of control they wish.

TFL can be used to develop models that will work on Android and iOS devices, as well as on MCU and embedded Linux systems. While Python and C are the main languages, TFL has support for Objective-C, C++, Java and Swift. The ability to use multiple languages is a great advantage, but the ability to optimize the models is essential to the success of TFL. This will be discussed in the following section.

Previous Research

An early piece of work was done by Nicholas Lane et. al. in their work Squeezing Deep Learning into Mobile and Embedded Devices [3]. This work points out that the move towards deep learning in embedded systems has been slow in part due to the large resources required to process the massive amount of layers of interconnected nodes, added to the fact that carrying out a single classification using sensor data can require calculations to be carried out over, potentially, thousands, if not millions, of parameters. It is obvious when these systems were designed, no thought was given to the models being run on a constrained device. This is not unusual: a system is designed and implemented, and subsequently optimised. This optimisation took place, and [11] points to work on an effective speech recognition system for Sparse LSTM on FPGA [4], where the authors proposed a method that compresses the LSTM model by a factor of 20, allowing an increase in prediction speed and with negligible loss in accuracy. The authors also proposed a scheduler to allow for parallelism as well as scheduling data flow, and the final step was designing a hardware flow that works directly on the sparse LSTM model. As this last item suggests, this work is aimed at hardware improvements. Other work, such as [5] also has a focus on hardware. In this case, the authors focus on a method to improve the efficiency of continuous mobile vision by designing an analogue computational image sensor. Once the sensor has captured an image, it then processes that image through a deep convolutional network. All this is done in the analogue domain. From here, the system will export digital vision features. The proposed method greatly reduces the workload applied to the analogue-to-digital converter and the subsequent digital electronics system, which allows for efficient continuous mobile vision applications.

Another paper of note is [6] in which the authors discuss the design and implementation of DeepX, which is a software system designed to accelerate the implementation of deep learning algorithms by overcoming a number of issues, the first of which is run-time optimisation, resource availability and processor selection, to name a few. The DeepX approach sets out to reduce the resource (memory, computation power, energy) used by making good use of network-based computation as well as the use of the range of available local processors. The proposed method is called Runtime Layer Compression (RLC). Up to the time of this work, the common approaches to optimising resources in embedded devices, such as compression, focused on the training phase of the workflow. RLC provides runtime control of computational tasks and memory management (which will have the side effect of creating better energy efficiency) used during the inference phase by extending compression principles. As well as RLC, the authors propose a second technique as part of the DeepX algorithm: Deep Architecture Decomposition (DAD). The DAD approach varies from the standard DL approach which has many layers and thousands of units, in that DAD identifies "unit-blocks" of the architecture, and generates a "decomposition plan" that assigns the blocks to a local or remote processor. There are of course existing algorithms for assigning computing tasks to the cloud, but they are unable to identify any optimised approaches due, mainly, to their lack of understanding of learning algorithms or approaches. DAD works to overcome these issues, as well as others.

These above approaches have looked at implementing machine learning or deep learning techniques from either a hardware or a systems point of view. This leaves the final area of investigation: algorithms. If we think of some type of embedded device which required a deep learning algorithm as part of its function, that work would typically be offloaded to the cloud. But that comes with problems due to inference implementation being susceptible to unpredictable latency and throughput issues. It also can expose the user to concerns as the data is being processed away from the system, but a third party. This last bit will probably make more sense being higher up somewhere, as an explanation of edge computing and what have you. Allowing inference predictions or classifications to be carried out on the device itself will remove these issues and concerns. The work detailed in [7] looks at the development of algorithms which could allow for processing tasks to be carried out on the device, and therefore remove the need for processing in the cloud. The authors developed a sparse\footnote{In scientific domains, typically numerical analysis or scientific computing, the term sparse refers to a sparse matrix of an array in which most of the elements are zero. There is no definition of what proportion of the elements should be zero for it to be called sparse, however, there are some rules of thumb. Another common idea is that if most of the elements are non-zero, the matrix is considered to be dense.} coding- and convolution kernel system to optimise deep learning model layers through the development of a Layer Compression Compiler (LCC) which optimises the model passed to it; a runtime framework (Sparse Inference Runtime (SIR)) that can use this optimised model - using SIR results in a reduction in computation, energy usage and required memory use. The final item is the Convolution Separation Runtime (CSR), which aims to significantly reduce the operations needed for the convolution process. The authors say that these techniques, which fall under the umbrella of SparseSep can allow a developer to adopt existing, commonly used deep models for their work.

The main idea behind this work is that the computational complexity and space complexity of the deep learning models can be improved through sparse representation of key layers and separation of convolution layers. One of the building blocks for the work is sparse dictionary learning - sparse dictionary learning is in the areas of compression sensing and signal recovery. The aim is to find a sparse representation of the input data in the form of a linear combination of basic. These elements are called atoms and they make up the dictionary.

The authors test their system on various processor platforms and their testing showed that SparseSep allowed for inference for various deep models to execute more efficiently, with, on average, just over 11 times less memory being needed, and running just over 13 times faster.

This covers a very small section of the early work done on embedded machine learning systems, but it gives an idea of where efficiencies were sought, but it has to be said that while these efficiencies were sought, there is still a lot of work to be done. As is stated in [8] when looking at power consumption, the factors which can impact the power required to run a system can be impacted elements such as the components used in the system, the algorithm selected and how it is distributed across the elements of the system on to elements such as data management, data transmission and data storage. The authors also make the point that none of these systems should be optimised as an individual component, rather the system as a whole should be optimised. The authors of this work are looking at the world of wearable devices in health care, but the statement holds true for all constrained devices.

Review of Embedded AI

This section of the document aims to present a review of embedded AI, or TinyML. We will discuss how a wider Embedded AI (EAI) architecture could be implemented, we will move on to look at the challenges faced by EAI developers

Within the field of AI, embedded systems are most closely associated with the Internet of Things (IoT). While not all IoT systems require AI, it can be argued that all embedded AI systems fall under the IoT umbrella. Some people have started to refer to IoT systems that use AI as AIot, or the Artificial Intelligence of Things. On a personal note, this seems like a rather forced term, and not one I intend to use again; it is noted here for completeness. Instead, I prefer the term AI enabled IoT.

AI is employed on IoT devices for a number of reasons. As the development and deployment of these systems have grown, our world has become more connected. But that is all these devices were capable of: connections. As the hardware capabilities of these IoT systems increase, researchers started to investigate the potential of deploying AI models on IoT devices, to allow the device to carry out some level of analysis, and either report back on that analysis or carry out an action based on that analysis.

AI-Enabled IoT Architectures

A tri-tier architecture is discussed in [16]. The authors say this architecture is similar to that used for IoT systems which have four layers [17]:

A perception layer: sensor inputs
Transport layer: Wi-Fi, routing, Bluetooth
Processing layer: Data centres, web services, cloud
Application layer: IoT devices, smart cars, etc.

The authors in [16] propose a three-layer system which contains an: end layer, an edge layer and a cloud layer. This architecture is sometimes referred to as fog-computing and contains a holistic set-up. In this architecture, some devices will be powered by being plugged into a grid, while others will be battery-powered. The authors of [18] say that because the end-device has to complete tasks to a deadline to meet the goals of a real-time application, the device will either run as a bare-metal application or run a real-time operating system (RTOS), instead of a more standard operating system (OS). The advantage of running a non-standard OS on the end device is that the system can allow for multi-tasking and task prioritisation in an effort to meet deadlines, which a standard OS (such as Linux) cannot guarantee.

In this system, the end layer can be thought of as being similar to the perception layer in the IoT model. The perception layer can have a wide variety of sensors attached to it. In an AI-enabled architecture, the end layer will also interact with the physical world via a range of sensors and actuators, however, it will also carry out computational tasks, such as data preprocessing [16]. This capability of performing computational tasks at the end layer (i.e. on the device) can reduce the latency, the bandwidth and the cost involved in transmitting data, which would be higher in a more traditional IoT system, all of which will improve the system performance. The next layer in this architecture is the edge layer. The edge layer can be thought of as having a series of nodes at the edges of this layer. The nodes on one side are responsible for receiving data from the perception layer. The data is then passed to the other side of the edge layer where more involved computational tasks are carried out. This layer again offers benefits of reduced latency, as well as providing an always ready, or always on service without the need for a reliable Internet service. TinyML systems offer improved data security and privacy solutions, as no data is uploaded to a cloud service for processing. However, they also have lower computational power than cloud-based AI systems. The advantage of neural networks, and deep neural networks is their suitability for use within embedded systems. Finally, there is the cloud layer, which can coordinate what is happening on the end layer and the edge layer. The cloud layer will allow the device to use resources through the Internet if this is an option. The cloud layer will also enable the so-called "smart" elements of IoT devices (smart homes, smart fridges, smart cities and so on). The cloud layer will also transmit any required data back to the cloud for storage and further processing to better train the model.

Sources

[1] S.R. Department. “Internet of things (iot) connected devices installed base worldwide from 2015 to 2025.” (2016), [Online]. Available: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/ (accessed: 22.10.2022)
[2] https://iot-analytics.com/number-connected-iot-devices
[3] N.D. Lane, S. Bhattacharya, A. Mathur, P. Georgiev, C. Forlivesi, and F. Kawsar, “Squeezing deep learning into mobile and embedded devices,” Prevasive Computing, pp. 82–88, Jul. 2017. DOI: 10.1109/MPRV.2017.2940968. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7994570.
[4] S. Han, J. Kang, H. Mao, et al., “Ese: Efficient speech recognition engine with sparse lstm on fpga,” Association for Computing Machinery, Inc, Feb. 2017, pp. 75–84, ISBN: 9781450343541. DOI: 10.1145/3020078.3021745. [Online]. Available: https://arxiv.org/pdf/1612.00694.pdf
[5] R. Likamwa, Y. Hou, Y. Gao, M. Polansky, and L. Zhong, “Redeye: Analog convnet image sensor architecture for continuous mobile vision,” Institute of Electrical and Electronics Engineers Inc., Aug. 2016, pp. 255–266, ISBN: 9781467389471. DOI: 10.1109/ISCA.2016.31
[6] N.D. Lane, S. Bhattacharya, P. Georgiev, et al., “Deepx: A software accelerator for low-power deep learning inference on mobile devices,” Institute of Electrical and Electronics Engineers Inc., Apr. 2016, ISBN: 9781509008025. DOI: 10.1109/IPSN.2016.7460664.
[7] S. Bhattacharya and N.D. Lane, “Sparsification and separation of deep learning layers for constrained resource inference on wearables,” Association for Computing Machinery, Inc, Nov. 2016, pp. 176–189, ISBN:9781450342636. DOI: 10.1145/2994551.2994564
[8] M.S. Diab and E. Rodriguez-Villegas, “Embedded machine learning using microcontrollers in wearable and ambulatory systems for health and care applications: A review,” IEEE Access, vol. 10, pp. 98 450–98 474, 2022, ISSN: 21693536. DOI: 10.1109/ACCESS.2022.3206782.
[9] https://www.businesswire.com/news/home/20211013005793/en/Global-Microcontroller-Market-Size-Share-Trends-Analysis-Report-2021-2028---ResearchAndMarkets.com
[10] https://www.financialexpress.com/auto/industry/semiconductors-your-car-is-a-computer-on-wheels-maruti-suzuki-cv-raman-electric-cars/2261989/
[11] N.D. Lane, S. Bhattacharya, A. Mathur, P. Georgiev, C. Forlivesi, and F. Kawsar, “Squeezing deep learning into mobile and embedded devices,” Pervasive Computing, pp. 82–88, Jul. 2017. DOI: 10.1109/MPRV.2017.2940968. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7994570
[12] https://www.sparkfun.com/products/15170
[13] https://www.st.com/content/st_com/en/about/innovation---technology/artificial-intelligence.html
[14] https://www.hackster.io/mjrobot/tinyml-motion-recognition-using-raspberry-pi-pico-6b6071
[15] https://docs.arduino.cc/hardware/nano-33-ble-sense
[16] S. Deng, H. Zhao, W. Fang, J. Yin, S. Dustdar and A. Y. Zomaya, "Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence," in IEEE Internet of Things Journal, vol. 7, no. 8, pp. 7457-7469, Aug. 2020, doi: 10.1109/JIOT.2020.2984887.
[17] A. Simmins, "Dgtl Infra," 13 11 2022. [Online]. Available: https://dgtlinfra.com/internet-of-things-iot-architecture/. [Accessed 11 4 2023]
[18] S. Branco, A. G. Ferreira, and J. Cabral, “Machine Learning in Resource-Scarce Embedded Systems, FPGAs, and End-Devices: A Survey,” Electronics, vol. 8, no. 11, p. 1289, Nov. 2019, doi: 10.3390/electronics8111289.