Linux Kernel Training - johnosbb/Automation GitHub Wiki

Linux Training

Cross development toolchains

Pre-compiled toolchains Toolchain generation tools

Crosstool-ng :Crosstool-NG is a versatile (cross) toolchain generator. It supports many architectures and components and has a simple yet powerful menuconfig-style interface. crosstool-NG is really targeted at building toolchains, and only toolchains. It is then up to you to use it the way you want.
Buildroot Installing a pre-compiled toolchain

The Linux Boot

Booting Linux using u-boot
ROM boot loader initialises essential hardware
It locates the (First Stage Boot Loader) FSBL, copies it to ram and then passes control to it.
Second Stage Boot Loader (SSBL) is copied to ram, FSBL then passes control to the second stage boot loader. (By 'Passes Control' we mean previous program ceases to exit, i.e. not a call from the first stage)
The second stage Boot Loader (normally uboot) will decompress the kernel and device tree and load into ram and then start the kernel and pass it the boot arguments.
The kernel parses the passed parameters, looks for the root file system, mounts that file system and then launches the init system (pid 0).

Linux kernel parameters

See here for details

The Linux startup sequence

The SystemV initialization system

Booting Linux through the network using TFTP and NFS

Set up a TFTP server
- sudo apt-get update && sudo apt-get install xinetd tftpd tftp
- Create file `/etc/xinetd.d/tftp' with the following contents:

Create file `/etc/xinetd.d/tftp' with the following contents:

service tftp
{
protocol = udp
port = 69
socket_type = dgram
wait = yes
user = nobody
server = /usr/sbin/in.tftpd
server_args = /tftpboot
disable = no
}

Create directory /tftpboot/' (this matches the server-args' above) and set its permissions:
$ sudo mkdir /tftpboot/
$ sudo chmod -R 777 /tftpboot/
$ sudo chown -R nobody /tftpboot/
Restart the `xinetd' service:
$ sudo service xinetd restart
In uboot
- set serverip <host_pc_ip_address>
- saveenv
- Next modify U-Boot's boot command to boot via TFTP:
- set origbootcmd "$bootcmd"
- set bootcmd "dhcp: tftp ${kernel_addr} ${serverip}:Image; tftp ${fdt_addr} ${serverip}:yourdevicetree.dtb; booti ${kernel_addr} - ${fdt_addr}"
- saveenv

Booting from NFS

Beaglebone example
setenv ipaddr 192.168.0.250
setenv serverip 192.168.0.251
tftpboot 0x80F80000 am335x-boneblack.dtb
tftpboot 0x80007FC0 uImage-BBB
setenv bootargs console=ttyO0,115200n8 root=/dev/nfs rw nfsroot=192.168.0.251:/home/cpeacock/export/rootfs ip=192.168.0.250:::::eth0
bootm 0x80007FC0 - 0x80F80000

Other examples

bootargs = "earlycon console=ttyPS0,115200 init_call_debug ignore_loglevel debug debug_locks_verbose=1 root=/dev/nfs nfsroot=192.168.10.xxx:/home/xxxxxx/NFS_MOUNT_POINT,tcp ip=192.168.10.xx rw vt.global_cursor_default=0 modprobe.blacklist=bad_driver sched_debug mminit_loglevel=4 loglevel=8 clk_ignore_unused"; stdout-path = "serial0:115200n8";

Creating the embedded Linux kernel

Downloading stable source code
Getting a tarball
Using GIT
Configuring the kernel
Compiling the kernel and its modules
Modules delivered in-tree
Out-of-tree modules

Installing the kernel and the modules

Configuring and compiling a target kernel for a target board, raspberry Pi example
- cp arch/arm/configs/bcmrpi_cutdown_defconfig .config
- Then configure the kernel build
- make ARCH=arm CROSS_COMPILE=/usr/bin/arm-linux-gnueabi- oldconfig
- Optional: Customise the build using menuconfig
- make ARCH=arm CROSS_COMPILE=/usr/bin/arm-linux-gnueabi- menuconfig
- Then run the compilation
- make ARCH=arm CROSS_COMPILE=/usr/bin/arm-linux-gnueabi- -k
Questions: How are modules named

Linux kernel programming and debugging

Development in the Linux kernel

Memory allocation

- The Buddy allocation system is an algorithm in which a larger memory block is divided into small parts to satisfy the request. This algorithm is used to give best fit. The two smaller parts of block are of equal size and called as buddies. In the same manner one of the two buddies will further divide into smaller parts until the request is fulfilled. Benefit of this technique is that the two buddies can combine to form the block of larger size according to the memory request.
- - coalescing is defined as how quickly adjacent buddies can be combined to form larger segments this is known as coalescing. For example, when the kernel releases the C1 unit it was allocated, the system can coalesce C1 and C2 into a 64kb segment. This segment B1 can in turn be coalesced with its buddy B2 to form a 128kb segment. Ultimately we can end up with the original 256kb segment.
- Slab Allocation is a second strategy for allocating kernel memory is known as slab allocation. It eliminates fragmentation caused by allocations and deallocations. This method is used to retain allocated memory that contains a data object of a certain type for reuse upon subsequent allocations of objects of the same type. In slab allocation memory chunks suitable to fit data objects of certain type or size are preallocated. Cache does not free the space immediately after use although it keeps track of data which are required frequently so that whenever request is made the data will be reached very fast. Two terms required are associated with this method:
- - Slab – A slab is made up of one or more physically contiguous pages. The slab is the actual container of data associated with objects of the specific kind of the containing cache.
- - Cache – Cache represents a small amount of very fast memory. A cache consists of one or more slabs. There is a single cache for each unique kernel data structure.

Free page management

If a module needs to allocate big chunks of memory, it is usually better to use a pageoriented technique. Requesting whole pages also has advantages, To allocate pages, the following functions are available:

get_zeroed_page(unsigned int flags); Returns a pointer to a new page and fills the page with zeros.
__get_free_page(unsigned int flags); Similar to get_zeroed_page, but doesn’t clear the page.
__get_free_pages(unsigned int flags, unsigned int order); Allocates and returns a pointer to the first byte of a memory area that is potentially several (physically contiguous) pages long but doesn’t zero the area. The flags argument works in the same way as with kmalloc; usually either GFP_KERNEL or GFP_ATOMIC is used, perhaps with the addition of the __GFP_DMA flag (for memory that can be used for ISA direct-memory-access operations)

Normal memory allocation

Malloc

A process' memory usually is classified as either static, the size is predetermined at compile time, or dynamic, space is allocated as needed at runtime. The latter, in turn, is divided into heap space, where malloc()'d memory comes from, and stack, where functions' temporary work space is placed. For very large requests, malloc() uses the mmap() system call to find addressable memory space. This process helps reduce the negative effects of memory fragmentation when large blocks of memory are freed but locked by smaller, more recently allocated blocks lying between them and the end of the allocated space.

kmalloc

The first argument to kmalloc is the size of the block to be allocated. The second argument, the allocation flags, is much more interesting, because it controls the behavior of kmalloc in a number of ways.

The most commonly used flag, GFP_KERNEL, means that the allocation (internally performed by calling, eventually, _get_free_pages, which is the source of the GFP prefix) is performed on behalf of a process running in kernel space. In other words, this means that the calling function is executing a system call on behalf of a process. Using GFP_KERNEL means that kmalloc can put the current process to sleep waiting for a page when called in low-memory situations. A function that allocates memory using GFP_KERNEL must, therefore, be reentrant and cannot be running in atomic context. While the current process sleeps, the kernel takes proper action to locate some free memory, either by flushing buffers to disk or by swapping out memory from a user process.

GFP_KERNEL isn’t always the right allocation flag to use; sometimes kmalloc is called from outside a process’s context. This type of call can happen, for instance, in interrupt handlers, tasklets, and kernel timers. In this case, the current process should not be put to sleep, and the driver should use a flag of GFP_ATOMIC instead. The kernel normally tries to keep some free pages around in order to fulfill atomic allocation. When GFP_ATOMIC is used, kmalloc can use even the last free page. If that last page does not exist, however, the allocation fails.

The allocation flags listed above can be augmented by an ORing with additional flags, which change how the allocation is carried out, for example __GFP_DMA,__GFP_REPEAT,__GFP_NOFAIL,__GFP_NORETRY

The kernel manages the system’s physical memory, which is available only in pagesized chunks. As a result, kmalloc looks rather different from a typical user-space malloc implementation. A simple, heap-oriented allocation technique would quickly run into trouble; it would have a hard time working around the page boundaries. Thus, the kernel uses a special page-oriented allocation technique to get the best use from the system’s RAM.

Linux handles memory allocation by creating a set of pools of memory objects of fixed sizes. Allocation requests are handled by going to a pool that holds sufficiently large objects and handing an entire memory chunk back to the requester. The memory management scheme is quite complex, and the details of it are not normally all that interesting to device driver writers.

The one thing driver developers should keep in mind, though, is that the kernel can allocate only certain predefined, fixed-size byte arrays. If you ask for an arbitrary amount of memory, you’re likely to get slightly more than you asked for, up to twice as much. Also, programmers should remember that the smallest allocation that kmalloc can handle is as big as 32 or 64 bytes, depending on the page size used by the system’s architecture.

There is an upper limit to the size of memory chunks that can be allocated by kmalloc. That limit varies depending on architecture and kernel configuration options. If your code is to be completely portable, it cannot count on being able to allocate anything larger than 128 KB. If you need more than a few kilobytes, however, there are better ways than kmalloc to obtain memory.

Virtual memory allocation

The kmalloc() function guarantees that the pages are physically contiguous (and virtually contiguous).
The vmalloc() function works in a similar fashion to kmalloc(), except it allocates memory that is only virtually contiguous and not necessarily physically contiguous.

Huge allocations

Most current CPU architectures support bigger pages (so the CPU/OS have less entries to look-up), those are named Huge pages (on Linux). This support is built on top of multiple page size support that is provided by most modern architectures. For example, x86 CPUs normally support 4K and 2M (1G if architecturally supported) page sizes, ia64 architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, 256M and ppc64 supports 4K and 16M.
CMA (contiguous memory allocator) is intended to provide large, contiguous DMA buffers to drivers without requiring that memory be set aside for that exclusive purpose.

Linked lists

Suppose we want to implement a linked list of data structure Person. The common implementation is as below, struct Person {
char name[30];
unsigned int weight;
unsigned char gender;
struct list_head list; /* kernel's list structure */
};

The list_head structure is defined in list.h,

struct list_head {
struct list_head *next, *prev;
};

struct Person personList;
LIST_HEAD_INIT(&personList.list);

There’re two ways to create a new node of the data structure Person we defined above. To create a node dynamically,

struct Person* aNewPersonPointer;
aNewPersonPointer = kmalloc(sizeof(*aNewPersonPointer), GFP_KERNEL);
strcpy(aNewPersonPointer->name, "roman10");
aNewPersonPointer->weight = 130;
aNewPersonPointer->gender = 1;
INIT_LIST_HEAD(&aNewPersonPointer->list);

Or you can create a node statically,

struct Person aNewPerson = {
.name = strcpy(aNewPerson.name, "roman10"),
.weight = 130,
.gender = 1,
.list = LIST_HEAD_INIT(aNewPerson.list),
};

Linux kernel provides the following function to add a linked list node,

void list_add(struct list_head *new, struct list_head *head);
You can pass any list node as head, the new node will be inserted immediately after the head. If you pass the last element as head, this function can be used to implement a stack. Follow the previous example,

list_add(&aNewPerson->list, &personList.list);

There’s another function to add a linked list node,

void list_add_tail(struct list_head *new, struct list_head *head);

Similarly, you can pass any node as head, the new node will be inserted right before the head. If you pass the first node as head, the new node will be inserted before the first node. As the Linux implementation of linked list is circular, it’s equivalent to inserting the new node after the last node. Therefore, this function can be used to implement a queue.

Follow the previous example,

list_add_tail(&aNewPerson->list, &personList.list);

Delete a Node from the List

Linux kernel provides a function to delete a node from the list,

void list_del(struct list_head *entry);

It will delete the entry node from the list. This function removes the entry node from the linked list by disconnect prev and next pointers from the list, but it doesn’t free any memory space allocated for entry node. Follow the example above,

list_del(&aNewPerson->list);

Linked List Traversal

The list.h header defines a macro list_for_each(pos, head) for linked-list traversal. Both pos and head are pointers to the list_head. pos is used as a loop cursor, and head is the head node of the linked list. In our example,

struct list_head *node;
struct Person *aPerson;
list_for_each(node, &personList.list) {
//access the list node through node
aPerson = list_entry(node, struct Person, list);
}

list_entry is a macro of 3 inputs,

list_entry(ptr, type, member)

where ptr is a pointer to list_head, type is the structure that list_head is embedded in, in our case, struct Person, and member is the name of the list_head within the defined structure, list in our case.

An easier way to loop through the linked list is also provided, which combines the loop and list_entry function,

list_for_each_entry(pos, head, member)

In our example,

struct Person *aPerson;
list_for_each_entry(aPerson, &personList.list, list) {
//access the member from aPerson
}

There’s a third way to loop through the linked list, used to loop through the list while removing node from it.

list_for_each_entry_safe(pos, n, head, member)

where n is another pointer to user-defined structure used as temporary storage. In our example,

struct Person *aPerson, *tempPerson;
list_for_each_entry_safe(aPerson, tempPerson, &personList.list, list) {
//access the member from aPerson
list_del(&aPerson->list);

`kfree(aPerson);`

}

Debug tools

Exercise: Writing the "hello world" kernel module Exercise: Adding a driver to kernel sources and configuration menu

Using module parameters

There is a macro function, MODULE_PARM_DESC(), that is used to document arguments that the module can take. It takes two parameters: a variable name and a free form string describing that variable

module_param(myshort, short, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP);
MODULE_PARM_DESC(myshort, "A short integer");

module_param_array(name, type, num, perm);
The first param is the parameter's (in this case the array's) name
The second param is the data type of the elements of the array
The third argument is a pointer to the variable that will store the number
of elements of the array initialized by the user at module loading time
The fourth argument is the permission bits
Example: module_param_array(myintArray, int, &arr_argc, 0000);
Passing the arguments: insmod hello-5.ko mystring="bebop" mybyte=255 myintArray=-1

Reference Counting

Reference counting tracks the number of references to a given object in order to prevent that object from being prematurely freed. Although this is a conceptually simple technique, there are a great many variants adapted to different situations. A key reason for the variety of reference-counting techniques is the wide variety of mechanisms used to protect objects from concurrent access. Furthermore, the same object might be protected by different mechanisms at different times, which further increases the required number of styles of reference counting. In the Linux kernel, the main categories of synchronization mechanisms are

Locking, including semaphores and mutexes
Reference counts
- Simple counting, with neither atomic operations nor memory barriers, can be used when the reference counter acquisition and release are both protected by the same lock.
- Simple atomic counting may be used in cases where any CPU acquiring a reference must already hold a reference. This style is used when a single CPU creates an object for its own private use, but must allow other CPU, tasks, timer handlers, or I/O completion handlers that it later spawns to also access this object. Atomic counting is required because locking is not used to protect all reference-count operations, which means that it is possible for two different CPUs to concurrently manipulate the reference count. If normal increment and decrement were used, a pair of CPUs might both fetch the reference count concurrently, perhaps both obtaining the value “3”
RCU Read-Copy-Update (RCU) is a technique for manual memory management that is growing ever more popular in the Linux kernel. Linux Primitives Supporting Reference Counting include:
- void atomic dec(atomic t *var); Atomically decrements the referenced variable without necessarily issuing a memory barrier or disabling compiler optimizations.
- int atomic dec and test(atomic t *var); Atomically decrements the reference variable. ** void atomic set(atomic t *var, int val);

Kernel multi-tasking

Locking in the Kernel

There are two Main Types of Kernel Locks: Spinlocks and Mutexes

Spinlocks: The fundamental type is the spinlock (include/asm/spinlock.h), which is a very simple single-holder lock: if you can't get the spinlock, you keep trying (spinning) until you can. Spinlocks are very small and fast, and can be used anywhere. If a softirq shares data with user context, you have two problems. Firstly, the current user context can be interrupted by a softirq, and secondly, the critical region could be entered from another CPU. This is where spin_lock_bh() (include/linux/spinlock.h) is used. It disables softirqs on that CPU, then grabs the lock. spin_unlock_bh() does the reverse. (The '_bh' suffix is a historical reference to "Bottom Halves", the old name for software interrupts. It should really be called spin_lock_softirq()'.

Note that you can also use spin_lock_irq() or spin_lock_irqsave() here, which stop hardware interrupts as well

Mutex: The second type is a mutex, it is like a spinlock, but you may block holding a mutex. If you can't lock a mutex, your task will suspend itself, and be woken up when the mutex is released. This means the CPU can do something else while you are waiting. There are many cases when you simply can't sleep and so have to use a spinlock instead.
- Calls like mutex_lock_interruptible() allow you grab the mutex, and mutex_unlock() releases it. There is also a mutex_lock(), which should be avoided, because it will not return if a signal is received.

Task handling

Tasklet

In short, a tasklet is something like a very small thread that has neither stack, not context of its own. Such “threads” work quickly and completely. The main features of tasklets are the following:

Tasklets are atomic, so we cannot use sleep() and such synchronization primitives as mutexes, semaphores, etc. from them. But we can use, say, spinlock;
They are called in a “softer” context, than ISR. In this context hardware interrupts are allowed. They displace tasklets for the time of the ISR execution. This context is called softirq in the Linux kernel and in addition to the running of tasklets, it is also used by several subsystems;
A tasklet runs on the same core that schedules it. Or rather, has been the first one to schedule it by calling softirq, the handlers of which are always bound to the calling kernel;
Different tasklets can be running in parallel. But at the same time, a tasklet cannot be called concurrently with itself, as it runs on one kernel only: the kernel that has scheduled its execution;
Tasklets are executed by the principle of non-preemptive scheduling, one by one, in turn. We can schedule them with two different priorities: normal and high.

Concurrent programming

Key differences between softirq and tasklet are:

Allocation: Softirqs are statically allocated at compile-time. Unlike tasklets, you cannot dynamically register and destroy softirqs. Tasklets can be statically allocated using DECLARE_TASKLET(name, func, data) or can also be allocated dynamically and initialized at runtime using tasklet_init(name, func, data)
Concurrency: Softirqs can run concurrently on several CPUs, even if they are of the same type because softirqs are reentrant functions and must explicitly protect their data structures with spinlocks. Tasklets are non-reentrant and tasklets of the same type are always serialized: in other words, the same type of tasklet cannot be executed by two CPUs at the same time. However, tasklets of different types can be executed concurrently on several CPUs.
Processing: Softirqs are activated by means of the raise_softirq(). The pending softirqs are processed by do_softirq() and ksoftirqd kernel thread after being enabled by local_bh_enable() or by spin_unlock_bh() Tasklets are a bottom-half mechanism built on top of softirqs i.e. tasklets are represented by two softirqs: HI_SOFTIRQ and TASKLET_SOFTIRQ. Tasklets are actually run from a softirq. The only real difference in these types is that the HI_SOFTIRQ based tasklets run prior to the TASKLET_SOFTIRQ tasklets. So, tasklet_schedule() basically calls raise_softirq(TASKLET_SOFTIRQ)
Note that softirqs (and hence tasklets and timers) are run on return from hardware interrupts, or on return from a system call. Also as soon as the thread that raised the softirq ends, that single softirq (and on other) is run to minimize softirq latency.

Kernel threads

A kernel thread is a kernel task running only in kernel mode; it usually has not been created by fork() or clone() system calls. An example is kworker or kswapd. It is a task_struct with no userspace components. A kernel thread is derived from kthreadd kernel thread instead of the init process and is created by a kernel-only API .Kernel threads have emerged from the need to run kernel code in process context. Kernel threads are the basis of the workqueue mechanism. Essentially, a thread kernel is a thread that only runs in kernel mode and has no user address space or other user attributes.

Fixing race conditions with mutexes

Race conditions come about as a result of shared access to resources. When two threads of execution[1] have a reason to work with the same data structures (or hardware resources), the potential for mixups always exists. So the first rule of thumb to keep in mind as you design your driver is to avoid shared resources whenever possible. If there is no concurrent access, there can be no race conditions. So carefully-written kernel code should have a minimum of sharing.
The usual technique for access management is called locking or mutual exclusion—making sure that only one thread of execution can manipulate a shared resource at any time.
Semaphores: A semaphore is a single integer value combined with a pair of functions that are typically called P and V. A process wishing to enter a critical section will call P on the relevant semaphore; if the semaphore's value is greater than zero, that value is decremented by one and the process continues. If, instead, the semaphore's value is 0 (or less), the process must wait until somebody else releases the semaphore. Unlocking a semaphore is accomplished by calling V; this function increments the value of the semaphore and, if necessary, wakes up processes that are waiting. When semaphores are used for mutual exclusion—keeping multiple processes from running within a critical section simultaneously—their value will be initially set to 1. Such a semaphore can be held only by a single process or thread at any given time. A semaphore used in this mode is sometimes called a mutex, which is, of course, an abbreviation for "mutual exclusion." Almost all semaphores found in the Linux kernel are used for mutual exclusion.

High Resolution Timers

Linux Kernel Tracing and Profiling 2.5 hours The Kernel tracing infrastructure o Tracepoints o The ftrace function tracer o Kprobes o Event tracers Debugging the kernel using traces Exercise: Analyze kernel behavior using static and dynamic traces Performance monitoring in the Linux kernel o Perfcounters o Perf events Exercise: Use perf to analyse kernel and program performances