linux development - animeshtrivedi/notes GitHub Wiki
- https://kernelnewbies.org/LinuxVersions
- https://github.com/axboe/liburing/wiki/What%27s-new-with-io_uring-in-6.11-and-6.12
-
Faster asynchronous Direct I/O using io_uring, https://kernelnewbies.org/Linux_6.6#Faster_asynchronous_Direct_I.2FO_using_io_uring
- There are two optimizations that have an effect on the performance:
- There are also some cache optimizations that might affect the performance:
-
User xattrs and direct IO https://kernelnewbies.org/Linux_6.6#TMPFS
- Kernel Concurrency References, https://hackmd.io/@0xff07/linux-concurrency/%2F%400xff07%2FSk-G0xhY6
- Crash white paper: https://crash-utility.github.io/crash_whitepaper.html
- kernel boot params: https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
- Elixir: https://elixir.bootlin.com/linux/v6.9/source
cat /proc/kallsyms | grep 'memset'
0000000000000000 t __pfx_text_poke_memset
0000000000000000 t text_poke_memset
0000000000000000 T __pfx_memset_io
0000000000000000 T memset_io
0000000000000000 T __pfx___memset
0000000000000000 T __pfx_memset
0000000000000000 T memset
0000000000000000 T __memset
0000000000000000 t __pfx_memset_orig
0000000000000000 t memset_orig
0000000000000000 r __ksymtab___memset
0000000000000000 r __ksymtab_memset
0000000000000000 r __ksymtab_memset_io
0000000000000000 t time_nsec_memset_show [null_ablk]
0000000000000000 b time_nsec_memset [null_ablk]
0000000000000000 t num_fcount_memset_show [null_ablk]
0000000000000000 b num_fcount_memset [null_ablk]
0000000000000000 d kobj_num_fcount_memset [null_ablk]
0000000000000000 d kobj_time_nsec_memset [null_ablk]
0000000000000000 t __pfx_time_nsec_memset_show [null_ablk]
0000000000000000 t __pfx_num_fcount_memset_show [null_ablk]
0000000000000000 t __pfx_memset_probe2 [null_ablk]
0000000000000000 t memset_probe2 [null_ablk]
0000000000000000 t memset_extent_buffer [btrfs]
0000000000000000 t memset_extent_buffer.cold [btrfs]
0000000000000000 t __pfx_memset_extent_buffer [btrfs]
https://man7.org/linux/man-pages/man1/nm.1.html: If lowercase, the symbol is usually local; if uppercase, the symbol is global (external). There are however a few lowercase symbols that are shown for special global symbols ("u", "v" and "w").
"B"
"b" The symbol is in the BSS data section. This section
typically contains zero-initialized or uninitialized
data, although the exact behavior is system dependent.
"D"
"d" The symbol is in the initialized data section.
"T"
"t" The symbol is in the text (code) section.
"R"
"r" The symbol is in a read only data section.
T means that symbol is globally visible, and can be used in other kernel's code. https://stackoverflow.com/questions/39120818/what-is-the-difference-between-t-and-t-in-proc-kallsyms
https://stackoverflow.com/questions/44326565/perf-kernel-module-symbols-not-showing-up-in-profiling
make sure to add to gcc -g -fno-omit-frame-pointer
Then make install so that the module shows up in /lib/modules/uname -r
/[extra|updates]/
Also do not forget to do, sudo depmod -a
afterwards.
See the OOT-nullblk Makefile/Kbuild files.
boot time parameter set in /etc/default/grub
and then update-grub2
BOOT_IMAGE=/boot/vmlinuz-6.9.0-atr root=UUID=615a8273-ed80-47b3-87ff-d967f08e23af ro amd_pstate=disable amd_prefcore=disable cpuidle.off=1 cpufreq.off=1 processor.max_cstate=0 idle=halt nosmt=force iommu=off crashkernel=512M-:192M
checking : https://stackoverflow.com/questions/44286683/check-for-iommu-support-on-linux
Not enabled:
$ sudo find /sys | grep dmar
$
Enabled:
$ sudo find /sys | grep dmar
/sys/class/iommu/dmar2
/sys/class/iommu/dmar0
/sys/class/iommu/dmar3
/sys/class/iommu/dmar1
[...]
- Compile with
-fno-omit-frame-pointer -g
- check with
objdump --syms
orfile
animesh.trivedi@flex20:~/fio$ objdump --syms /usr/bin/fio
/usr/bin/fio: file format elf64-x86-64
SYMBOL TABLE:
no symbols
animesh.trivedi@flex20:~/fio$ file /usr/bin/fio
/usr/bin/fio: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=cc5e0dd0e9922054dbdc229d347e67e887eff56f, for GNU/Linux 3.2.0, stripped
animesh.trivedi@flex20:~/fio$ file `which fio`
/home/animesh.trivedi/local/bin//fio: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=0f4edf72789f267f27cc426deb210a916cd47bc0, for GNU/Linux 3.2.0, with debug_info, not stripped
animesh.trivedi@flex20:~/fio$ objdump --syms ./fio
./fio: file format elf64-x86-64
SYMBOL TABLE:
0000000000000000 l df *ABS* 0000000000000000 Scrt1.o
00000000000003c4 l O .note.ABI-tag 0000000000000020 __abi_tag
0000000000000000 l df *ABS* 0000000000000000 gettime.c
000000000002c450 l F .text 0000000000000038 clock_cmp
000000000002c490 l F .text 00000000000001e0 clock_thread_fn
000000000002c670 l F .text 0000000000000025 fio_get_mono_time.part.0
00000000000bbfe0 l O .rodata 0000000000000012 __PRETTY_FUNCTION__.2
000000000002c6a0 l F .text 00000000000001e8 __fio_gettime
00000000001d0190 l O .bss 0000000000000004 cycles_wrap
00000000001d01b8 l O .bss 0000000000000008 cycles_start
00000000001d0194 l O .bss 0000000000000004 max_cycles_shift
[..]
sudo btrfs filesystem mkswapfile --size 4g --uuid clear ~/swapfile
sudo swapon -p 101 ~/swapfile
swapon
https://forum.garudalinux.org/t/create-a-swapfile-afterwards-problems-with-btrfs/34326
cpuid
command is useful to extract microarchitectural features of a CPU
https://linux.die.net/man/1/cpuid
sudo apt-get install cpuid
# reserve some pages
echo 512 > /proc/sys/vm/nr_hugepages
# then mount the file system
mount -t hugetlbfs -o uid=$USER,mode=700,pagesize=2M,size=2G none ~/mnt/hugetlbfs/
Excellent introduction: https://sysprog21.github.io/lkmpg/ (https://github.com/sysprog21)
When compiling kernel module, it will inherit all the defined modules from /usr/src/linux-headers-xxx/include/generated/autoconf.h
https://wiki.ubuntu.com/KernelTeam/GitKernelBuild
make -j $(getconf _NPROCESSORS_ONLN) deb-pkg LOCALVERSION=-custom
-> Kernel hacking
-> Compile-time checks and compiler options
-> Generate BTF typeinfo (DEBUG_INFO_BTF [=n])
https://github.com/akopytov/sysbench
- https://upcloud.com/blog/evaluating-cloud-server-performance-with-sysbench
- https://wiki.gentoo.org/wiki/Sysbench
- https://towshif.github.io/site/tutorials/Linux%20Shell/benchmark-linux/
- https://blog.cloud-mercato.com/why-i-love-sysbench/
sysbench --threads=128 --time=10 memory --memory-total-size=1T --memory-block-size=$((128*1024)) --memory-scope=global --memory-access-mode=rnd --memory-oper=read run
- Excellent write up with experiments: https://wiki.linuxfoundation.org/realtime/documentation/howto/applications/cpuidle
- https://docs.kernel.org/admin-guide/pm/cpuidle.html
- https://vstinner.github.io/intel-cpus.html
- https://wiki.archlinux.org/title/CPU_frequency_scaling
AMD: amd_pstate=disable amd_prefcore=disable cpuidle.off=1 cpufreq.off=1 processor.max_cstate=0 idle=halt
Intel: intel_pstate=disable cpuidle.off=1 cpufreq.off=1 processor.max_cstate=0 intel_idle.max_cstate=0 idle=halt
There are four CPUIdle governors available, menu, TEO, ladder and haltpoll. Which of them is used by default depends on the configuration of the kernel and in particular on whether or not the scheduler tick can be stopped by the idle loop. Available governors can be read from the available_governors, and the governor can be changed at runtime. The name of the CPUIdle governor currently used by the kernel can be read from the current_governor_ro or current_governor file under /sys/devices/system/cpu/cpuidle/ in sysfs. Which CPUIdle driver is used, on the other hand, usually depends on the platform the kernel is running on, but there are platforms with more than one matching driver. For example, there are two drivers that can work with the majority of Intel platforms, intel_idle and acpi_idle, one with hardcoded idle states information and the other able to read that information from the system’s ACPI tables, respectively. Still, even in those cases, the driver chosen at the system initialization time cannot be replaced later, so the decision on which one of them to use has to be made early (on Intel platforms the acpi_idle driver will be used if intel_idle is disabled for some reason or if it does not recognize the processor). The name of the CPUIdle driver currently used by the kernel can be read from the current_driver file under /sys/devices/system/cpu/cpuidle/ in sysfs.
tickless vs ticked system with idle state management
The kernel can be configured to disable stopping the scheduler tick in the idle loop altogether. That can be done through the build-time configuration of it (by unsetting the CONFIG_NO_HZ_IDLE configuration option) or by passing nohz=off to it in the command line. In both cases, as the stopping of the scheduler tick is disabled, the governor’s decisions regarding it are simply ignored by the idle loop code and the tick is never stopped.
If the given system is tickless, it will use the menu governor by default and if it is not tickless, the default CPUIdle governor on it will be ladder.
flex19:~$ sudo cat /sys/devices/system/cpu/cpuidle/current_governor
menu
flex19:~$ sudo cat /sys/devices/system/cpu/cpuidle/current_governor_ro
menu
flex19:~$ sudo ls /sys/devices/system/cpu/cpuidle/
available_governors current_driver current_governor current_governor_ro
flex19:~$ sudo cat /sys/devices/system/cpu/cpuidle/available_governors
ladder menu teo
flex19:~$ sudo cat /sys/devices/system/cpu/cpuidle/current_driver
acpi_idle
animesh.trivedi@flex19:~$ cat /boot/config-`uname -r` | grep CONFIG_NO_HZ_IDLE
CONFIG_NO_HZ_IDLE=y
Kernel/AMD modules:
amd-uncore
amd-pstate
amd_freq_sensitivity
#
sudo modprobe -v amd_pstate
# Does not do anything?
For each CPU in the system, there is a /sys/devices/system/cpu/cpu/cpuidle/ directory in sysfs, where the number is assigned to the given CPU at the initialization time. That directory contains a set of subdirectories called state0, state1 and so on, up to the number of idle state objects defined for the given CPU minus one. Each of these directories corresponds to one idle state object and the larger the number in its name, the deeper the (effective) idle state represented by it.
/sys/devices/system/cpu/cpu0/cpuidle/
idle=???
What does it mean?
The x86 architecture support code recognizes three kernel command line options related to CPU idle time management: idle=poll, idle=halt, and idle=nomwait. The first two of them disable the acpi_idle and intel_idle drivers altogether, which effectively causes the entire CPUIdle subsystem to be disabled and makes the idle loop invoke the architecture support code to deal with idle CPUs. How it does that depends on which of the two parameters is added to the kernel command line. In the idle=halt case, the architecture support code will use the HLT instruction of the CPUs (which, as a rule, suspends the execution of the program and causes the hardware to attempt to enter the shallowest available idle state) for this purpose, and if idle=poll is used, idle CPUs will execute a more or less “lightweight” sequence of instructions in a tight loop. [Note that using idle=poll is somewhat drastic in many cases, as preventing idle CPUs from saving almost any energy at all may not be the only effect of it. For example, on Intel hardware it effectively prevents CPUs from using P-states (see CPU Performance Scaling) that require any number of CPUs in a package to be idle, so it very well may hurt single-thread computations performance as well as energy-efficiency. Thus using it for performance reasons may not be a good idea at all.] The idle=nomwait option prevents the use of MWAIT instruction of the CPU to enter idle states. When this option is used, the acpi_idle driver will use the HLT instruction instead of MWAIT. On systems running Intel processors, this option disables the intel_idle driver and forces the use of the acpi_idle driver instead. Note that in either case, acpi_idle driver will function only if all the information needed by it is in the system’s ACPI tables.
How can I boot with an older kernel version? What does GRUB_DEFAULT="1>2" mean?
ubuntu:~$ sudo grub-mkconfig | grep -iE "menuentry 'Ubuntu, with Linux" | awk '{print i++ " : "$1, $2, $3, $4, $5, $6, $7}'
0 : menuentry 'Ubuntu, with Linux 5.4.0-80-generic' --class ubuntu
1 : menuentry 'Ubuntu, with Linux 5.4.0-80-generic (recovery mode)'
2 : menuentry 'Ubuntu, with Linux 4.15.0-159-generic' --class ubuntu
3 : menuentry 'Ubuntu, with Linux 4.15.0-159-generic (recovery mode)'
4 : menuentry 'Ubuntu, with Linux 4.15.0-45-generic' --class ubuntu
5 : menuentry 'Ubuntu, with Linux 4.15.0-45-generic (recovery mode)'
Modify the GRUB_DEFAULT=0
value as per your need. Currently my server booted with 5.4.0-80-generic
ubuntu:~# uname -srn
Linux ubuntu 5.4.0-80-generic
so i want to boot my system with 4.15.0-45-generic
which is menu entry 4
modified GRUB_DEFAULT="1>4"
value in /etc/default/grub
executed below command to regenerate a grub config file with modified GRUB_DEFAULT settings.
Explained "1>4" format here
sudo update-grub
sudo systemctl reboot
post reboot my ubuntu server booted with old kernel 4.15.0-45-generic
ubuntu:~# uname -srn
Linux ubuntu 4.15.0-45-generic
sudo mount -t tmpfs -o size=32G,noswap,uid=$USER,mpol=prefer:0,huge=never animesh.trivedi ~/mnt/tmpfs/
https://docs.kernel.org/block/null_blk.html
sudo modprobe null_blk queue_mode=2 home_node=0 gb=32 bs=4096 nr_devices=1 irqmode=1 hw_queue_depth=8 use_per_node_hctx=1 memory_backed=1 cache_size=0 mbps=0 no_sched=1 blocking=0
- 2 = multi-queue
- home_node=0 (0 NUMA node)
- irq_mode=1 (uses IPI, only with mode 2 i.e. Timer it will simulate the latency injection with
completion_nsec
, completion_nsec=1000) - use_per_node_hctx=1 tells to use 1 queue:NUMA_NODE mode (otherwise, set this to 0, and then specify the queues using
submit_queues
)- parm: use_per_node_hctx:Use per-node allocation for hardware context queues. Default: false (bool)
- hw_queue_depth=8 (The hardware queue depth of the device)
- memory_backed=1 (yes, do actual work is done)
- no_sched=1 (no scheduler, 0 is MQ-sched)
- blocking=?
Register as a blocking blk-mq driver device, null_blk will set the BLK_MQ_F_BLOCKING flag, indicating that it sometimes/always needs to block in its ->queue_rq() function.
-
ctrl
+shift
+-
(forward) -
ctrl
+alt
+-
(backward)
Reload window: https://stackoverflow.com/questions/60714159/is-there-a-way-to-reconnect-to-a-disconnected-vs-code-remote-ssh-connection
ctl + shift + P
and then "reload window"
VScode has the following in the include path that needs to be expanded in order to compile the kernel source module.
include path expanded here:
${workspaceFolder}/**
/usr/src/linux-headers-6.9.0-atr-2024-07-05/arch/x86/include/
/usr/src/linux-headers-6.9.0-atr-2024-07-05/arch/x86/include/generated/
/usr/src/linux-headers-6.9.0-atr-2024-07-05/include/
/usr/src/linux-headers-6.9.0-atr-2024-07-05/arch/x86/include/uapi/
/usr/src/linux-headers-6.9.0-atr-2024-07-05/arch/x86/include/generated/uapi/
/usr/src/linux-headers-6.9.0-atr-2024-07-05/include/linux/
also in the defined:
__GNUC__
__KERNEL__
MODULE
make -n
Crash commands are documented here: https://crash-utility.github.io/help_pages/mod.html (or crash> help mod
)
There are some options on changing the dump format, see Makedumpfile
options: vim /etc/default/kdump-tools
https://hackmd.io/@0xff07/S1ASmzgun#Optional-Set-dump-file-format-in-etc
memory size: etc/default/grub.d/kdump-tools.cfg
or directly in the grub file, and then do update-grub2
.
Ok, the key problem is what is causing the kernel crashing with the nvmev kernel module:
(inconclusive)
Step-1 : Can we find the precise step where the fault is happening?
crash> bt
PID: 1164 TASK: ffffa01f82b02f40 CPU: 2 COMMAND: "insmod"
#0 [ffffbe4e40e7f6f0] machine_kexec at ffffffff9d095df0
#1 [ffffbe4e40e7f748] __crash_kexec at ffffffff9d2335df
#2 [ffffbe4e40e7f808] crash_kexec at ffffffff9d233b84
#3 [ffffbe4e40e7f810] oops_end at ffffffff9d043d44
#4 [ffffbe4e40e7f830] page_fault_oops.cold at ffffffff9e0d1913
#5 [ffffbe4e40e7f8b8] exc_page_fault at ffffffff9e18d20e
#6 [ffffbe4e40e7f8e0] asm_exc_page_fault at ffffffff9e2012a6
[exception RIP: NVMEV_PCI_INIT+346]
RIP: ffffffffc0c913fa RSP: ffffbe4e40e7f990 RFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffa01f84064000 RCX: 00000000ffffffff
RDX: ffffffffc0c2d880 RSI: ffffffffc0c2d8c0 RDI: 0000000000000010
RBP: 0000000000032040 R8: 0000000000000000 R9: ffffbe4e40e7f8e8
R10: ffffffff9eaffe10 R11: 0000000000000000 R12: ffffffffffffffff
R13: ffffa01f84064000 R14: ffffa01f90ae5740 R15: ffffa01f85e621c0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffbe4e40e7f9c0] init_module at ffffffffc0c97310 [nvmev]
#8 [ffffbe4e40e7f9f0] do_one_initcall at ffffffff9d002a88
#9 [ffffbe4e40e7fa60] do_init_module at ffffffff9d1fb7a0
#10 [ffffbe4e40e7fa80] init_module_from_file at ffffffff9d1fe626
#11 [ffffbe4e40e7fb30] idempotent_init_module at ffffffff9d1fe791
#12 [ffffbe4e40e7fbb8] __x64_sys_finit_module at ffffffff9d1fea1e
#13 [ffffbe4e40e7fbe8] do_syscall_64 at ffffffff9e1864b2
#14 [ffffbe4e40e7ff50] entry_SYSCALL_64_after_hwframe at ffffffff9e20012f
RIP: 00007f987af2725d RSP: 00007ffda87abb18 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000561c82191760 RCX: 00007f987af2725d
RDX: 0000000000000000 RSI: 0000561c821912a0 RDI: 0000000000000003
RBP: 00007ffda87abbd0 R8: 0000000000000040 R9: 0000000000000000
R10: 00007f987b003b20 R11: 0000000000000246 R12: 0000561c821912a0
R13: 0000000000000000 R14: 0000561c82191730 R15: 0000561c821912a0
ORIG_RAX: 0000000000000139 CS: 0033 SS: 002b
crash>
With this bt, we get the RIP at ffffffffc0c913fa
, so if we look it up:
crash> sym ffffffffc0c913fa
ffffffffc0c913fa (T) NVMEV_PCI_INIT+346 [nvmev] /usr/src/linux-headers-6.9.0-atr-2024-07-05/./include/linux/topology.h: 96
crash>
Step-2, how did we get to this offending instruction? -r
is for
-r (reverse) displays all instructions from the start of the
routine up to and including the designated address.
so with that we get:
crash> dis -lr NVMEV_PCI_INIT+346
/home/atr/src/nvmevirt/pci.c: 616
0xffffffffc0c912a0 <NVMEV_PCI_INIT>: nopw (%rax)
0xffffffffc0c912a4 <NVMEV_PCI_INIT+4>: nopl 0x0(%rax,%rax,1)
/home/atr/src/nvmevirt/pci.c: 617
0xffffffffc0c912a9 <NVMEV_PCI_INIT+9>: movabs $0x10000201100c51,%rsi
/home/atr/src/nvmevirt/pci.c: 616
0xffffffffc0c912b3 <NVMEV_PCI_INIT+19>: push %r15
/home/atr/src/nvmevirt/pci.c: 484
0xffffffffc0c912b5 <NVMEV_PCI_INIT+21>: movabs $0xffeffffd00000000,%rdx
/home/atr/src/nvmevirt/pci.c: 616
0xffffffffc0c912bf <NVMEV_PCI_INIT+31>: push %r14
0xffffffffc0c912c1 <NVMEV_PCI_INIT+33>: push %r13
0xffffffffc0c912c3 <NVMEV_PCI_INIT+35>: push %r12
0xffffffffc0c912c5 <NVMEV_PCI_INIT+37>: push %rbp
/usr/src/linux-headers-6.9.0-atr-2024-07-05/./include/linux/topology.h: 96
0xffffffffc0c912c6 <NVMEV_PCI_INIT+38>: mov $0x32040,%rbp
/home/atr/src/nvmevirt/pci.c: 616
0xffffffffc0c912cd <NVMEV_PCI_INIT+45>: push %rbx
/home/atr/src/nvmevirt/pci.c: 617
0xffffffffc0c912ce <NVMEV_PCI_INIT+46>: mov 0x10(%rdi),%rax
/home/atr/src/nvmevirt/pci.c: 616
0xffffffffc0c912d2 <NVMEV_PCI_INIT+50>: mov %rdi,%rbx
/home/atr/src/nvmevirt/pci.c: 617
0xffffffffc0c912d5 <NVMEV_PCI_INIT+53>: mov 0x40(%rdi),%rcx
/home/atr/src/nvmevirt/pci.c: 493
0xffffffffc0c912d9 <NVMEV_PCI_INIT+57>: and (%rax),%rdx
0xffffffffc0c912dc <NVMEV_PCI_INIT+60>: movb $0x0,0xe(%rax)
/home/atr/src/nvmevirt/pci.c: 499
0xffffffffc0c912e0 <NVMEV_PCI_INIT+64>: or %rsi,%rdx
/home/atr/src/nvmevirt/pci.c: 501
0xffffffffc0c912e3 <NVMEV_PCI_INIT+67>: mov 0x10(%rax),%esi
/home/atr/src/nvmevirt/pci.c: 495
0xffffffffc0c912e6 <NVMEV_PCI_INIT+70>: movl $0x1080201,0x8(%rax)
/home/atr/src/nvmevirt/pci.c: 502
0xffffffffc0c912ed <NVMEV_PCI_INIT+77>: mov %rdx,(%rax)
/home/atr/src/nvmevirt/pci.c: 501
0xffffffffc0c912f0 <NVMEV_PCI_INIT+80>: mov %rcx,%rdx
/home/atr/src/nvmevirt/pci.c: 504
0xffffffffc0c912f3 <NVMEV_PCI_INIT+83>: shr $0x20,%rcx
/home/atr/src/nvmevirt/pci.c: 501
0xffffffffc0c912f7 <NVMEV_PCI_INIT+87>: shr $0xe,%rdx
0xffffffffc0c912fb <NVMEV_PCI_INIT+91>: and $0x3ff9,%esi
/home/atr/src/nvmevirt/pci.c: 504
0xffffffffc0c91301 <NVMEV_PCI_INIT+97>: mov %ecx,0x14(%rax)
/home/atr/src/nvmevirt/pci.c: 529
0xffffffffc0c91304 <NVMEV_PCI_INIT+100>: movabs $0x2000807f6011,%rcx
/home/atr/src/nvmevirt/pci.c: 501
0xffffffffc0c9130e <NVMEV_PCI_INIT+110>: shl $0xe,%edx
/home/atr/src/nvmevirt/pci.c: 507
0xffffffffc0c91311 <NVMEV_PCI_INIT+113>: movq $0x370d0c51,0x2c(%rax)
/home/atr/src/nvmevirt/pci.c: 501
0xffffffffc0c91319 <NVMEV_PCI_INIT+121>: or $0x4,%edx
/home/atr/src/nvmevirt/pci.c: 511
0xffffffffc0c9131c <NVMEV_PCI_INIT+124>: movb $0x40,0x34(%rax)
/home/atr/src/nvmevirt/pci.c: 501
0xffffffffc0c91320 <NVMEV_PCI_INIT+128>: or %esi,%edx
0xffffffffc0c91322 <NVMEV_PCI_INIT+130>: mov %edx,0x10(%rax)
/home/atr/src/nvmevirt/pci.c: 514
0xffffffffc0c91325 <NVMEV_PCI_INIT+133>: mov $0xf,%edx
0xffffffffc0c9132a <NVMEV_PCI_INIT+138>: mov %dx,0x3c(%rax)
/home/atr/src/nvmevirt/pci.c: 618
0xffffffffc0c9132e <NVMEV_PCI_INIT+142>: mov 0x18(%rdi),%rdx
/home/atr/src/nvmevirt/pci.c: 524
0xffffffffc0c91332 <NVMEV_PCI_INIT+146>: mov (%rdx),%eax
0xffffffffc0c91334 <NVMEV_PCI_INIT+148>: and $0xfff80000,%eax
0xffffffffc0c91339 <NVMEV_PCI_INIT+153>: or $0x35001,%eax
0xffffffffc0c9133e <NVMEV_PCI_INIT+158>: mov %eax,(%rdx)
0xffffffffc0c91340 <NVMEV_PCI_INIT+160>: movzbl 0x4(%rdx),%eax
0xffffffffc0c91344 <NVMEV_PCI_INIT+164>: and $0xfffffff4,%eax
0xffffffffc0c91347 <NVMEV_PCI_INIT+167>: or $0x8,%eax
0xffffffffc0c9134a <NVMEV_PCI_INIT+170>: mov %al,0x4(%rdx)
/home/atr/src/nvmevirt/pci.c: 619
0xffffffffc0c9134d <NVMEV_PCI_INIT+173>: mov 0x20(%rdi),%rdx
/home/atr/src/nvmevirt/pci.c: 539
0xffffffffc0c91351 <NVMEV_PCI_INIT+177>: mov (%rdx),%rax
0xffffffffc0c91354 <NVMEV_PCI_INIT+180>: and $0x78000000,%eax
0xffffffffc0c91359 <NVMEV_PCI_INIT+185>: or %rcx,%rax
/home/atr/src/nvmevirt/pci.c: 544
0xffffffffc0c9135c <NVMEV_PCI_INIT+188>: movabs $0x100085a100020010,%rcx
/home/atr/src/nvmevirt/pci.c: 529
0xffffffffc0c91366 <NVMEV_PCI_INIT+198>: mov %rax,(%rdx)
/home/atr/src/nvmevirt/pci.c: 544
0xffffffffc0c91369 <NVMEV_PCI_INIT+201>: movabs $0xe0037000c1000000,%rax
/home/atr/src/nvmevirt/pci.c: 539
0xffffffffc0c91373 <NVMEV_PCI_INIT+211>: movl $0x8000,0x8(%rdx)
/home/atr/src/nvmevirt/pci.c: 620
0xffffffffc0c9137a <NVMEV_PCI_INIT+218>: mov 0x28(%rdi),%rdx
/home/atr/src/nvmevirt/pci.c: 559
0xffffffffc0c9137e <NVMEV_PCI_INIT+222>: and (%rdx),%rax
0xffffffffc0c91381 <NVMEV_PCI_INIT+225>: or %rcx,%rax
0xffffffffc0c91384 <NVMEV_PCI_INIT+228>: mov %rax,(%rdx)
/home/atr/src/nvmevirt/pci.c: 621
0xffffffffc0c91387 <NVMEV_PCI_INIT+231>: mov 0x30(%rdi),%rax
/home/atr/src/nvmevirt/pci.c: 570
0xffffffffc0c9138b <NVMEV_PCI_INIT+235>: movl $0x15010001,(%rax)
/home/atr/src/nvmevirt/pci.c: 575
0xffffffffc0c91391 <NVMEV_PCI_INIT+241>: movl $0x18010002,0x50(%rax)
/home/atr/src/nvmevirt/pci.c: 580
0xffffffffc0c91398 <NVMEV_PCI_INIT+248>: movl $0x19010004,0x80(%rax)
/home/atr/src/nvmevirt/pci.c: 585
0xffffffffc0c913a2 <NVMEV_PCI_INIT+258>: movl $0x2701000e,0x90(%rax)
/home/atr/src/nvmevirt/pci.c: 590
0xffffffffc0c913ac <NVMEV_PCI_INIT+268>: movl $0x2a010003,0x170(%rax)
/home/atr/src/nvmevirt/pci.c: 595
0xffffffffc0c913b6 <NVMEV_PCI_INIT+278>: movl $0x10019,0x1a0(%rax)
/home/atr/src/nvmevirt/pci.c: 626
0xffffffffc0c913c0 <NVMEV_PCI_INIT+288>: mov -0x628bf(%rip),%rax # 0xffffffffc0c2eb08 <nvmev_vdev>
0xffffffffc0c913c7 <NVMEV_PCI_INIT+295>: movb $0x0,0x130(%rdi)
/usr/src/linux-headers-6.9.0-atr-2024-07-05/./include/linux/topology.h: 96
0xffffffffc0c913ce <NVMEV_PCI_INIT+302>: movslq 0x60(%rax),%r12
0xffffffffc0c913d2 <NVMEV_PCI_INIT+306>: cmp $0x2000,%r12
0xffffffffc0c913d9 <NVMEV_PCI_INIT+313>: jae 0xffffffffc0c91672 <NVMEV_PCI_INIT+978>
0xffffffffc0c913df <NVMEV_PCI_INIT+319>: mov -0x613c72e0(,%r12,8),%rax
/home/atr/src/nvmevirt/pci.c: 395
0xffffffffc0c913e7 <NVMEV_PCI_INIT+327>: mov $0xffffffffc0c2d880,%rdx
0xffffffffc0c913ee <NVMEV_PCI_INIT+334>: mov $0x10,%edi
0xffffffffc0c913f3 <NVMEV_PCI_INIT+339>: mov $0xffffffffc0c2d8c0,%rsi
/usr/src/linux-headers-6.9.0-atr-2024-07-05/./include/linux/topology.h: 96
0xffffffffc0c913fa <NVMEV_PCI_INIT+346>: mov (%rax,%rbp,1),%eax
crash>
not entirely clean, why topology.h: 96
is offending. The crash utility shows the key reason being:
PANIC: "Oops: 0000 [#1] PREEMPT SMP NOPTI" (check log for details)
Step-3: I am trying to print local variables values but does not work, so leaving it for now. issue-1
ok, here are some more examples, and modes of navigating the address: gdb list FUNC+OFF
(shows directly the faulting location). It is challenging with the function inlining.
crash> gdb list *NVMEV_PCI_INIT+346
0xffffffffc0c913fa is in NVMEV_PCI_INIT (./include/linux/topology.h:96).
91 #endif
92
93 #ifndef cpu_to_node
94 static inline int cpu_to_node(int cpu)
95 {
96 return per_cpu(numa_node, cpu);
97 }
98 #endif
99
100 #ifndef set_numa_node
crash>
help bt
has lots of help
# Display the stack trace of the active task(s) when the kernel panicked:
crash> bt -a
# Display the stack trace of the active task(s) when the kernel panicked,
and filter out the stack of the idle tasks:
crash> bt -a -n idle
# Display the stack trace of the active task on CPU 0 and 1:
crash> bt -c 0,1
# Display the stack traces of task f2814000 and PID 1592:
crash> bt f2814000 1592
# Dump the text symbols found in the current context's stack:
crash> bt -t
# Search the current stack for possible exception frames:
crash> bt -e
#crash using using -f, -F, and -FF
crash> bt -f | -F | -FF
# Check the kernel stack of all tasks for evidence of a stack overflow:
crash> bt -v
See the dmesg
log for this crash, that also contain useful details:
crash> log
[...]
[ 20.989762] NVMeVirt: FTL physical space: 4293918720, logical space: 4013008149 (physical/logical * 100 = 107)
[ 20.989763] NVMeVirt: ns 0/1: size 3827 MiB
[ 20.989764] ------------[ cut here ]------------
[ 20.989767] UBSAN: array-index-out-of-bounds in ./include/linux/topology.h:96:9
[ 20.989782] index -1 is out of range for type 'long unsigned int [8192]'
[ 20.989790] CPU: 2 PID: 1164 Comm: insmod Kdump: loaded Tainted: G OE 6.9.0-atr-2024-07-05 #13
[ 20.989794] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 20.989795] Call Trace:
[ 20.989800] <TASK>
[ 20.989801] dump_stack_lvl+0x5d/0x80
[ 20.989819] ubsan_epilogue+0x5/0x30
[ 20.989832] __ubsan_handle_out_of_bounds.cold+0x46/0x4b
[ 20.989834] NVMEV_PCI_INIT+0x3e1/0x3f0 [nvmev]
[ 20.989842] NVMeV_init+0x4c0/0x547 [nvmev]
[ 20.989846] ? __pfx_NVMeV_init+0x10/0x10 [nvmev]
[...]
- Follow the installation setup https://ubuntu.com/server/docs/kernel-crash-dump
- Man page: https://man7.org/linux/man-pages/man8/crash.8.html
- Github: https://github.com/crash-utility/ and https://github.com/crash-utility/crash
post installation configuration command options:
sudo dpkg-reconfigure kdump-tools
sudo dpkg-reconfigure kdump-tools
# check status
kdump-config show (or status)
Where is the dump file after a crash: Once completed, the system will reboot to its normal operational mode. You will then find the kernel crash dump file, and related subdirectories, in the /var/crash
directory by running, e.g. ls /var/crash
, which produces the following:
atr@u24clean:~$ ll /var/crash/
total 48K
drwxrwxrwt 4 root root 4.0K Jul 16 11:51 ./
drwxr-xr-x 13 root root 4.0K Jul 4 10:51 ../
drwxr-xr-x 2 root root 4.0K Jul 16 11:41 202407161141/
drwxr-xr-x 2 root root 4.0K Jul 16 11:51 202407161151/
-rw-r--r-- 1 root root 0 Jul 16 11:51 kdump_lock
-rw-r--r-- 1 root root 283 Jul 17 13:44 kexec_cmd
-rw-r--r-- 1 root root 25K Jul 16 11:41 linux-image-6.9.0-atr-2024-07-05-202407161141.crash
atr@u24clean:~$
How to use the crash
utility and other tabs
- old tutorial: https://www.dedoimedo.com/computers/crash-analyze.html
- kdump: https://www.kernel.org/doc/Documentation/kdump/kdump.txt
- Nick's page: https://github.com/nicktehrany/notes/wiki/Kernel-Hacking
- https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/kernel_administration_guide/kernel_crash_dump_guide#sect-crash-running-the-utility
- Setup kdump on Ubuntu 22.04 (Excellent, July 2023): https://hackmd.io/@0xff07/S1ASmzgun
- A CRASH COURSE ON DEBUGGING KERNEL CRASHES USING THE CRASH UTILITY https://walac.github.io/kernel-crashes/
- https://walac.github.io/kernel-tracing/
- Debugging the Linux kernel using the GDB, https://wiki.st.com/stm32mpu/wiki/Debugging_the_Linux_kernel_using_the_GDB
- https://github.com/crash-utility/crash/issues/47
-
crash> mod -s nvmev /home/atr/src/nvmevirt/nvmev.ko
https://stackoverflow.com/questions/32069887/not-able-to-load-my-module-symbols-in-crash-utility
Command with what I am trying to locate out-of-tree build of a kernel module
sudo crash --src /usr/src/linux-6.9.0-atr-2024-07-05/ --src /home/atr/src/nvmevirt/ /lib/debug/boot/vmlinux-6.9.0-atr-2024-07-05 /var/crash/202407161151/dump.202407161151
So this one finds the kernel symbols, but out of tree build sources.
If I pass an invalid path name then crash complains that the path is invalid, hence at least it is registering it:
crash: invalid --src argument: /home/atr/src/nvmevirtxxx/
reading the man page, crash can be extended with extension modules: https://crash-utility.github.io/extensions.html (-x dir
)
ok, passing --mod /home/atr/src/nvmevirt/ --mod /lib/modules/
uname -r/
still does not help,l so for now we need to do manually.
crash> mod -s nvmev /home/atr/src/nvmevirt/nvmev.ko
MODULE NAME TEXT_BASE SIZE OBJECT FILE
ffffffffc0c2e600 nvmev ffffffffc0c90000 90112 /home/atr/src/nvmevirt/nvmev.ko
crash> mod -S
MODULE NAME TEXT_BASE SIZE OBJECT FILE
ffffffffc05cfec0 floppy ffffffffc05bc000 159744 /lib/modules/6.9.0-atr-2024-07-05/kernel/drivers/block/floppy.ko
[...]
05/kernel/drivers/platform/x86/intel/pmc/intel_pmc_core.ko
ffffffffc0c2e600 nvmev ffffffffc0c90000 90112 /home/atr/src/nvmevirt/nvmev.ko
ffffffffc0c9b1c0 intel_uncore_frequency_common ffffffffc0c99000 16384 /lib/modules/6.9.0-atr-2024-07-05/kernel/drivers/platform/x86/intel/uncore-frequency/intel-uncore-frequency-common.ko
crash>
Still the problem is how to attach it to a source code? OK, found the issue (unsure why):
Instead of loading the out-of-tree kernel module specifically (that just loads the symbols), load the source directory. So instead of doing mod -s nvmev /home/atr/src/nvmevirt/nvmev.ko
, do mod -S /home/atr/src/nvmevirt/
and then call mod -S
(without path to load the rest of the kernel symbols from the standard path location). So, here is a successful sequencing:
sudo crash --src /usr/src/linux-6.9.0-atr-2024-07-05/ \
--src /home/atr/src/nvmevirt/ \
--mod /home/atr/src/nvmevirt/ \
--mod /lib/modules/`uname -r`/ \
/lib/debug/boot/vmlinux-6.9.0-atr-2024-07-05 \
/var/crash/202407161151/dump.202407161151
crash 8.0.4
Copyright (C) 2002-2022 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011, 2020-2022 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
Copyright (C) 2015, 2021 VMware, Inc.
[...]
For help, type "help".
Type "apropos word" to search for commands related to "word"...
KERNEL: /lib/debug/boot/vmlinux-6.9.0-atr-2024-07-05 [TAINTED]
DUMPFILE: /var/crash/202407161151/dump.202407161151 [PARTIAL DUMP]
CPUS: 8
DATE: Tue Jul 16 11:50:57 UTC 2024
UPTIME: 00:00:20
LOAD AVERAGE: 0.12, 0.03, 0.01
TASKS: 205
NODENAME: u24clean
RELEASE: 6.9.0-atr-2024-07-05
VERSION: #13 SMP PREEMPT_DYNAMIC Tue Jul 9 11:37:22 CEST 2024
MACHINE: x86_64 (2995 Mhz)
MEMORY: 4 GB
PANIC: "Oops: 0000 [#1] PREEMPT SMP NOPTI" (check log for details)
PID: 1164
COMMAND: "insmod"
TASK: ffffa01f82b02f40 [THREAD_INFO: ffffa01f82b02f40]
CPU: 2
STATE: TASK_RUNNING (PANIC)
crash> mod -S /home/atr/src/nvmevirt/
mod: cannot find or load object file for floppy module
[...]
mod: cannot find or load object file for intel_pmc_core module
?? Section *UND* not found for symbol __this_module
MODULE NAME TEXT_BASE SIZE OBJECT FILE
ffffffffc0c2e600 nvmev ffffffffc0c90000 90112 /home/atr/src/nvmevirt/nvmev.o
mod: cannot find or load object file for intel_uncore_frequency_common module
crash> mod -S
MODULE NAME TEXT_BASE SIZE OBJECT FILE
ffffffffc05cfec0 floppy ffffffffc05bc000 159744 /lib/modules/6.9.0-atr-2024-07-05/kernel/drivers/block/floppy.ko
[...]
ffffffffc0c74180 intel_pmc_core ffffffffc0c6d000 126976 /lib/modules/6.9.0-atr-2024-07-05/kernel/drivers/platform/x86/intel/pmc/intel_pmc_core.ko
ffffffffc0c2e600 nvmev ffffffffc0c90000 90112 /home/atr/src/nvmevirt/nvmev.o
ffffffffc0c9b1c0 intel_uncore_frequency_common ffffffffc0c99000 16384 /lib/modules/6.9.0-atr-2024-07-05/kernel/drivers/platform/x86/intel/uncore-frequency/intel-uncore-frequency-common.ko
crash> bt -s
PID: 1164 TASK: ffffa01f82b02f40 CPU: 2 COMMAND: "insmod"
#0 [ffffbe4e40e7f6f0] machine_kexec+464 at ffffffff9d095df0
#1 [ffffbe4e40e7f748] __crash_kexec+127 at ffffffff9d2335df
#2 [ffffbe4e40e7f808] crash_kexec+36 at ffffffff9d233b84
#3 [ffffbe4e40e7f810] oops_end+164 at ffffffff9d043d44
#4 [ffffbe4e40e7f830] page_fault_oops.cold+624 at ffffffff9e0d1913
#5 [ffffbe4e40e7f8b8] exc_page_fault+126 at ffffffff9e18d20e
#6 [ffffbe4e40e7f8e0] asm_exc_page_fault+38 at ffffffff9e2012a6
[exception RIP: NVMEV_PCI_INIT+346]
RIP: ffffffffc0c913fa RSP: ffffbe4e40e7f990 RFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffa01f84064000 RCX: 00000000ffffffff
RDX: ffffffffc0c2d880 RSI: ffffffffc0c2d8c0 RDI: 0000000000000010
RBP: 0000000000032040 R8: 0000000000000000 R9: ffffbe4e40e7f8e8
R10: ffffffff9eaffe10 R11: 0000000000000000 R12: ffffffffffffffff
R13: ffffa01f84064000 R14: ffffa01f90ae5740 R15: ffffa01f85e621c0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffbe4e40e7f9c0] init_module+1216 at ffffffffc0c97310 [nvmev]
#8 [ffffbe4e40e7f9f0] do_one_initcall+88 at ffffffff9d002a88
#9 [ffffbe4e40e7fa60] do_init_module+144 at ffffffff9d1fb7a0
#10 [ffffbe4e40e7fa80] init_module_from_file+134 at ffffffff9d1fe626
#11 [ffffbe4e40e7fb30] idempotent_init_module+289 at ffffffff9d1fe791
#12 [ffffbe4e40e7fbb8] __x64_sys_finit_module+94 at ffffffff9d1fea1e
#13 [ffffbe4e40e7fbe8] do_syscall_64+130 at ffffffff9e1864b2
#14 [ffffbe4e40e7ff50] entry_SYSCALL_64_after_hwframe+118 at ffffffff9e20012f
RIP: 00007f987af2725d RSP: 00007ffda87abb18 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000561c82191760 RCX: 00007f987af2725d
RDX: 0000000000000000 RSI: 0000561c821912a0 RDI: 0000000000000003
RBP: 00007ffda87abbd0 R8: 0000000000000040 R9: 0000000000000000
R10: 00007f987b003b20 R11: 0000000000000246 R12: 0000561c821912a0
R13: 0000000000000000 R14: 0000561c82191730 R15: 0000561c821912a0
ORIG_RAX: 0000000000000139 CS: 0033 SS: 002b
crash> dis -l init_module
/home/atr/src/nvmevirt/main.c: 604
0xffffffffc0c96e50 <NVMeV_init>: endbr64
[...]
0xffffffffc0c96f21 <init_module+209>: je 0xffffffffc0c96f3e <init_module+238>
/home/atr/src/nvmevirt/main.c: 218
crash> dis -s init_module
FILE: /home/atr/src/nvmevirt/main.c
LINE: 604
599 NVMEV_INFO("Version %x.%x for >> %s <<\n",
600 (NVMEV_VERSION & 0xff00) >> 8, (NVMEV_VERSION & 0x00ff), type);
601 }
[...]
644 VDEV_FINALIZE(nvmev_vdev);
645 return -EIO;
646 }
crash>
# will list boots
journalctl --list-boots
# will show the last point of the logs
journalctl -e
# follow the log at the bottom
journalctl -e -f
# select a priority-level between 0/"emerg" and 7/"debug"
journalctl -p ###