BTRFS - hpaluch/hpaluch.github.io GitHub Wiki
BTRFS
BTRFS is official open-source alternative to ZFS (ZFS was open sourced under CDDL license by Sun but closed later by Oracle so there is friction and doubt if it is legally "safe" to use it) from Oracle.
Benefits:
- official open-source from Oracle
- included in standard kernel (no license issues)
- very easy management and mount of sub-volumes (unlike ZFS where you have to export/import and use complex commands when you want just temporarily mount some ZFS volume elsewhere).
Problems:
- MySQL performance is very slow:
- free space accounting is problematic (you need to enable "group quota", but enabling
group quota will slow down whole filesystems when you have more than few snapshots...)
- unlike that ZFS is always properly reporting both sub-volume usage and fragmentation level
- may corrupt filesystem when disk is full
- reported free space may not be available due fragmentation and disk filesystem may become full even when free space exists.
MySQL on BTRFS in depth
Main article that shows that there is a problem:
I decided to repeat my trivial MySQL benchmarks with test-ATIS
, this time
under Ubuntu 22.04 LTS (VM Inside Proxmox VE 8.1.3, lvm-thin on Seagate IronWolf 4TB, cache=unsafe,
discard=on).
- for details how to get
test-ATIS
pleas read my original wiki here: https://github.com/hpaluch/hpaluch.github.io/wiki/Simple-MySQL-benchmarks
Here are brief results for test-ATIS
and package mariadb-server
version 10.6.12-0ubuntu0.22.04.1
Kernel: 5.15.0-52-generic
Filesystem | Options | Time (seconds, less is better) |
---|---|---|
ext4 | defaults | 23 |
btrfs | defaults | 32 |
btrfs | nocow | 28 |
btrfs | nobarrier | 30 |
btrfs | nocow,nobarrier | 30 |
There are some things that really puzzles me (why nobarrier
is slower ?).
However found this thread:
Where is recommended:
innodb_doublewrite = 0
innodb_flush_method = O_DSYNC
Here is my /etc/mysql/mariadb.conf.d/99-local.cnf
[mysqld]
datadir = /mnt/btrfs/data1
#datadir = /mnt/btrfs/data2nocow
innodb_doublewrite = 0
innodb_flush_method = O_DSYNC
Remember to always verify such settings with SQL command:
show variables like 'innodb_doublewrite';
show variables like 'innodb_flush_method';
Here are results
Filesystem | Options | Time (seconds, less is better) |
---|---|---|
ext4 | defaults | 23 |
btrfs | defaults | 32 |
btrfs | defaults + innodb tunning | 34 |
Hmm, worse
But here is something I did - but keep in mind, that in case of unclean shutdown you will loose data!!!:
Here is my /etc/mysql/mariadb.conf.d/99-local.cnf
[mysqld]
datadir = /mnt/btrfs/data1
innodb_doublewrite = 0
# NEVER USE nosync IN PRODUCTION!
innodb_flush_method = nosync
And now it beats ext4 (with sync which is unfair of course):
Filesystem | Options | Time (seconds, less is better) |
---|---|---|
ext4 | defaults | 23 |
ext4 | defaults + innodb nosync | 15 |
btrfs | defaults | 32 |
btrfs | defaults + innodb tunning | 34 |
btrfs | defaults + innodb nosync | 14 |
The difference is so striking that I also tried same settings under ext4. Of course: NEVER USE IT FOR PRODUCTION!
So summary is actually optimistic - btrfs is not slow by design, but the fsync(2) and friends are suboptimal so far.
Cloning openSUSE Tumbleweed on BTRFS to another computer
Yes - it is possible :-)
My environment:
- transferred OS: openSUSE Tumbleweed 20240627 with default BTRFS layout (lot of subvolumes/snapshots)
- Source Host OS: Proxmox 8.2.4 (Debian 12)
- Target Host OS: Arch Linux (I have also openSUSE LEAP 15.5 there but it had too old btrfs-progs !)
On Source computer I did this to mount Tumbleweed BTRFS notice subvolid=5
which is
required to get REAL BTRFS root mounted:
$ mount -o subvolid=5 /dev/sdb2 /mnt/source/
$ cd /mnt/source
$ btrfs su get-default .
ID 266 gen 581 top level 265 path @/.snapshots/1/snapshot
$ btrfs su li -pt .
ID gen parent top level path
-- --- ------ --------- ----
256 580 5 5 @
257 552 256 256 @/var
258 554 256 256 @/usr/local
259 556 256 256 @/srv
260 558 256 256 @/root
261 560 256 256 @/opt
262 562 256 256 @/home
263 580 256 256 @/boot/grub2/x86_64-efi
264 580 256 256 @/boot/grub2/i386-pc
265 568 256 256 @/.snapshots
266 581 265 265 @/.snapshots/1/snapshot
Here is script /mnt/source/backup_subvol.sh
that I used to backup each BTRFS volume
using btrfs send
which means to do for each volume:
- create read-only snapshot for volume (I call it always
backup-snapshot
) - backup volume with
btrfs send
- delete read-only snapshot
backup-snapshot
Script /mnt/source/backup_subvol.sh
contents:
!/bin/bash
set -xeuo pipefail
cd `dirname $0`
snap=backup-snapshot
for subvol in `btrfs su li . | awk '{print $NF}'`
do
safe_name=`echo "$subvol" | tr '@/.' '_-_'`
t=/root/backups/tw-plasma-btrfs/dump/$safe_name.bin
echo "Target: '$t"
btrfs su sn -r "$subvol" "$snap"
btrfs send -v --compressed-data -f "$t" "$snap"
ls -lh "$t"
btrfs su del -c "$snap"
done
exit 0
On Source system you also need to prepare list of volumes for
restore script with /mnt/source/list-volumes-single-line.sh
:
#!/bin/bash
set -xeuo pipefail
cd `dirname $0`
t=/root/backups/tw-plasma-btrfs/dump/volumes.lst
t2=/root/backups/tw-plasma-btrfs/dump/volumes-single-line.lst
btrfs su li . | awk '{print $NF}' > $t
btrfs su li . | awk '{print $NF}' | tr '\n' ' ' > $t2
exit 0
After run of both scripts you have to transfer all data (in my case
in /root/backups/tw-plasma-btrfs/dump/
to Target computer.
WARNING! On target computer I have to use Arch Linux instead of openSUSE 15.5
to restore BTRFS because btrfs receive
was too old and did not recognized
dump protocol version (2)...
On target you have to create and mount empty BTRFS filesystem:
$ mkfs.btrfs -L my-target-label /dev/sdaX
$ mount -o subvolid=5 /dev/sdaX /mnt/target
NOTE: I did not enable BTRFS quota on Target to save lot of CPU cycles (BTRFS is known to spent lot of CPU time when quota is enabled and volumes are modified - the more files/directories and more volumes are there the worse CPU usage is). However not enabling quota will likely break
snapper list
command - it will not know how big each volume/snapshot is...On MicroOS SUSE went so far - to use 2 BTRFS partitions:
- for system with enabled quota - which is modified on updates only (so no much CPU time is spend in typical usage)
- another BTRFS partition mounted under
/var
with quota disabled - here are all folders that are write intensive and/or heavily updated
When ready we can restore data with script /mnt/target/restore_volumes.sh
!/bin/bash
set -xeuo pipefail
cd `dirname $0`
snap=backup-snapshot
for subvol in @ @/var @/usr/local @/srv @/root @/opt @/home @/boot/grub2/x86_64-efi @/boot/grub2/i386-pc @/.snapshots @/.snapshots/1/snapshot
do
safe_name=`echo "$subvol" | tr '@/.' '_-_'`
s=`pwd`/00backups/tw-plasma-btrfs/dump/$safe_name.bin
echo "Source: '$s"
[ -f "$s" ] || {
echo "ERROR: Backup file '$s' for volume '$subvol' not found." >&2
exit 1
}
btrfs -v receive -f "$s" `pwd`
ls -lh "$s"
[ -d "$snap" ] || {
echo "ERROR: Snapshot: $snap is not directory" >&2
exit 1
}
btrfs -v su sn "$snap" "$subvol"
btrfs -v su del -c "$snap"
done
exit 0
WARNING! Before running above script you will have to:
- replace list of volumes on line
for subvol in ...
with content ofYOUR_BACKUP/volumes-single-line.lst
- replace path to backup
s=....
to your backup files
After restore you need to do on Target:
- set default volume to your latest snapshot
- on Source you can query default volume with:
$ btrfs su get-default . ID 266 gen 581 top level 265 path @/.snapshots/1/snapshot
- on Target you can set it with command like:
$ btrfs su set-default @/.snapshots/1/snapshot $ btrfs su get-default .
Last but not least, you need to:
- query new UUID of your target BTRFS filesystem, in my case:
$ lsblk -f /dev/sda NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS sda ... └─sda8 btrfs tw-plasma-zotsamtb a37e216d-fd68-4c72-acba-730b91d940f6 70.6G 14% /mnt/target
- and you have to replace old UUID with new (in example
a37e216d-fd68-4c72-acba-730b91d940f6
) in PROPER SNAPSHOT directory - in my case: /mnt/target/@/.snapshots/1/snapshot/etc/fstab
/mnt/target/@/.snapshots/1/snapshot/boot/grub2/grub.cfg
- I used vim editor with command ':%s/OLD-UUID/NEW-UUID/gc'
Then reboot to your main Linux system (in my case LEAP 15.5) and update grub configuration (preferably with enabled os-prober - so it will automatically add Tumbleweed to menu).