BTRFS - hpaluch/hpaluch.github.io GitHub Wiki

BTRFS

BTRFS is official open-source alternative to ZFS (ZFS was open sourced under CDDL license by Sun but closed later by Oracle so there is friction and doubt if it is legally "safe" to use it) from Oracle.

Benefits:

  • official open-source from Oracle
  • included in standard kernel (no license issues)
  • very easy management and mount of sub-volumes (unlike ZFS where you have to export/import and use complex commands when you want just temporarily mount some ZFS volume elsewhere).

Problems:

  • MySQL performance is very slow:
  • free space accounting is problematic (you need to enable "group quota", but enabling group quota will slow down whole filesystems when you have more than few snapshots...)
    • unlike that ZFS is always properly reporting both sub-volume usage and fragmentation level
  • may corrupt filesystem when disk is full
  • reported free space may not be available due fragmentation and disk filesystem may become full even when free space exists.

MySQL on BTRFS in depth

Main article that shows that there is a problem:

I decided to repeat my trivial MySQL benchmarks with test-ATIS, this time under Ubuntu 22.04 LTS (VM Inside Proxmox VE 8.1.3, lvm-thin on Seagate IronWolf 4TB, cache=unsafe, discard=on).

Here are brief results for test-ATIS and package mariadb-server version 10.6.12-0ubuntu0.22.04.1 Kernel: 5.15.0-52-generic

Filesystem Options Time (seconds, less is better)
ext4 defaults 23
btrfs defaults 32
btrfs nocow 28
btrfs nobarrier 30
btrfs nocow,nobarrier 30

There are some things that really puzzles me (why nobarrier is slower ?).

However found this thread:

Where is recommended:

innodb_doublewrite = 0
innodb_flush_method = O_DSYNC

Here is my /etc/mysql/mariadb.conf.d/99-local.cnf

[mysqld]
datadir             = /mnt/btrfs/data1
#datadir            = /mnt/btrfs/data2nocow
innodb_doublewrite  = 0
innodb_flush_method = O_DSYNC

Remember to always verify such settings with SQL command:

show variables like 'innodb_doublewrite';
show variables like 'innodb_flush_method';

Here are results

Filesystem Options Time (seconds, less is better)
ext4 defaults 23
btrfs defaults 32
btrfs defaults + innodb tunning 34

Hmm, worse

But here is something I did - but keep in mind, that in case of unclean shutdown you will loose data!!!:

Here is my /etc/mysql/mariadb.conf.d/99-local.cnf

[mysqld]
datadir                 = /mnt/btrfs/data1
innodb_doublewrite = 0
# NEVER USE nosync IN PRODUCTION!
innodb_flush_method = nosync

And now it beats ext4 (with sync which is unfair of course):

Filesystem Options Time (seconds, less is better)
ext4 defaults 23
ext4 defaults + innodb nosync 15
btrfs defaults 32
btrfs defaults + innodb tunning 34
btrfs defaults + innodb nosync 14

The difference is so striking that I also tried same settings under ext4. Of course: NEVER USE IT FOR PRODUCTION!

So summary is actually optimistic - btrfs is not slow by design, but the fsync(2) and friends are suboptimal so far.

Cloning openSUSE Tumbleweed on BTRFS to another computer

Yes - it is possible :-)

My environment:

  • transferred OS: openSUSE Tumbleweed 20240627 with default BTRFS layout (lot of subvolumes/snapshots)
  • Source Host OS: Proxmox 8.2.4 (Debian 12)
  • Target Host OS: Arch Linux (I have also openSUSE LEAP 15.5 there but it had too old btrfs-progs !)

On Source computer I did this to mount Tumbleweed BTRFS notice subvolid=5 which is required to get REAL BTRFS root mounted:

$ mount -o subvolid=5 /dev/sdb2 /mnt/source/
$ cd /mnt/source
$ btrfs su get-default .

ID 266 gen 581 top level 265 path @/.snapshots/1/snapshot

$ btrfs su li -pt .

ID	gen	parent	top level	path	
--	---	------	---------	----	
256	580	5	5		@
257	552	256	256		@/var
258	554	256	256		@/usr/local
259	556	256	256		@/srv
260	558	256	256		@/root
261	560	256	256		@/opt
262	562	256	256		@/home
263	580	256	256		@/boot/grub2/x86_64-efi
264	580	256	256		@/boot/grub2/i386-pc
265	568	256	256		@/.snapshots
266	581	265	265		@/.snapshots/1/snapshot

Here is script /mnt/source/backup_subvol.sh that I used to backup each BTRFS volume using btrfs send which means to do for each volume:

  • create read-only snapshot for volume (I call it always backup-snapshot)
  • backup volume with btrfs send
  • delete read-only snapshot backup-snapshot

Script /mnt/source/backup_subvol.sh contents:

!/bin/bash
set -xeuo pipefail

cd `dirname $0`
snap=backup-snapshot

for subvol in `btrfs su li . | awk '{print $NF}'`
do
safe_name=`echo "$subvol" | tr '@/.' '_-_'`
t=/root/backups/tw-plasma-btrfs/dump/$safe_name.bin
echo "Target: '$t"
btrfs su sn -r "$subvol" "$snap"
btrfs send -v --compressed-data -f "$t" "$snap"
ls -lh "$t"
btrfs su del -c "$snap"
done

exit 0

On Source system you also need to prepare list of volumes for restore script with /mnt/source/list-volumes-single-line.sh:

#!/bin/bash
set -xeuo pipefail

cd `dirname $0`

t=/root/backups/tw-plasma-btrfs/dump/volumes.lst
t2=/root/backups/tw-plasma-btrfs/dump/volumes-single-line.lst
btrfs su li . | awk '{print $NF}' > $t
btrfs su li . | awk '{print $NF}' | tr  '\n' ' '  > $t2
exit 0

After run of both scripts you have to transfer all data (in my case in /root/backups/tw-plasma-btrfs/dump/ to Target computer.

WARNING! On target computer I have to use Arch Linux instead of openSUSE 15.5 to restore BTRFS because btrfs receive was too old and did not recognized dump protocol version (2)...

On target you have to create and mount empty BTRFS filesystem:

$ mkfs.btrfs -L my-target-label /dev/sdaX
$ mount -o subvolid=5 /dev/sdaX /mnt/target

NOTE: I did not enable BTRFS quota on Target to save lot of CPU cycles (BTRFS is known to spent lot of CPU time when quota is enabled and volumes are modified - the more files/directories and more volumes are there the worse CPU usage is). However not enabling quota will likely break snapper list command - it will not know how big each volume/snapshot is...

On MicroOS SUSE went so far - to use 2 BTRFS partitions:

  1. for system with enabled quota - which is modified on updates only (so no much CPU time is spend in typical usage)
  2. another BTRFS partition mounted under /var with quota disabled - here are all folders that are write intensive and/or heavily updated

When ready we can restore data with script /mnt/target/restore_volumes.sh

!/bin/bash
set -xeuo pipefail

cd `dirname $0`
snap=backup-snapshot

for subvol in @ @/var @/usr/local @/srv @/root @/opt @/home @/boot/grub2/x86_64-efi @/boot/grub2/i386-pc @/.snapshots @/.snapshots/1/snapshot
do
safe_name=`echo "$subvol" | tr '@/.' '_-_'`
s=`pwd`/00backups/tw-plasma-btrfs/dump/$safe_name.bin
echo "Source: '$s"
[ -f "$s" ] || {
	echo "ERROR: Backup file '$s' for volume '$subvol' not found." >&2
	exit 1
}
btrfs -v receive -f "$s" `pwd`
ls -lh "$s"
[ -d "$snap" ] || {
	echo "ERROR: Snapshot: $snap is not directory" >&2
	exit 1
}
btrfs -v su sn "$snap" "$subvol"
btrfs -v su del -c "$snap"
done

exit 0

WARNING! Before running above script you will have to:

  • replace list of volumes on line for subvol in ... with content of YOUR_BACKUP/volumes-single-line.lst
  • replace path to backup s=.... to your backup files

After restore you need to do on Target:

  • set default volume to your latest snapshot
  • on Source you can query default volume with:
    $ btrfs su get-default .
    
    ID 266 gen 581 top level 265 path @/.snapshots/1/snapshot
    
  • on Target you can set it with command like:
    $ btrfs su set-default @/.snapshots/1/snapshot
    $ btrfs su get-default .
    

Last but not least, you need to:

  • query new UUID of your target BTRFS filesystem, in my case:
    $ lsblk -f /dev/sda
    
    NAME   FSTYPE FSVER LABEL              UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
    sda
    ...
    └─sda8 btrfs        tw-plasma-zotsamtb a37e216d-fd68-4c72-acba-730b91d940f6   70.6G    14% /mnt/target
    
  • and you have to replace old UUID with new (in example a37e216d-fd68-4c72-acba-730b91d940f6) in PROPER SNAPSHOT directory - in my case:
  • /mnt/target/@/.snapshots/1/snapshot/etc/fstab
  • /mnt/target/@/.snapshots/1/snapshot/boot/grub2/grub.cfg
  • I used vim editor with command ':%s/OLD-UUID/NEW-UUID/gc'

Then reboot to your main Linux system (in my case LEAP 15.5) and update grub configuration (preferably with enabled os-prober - so it will automatically add Tumbleweed to menu).