RocksDB Timeline and F2FS File Allocation on ZNS - nicktehrany/notes GitHub Wiki
This post aims to analyze the file allocation scheme of RocksDB and the data mapping of these files on ZNS with F2FS. In order to achieve this we use a custom a RocksDB POSIX file system wrapper implementation to see file allocations and iteratively exit the program (exit to stop the program and see where the file ended up on the ZNS). The RocksDB file system wrapper comes from the Storage Systems course at the VU Amsterdam (see the course homepage for more info).
Full Timeline
// TODO make a summary timeline for the iterations (short and link below for the detailed steps)
Iterations
We use an iterative approach with increasing workloads in order to know what happens. Firstly we write very little key-value pairs, in order to generate a single file. We then proceed to locate this file. With each iteration we increase the workload slightly, such that 2 files are created (or possible compaction and flushes). However, we always want to know exactly what happens when and how F2FS handles these files. With this we create a timeline and mapping of F2FS allocation schemes for these files. To find mappings of files we use our filemap tool.
The setup of RocksDB is the same for all workloads (except the amount of key-value pairs being written or when we exit the code). See the full setup with later changes in the end in the Setup Section.
# In our rocksdb context
ctx_test->options.wal_dir ="/mnt/f2fs/wal/";
ctx_test->options.use_direct_io_for_flush_and_compaction = true;
ctx_test->options.max_bytes_for_level_base = 4*1024; // 4KiB (F2FS block - minimum allocation unit)
In the DummyFSForward we simply put an exit call after the return of NewWritableFile()
, with a temp so the compiler does not complain.
std::cout << get_seq_id() << " func: " << __FUNCTION__ << " line: " << __LINE__ << " " << std::endl;
IOStatus temp = this->_private_fs->NewWritableFile(fname, file_opts, result, dbg);
exit(1);
return temp;
F2FS Setup
When setting up F2FS it already writes some data, it's important seeing these offsets of LBAs having been written to understand the offsets for the next iterations. We run the f2fs_setup script to format and mount F2FS with our devices and then see what zones have been written (only the first 4, since next allocations for files during RocksDB runs start at zone 4).
user@stosys:~/src/f2fs-bench/file-map/build$ ../../setup_f2fs nvme0n2 nvme0n1
Success formatting namespace:1
F2FS-tools: mkfs.f2fs Ver: 1.15.0 (2022-05-25)
Info: Disable heap-based policy
Info: Debug level = 0
Info: Trim is enabled
Info: Host-managed zoned block device:
100 zones, 2147483648 zone size(bytes), 0 randomly writeable zones
524288 blocks per zone
Info: Segments per section = 1024
Info: Sections per zone = 1
Info: sector size = 512
Info: total sectors = 427819008 (208896 MB)
Info: zone aligned segment0 blkaddr: 524288
Info: format version with
"Linux version 5.19.0-051900-generic (kernel@sita) (x86_64-linux-gnu-gcc-11 (Ubuntu 11.3.0-5ubuntu1) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38.90.20220713) #202207312230 SMP PREEMPT_DYNAMIC Sun Jul 31 22:34:11 UTC 2022"
Info: [/dev/nvme0n1] Discarding device
Info: This device doesn't support BLKSECDISCARD
Info: Discarded 4096 MB
Info: [/dev/nvme0n2] Discarding device
Info: Discarded 204800 MB
Info: Overprovision ratio = 10.000%
Info: Overprovision segments = 18972 (GC reserved = 15092)
Info: format successful
user@stosys:~/src/f2fs-bench/file-map/build$ sudo nvme zns report-zones /dev/nvme0n2 -d 4
nr_zones: 100
SLBA: 0 WP: 0x8 Cap: 0x21a800 State: 0x20 Type: 0x2 Attrs: 0 AttrsInfo: 0
SLBA: 0x400000 WP: 0x400000 Cap: 0x21a800 State: 0x10 Type: 0x2 Attrs: 0 AttrsInfo: 0
SLBA: 0x800000 WP: 0x800000 Cap: 0x21a800 State: 0x10 Type: 0x2 Attrs: 0 AttrsInfo: 0
SLBA: 0xc00000 WP: 0xc00008 Cap: 0x21a800 State: 0x20 Type: 0x2 Attrs: 0 AttrsInfo: 0
We can see zone 4 has been written up to 0xc00008
which is important for the first iteration. (also interesting that zone 1 has been written the same amount)
IMPORTANT in between runs we always need to rerun the setup_f2fs script to reformat the file system and start over.
Iteration 1
Rocksdb generates the following files (and dir for wal
which is empty for now).
user@stosys:~/src/f2fs-bench$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 0] 000000.dbtmp
├── [ 0] LOCK
├── [8.8K] LOG
└── [3.4K] wal
1 directory, 3 files
Now locating the LOG file we see, that the file is in the beginning of zone 4.
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/LOG
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
Warning: Extent Reported on nvme0n1 PBAS: 0x000000 PBAE: 0x0000e8 SIZE: 0x0000e8
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc00020 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00008 PBAE: 0xc00020 SIZE: 0x18
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x18 AES: 0x18 EAES: 24.000000 NOZ: 1
We can see the LOG file is fully written in zone 4 (appears F2FS uses zones 1-3 internally and up to 0x8
in zone 4). We can also see that part of the file is on the conventional namespace (possibly inline data).
Iteration 2
Now we modify the source code of the file system wrapper to allow 2 file creations.
# GLOBALLY
uint64_t ctr = 1;
IOStatus DummyFSForward::NewWritableFile(const std::string &fname,
const FileOptions &file_opts,
std::unique_ptr<FSWritableFile> *result,
IODebugContext *dbg) {
std::cout << get_seq_id() << " func: " << __FUNCTION__ << " line: " << __LINE__ << " " << std::endl;
IOStatus temp = this->_private_fs->NewWritableFile(fname, file_opts, result, dbg);
if (ctr == 2)
exit(1);
ctr++;
return temp;
}
Then running it with sudo ../bin/m45 -p s2fs:nvme0n2:///mnt/f2fs/ -e 20000 -k 1000 -v 16 -S
(it will exit so the number of keys does not really matter so much).
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [8.8K] LOG
├── [ 0] MANIFEST-000001
└── [3.4K] wal
1 directory, 4 files
It now created an IDENTITY
file and wrote 36B to it, however the file is now in ZONE 2, which was prior assumed to be F2FS metadata, which could now be inline data since we only write 36B.
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/IDENTITY
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 2 ****
LBAS: 0x400000 LBAE: 0x61a800 CAP: 0x21a800 WP: 0x400028 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0x400018 PBAE: 0x400018 SIZE: 0
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0 AES: 0 EAES: 0.000000 NOZ: 1
As we can see we retrieve a valid LBA for the extent from ioctl()
, however it's size is 0, because in order to write the ZNS a unit of 512 has to be written (although it should be fsync()).
Iteration 3
Again we increase the allowed number of files by 1 to 3.
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 0] 000001.dbtmp
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [8.8K] LOG
├── [ 13] MANIFEST-000001
└── [3.4K] wal
1 directory, 5 files
Now we have an additional MANIFEST-000001
file. Locating it, it is placed after the LOG file in zone 4.
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/MANIFEST-000001
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc00018 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00010 PBAE: 0xc00018 SIZE: 0x8
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x8 AES: 0x8 EAES: 8.000000 NOZ: 1
Iteration 4
Again increase the number of created files by 1 to 4. Here it creates another MANIFEST file but empty this time.
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 16] CURRENT
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [ 19K] LOG
├── [ 13] MANIFEST-000001
├── [ 0] MANIFEST-000004
└── [3.4K] wal
1 directory, 6 files
We can also see an increase in the LOG file, mapping it again we can see its origin address in zone 4 from 0xc00008
to 0xc00020
has been partially invalidated and the extent is now mapped from 0xc00018
to 0xc00040
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/LOG
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
Warning: Extent Reported on nvme0n1 PBAS: 0x000000 PBAE: 0x0000d8 SIZE: 0x0000d8
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc00040 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00018 PBAE: 0xc00040 SIZE: 0x28
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x28 AES: 0x28 EAES: 40.000000 NOZ: 1
Iteration 5
Increase files to 5. Also note we will check when file deletions are happening, of which we had none so far. Now it creates and empty .dbtmp file. And the MANIFEST-000004 grows in size
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 0] 000004.dbtmp
├── [ 16] CURRENT
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [ 19K] LOG
├── [ 13] MANIFEST-000001
├── [ 57] MANIFEST-000004
└── [3.4K] wal
1 directory, 7 files
Locating both the MANIFEST files
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/MANIFEST-000001
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc00028 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00020 PBAE: 0xc00028 SIZE: 0x8
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x8 AES: 0x8 EAES: 8.000000 NOZ: 1
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/MANIFEST-000004
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
Warning: Extent Reported on nvme0n1 PBAS: 0x000000 PBAE: 0x001ff8 SIZE: 0x001ff8
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc00028 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00018 PBAE: 0xc00020 SIZE: 0x8
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x8 AES: 0x8 EAES: 8.000000 NOZ: 1
Iteration 6
Increase files to 6. Now we have an empty WAL file being created and the .dbtmp file being deleted (but it was empty anyways)
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 16] CURRENT
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [ 19K] LOG
├── [ 13] MANIFEST-000001
├── [ 57] MANIFEST-000004
└── [3.4K] wal
└── [ 0] 000005.log
1 directory, 7 files
Iteration 7
Increase files to 7. Another empty .dbtmp file is created.
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 16] CURRENT
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [ 19K] LOG
├── [ 57] MANIFEST-000004
├── [ 0] OPTIONS-000006.dbtmp
└── [3.4K] wal
└── [ 0] 000005.log
1 directory, 7 files
Iteration 8
Increase files to 8. Now we have all the keys written to the WAL, but no flush yet, hence no new files are created but the LOG has increased in size. And the .dbtmp Options file has increased in size. And the LOG file increased as well. Looking at the layout (see below), we can see the new LOG file being the earliest (in LBA) starting at 0xc00048
, which is 0x8
behind its prior starting point. Seemingly some other invalid data resides in that region. And it appears the LOG file is constantly being invalidated and appended to (as we have not seen any delete operation on it).
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 16] CURRENT
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [ 22K] LOG
├── [ 57] MANIFEST-000004
├── [6.2K] OPTIONS-000007
└── [3.4K] wal
└── [ 20M] 000005.log
1 directory, 7 files
LOG File
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/LOG
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
Warning: Extent Reported on nvme0n1 PBAS: 0x000000 PBAE: 0x0000d0 SIZE: 0x0000d0
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc09fc8 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00048 PBAE: 0xc00078 SIZE: 0x30
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x30 AES: 0x30 EAES: 48.000000 NOZ: 1
OPTIONS-000007 File
Locating the file we get.
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/OPTIONS-000007
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc09fc8 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00030 PBAE: 0xc00040 SIZE: 0x10
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x10 AES: 0x10 EAES: 16.000000 NOZ: 1
WAL/000005.log File
Locating this file we get.
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/wal/000005.log
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc09fc8 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00078 PBAE: 0xc09fc0 SIZE: 0x9f48
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x9f48 AES: 0x9f48 EAES: 40776.000000 NOZ: 1
Iteration 9
Increasing the command to write more trigger a flush. sudo ../bin/m45 -p s2fs:nvme0n2:///mnt/f2fs/ -e 2000000 -k 1000 -v 16 -S
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 16] CURRENT
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [ 22K] LOG
├── [ 57] MANIFEST-000004
├── [6.2K] OPTIONS-000007
└── [3.4K] wal
├── [ 63M] 000005.log
└── [ 0] 000008.log
1 directory, 8 files
We can see a new 000008.log file and an increase in 000005.log file size.
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/wal/000005.log
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
Warning: Extent Reported on nvme0n1 PBAS: 0x000000 PBAE: 0x003aa8 SIZE: 0x003aa8
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc1f908 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00078 PBAE: 0xc1f908 SIZE: 0x1f890
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x1f890 AES: 0x1f890 EAES: 129168.000000 NOZ: 1
But we have not triggered any flush yet, so we'll hardcode a memtable size in the configuration [doesn't work like that or I can't find the flush flag, I don't have a way to force a flush without writing more data].
Iteration 10
Therefore we simply write more data. sudo ../bin/m45 -p s2fs:nvme0n2:///mnt/f2fs/ -e 5000000 -k 1000 -v 16 -S
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 0] 000009.sst
├── [ 16] CURRENT
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [ 22K] LOG
├── [ 57] MANIFEST-000004
├── [6.2K] OPTIONS-000007
└── [3.4K] wal
├── [ 63M] 000005.log
└── [324K] 000008.log
1 directory, 9 files
Locating the new wal log file.
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/wal/000008.log
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
Warning: Extent Reported on nvme0n1 PBAS: 0x000000 PBAE: 0x0230b0 SIZE: 0x0230b0
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc1fbc0 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc12078 PBAE: 0xc12300 SIZE: 0x288
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x288 AES: 0x288 EAES: 648.000000 NOZ: 1
Note, the benchmark crashed but created a new sst file. In the next run we'll check again with another file. (seems like this was some concurrency issue in our exit statement? messed up thread coordination)
Iteration 11
Increase files to 10. Now we have one old log file being deleted, an additional being created and the first SST. We have two more files since the SST triggers a newRandomAccessFile
and we only had our exit in the NewWritableFile
.
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 63M] 000009.sst
├── [ 16] CURRENT
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [ 29K] LOG
├── [ 14K] MANIFEST-000004
├── [6.2K] OPTIONS-000007
└── [3.4K] wal
├── [ 63M] 000008.log
└── [ 0] 000010.log
1 directory, 9 files
The SST is mapped to zone 5, even though there is enough space in zone 4 to fit it [Correction: there might not have been enough space, we'll check this again in later iterations]. The NewRandomAccessFile
must contain some extra flags to indicate its different placement? [TODO: figure this out]
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/000009.sst
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 5 ****
LBAS: 0x1000000 LBAE: 0x121a800 CAP: 0x21a800 WP: 0x101f548 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0x1000000 PBAE: 0x101f548 SIZE: 0x1f548
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x1f548 AES: 0x1f548 EAES: 128328.000000 NOZ: 1
We can also see now the LOG is beginning to get fragmented
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/LOG
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
Warning: Extent Reported on nvme0n1 PBAS: 0x000000 PBAE: 0x0000c0 SIZE: 0x0000c0
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc1f998 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00040 PBAE: 0xc00068 SIZE: 0x28
EXTID: 2 PBAS: 0xc000d0 PBAE: 0xc000e8 SIZE: 0x18
====================================================================
STATS SUMMARY
====================================================================
NOE: 2 TES: 0x40 AES: 0x20 EAES: 32.000000 NOZ: 1
Iteration 11
We'll decrease the memtable size so that we write less and see what happens with the files. We also increase the number of files to 12 before exiting to flush 2 times, and create 2 different SST files.
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 59K] 000009.sst
├── [ 59K] 000011.sst
├── [ 16] CURRENT
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [ 32K] LOG
├── [ 16K] MANIFEST-000004
├── [6.2K] OPTIONS-000007
└── [3.4K] wal
├── [ 57K] 000010.log
└── [ 0] 000012.log
1 directory, 10 files
Locating the different SST files we can see they all end up in the same zone (as opposed to previously with larger file sizes them going to zone 5)
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/000009.sst
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc001a0 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00040 PBAE: 0xc000b8 SIZE: 0x78
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x78 AES: 0x78 EAES: 120.000000 NOZ: 1
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/000011.sst
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc00260 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc000f0 PBAE: 0xc00168 SIZE: 0x78
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x78 AES: 0x78 EAES: 120.000000 NOZ: 1
Locating the WAL we can also see it ends up in zone 4, hence no smart allocation so far, everything is just being dumped in zone 4.
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/wal/00001
000010.log 000012.log
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/wal/000010.log
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
Warning: Extent Reported on nvme0n1 PBAS: 0x000000 PBAE: 0x000018 SIZE: 0x000018
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc00260 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc001e8 PBAE: 0xc00260 SIZE: 0x78
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x78 AES: 0x78 EAES: 120.000000 NOZ: 1
Iteration 12
Now we want to increase the files to have a single compaction happen. We'll set the files to 13 (so that enough files for the compaction can be created)
user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K] /mnt/f2fs
├── [ 59K] 000009.sst
├── [ 59K] 000011.sst
├── [7.0K] 000012.sst
├── [ 0] 000014.sst
├── [ 0] 000015.sst
├── [ 16] CURRENT
├── [ 36] IDENTITY
├── [ 0] LOCK
├── [ 30K] LOG
├── [4.1K] MANIFEST-000004
├── [6.2K] OPTIONS-000007
└── [3.4K] wal
├── [ 57K] 000010.log
├── [ 57K] 000013.log
└── [ 0] 000016.log
1 directory, 14 files
checking the log we see that the first compaction happens at:
2022/09/15-13:16:50.601432 7f7bb87f5bc0 [/db_impl/db_impl_write.cc:1824] [default] New memtable created with log file: #8. Immutable memtables: 0.
2022/09/15-13:16:50.602078 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.601865) [/db_impl/db_impl_compaction_flush.cc:2623] Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots available 1, flush slots scheduled 1, compaction slots scheduled 0
2022/09/15-13:16:50.602110 7f7bb30fe640 [/flush_job.cc:816] [default] [JOB 2] Flushing memtable with next log file: 8
2022/09/15-13:16:50.602331 7f7bb30fe640 EVENT_LOG_v1 {"time_micros": 1663247810602272, "job": 2, "event": "flush_started", "num_memtables": 1, "num_entries": 56, "num_deletes": 0, "total_data_size": 57675, "memory_usage": 65784, "flush_reason": "Write Buffer Full"}
2022/09/15-13:16:50.602350 7f7bb30fe640 [/flush_job.cc:845] [default] [JOB 2] Level-0 flush table #9: started
2022/09/15-13:16:50.603241 7f7bb87f5bc0 [/db_impl/db_impl_write.cc:1824] [default] New memtable created with log file: #10. Immutable memtables: 1.
2022/09/15-13:16:50.603261 7f7bb87f5bc0 [WARN] [/column_family.cc:903] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2022/09/15-13:16:50.610233 7f7bb30fe640 EVENT_LOG_v1 {"time_micros": 1663247810609925, "cf_name": "default", "job": 2, "event": "table_file_creation", "file_number": 9, "file_size": 59999, "file_checksum": "", "file_checksum_func_name": "Unknown", "table_properties": {"data_size": 57873, "index_size": 1190, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 1, "index_value_is_delta_encoded": 1, "filter_size": 0, "raw_key_size": 56611, "raw_average_key_size": 1010, "raw_value_size": 896, "raw_average_value_size": 16, "num_data_blocks": 14, "num_entries": 56, "num_filter_entries": 0, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "", "column_family_name": "default", "column_family_id": 0, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "Snappy", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; max_dict_buffer_bytes=0; ", "creation_time": 1663247810, "oldest_key_time": 1663247810, "file_creation_time": 1663247810, "slow_compression_estimated_data_size": 0, "fast_compression_estimated_data_size": 0, "db_id": "c57de8cb-cdae-43c2-998e-8af612e28754", "db_session_id": "HV81XR7RGRG5F3FZ0F7B", "orig_file_number": 9}}
2022/09/15-13:16:50.610349 7f7bb30fe640 [/flush_job.cc:930] [default] [JOB 2] Level-0 flush table #9: 59999 bytes OK
2022/09/15-13:16:50.611045 7f7bb30fe640 [/flush_job.cc:979] [default] [JOB 2] Flush lasted 9011 microseconds, and 8278 cpu microseconds.
2022/09/15-13:16:50.612321 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.611061) [/memtable_list.cc:469] [default] Level-0 commit table #9 started
2022/09/15-13:16:50.612328 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.611706) [/memtable_list.cc:672] [default] Level-0 commit table #9: memtable #1 done
2022/09/15-13:16:50.612332 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.611836) EVENT_LOG_v1 {"time_micros": 1663247810611805, "job": 2, "event": "flush_finished", "output_compression": "Snappy", "lsm_state": [1, 0, 0, 0, 0, 0, 0], "immutable_memtables": 1}
2022/09/15-13:16:50.612336 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.611954) [/db_impl/db_impl_compaction_flush.cc:241] [default] Level summary: files[1 0 0 0 0 0 0] max score 14.65
2022/09/15-13:16:50.612346 7f7bb30fe640 [/db_impl/db_impl_files.cc:432] [JOB 2] Try to delete WAL files size 57955, prev total WAL file size 115909, number of live WAL files 3.
2022/09/15-13:16:50.613054 7f7bb38ff640 [/compaction/compaction_job.cc:2214] [default] [JOB 3] Compacting 1@0 files to L1, score 14.65
2022/09/15-13:16:50.613085 7f7bb38ff640 [/compaction/compaction_job.cc:2220] [default] Compaction start summary: Base version 3 Base level 0, inputs: [9(58KB)]
2022/09/15-13:16:50.613284 7f7bb38ff640 EVENT_LOG_v1 {"time_micros": 1663247810613188, "job": 3, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [9], "score": 14.6482, "input_data_size": 59999}
2022/09/15-13:16:50.613331 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.613298) [/db_impl/db_impl_compaction_flush.cc:2623] Calling FlushMemTableToOutputFile with column family [default], flush slots available 1,
compaction slots available 1, flush slots scheduled 1, compaction slots scheduled 1
2022/09/15-13:16:50.613341 7f7bb30fe640 [/flush_job.cc:816] [default] [JOB 4] Flushing memtable with next log file: 10
2022/09/15-13:16:50.613393 7f7bb30fe640 EVENT_LOG_v1 {"time_micros": 1663247810613385, "job": 4, "event": "flush_started", "num_memtables": 1, "num_entries": 56, "num_deletes": 0, "total_data_size": 57674, "memory_usage"
: 65776, "flush_reason": "Write Buffer Full"}
2022/09/15-13:16:50.613400 7f7bb30fe640 [/flush_job.cc:845] [default] [JOB 4] Level-0 flush table #11: started
2022/09/15-13:16:50.616275 7f7bb87f5bc0 [/db_impl/db_impl_write.cc:1824] [default] New memtable created with log file: #13. Immutable memtables: 1.
2022/09/15-13:16:50.616317 7f7bb87f5bc0 [WARN] [/column_family.cc:903] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2022/09/15-13:16:50.617230 7f7bb30fe640 EVENT_LOG_v1 {"time_micros": 1663247810617121, "cf_name": "default", "job": 4, "event": "table_file_creation", "file_number": 11, "file_size": 59989, "file_checksum": "", "file_che
cksum_func_name": "Unknown", "table_properties": {"data_size": 57869, "index_size": 1184, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 1, "index_value_is_delta_encoded": 1, "filter_size": 0,
"raw_key_size": 56610, "raw_average_key_size": 1010, "raw_value_size": 896, "raw_average_value_size": 16, "num_data_blocks": 14, "num_entries": 56, "num_filter_entries": 0, "num_deletions": 0, "num_merge_operands": 0, "
num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "", "column_family_name": "default", "column_family_id": 0, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "pre
fix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "Snappy", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; max_dict_buffer
_bytes=0; ", "creation_time": 1663247810, "oldest_key_time": 1663247810, "file_creation_time": 1663247810, "slow_compression_estimated_data_size": 0, "fast_compression_estimated_data_size": 0, "db_id": "c57de8cb-cdae-43c
2-998e-8af612e28754", "db_session_id": "HV81XR7RGRG5F3FZ0F7B", "orig_file_number": 11}}
2022/09/15-13:16:50.617269 7f7bb30fe640 [/flush_job.cc:930] [default] [JOB 4] Level-0 flush table #11: 59989 bytes OK
2022/09/15-13:16:50.617816 7f7bb30fe640 [/flush_job.cc:979] [default] [JOB 4] Flush lasted 4489 microseconds, and 3732 cpu microseconds.
2022/09/15-13:16:50.618226 7f7bb38ff640 [/compaction/compaction_job.cc:1829] [default] [JOB 3] Generated table #12: 5 keys, 7140 bytes
2022/09/15-13:16:50.618326 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.617824) [/memtable_list.cc:469] [default] Level-0 commit table #11 started
2022/09/15-13:16:50.618333 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.618172) [/memtable_list.cc:672] [default] Level-0 commit table #11: memtable #1 done
2022/09/15-13:16:50.618336 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.618235) EVENT_LOG_v1 {"time_micros": 1663247810618217, "job": 4, "event": "flush_finished", "output_compression": "Snappy", "lsm_state": [2,
0, 0, 0, 0, 0, 0], "immutable_memtables": 1}
2022/09/15-13:16:50.618340 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.618255) [/db_impl/db_impl_compaction_flush.cc:241] [default] Level summary: files[2 0 0 0 0 0 0] max score 14.65
2022/09/15-13:16:50.618349 7f7bb30fe640 [/db_impl/db_impl_files.cc:432] [JOB 4] Try to delete WAL files size 57954, prev total WAL file size 115909, number of live WAL files 3.
Note the Generated table #12 for job 3 (the compaction job), hence the resulting new SST after the compaction is table 12. Locating this file we can see that it is also written to zone 4 however already with a fragment
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/000012.sst
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc00170 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00140 PBAE: 0xc00148 SIZE: 0x8
EXTID: 2 PBAS: 0xc00168 PBAE: 0xc00170 SIZE: 0x8
====================================================================
STATS SUMMARY
====================================================================
NOE: 2 TES: 0x10 AES: 0x8 EAES: 8.000000 NOZ: 1
We can see that the MANIFEST file is between the SST 12 extents
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/MANIFEST-000004
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2
Warning: Extent Reported on nvme0n1 PBAS: 0x000000 PBAE: 0x001ff0 SIZE: 0x001ff0
====================================================================
EXTENT MAPPINGS
====================================================================
**** ZONE 4 ****
LBAS: 0xc00000 LBAE: 0xe1a800 CAP: 0x21a800 WP: 0xc002a8 SIZE: 0x400000 STATE: 0x20 MASK: 0xffc00000
EXTID: 1 PBAS: 0xc00158 PBAE: 0xc00168 SIZE: 0x10
====================================================================
STATS SUMMARY
====================================================================
NOE: 1 TES: 0x10 AES: 0x10 EAES: 16.000000 NOZ: 1
Retesting with Larger File Sizes
We want to confirm that even with larger files all files are being written to the same zone.
Notes
File type hints
Rocksdb passes types of data to its file system implementation that can optionally handle these differently, see file_system.h
enum class IOType : uint8_t {
kData,
kFilter,
kIndex,
kMetadata,
kWAL,
kManifest,
kLog,
kUnknown,
kInvalid,
};
Access pattern hints
see file_system.h
enum AccessPattern { kNormal, kRandom, kSequential, kWillNeed, kWontNeed };
virtual void Hint(AccessPattern /*pattern*/) {}
Setup
With this setup, when checking the log, flushing happens with 56 entries. sudo ../bin/m45 -p s2fs:nvme0n2:///mnt/f2fs/ -e 1000 -k 1000 -v 16 -S
ctx_test->options.wal_dir ="/mnt/f2fs/wal/";
ctx_test->options.use_direct_io_for_flush_and_compaction = true;
ctx_test->options.max_bytes_for_level_base = 4*1024; // 4KiB (F2FS block - minimum allocation unit)
ctx_test->options.write_buffer_size=1024;
ctx_test->options.target_file_size_base=1024;
ctx_test->options.max_bytes_for_level_multiplier=2;