RocksDB Timeline and F2FS File Allocation on ZNS - nicktehrany/notes GitHub Wiki

This post aims to analyze the file allocation scheme of RocksDB and the data mapping of these files on ZNS with F2FS. In order to achieve this we use a custom a RocksDB POSIX file system wrapper implementation to see file allocations and iteratively exit the program (exit to stop the program and see where the file ended up on the ZNS). The RocksDB file system wrapper comes from the Storage Systems course at the VU Amsterdam (see the course homepage for more info).

Full Timeline

// TODO make a summary timeline for the iterations (short and link below for the detailed steps)

Iterations

We use an iterative approach with increasing workloads in order to know what happens. Firstly we write very little key-value pairs, in order to generate a single file. We then proceed to locate this file. With each iteration we increase the workload slightly, such that 2 files are created (or possible compaction and flushes). However, we always want to know exactly what happens when and how F2FS handles these files. With this we create a timeline and mapping of F2FS allocation schemes for these files. To find mappings of files we use our filemap tool.

The setup of RocksDB is the same for all workloads (except the amount of key-value pairs being written or when we exit the code). See the full setup with later changes in the end in the Setup Section.

# In our rocksdb context
ctx_test->options.wal_dir ="/mnt/f2fs/wal/";
ctx_test->options.use_direct_io_for_flush_and_compaction = true;
ctx_test->options.max_bytes_for_level_base = 4*1024; // 4KiB (F2FS block - minimum allocation unit)

In the DummyFSForward we simply put an exit call after the return of NewWritableFile(), with a temp so the compiler does not complain.

std::cout << get_seq_id() << " func: " << __FUNCTION__ << " line: " << __LINE__ << " " << std::endl;
IOStatus temp = this->_private_fs->NewWritableFile(fname, file_opts, result, dbg);
exit(1);
return temp;

F2FS Setup

When setting up F2FS it already writes some data, it's important seeing these offsets of LBAs having been written to understand the offsets for the next iterations. We run the f2fs_setup script to format and mount F2FS with our devices and then see what zones have been written (only the first 4, since next allocations for files during RocksDB runs start at zone 4).

user@stosys:~/src/f2fs-bench/file-map/build$ ../../setup_f2fs nvme0n2 nvme0n1
Success formatting namespace:1

    F2FS-tools: mkfs.f2fs Ver: 1.15.0 (2022-05-25)

Info: Disable heap-based policy
Info: Debug level = 0
Info: Trim is enabled
Info: Host-managed zoned block device:
      100 zones, 2147483648 zone size(bytes), 0 randomly writeable zones
      524288 blocks per zone
Info: Segments per section = 1024
Info: Sections per zone = 1
Info: sector size = 512
Info: total sectors = 427819008 (208896 MB)
Info: zone aligned segment0 blkaddr: 524288
Info: format version with
  "Linux version 5.19.0-051900-generic (kernel@sita) (x86_64-linux-gnu-gcc-11 (Ubuntu 11.3.0-5ubuntu1) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38.90.20220713) #202207312230 SMP PREEMPT_DYNAMIC Sun Jul 31 22:34:11 UTC 2022"
Info: [/dev/nvme0n1] Discarding device
Info: This device doesn't support BLKSECDISCARD
Info: Discarded 4096 MB
Info: [/dev/nvme0n2] Discarding device
Info: Discarded 204800 MB
Info: Overprovision ratio = 10.000%
Info: Overprovision segments = 18972 (GC reserved = 15092)
Info: format successful
user@stosys:~/src/f2fs-bench/file-map/build$ sudo nvme zns report-zones /dev/nvme0n2 -d 4
nr_zones: 100
SLBA: 0          WP: 0x8        Cap: 0x21a800   State: 0x20 Type: 0x2  Attrs: 0    AttrsInfo: 0
SLBA: 0x400000   WP: 0x400000   Cap: 0x21a800   State: 0x10 Type: 0x2  Attrs: 0    AttrsInfo: 0
SLBA: 0x800000   WP: 0x800000   Cap: 0x21a800   State: 0x10 Type: 0x2  Attrs: 0    AttrsInfo: 0
SLBA: 0xc00000   WP: 0xc00008   Cap: 0x21a800   State: 0x20 Type: 0x2  Attrs: 0    AttrsInfo: 0

We can see zone 4 has been written up to 0xc00008 which is important for the first iteration. (also interesting that zone 1 has been written the same amount)

IMPORTANT in between runs we always need to rerun the setup_f2fs script to reformat the file system and start over.

Iteration 1

Rocksdb generates the following files (and dir for wal which is empty for now).

user@stosys:~/src/f2fs-bench$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [   0]  000000.dbtmp
├── [   0]  LOCK
├── [8.8K]  LOG
└── [3.4K]  wal

1 directory, 3 files

Now locating the LOG file we see, that the file is in the beginning of zone 4.

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/LOG
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

Warning: Extent Reported on nvme0n1  PBAS: 0x000000  PBAE: 0x0000e8  SIZE: 0x0000e8

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc00020  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00008    PBAE: 0xc00020    SIZE: 0x18

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x18        AES: 0x18        EAES: 24.000000   NOZ: 1

We can see the LOG file is fully written in zone 4 (appears F2FS uses zones 1-3 internally and up to 0x8 in zone 4). We can also see that part of the file is on the conventional namespace (possibly inline data).

Iteration 2

Now we modify the source code of the file system wrapper to allow 2 file creations.

# GLOBALLY
uint64_t ctr = 1;

IOStatus DummyFSForward::NewWritableFile(const std::string &fname,
        const FileOptions &file_opts,
        std::unique_ptr<FSWritableFile> *result,
        IODebugContext *dbg) {
    std::cout << get_seq_id() << " func: " << __FUNCTION__ << " line: " << __LINE__ << " " << std::endl;
    IOStatus temp = this->_private_fs->NewWritableFile(fname, file_opts, result, dbg);
    if (ctr == 2)
        exit(1);
    ctr++;
    return temp;
}

Then running it with sudo ../bin/m45 -p s2fs:nvme0n2:///mnt/f2fs/ -e 20000 -k 1000 -v 16 -S (it will exit so the number of keys does not really matter so much).

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [8.8K]  LOG
├── [   0]  MANIFEST-000001
└── [3.4K]  wal

1 directory, 4 files

It now created an IDENTITY file and wrote 36B to it, however the file is now in ZONE 2, which was prior assumed to be F2FS metadata, which could now be inline data since we only write 36B.

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/IDENTITY
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 2 ****
LBAS: 0x400000  LBAE: 0x61a800  CAP: 0x21a800  WP: 0x400028  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0x400018    PBAE: 0x400018    SIZE: 0

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0           AES: 0           EAES: 0.000000    NOZ: 1

As we can see we retrieve a valid LBA for the extent from ioctl(), however it's size is 0, because in order to write the ZNS a unit of 512 has to be written (although it should be fsync()).

Iteration 3

Again we increase the allowed number of files by 1 to 3.

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [   0]  000001.dbtmp
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [8.8K]  LOG
├── [  13]  MANIFEST-000001
└── [3.4K]  wal

1 directory, 5 files

Now we have an additional MANIFEST-000001 file. Locating it, it is placed after the LOG file in zone 4.

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/MANIFEST-000001
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc00018  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00010    PBAE: 0xc00018    SIZE: 0x8

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x8         AES: 0x8         EAES: 8.000000    NOZ: 1

Iteration 4

Again increase the number of created files by 1 to 4. Here it creates another MANIFEST file but empty this time.

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [  16]  CURRENT
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [ 19K]  LOG
├── [  13]  MANIFEST-000001
├── [   0]  MANIFEST-000004
└── [3.4K]  wal

1 directory, 6 files

We can also see an increase in the LOG file, mapping it again we can see its origin address in zone 4 from 0xc00008 to 0xc00020 has been partially invalidated and the extent is now mapped from 0xc00018 to 0xc00040

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/LOG
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

Warning: Extent Reported on nvme0n1  PBAS: 0x000000  PBAE: 0x0000d8  SIZE: 0x0000d8

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc00040  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00018    PBAE: 0xc00040    SIZE: 0x28

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x28        AES: 0x28        EAES: 40.000000   NOZ: 1

Iteration 5

Increase files to 5. Also note we will check when file deletions are happening, of which we had none so far. Now it creates and empty .dbtmp file. And the MANIFEST-000004 grows in size

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [   0]  000004.dbtmp
├── [  16]  CURRENT
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [ 19K]  LOG
├── [  13]  MANIFEST-000001
├── [  57]  MANIFEST-000004
└── [3.4K]  wal

1 directory, 7 files

Locating both the MANIFEST files

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/MANIFEST-000001
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc00028  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00020    PBAE: 0xc00028    SIZE: 0x8

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x8         AES: 0x8         EAES: 8.000000    NOZ: 1
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/MANIFEST-000004
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

Warning: Extent Reported on nvme0n1  PBAS: 0x000000  PBAE: 0x001ff8  SIZE: 0x001ff8

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc00028  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00018    PBAE: 0xc00020    SIZE: 0x8

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x8         AES: 0x8         EAES: 8.000000    NOZ: 1

Iteration 6

Increase files to 6. Now we have an empty WAL file being created and the .dbtmp file being deleted (but it was empty anyways)

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [  16]  CURRENT
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [ 19K]  LOG
├── [  13]  MANIFEST-000001
├── [  57]  MANIFEST-000004
└── [3.4K]  wal
    └── [   0]  000005.log

1 directory, 7 files

Iteration 7

Increase files to 7. Another empty .dbtmp file is created.

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [  16]  CURRENT
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [ 19K]  LOG
├── [  57]  MANIFEST-000004
├── [   0]  OPTIONS-000006.dbtmp
└── [3.4K]  wal
    └── [   0]  000005.log

1 directory, 7 files

Iteration 8

Increase files to 8. Now we have all the keys written to the WAL, but no flush yet, hence no new files are created but the LOG has increased in size. And the .dbtmp Options file has increased in size. And the LOG file increased as well. Looking at the layout (see below), we can see the new LOG file being the earliest (in LBA) starting at 0xc00048, which is 0x8 behind its prior starting point. Seemingly some other invalid data resides in that region. And it appears the LOG file is constantly being invalidated and appended to (as we have not seen any delete operation on it).

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [  16]  CURRENT
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [ 22K]  LOG
├── [  57]  MANIFEST-000004
├── [6.2K]  OPTIONS-000007
└── [3.4K]  wal
    └── [ 20M]  000005.log

1 directory, 7 files

LOG File

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/LOG
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

Warning: Extent Reported on nvme0n1  PBAS: 0x000000  PBAE: 0x0000d0  SIZE: 0x0000d0

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc09fc8  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00048    PBAE: 0xc00078    SIZE: 0x30

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x30        AES: 0x30        EAES: 48.000000   NOZ: 1

OPTIONS-000007 File

Locating the file we get.

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/OPTIONS-000007
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc09fc8  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00030    PBAE: 0xc00040    SIZE: 0x10

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x10        AES: 0x10        EAES: 16.000000   NOZ: 1

WAL/000005.log File

Locating this file we get.

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/wal/000005.log
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc09fc8  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00078    PBAE: 0xc09fc0    SIZE: 0x9f48

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x9f48      AES: 0x9f48      EAES: 40776.000000  NOZ: 1

Iteration 9

Increasing the command to write more trigger a flush. sudo ../bin/m45 -p s2fs:nvme0n2:///mnt/f2fs/ -e 2000000 -k 1000 -v 16 -S

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [  16]  CURRENT
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [ 22K]  LOG
├── [  57]  MANIFEST-000004
├── [6.2K]  OPTIONS-000007
└── [3.4K]  wal
    ├── [ 63M]  000005.log
    └── [   0]  000008.log

1 directory, 8 files

We can see a new 000008.log file and an increase in 000005.log file size.

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/wal/000005.log
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

Warning: Extent Reported on nvme0n1  PBAS: 0x000000  PBAE: 0x003aa8  SIZE: 0x003aa8

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc1f908  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00078    PBAE: 0xc1f908    SIZE: 0x1f890

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x1f890     AES: 0x1f890     EAES: 129168.000000  NOZ: 1

But we have not triggered any flush yet, so we'll hardcode a memtable size in the configuration [doesn't work like that or I can't find the flush flag, I don't have a way to force a flush without writing more data].

Iteration 10

Therefore we simply write more data. sudo ../bin/m45 -p s2fs:nvme0n2:///mnt/f2fs/ -e 5000000 -k 1000 -v 16 -S

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [   0]  000009.sst
├── [  16]  CURRENT
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [ 22K]  LOG
├── [  57]  MANIFEST-000004
├── [6.2K]  OPTIONS-000007
└── [3.4K]  wal
    ├── [ 63M]  000005.log
    └── [324K]  000008.log

1 directory, 9 files

Locating the new wal log file.

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/wal/000008.log
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

Warning: Extent Reported on nvme0n1  PBAS: 0x000000  PBAE: 0x0230b0  SIZE: 0x0230b0

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc1fbc0  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc12078    PBAE: 0xc12300    SIZE: 0x288

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x288       AES: 0x288       EAES: 648.000000  NOZ: 1

Note, the benchmark crashed but created a new sst file. In the next run we'll check again with another file. (seems like this was some concurrency issue in our exit statement? messed up thread coordination)

Iteration 11

Increase files to 10. Now we have one old log file being deleted, an additional being created and the first SST. We have two more files since the SST triggers a newRandomAccessFile and we only had our exit in the NewWritableFile.

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [ 63M]  000009.sst
├── [  16]  CURRENT
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [ 29K]  LOG
├── [ 14K]  MANIFEST-000004
├── [6.2K]  OPTIONS-000007
└── [3.4K]  wal
    ├── [ 63M]  000008.log
    └── [   0]  000010.log

1 directory, 9 files

The SST is mapped to zone 5, even though there is enough space in zone 4 to fit it [Correction: there might not have been enough space, we'll check this again in later iterations]. The NewRandomAccessFile must contain some extra flags to indicate its different placement? [TODO: figure this out]

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/000009.sst
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 5 ****
LBAS: 0x1000000  LBAE: 0x121a800  CAP: 0x21a800  WP: 0x101f548  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0x1000000   PBAE: 0x101f548   SIZE: 0x1f548

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x1f548     AES: 0x1f548     EAES: 128328.000000  NOZ: 1

We can also see now the LOG is beginning to get fragmented

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/LOG
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

Warning: Extent Reported on nvme0n1  PBAS: 0x000000  PBAE: 0x0000c0  SIZE: 0x0000c0

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc1f998  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00040    PBAE: 0xc00068    SIZE: 0x28
EXTID: 2     PBAS: 0xc000d0    PBAE: 0xc000e8    SIZE: 0x18

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 2     TES: 0x40        AES: 0x20        EAES: 32.000000   NOZ: 1

Iteration 11

We'll decrease the memtable size so that we write less and see what happens with the files. We also increase the number of files to 12 before exiting to flush 2 times, and create 2 different SST files.

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [ 59K]  000009.sst
├── [ 59K]  000011.sst
├── [  16]  CURRENT
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [ 32K]  LOG
├── [ 16K]  MANIFEST-000004
├── [6.2K]  OPTIONS-000007
└── [3.4K]  wal
    ├── [ 57K]  000010.log
    └── [   0]  000012.log

1 directory, 10 files

Locating the different SST files we can see they all end up in the same zone (as opposed to previously with larger file sizes them going to zone 5)

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/000009.sst
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc001a0  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00040    PBAE: 0xc000b8    SIZE: 0x78

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x78        AES: 0x78        EAES: 120.000000  NOZ: 1
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/000011.sst
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc00260  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc000f0    PBAE: 0xc00168    SIZE: 0x78

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x78        AES: 0x78        EAES: 120.000000  NOZ: 1

Locating the WAL we can also see it ends up in zone 4, hence no smart allocation so far, everything is just being dumped in zone 4.

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/wal/00001
000010.log  000012.log
user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/wal/000010.log
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

Warning: Extent Reported on nvme0n1  PBAS: 0x000000  PBAE: 0x000018  SIZE: 0x000018

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc00260  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc001e8    PBAE: 0xc00260    SIZE: 0x78

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x78        AES: 0x78        EAES: 120.000000  NOZ: 1

Iteration 12

Now we want to increase the files to have a single compaction happen. We'll set the files to 13 (so that enough files for the compaction can be created)

user@stosys:~/src/f2fs-bench/file-map/build$ tree /mnt/f2fs -h -L 2
[4.0K]  /mnt/f2fs
├── [ 59K]  000009.sst
├── [ 59K]  000011.sst
├── [7.0K]  000012.sst
├── [   0]  000014.sst
├── [   0]  000015.sst
├── [  16]  CURRENT
├── [  36]  IDENTITY
├── [   0]  LOCK
├── [ 30K]  LOG
├── [4.1K]  MANIFEST-000004
├── [6.2K]  OPTIONS-000007
└── [3.4K]  wal
    ├── [ 57K]  000010.log
    ├── [ 57K]  000013.log
    └── [   0]  000016.log

1 directory, 14 files

checking the log we see that the first compaction happens at:

2022/09/15-13:16:50.601432 7f7bb87f5bc0 [/db_impl/db_impl_write.cc:1824] [default] New memtable created with log file: #8. Immutable memtables: 0.
2022/09/15-13:16:50.602078 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.601865) [/db_impl/db_impl_compaction_flush.cc:2623] Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots available 1, flush slots scheduled 1, compaction slots scheduled 0
2022/09/15-13:16:50.602110 7f7bb30fe640 [/flush_job.cc:816] [default] [JOB 2] Flushing memtable with next log file: 8
2022/09/15-13:16:50.602331 7f7bb30fe640 EVENT_LOG_v1 {"time_micros": 1663247810602272, "job": 2, "event": "flush_started", "num_memtables": 1, "num_entries": 56, "num_deletes": 0, "total_data_size": 57675, "memory_usage": 65784, "flush_reason": "Write Buffer Full"}
2022/09/15-13:16:50.602350 7f7bb30fe640 [/flush_job.cc:845] [default] [JOB 2] Level-0 flush table #9: started
2022/09/15-13:16:50.603241 7f7bb87f5bc0 [/db_impl/db_impl_write.cc:1824] [default] New memtable created with log file: #10. Immutable memtables: 1.
2022/09/15-13:16:50.603261 7f7bb87f5bc0 [WARN] [/column_family.cc:903] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2022/09/15-13:16:50.610233 7f7bb30fe640 EVENT_LOG_v1 {"time_micros": 1663247810609925, "cf_name": "default", "job": 2, "event": "table_file_creation", "file_number": 9, "file_size": 59999, "file_checksum": "", "file_checksum_func_name": "Unknown", "table_properties": {"data_size": 57873, "index_size": 1190, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 1, "index_value_is_delta_encoded": 1, "filter_size": 0, "raw_key_size": 56611, "raw_average_key_size": 1010, "raw_value_size": 896, "raw_average_value_size": 16, "num_data_blocks": 14, "num_entries": 56, "num_filter_entries": 0, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "", "column_family_name": "default", "column_family_id": 0, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "Snappy", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; max_dict_buffer_bytes=0; ", "creation_time": 1663247810, "oldest_key_time": 1663247810, "file_creation_time": 1663247810, "slow_compression_estimated_data_size": 0, "fast_compression_estimated_data_size": 0, "db_id": "c57de8cb-cdae-43c2-998e-8af612e28754", "db_session_id": "HV81XR7RGRG5F3FZ0F7B", "orig_file_number": 9}}
2022/09/15-13:16:50.610349 7f7bb30fe640 [/flush_job.cc:930] [default] [JOB 2] Level-0 flush table #9: 59999 bytes OK
2022/09/15-13:16:50.611045 7f7bb30fe640 [/flush_job.cc:979] [default] [JOB 2] Flush lasted 9011 microseconds, and 8278 cpu microseconds.
2022/09/15-13:16:50.612321 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.611061) [/memtable_list.cc:469] [default] Level-0 commit table #9 started
2022/09/15-13:16:50.612328 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.611706) [/memtable_list.cc:672] [default] Level-0 commit table #9: memtable #1 done
2022/09/15-13:16:50.612332 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.611836) EVENT_LOG_v1 {"time_micros": 1663247810611805, "job": 2, "event": "flush_finished", "output_compression": "Snappy", "lsm_state": [1, 0, 0, 0, 0, 0, 0], "immutable_memtables": 1}
2022/09/15-13:16:50.612336 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.611954) [/db_impl/db_impl_compaction_flush.cc:241] [default] Level summary: files[1 0 0 0 0 0 0] max score 14.65
2022/09/15-13:16:50.612346 7f7bb30fe640 [/db_impl/db_impl_files.cc:432] [JOB 2] Try to delete WAL files size 57955, prev total WAL file size 115909, number of live WAL files 3.
2022/09/15-13:16:50.613054 7f7bb38ff640 [/compaction/compaction_job.cc:2214] [default] [JOB 3] Compacting 1@0 files to L1, score 14.65
2022/09/15-13:16:50.613085 7f7bb38ff640 [/compaction/compaction_job.cc:2220] [default] Compaction start summary: Base version 3 Base level 0, inputs: [9(58KB)]
2022/09/15-13:16:50.613284 7f7bb38ff640 EVENT_LOG_v1 {"time_micros": 1663247810613188, "job": 3, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [9], "score": 14.6482, "input_data_size": 59999}
2022/09/15-13:16:50.613331 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.613298) [/db_impl/db_impl_compaction_flush.cc:2623] Calling FlushMemTableToOutputFile with column family [default], flush slots available 1,
compaction slots available 1, flush slots scheduled 1, compaction slots scheduled 1
2022/09/15-13:16:50.613341 7f7bb30fe640 [/flush_job.cc:816] [default] [JOB 4] Flushing memtable with next log file: 10
2022/09/15-13:16:50.613393 7f7bb30fe640 EVENT_LOG_v1 {"time_micros": 1663247810613385, "job": 4, "event": "flush_started", "num_memtables": 1, "num_entries": 56, "num_deletes": 0, "total_data_size": 57674, "memory_usage"
: 65776, "flush_reason": "Write Buffer Full"}
2022/09/15-13:16:50.613400 7f7bb30fe640 [/flush_job.cc:845] [default] [JOB 4] Level-0 flush table #11: started
2022/09/15-13:16:50.616275 7f7bb87f5bc0 [/db_impl/db_impl_write.cc:1824] [default] New memtable created with log file: #13. Immutable memtables: 1.
2022/09/15-13:16:50.616317 7f7bb87f5bc0 [WARN] [/column_family.cc:903] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2022/09/15-13:16:50.617230 7f7bb30fe640 EVENT_LOG_v1 {"time_micros": 1663247810617121, "cf_name": "default", "job": 4, "event": "table_file_creation", "file_number": 11, "file_size": 59989, "file_checksum": "", "file_che
cksum_func_name": "Unknown", "table_properties": {"data_size": 57869, "index_size": 1184, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 1, "index_value_is_delta_encoded": 1, "filter_size": 0,
 "raw_key_size": 56610, "raw_average_key_size": 1010, "raw_value_size": 896, "raw_average_value_size": 16, "num_data_blocks": 14, "num_entries": 56, "num_filter_entries": 0, "num_deletions": 0, "num_merge_operands": 0, "
num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "", "column_family_name": "default", "column_family_id": 0, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "pre
fix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "Snappy", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; max_dict_buffer
_bytes=0; ", "creation_time": 1663247810, "oldest_key_time": 1663247810, "file_creation_time": 1663247810, "slow_compression_estimated_data_size": 0, "fast_compression_estimated_data_size": 0, "db_id": "c57de8cb-cdae-43c
2-998e-8af612e28754", "db_session_id": "HV81XR7RGRG5F3FZ0F7B", "orig_file_number": 11}}
2022/09/15-13:16:50.617269 7f7bb30fe640 [/flush_job.cc:930] [default] [JOB 4] Level-0 flush table #11: 59989 bytes OK
2022/09/15-13:16:50.617816 7f7bb30fe640 [/flush_job.cc:979] [default] [JOB 4] Flush lasted 4489 microseconds, and 3732 cpu microseconds.
2022/09/15-13:16:50.618226 7f7bb38ff640 [/compaction/compaction_job.cc:1829] [default] [JOB 3] Generated table #12: 5 keys, 7140 bytes
2022/09/15-13:16:50.618326 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.617824) [/memtable_list.cc:469] [default] Level-0 commit table #11 started
2022/09/15-13:16:50.618333 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.618172) [/memtable_list.cc:672] [default] Level-0 commit table #11: memtable #1 done
2022/09/15-13:16:50.618336 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.618235) EVENT_LOG_v1 {"time_micros": 1663247810618217, "job": 4, "event": "flush_finished", "output_compression": "Snappy", "lsm_state": [2,
0, 0, 0, 0, 0, 0], "immutable_memtables": 1}
2022/09/15-13:16:50.618340 7f7bb30fe640 (Original Log Time 2022/09/15-13:16:50.618255) [/db_impl/db_impl_compaction_flush.cc:241] [default] Level summary: files[2 0 0 0 0 0 0] max score 14.65
2022/09/15-13:16:50.618349 7f7bb30fe640 [/db_impl/db_impl_files.cc:432] [JOB 4] Try to delete WAL files size 57954, prev total WAL file size 115909, number of live WAL files 3.

Note the Generated table #12 for job 3 (the compaction job), hence the resulting new SST after the compaction is table 12. Locating this file we can see that it is also written to zone 4 however already with a fragment

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/000012.sst
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc00170  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00140    PBAE: 0xc00148    SIZE: 0x8
EXTID: 2     PBAS: 0xc00168    PBAE: 0xc00170    SIZE: 0x8

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 2     TES: 0x10        AES: 0x8         EAES: 8.000000    NOZ: 1

We can see that the MANIFEST file is between the SST 12 extents

user@stosys:~/src/f2fs-bench/file-map/build$ sudo ./filemap -f /mnt/f2fs/MANIFEST-000004
Warning: nvme0n1 is registered as containing this file, however it is not a ZNS.
If it is used with F2FS as the conventional device, enter the assocaited ZNS device name: nvme0n2

Warning: Extent Reported on nvme0n1  PBAS: 0x000000  PBAE: 0x001ff0  SIZE: 0x001ff0

====================================================================
                        EXTENT MAPPINGS
====================================================================

**** ZONE 4 ****
LBAS: 0xc00000  LBAE: 0xe1a800  CAP: 0x21a800  WP: 0xc002a8  SIZE: 0x400000  STATE: 0x20  MASK: 0xffc00000

EXTID: 1     PBAS: 0xc00158    PBAE: 0xc00168    SIZE: 0x10

====================================================================
                        STATS SUMMARY
====================================================================

NOE: 1     TES: 0x10        AES: 0x10        EAES: 16.000000   NOZ: 1

Retesting with Larger File Sizes

We want to confirm that even with larger files all files are being written to the same zone.

Notes

File type hints

Rocksdb passes types of data to its file system implementation that can optionally handle these differently, see file_system.h

enum class IOType : uint8_t {
  kData,
  kFilter,
  kIndex,
  kMetadata,
  kWAL,
  kManifest,
  kLog,
  kUnknown,
  kInvalid,
};

Access pattern hints

see file_system.h

enum AccessPattern { kNormal, kRandom, kSequential, kWillNeed, kWontNeed };

virtual void Hint(AccessPattern /*pattern*/) {}

Setup

With this setup, when checking the log, flushing happens with 56 entries. sudo ../bin/m45 -p s2fs:nvme0n2:///mnt/f2fs/ -e 1000 -k 1000 -v 16 -S

ctx_test->options.wal_dir ="/mnt/f2fs/wal/";
ctx_test->options.use_direct_io_for_flush_and_compaction = true;
ctx_test->options.max_bytes_for_level_base = 4*1024; // 4KiB (F2FS block - minimum allocation unit)
ctx_test->options.write_buffer_size=1024;
ctx_test->options.target_file_size_base=1024;
ctx_test->options.max_bytes_for_level_multiplier=2;