Databases - nimrody/knowledgebase GitHub Wiki

Other

Specialized

Postgres

Streaming

Papers

Distributed databases

General notes

Time series

Relational

NoSQL

  • Elasticsearch

  • Redis persistence

  • Data written to kernel buffers using the write(2) system call (or equivalent) that gives us data safety against process failure. Data committed to the disk using the fsync(2) system call (or equivalent) that gives us, virtually, data safety against complete system failure like a power outage.

  • fsync

    creat(/dir/log); write(/dir/log, “2, 3, [checksum], foo”); fsync(/dir/log); fsync(/dir); // fsync parent directory of log file pwrite(/dir/orig, 2, “bar”); fsync(/dir/orig); unlink(/dir/log);

    That should prevent corruption on any Linux filesystem, but if we want to make sure that the file actually contains “bar”, we need another fsync at the end.

    creat(/dir/log); write(/dir/log, “2, 3, [checksum], foo”); fsync(/dir/log); fsync(/dir); pwrite(/dir/orig, 2, “bar”); fsync(/dir/orig); unlink(/dir/log); fsync(/dir);

    That results in consistent behavior and guarantees that our operation actually modifies the file after it’s completed, as long as we assume that fsync actually flushes to disk. OS X and some versions of ext3 have an fsync that doesn’t really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk, and some versions of ext3 only flush to disk if the the inode changed (which would only happen at most once a second on writes to the same file, since the inode mtime has one second granularity), as an optimization.

    Even if we assume fsync issues a flush command to the disk, some disks ignore flush directives for the same reason fsync is gimped on OS X and some versions of ext3 – to look better in benchmarks. Handling that is beyond the scope of this post, but the Rajimwale et al. DSN ‘11 paper and related work cover that issue.