Definition: "Batch-atomic write" mean the ability to write an arbitrary number of bytes into arbitrary locations in a single file, and to do so atomically. Following a system crash or power failure, either all the writes are preserved in the file, or none of them are.
Batch-atomic writes are supported by the F2FS filesystem on Linux.
This document explores changes to SQLite to provide a more general framework to make use of batch-atomic writes, not just in F2FS but in any other filesystem that happens to support batch-atomic writes.
Express the proposed design changes in words, since we find that writing things down, in detail, often brings to light subtle bugs that would be missed otherwise.
Provide a simple and accessible explanation of the proposed changes to encourage review by as many different people as possible.
Provide a baseline for discussion of changes or issues with the design.
2.0 VFS Interface Enhancements
The sqlite3_io_methods.xDeviceCharateristics() method adds a new bit SQLITE_IOCAP_BATCH_ATOMIC. The new bit is set only if the underlying filesystem supports batch-atomic writes.
Three new sqlite3_file_control() operations are added:
- SQLITE_FCNTL_COMMIT_ATOMIC_WRITE, and
An batch-atomic write consists of all write operations that are bracketed by calls to the first two new file controls, SQLITE_FCNTL_BEGIN_ATOMIC_WRITE and SQLITE_FCNTL_COMMIT_ATOMIC_WRITE.
The third file control, SQLITE_FCNTL_ROLLBACK_ATOMIC_WRITE, will cancel all writes back to the previous SQLITE_FCNTL_BEGIN_ATOMIC_WRITE.
If batch-atomic writes cannot be supported, then the SQLITE_FCNTL_BEGIN_ATOMIC_WRITE file-control should return a non-zero error code such as SQLITE_NOTFOUND or SQLITE_IOERR.
If a particular batch-write operation could not be supported (perhaps due to the write being too larger or exhaustion of available space on the storage device) then SQLITE_FCNTL_COMMIT_ATOMIC_WRITE must return an error code and all of the write operations going back to the previous SQLITE_FCNTL_BEGIN_ATOMIC_WRITE must be rolled back.
If a write operation that follows SQLITE_FCNTL_BEGIN_ATOMIC_WRITE fails, then SQLite will still invoke SQLITE_FCNTL_ROLLBACK_ATOMIC_WRITE to close the transaction and force the rollback.
If a call to SQLITE_FCNTL_COMMIT_ATOMIC_WRITE fails, then SQLite assumes that the write operation is rolled back automatically. The SQLITE_FCNTL_ROLLBACK_ATOMIC_COMMIT file control is not invoked in this case.
If SQLITE_FCNTL_COMMIT_ATOMIC_WRITE returns successfully, that means that all content back to the previous SQLITE_FCNTL_BEGIN_ATOMIC_WRITE has been committed to persistent storage successfully. No additional fsync() calls are required to ensure durability.
The "os_unix.c" VFS will be enhanced to support the interfaced defined above. F2FS support will likely be enclosed within "#ifdef __linux__".
No changes are needed to VFS modules that do not support batch-atomic writes. The SQLITE_IOCAP_BATCH_ATOMIC bit will not be set on those modules and the SQLITE_FCNTL_BEGIN_ATOMIC_WRITE, SQLITE_FCNTL_COMMIT_ATOMIC_WRITE, and SQLITE_FCNTL_ROLLBACK_ATOMIC_WRITE file-control calls will return SQLITE_NOTFOUND since they have not been implemented. The pager layer will use those characteristics of legacy VFSes to recognize that the legacy VFSes do not support batch-atomic writes.
3.0 Pager Enhancements
Changes to the pager are designed so that applications do not need to do anything to take advantage of batch-atomic write capabilities. If the underlying filesystem supports batch-atomic write, then transactions will be more efficient and use less I/O without any modifications to the application. In particular, the application is not required to set "PRAGMA journal_mode=MEMORY".
3.1 Pager Enhancements for Rollback Mode
Definition: "Rollback Mode" means any journal mode that uses a rollback journal: DELETE, TRUNCATE, or PERSIST.
Definition: A "batch-atomic write transaction" as any transaction that is capable of using the batch-atomic write capabilities of the filesystem.
Definition: A "normal transaction" is one in which the rollback journal must be flushed to disk and synced before writing to the database and that commits by unlinking, truncating, or overwriting the header of the rollback journal, depending on whether the journal_mode is DELETE, TRUNCATE, or PERSIST, respectively. In other words, a normal transaction is a transaction that continues to work like SQLite always has worked for the past decade.
If the SQLITE_IOCAP_BATCH_ATOMIC bit is set, then SQLite starts out by assuming that every transaction will be a batch-atomic write transaction. In that case, SQLite will always use an in-memory rollback journal. However, a transaction might need to be converted into a normal transaction, so there must be the capability to spill the in-memory rollback journal to disk. The code in the memjournal.c source file has most of these capabilities, but might need some minor enhancements.
If the SQLITE_IOCAP_BATCH_ATOMIC bit is clear, then SQLite starts out in normal transaction mode.
If a pager-cache overflow occurs, then the transaction converts to a normal transaction.
If multiple database files (other than TEMP) are modified, then the transaction converts to a normal transaction. This is necessary to ensure that a multi-database change is atomic across all databases. The batch-atomic write capability is only able to ensure atomic commits on a single file, not across multiple files.
The rollback journal is held in memory during a batch-atomic write transaction. There is no disk I/O associated the rollback journal for a batch-atomic write transaction. However, when a batch-atomic write transaction converts into a normal transaction, at that point the rollback journal must be written to disk.
To commit a normal transaction, SQLite first writes and syncs the rollback journal, then makes all changes to the database file, and then if the previous two steps were successful, it unlinks, truncates, or zeros the header of the rollback journal (depending on whether the journal_mode is DELETE, TRUNCATE, or PERSIST, respectively) in order to commit the transaction. For a multi-database normal transaction, additional steps for dealing with the master-journal are involved. Everything in this bullet-item is the standard behavior that SQLite has exhibited for over a decade. Nothing about this bullet-item represents a change. This bullet exists as background information only.
To commit a batch-atomic write transaction, SQLite first invokes SQLITE_FCNTL_BEGIN_ATOMIC_WRITE, then makes all changes to the database file, and then invokes SQLITE_FCNTL_COMMIT_ATOMIC_WRITE. If any of those steps fail for any reason, then the transaction fails.
Future versions of SQLite might retry a failed batch-atomic transaction as a normal transaction. But for now, a failed batch-atomic transaction is a hard failure which is not retried.
3.2 Pager Enhancements for WAL Mode
Batch-atomic write capability does not change the operation of WAL mode.