fix: durable header-rooted free-list (closes #1, #2)#5
Merged
Conversation
…e header chain The persistent free-list was stored as individual catalog rows keyed `[0x04]||page_id_be` with deferred-free pairs under key `[0x05]`. Every commit that freed pages had to copy-on-write the catalog tree to update these rows, adding per-commit overhead proportional to the number of freed pages and causing unbounded catalog churn under sustained writes. Replace this with a chain of `PageKind::Free` pages rooted in the A/B header's `free_list_root` slot. Each chain page stores `(commit_id, page_id)` pairs in a compact binary layout; the chain is rewritten atomically with the header swap, so it survives unclean shutdown without catalog involvement. Benefits: - Freed pages drain to the allocator cache on commit without touching the catalog tree, eliminating per-commit catalog CoW overhead. - The reclamation floor (oldest reader pin or oldest retained history commit, whichever is lower) is computed once at `begin_write`; pages below it are safe to recycle immediately within the same txn. - Compaction no longer needs to partition catalog rows into free-list vs non-free-list groups: the dense repack simply resets `free_list_root` to 0 and the catalog stays clean. Remove `compaction::freelist` (the catalog-based helpers), strip the `FreeList` / `DeferredFree` catalog row kinds and their codec, and drop the `skip_freelist_persistence_when_no_readers` open option (the new path always recycles eligible pages without a catalog write, making the flag's trade-off moot). Add `pager::freelist` with `read_chain` and `rewrite_chain` — the two primitives used by the commit and compaction paths.
…overy paths Track `free_list_root_page_id` in `WriterState` and decode it from the A/B header on open. Every header-writing code path (commit, rekey, segment link/unlink, reader-pin cleanup, compaction) now forwards the current chain head through `HeaderFieldsParams` so the slot is preserved across header swaps instead of being zero-filled. Key additions: `Db::free_page_consumed` — a per-txn sink (`Arc<Mutex<Vec<u64>>>`) that records page ids drawn from `free_page_cache`. The commit path uses this to remove consumed pages from the durable chain (they now hold live data). Cleared at `begin_write` so each txn starts with a clean slate. `BTree::set_free_page_consumed` / `free_page_consumed` field — mirrors the existing `free_page_cache` hook; every page popped from the cache during `allocate_page` is pushed here so the commit path knows which chain entries to retire. `BTree::first_key` — leftmost-spine descent returning the smallest key in O(height). Used by `oldest_retained_history_commit` to find the oldest entry in the commit-history tree without a full scan. `Db::oldest_retained_history_commit` — computes the older of the in-memory reader floor and the oldest retained history commit; together these are the reclamation floor gating which free-list entries are safe to recycle at the start of a new write txn. `cleanup_stale_reader_pins` is strengthened to remove *all* durable pin rows on writer open (not just own-PID or wall-clock-expired ones). While the exclusive writer lock is held at open, no live reader can own a pin row yet, so every row present is a prior-incarnation leftover. The old heuristic could leave a crashed foreign-PID pin in place until its lease expired, permanently blocking the reclamation floor. `recovery::deep_walk` now walks and validates the free-list chain: verifies each chain page's AEAD, checks entries are in range, unique, and not simultaneously live, then accounts chain pages and tracked free pages as reachable so they are not reported as orphans. `DbMode` is threaded through `open_existing_inner` so open-time recovery (pin cleanup, backlog drain) runs only for Standalone / Follower handles that hold the exclusive writer lock.
The single `src/txn/write.rs` (896 lines) grew to cover four distinct
concerns. Replace it with `src/txn/write/` containing:
- `txn.rs` — `WriteTxn` struct, `begin`/`abort`, and all mutation
methods (`put`, `delete`, quota checks, segment ops)
- `commit.rs` — the commit path: drain freed pages, rewrite the durable
free-list chain, update the catalog/history trees, and
swap the A/B header
- `spill.rs` — `SpillScope` and `SpillSegmentMeta` (AEAD scratch file)
- `counter.rs` — `CounterRef` (durable monotonic named counters)
- `mod.rs` — re-exports only; no logic
Public surface is unchanged (`WriteTxn`, `SpillScope`, `ScratchOffset`,
`CounterRef` all re-export from `crate::txn::write`).
The commit path in `commit.rs` is updated to use the durable free-list
chain: at `begin_write` it reads the current chain to populate the
in-memory allocator cache with floor-safe pages; at commit it rewrites
the chain with newly freed pages added, consumed pages removed, and
chain host pages recycled — all atomically with the header swap.
… stall policy
deferred_free_reclaim (new):
- `sustained_overwrite_commits_drain_deferred_free` — rewrites a
constant key set across 200 commits with no reader pinned; asserts
the free-list backlog stays bounded (pages drain to the allocator on
commit rather than accumulating).
- `sustained_overwrite_does_not_grow_file` — same workload, asserts
the file's page high-water mark stays bounded (freed pages are
recycled, not orphaned).
- `bounded_history_reuses_pages_after_window` — verifies reclamation
under a capped commit-history window: pages freed below the oldest
retained history root are recycled once the window slides past them.
compaction_incremental:
- `compact_step_reopen_history_consistent` — verifies that after an
incremental compaction with a dense repack, the A/B header's
commit-history root is reset to 0 (the repack discards it); a
subsequent write and read on the reopened store must succeed.
- `compact_step_preserves_free_list` — verifies that an intermediate
`compact_step` (budget too small for a full repack) preserves the
durable free-list chain in the header so pages are not lost.
fsck_deep:
- `free_list_pages_are_accounted_not_orphans` — after populating the
durable free-list by deleting keys, `run_deep_walk` must report no
page issues and no orphans (chain pages and tracked free pages are
correctly accounted as reachable).
reader_stall_policy:
- `reopen_with_drainable_backlog_does_not_brick_readers` — reproduces
a regression where a store reopened with a large inherited
deferred-free backlog tripped `AbortOldest` on the first post-open
commit, permanently bricking readers. The backlog's entries are all
older than the fresh reader's pin, so the commit must drain them and
evaluate the stall policy against the post-drain depth.
benches/comparison: remove the now-deleted
`with_skip_freelist_persistence_when_no_readers` call; the comment is
updated to explain why write latency is comparable to redb without that
flag.
Member
Author
|
Implemented at e0e4da0. Both issues share one root cause — the catalog-stored deferred-free list that was never drained on commit — so this replaces it with a durable header-rooted free-list chain that recycles freed pages below the reclamation floor and is committed atomically with the header swap. #1's open-time brick and #2's unbounded growth both fall out of that single change. Full test suite is green. #3 (compaction corrupting large payloads) is a separate bug in the |
…ch compiles Bench targets now carry `test = false`, which prevents `cargo test` and nextest from compiling `harness = false` benches and running their `main` as test cases. Without that flag, building the comparison bench dragged in rocksdb → librocksdb-sys → bindgen → clang-sys → libclang on every platform and also executed the benchmark binary during the test run. The two Ubuntu jobs that explicitly compile benches (`clippy --all-targets` and `cargo check --benches`) still need libclang, so both now install clang and libclang-dev before the Rust cache step.
On Windows, opening a directory as a file handle fails with
PermissionDenied ("Access is denied") before sync_all is ever reached.
sync_dir's existing logic already treated PermissionDenied and Unsupported
from sync_all as best-effort no-ops (directory fsync is advisory there),
but the open failure propagated as an error instead.
Extend the same no-op treatment to the open call itself so that
SegmentWriter::create, which calls sync_dir on "seg/.staging", does not
brick on Windows.
The dispatch_io handler RcBlock and the raw block pointer derived from it are !Send. When they were constructed inline inside read_at/write_at, they lived in the async fn's state machine across the oneshot .await, making the returned futures !Send — breaking cargo check on every Apple target because VfsFile requires Send futures. Extracted the block construction and channel submission into synchronous free functions submit_read/submit_write. The !Send values are created, used, and dropped entirely within those functions before any await point is reached, so the calling futures remain Send. Verified with: cargo check --target aarch64-apple-darwin --lib cargo check --target x86_64-apple-darwin --lib cargo check --target aarch64-apple-ios --lib
nextest (cargo test --no-run) and cargo bench both compile the dev-dependency graph which pulls in rocksdb → librocksdb-sys → bindgen → libclang. Windows runner images already ship LLVM; Linux and macOS do not. Added an LLVM/Clang install step to the test job (apt on Linux, brew + LIBCLANG_PATH on macOS) and to the bench job (apt on Linux), so bindgen can locate libclang on every platform and the jobs no longer fail at the compile step.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the catalog-stored deferred-free machinery with a durable, header-rooted free-list chain, fixing the two structural problems behind #1 and #2 with one design.
A page freed by a commit is recorded as a
(commit_id, page_id)entry in a chain ofPageKind::Freepages rooted at the A/B header'sfree_list_rootslot — outside the catalog B+ tree. The chain is committed atomically with the header swap, and ordinarybegin_writerecycles any entry below the reclamation floor (observable by no live reader and no retained-history root).Closes #1 — open() no longer bricks on a drainable backlog
cleanup_stale_reader_pinsnow deletes every durable reader-pin row left by a prior incarnation (a Standalone/Follower handle holds the exclusive writer lock before handing out any reader, so every pin present at open is a dead-incarnation leftover). This removes a crashed foreign-PID pin immediately instead of waiting out its 30s lease.Closes #2 — bounded disk, no super-linear time
Tests
New coverage for free-list survival across unclean reopen, drainable-backlog reopen, free-list reclamation, fsck free-list accounting, and compaction free-list/history consistency.