Skip to content

fix: durable header-rooted free-list (closes #1, #2)#5

Merged
farhan-syah merged 8 commits into
mainfrom
fix/durable-freelist-reclamation
Jun 14, 2026
Merged

fix: durable header-rooted free-list (closes #1, #2)#5
farhan-syah merged 8 commits into
mainfrom
fix/durable-freelist-reclamation

Conversation

@farhan-syah

Copy link
Copy Markdown
Member

Summary

Replaces the catalog-stored deferred-free machinery with a durable, header-rooted free-list chain, fixing the two structural problems behind #1 and #2 with one design.

A page freed by a commit is recorded as a (commit_id, page_id) entry in a chain of PageKind::Free pages rooted at the A/B header's free_list_root slot — outside the catalog B+ tree. The chain is committed atomically with the header swap, and ordinary begin_write recycles any entry below the reclamation floor (observable by no live reader and no retained-history root).

Closes #1 — open() no longer bricks on a drainable backlog

  • Open-time cleanup_stale_reader_pins now deletes every durable reader-pin row left by a prior incarnation (a Standalone/Follower handle holds the exclusive writer lock before handing out any reader, so every pin present at open is a dead-incarnation leftover). This removes a crashed foreign-PID pin immediately instead of waiting out its 30s lease.
  • The reader-stall policy fires only on entries genuinely stuck behind a pin (at/above the reclamation floor); an inherited but drainable backlog no longer aborts the open-time reader.

Closes #2 — bounded disk, no super-linear time

  • Freed pages are reused mid-run by the allocator, so the file stops growing monotonically under sustained per-entry commits.
  • The free-list is maintained incrementally on the header chain — no per-commit full decode/re-sort/re-CoW of a catalog row — removing the O(n²) write term.

Note: the two consumer-side asks in #2 (exposing compact/compact_step through nodedb-lite's StorageEngine trait, and a batched multi-document commit path) live in the consumer and are out of scope for this PR. PR #4 is the Send-ness enabler for the former.

Tests

New coverage for free-list survival across unclean reopen, drainable-backlog reopen, free-list reclamation, fsck free-list accounting, and compaction free-list/history consistency.

…e header chain

The persistent free-list was stored as individual catalog rows keyed
`[0x04]||page_id_be` with deferred-free pairs under key `[0x05]`. Every
commit that freed pages had to copy-on-write the catalog tree to update
these rows, adding per-commit overhead proportional to the number of
freed pages and causing unbounded catalog churn under sustained writes.

Replace this with a chain of `PageKind::Free` pages rooted in the A/B
header's `free_list_root` slot. Each chain page stores `(commit_id,
page_id)` pairs in a compact binary layout; the chain is rewritten
atomically with the header swap, so it survives unclean shutdown without
catalog involvement.

Benefits:
- Freed pages drain to the allocator cache on commit without touching
  the catalog tree, eliminating per-commit catalog CoW overhead.
- The reclamation floor (oldest reader pin or oldest retained history
  commit, whichever is lower) is computed once at `begin_write`; pages
  below it are safe to recycle immediately within the same txn.
- Compaction no longer needs to partition catalog rows into free-list vs
  non-free-list groups: the dense repack simply resets `free_list_root`
  to 0 and the catalog stays clean.

Remove `compaction::freelist` (the catalog-based helpers), strip the
`FreeList` / `DeferredFree` catalog row kinds and their codec, and drop
the `skip_freelist_persistence_when_no_readers` open option (the new
path always recycles eligible pages without a catalog write, making the
flag's trade-off moot).

Add `pager::freelist` with `read_chain` and `rewrite_chain` — the two
primitives used by the commit and compaction paths.
…overy paths

Track `free_list_root_page_id` in `WriterState` and decode it from the
A/B header on open. Every header-writing code path (commit, rekey,
segment link/unlink, reader-pin cleanup, compaction) now forwards the
current chain head through `HeaderFieldsParams` so the slot is preserved
across header swaps instead of being zero-filled.

Key additions:

`Db::free_page_consumed` — a per-txn sink (`Arc<Mutex<Vec<u64>>>`) that
records page ids drawn from `free_page_cache`. The commit path uses this
to remove consumed pages from the durable chain (they now hold live
data). Cleared at `begin_write` so each txn starts with a clean slate.

`BTree::set_free_page_consumed` / `free_page_consumed` field — mirrors
the existing `free_page_cache` hook; every page popped from the cache
during `allocate_page` is pushed here so the commit path knows which
chain entries to retire.

`BTree::first_key` — leftmost-spine descent returning the smallest key
in O(height). Used by `oldest_retained_history_commit` to find the
oldest entry in the commit-history tree without a full scan.

`Db::oldest_retained_history_commit` — computes the older of the
in-memory reader floor and the oldest retained history commit; together
these are the reclamation floor gating which free-list entries are safe
to recycle at the start of a new write txn.

`cleanup_stale_reader_pins` is strengthened to remove *all* durable pin
rows on writer open (not just own-PID or wall-clock-expired ones). While
the exclusive writer lock is held at open, no live reader can own a pin
row yet, so every row present is a prior-incarnation leftover. The old
heuristic could leave a crashed foreign-PID pin in place until its lease
expired, permanently blocking the reclamation floor.

`recovery::deep_walk` now walks and validates the free-list chain:
verifies each chain page's AEAD, checks entries are in range, unique,
and not simultaneously live, then accounts chain pages and tracked free
pages as reachable so they are not reported as orphans.

`DbMode` is threaded through `open_existing_inner` so open-time recovery
(pin cleanup, backlog drain) runs only for Standalone / Follower handles
that hold the exclusive writer lock.
The single `src/txn/write.rs` (896 lines) grew to cover four distinct
concerns. Replace it with `src/txn/write/` containing:

- `txn.rs`     — `WriteTxn` struct, `begin`/`abort`, and all mutation
                  methods (`put`, `delete`, quota checks, segment ops)
- `commit.rs`  — the commit path: drain freed pages, rewrite the durable
                  free-list chain, update the catalog/history trees, and
                  swap the A/B header
- `spill.rs`   — `SpillScope` and `SpillSegmentMeta` (AEAD scratch file)
- `counter.rs` — `CounterRef` (durable monotonic named counters)
- `mod.rs`     — re-exports only; no logic

Public surface is unchanged (`WriteTxn`, `SpillScope`, `ScratchOffset`,
`CounterRef` all re-export from `crate::txn::write`).

The commit path in `commit.rs` is updated to use the durable free-list
chain: at `begin_write` it reads the current chain to populate the
in-memory allocator cache with floor-safe pages; at commit it rewrites
the chain with newly freed pages added, consumed pages removed, and
chain host pages recycled — all atomically with the header swap.
… stall policy

deferred_free_reclaim (new):
  - `sustained_overwrite_commits_drain_deferred_free` — rewrites a
    constant key set across 200 commits with no reader pinned; asserts
    the free-list backlog stays bounded (pages drain to the allocator on
    commit rather than accumulating).
  - `sustained_overwrite_does_not_grow_file` — same workload, asserts
    the file's page high-water mark stays bounded (freed pages are
    recycled, not orphaned).
  - `bounded_history_reuses_pages_after_window` — verifies reclamation
    under a capped commit-history window: pages freed below the oldest
    retained history root are recycled once the window slides past them.

compaction_incremental:
  - `compact_step_reopen_history_consistent` — verifies that after an
    incremental compaction with a dense repack, the A/B header's
    commit-history root is reset to 0 (the repack discards it); a
    subsequent write and read on the reopened store must succeed.
  - `compact_step_preserves_free_list` — verifies that an intermediate
    `compact_step` (budget too small for a full repack) preserves the
    durable free-list chain in the header so pages are not lost.

fsck_deep:
  - `free_list_pages_are_accounted_not_orphans` — after populating the
    durable free-list by deleting keys, `run_deep_walk` must report no
    page issues and no orphans (chain pages and tracked free pages are
    correctly accounted as reachable).

reader_stall_policy:
  - `reopen_with_drainable_backlog_does_not_brick_readers` — reproduces
    a regression where a store reopened with a large inherited
    deferred-free backlog tripped `AbortOldest` on the first post-open
    commit, permanently bricking readers. The backlog's entries are all
    older than the fresh reader's pin, so the commit must drain them and
    evaluate the stall policy against the post-drain depth.

benches/comparison: remove the now-deleted
`with_skip_freelist_persistence_when_no_readers` call; the comment is
updated to explain why write latency is comparable to redb without that
flag.
@farhan-syah

Copy link
Copy Markdown
Member Author

Implemented at e0e4da0.

Both issues share one root cause — the catalog-stored deferred-free list that was never drained on commit — so this replaces it with a durable header-rooted free-list chain that recycles freed pages below the reclamation floor and is committed atomically with the header swap. #1's open-time brick and #2's unbounded growth both fall out of that single change. Full test suite is green.

#3 (compaction corrupting large payloads) is a separate bug in the bulk_load repack path and is intentionally not part of this PR.

…ch compiles

Bench targets now carry `test = false`, which prevents `cargo test` and
nextest from compiling `harness = false` benches and running their `main`
as test cases. Without that flag, building the comparison bench dragged in
rocksdb → librocksdb-sys → bindgen → clang-sys → libclang on every
platform and also executed the benchmark binary during the test run.

The two Ubuntu jobs that explicitly compile benches (`clippy --all-targets`
and `cargo check --benches`) still need libclang, so both now install
clang and libclang-dev before the Rust cache step.
On Windows, opening a directory as a file handle fails with
PermissionDenied ("Access is denied") before sync_all is ever reached.
sync_dir's existing logic already treated PermissionDenied and Unsupported
from sync_all as best-effort no-ops (directory fsync is advisory there),
but the open failure propagated as an error instead.

Extend the same no-op treatment to the open call itself so that
SegmentWriter::create, which calls sync_dir on "seg/.staging", does not
brick on Windows.
The dispatch_io handler RcBlock and the raw block pointer derived from
it are !Send. When they were constructed inline inside read_at/write_at,
they lived in the async fn's state machine across the oneshot .await,
making the returned futures !Send — breaking cargo check on every Apple
target because VfsFile requires Send futures.

Extracted the block construction and channel submission into synchronous
free functions submit_read/submit_write. The !Send values are created,
used, and dropped entirely within those functions before any await point
is reached, so the calling futures remain Send.

Verified with:
  cargo check --target aarch64-apple-darwin --lib
  cargo check --target x86_64-apple-darwin --lib
  cargo check --target aarch64-apple-ios --lib
nextest (cargo test --no-run) and cargo bench both compile the
dev-dependency graph which pulls in rocksdb → librocksdb-sys →
bindgen → libclang. Windows runner images already ship LLVM; Linux
and macOS do not.

Added an LLVM/Clang install step to the test job (apt on Linux,
brew + LIBCLANG_PATH on macOS) and to the bench job (apt on Linux),
so bindgen can locate libclang on every platform and the jobs no
longer fail at the compile step.
@farhan-syah farhan-syah merged commit 42789f7 into main Jun 14, 2026
18 checks passed
@farhan-syah farhan-syah deleted the fix/durable-freelist-reclamation branch June 15, 2026 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant