Fused delta(for(bitpacking)) decode (unstable_encodings) by joseph-isaacs · Pull Request #8224 · vortex-data/vortex

joseph-isaacs · 2026-06-02T18:28:25Z

Summary

Wires the new fastlanes::Delta::unfor_undelta_pack fused kernel into delta decompression, behind a new default-off unstable_encodings feature.

When a DeltaArray's deltas child is a FoR array (unsigned reference) wrapping a BitPacked array stored as full, zero-offset chunks with no patches, delta_decompress takes a fully fused fast path (try_fused_for_bitpacking → decompress_fused): each chunk is unpacked, FoR-decoded, and un-delta'd in a single pass before untransposing. Every other shape (signed reference, patches, sliced bit-packing) falls back to the existing generic path unchanged.

Note

Depends on spiraldb/fastlanes#140 (the delta_for_bitpacking kernel). This branch carries a temporary [patch.crates-io] pinning fastlanes to rev 267717cd72e8b6f0ed0e5321ae3fc785fa433058. It must be replaced by a published fastlanes version bump before merge — until then, Rust publish dry-run and Rust build (all-features) are expected red because crates.io fastlanes 0.5.0 has no delta_for_bitpacking feature (this is a standard stacked cross-repo PR: merge + release fastlanes first).

Feature flag

vortex-fastlanes: new unstable_encodings = ["fastlanes/delta_for_bitpacking"]. The fused path, its imports, the round-trip test, and the bench are all #[cfg(feature = "unstable_encodings")].
vortex-btrblocks's existing unstable_encodings feature propagates vortex-fastlanes/unstable_encodings.

With the feature off (default) the kernel and fast path are compiled out — no behavior or code-size change on the default build.

Tests

fused_for_bitpacking_roundtrip builds the stack from non-strictly-increasing u32/u64 columns, asserts the fused path is actually taken (not a silent fallback), and round-trips. cargo test -p vortex-fastlanes --lib delta:: (61 tests) passes; cargo clippy --all-targets --all-features, the default lib build, and nightly fmt --check are clean. The compat suite passes 35/35.

Performance — fused vs the real current Vortex decode

benches/delta_for_bitpack.rs A/Bs the real decode entry points on the same array: fused = delta_decompress (fast path) vs current = delta_decompress_generic (the pre-fusion path Vortex uses today). Cold each iteration, fastest time:

case	current Vortex	fused	speedup
u32, 64 Ki	146 µs	32.0 µs	4.6×
u32, 1 Mi	3.34 ms	600 µs	5.6×
u64, 64 Ki	81.9 µs	44.3 µs	1.85×
u64, 1 Mi	6.88 ms	1.00 ms	6.9×

The win is eliminating the intermediate FoR-decoded PrimitiveArray materialization (+ its validity mask + a second allocation/pass), not the kernel itself: the kernel is ~0.16 ns/elem while current spends ~3.3 ns/elem, i.e. ~95% of the current path is array machinery.

Is the kernel itself optimal? (asm)

Yes — measured locally. The fused kernel is at parity with the shipped unfor_pack/undelta_pack (within ~3%), and wider SIMD regresses realistic widths (AVX2/AVX-512 ~10% slower than SSE2; asm is clean %ymm, zero shuffles — it's port-throughput/frequency-bound, not codegen). Details in spiraldb/fastlanes#140.

Code-size analysis

The kernel is monomorphized per (type × bit-width). Release libfastlanes rlib:

`unstable_encodings`	new symbols	new `.text`
off (default)	0	0 B
on	128	~254 KiB

Fully opt-in via the feature.

🤖 Generated with Claude Code

Wire the new `fastlanes::Delta::unfor_undelta_pack` kernel into delta decompression. When a DeltaArray's `deltas` child is a FoR array (unsigned reference) wrapping a BitPacked array stored as full, zero-offset chunks with no patches, `delta_decompress` now takes a fully fused fast path (`try_fused_for_bitpacking` -> `decompress_fused`) that unpacks, applies the frame-of-reference, and inverts the delta encoding in a single pass per chunk before untransposing. All other shapes fall back to the existing generic path. A round-trip test builds the stack from non-strictly-increasing (monotone non-decreasing) u32/u64 columns and asserts the fused path is actually taken. The `delta_for_bitpack` divan bench compares the fused decode against an unfused baseline (materialize the FoR(bitpacked) deltas, then generic delta decode). On non-decreasing columns the fused path is ~1.3-2.0x faster, with the gap widening at larger sizes and for u64. A local-dev `[patch.crates-io]` points fastlanes at the sibling checkout that carries the kernel; it would be replaced by a published version bump. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Put the fused decode fast path (`try_fused_for_bitpacking` / `decompress_fused`), its imports, the round-trip test, and the bench behind a new `unstable_encodings` feature on vortex-fastlanes that enables `fastlanes/unstable`. With the feature off (the default) the kernel is compiled out entirely, so there is no `.text` cost; vortex-btrblocks' existing `unstable_encodings` feature now propagates it. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The `[patch.crates-io]` previously pointed at a sibling `../fastlanes` checkout, which does not exist in CI and broke workspace resolution for every job. Point it at the pushed fastlanes branch (spiraldb/fastlanes#140) so the workspace resolves and both default and all-features builds compile. To be replaced by a published fastlanes version bump once that PR merges. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Split the combined `use` statements into one item per line and regroup, matching the repo's nightly rustfmt config (imports_granularity = "Item", group_imports = "StdExternalCrate"). No functional change. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

codspeed-hq · 2026-06-02T18:40:37Z

Merging this PR will not alter performance

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

✅ 1275 untouched benchmarks
🆕 8 new benchmarks

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
🆕	Simulation	`current_u64[65536]`	N/A	654 µs	N/A
🆕	Simulation	`fused_u32[65536]`	N/A	235.3 µs	N/A
🆕	Simulation	`fused_u64[65536]`	N/A	378 µs	N/A
🆕	Simulation	`fused_u64[1048576]`	N/A	5.7 ms	N/A
🆕	Simulation	`fused_u32[1048576]`	N/A	3.5 ms	N/A
🆕	Simulation	`current_u32[1048576]`	N/A	5.6 ms	N/A
🆕	Simulation	`current_u64[1048576]`	N/A	13.5 ms	N/A
🆕	Simulation	`current_u32[65536]`	N/A	379.2 µs	N/A

_{Comparing claude/delta-bitpacking-fastlanes-V6mTZ (1565f71) with develop (81046d7)}

Point `unstable_encodings` at `fastlanes/delta_for_bitpacking` and bump the patched fastlanes git revision accordingly. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Expose `delta_decompress` / `delta_decompress_generic` under the `_test-harness` feature and rewrite the bench so both arms call the real decode entry points on the identical delta(for(bitpacking)) array: `fused` (the unfor_undelta_pack fast path) vs `current` (the pre-fusion generic decode). The previous baseline reused a cached intermediate and understated the gap; the cold-vs-cold comparison shows ~4.6x (u32 64Ki) to ~6.9x (u64 1Mi), dominated by avoiding the intermediate FoR-decoded PrimitiveArray materialization rather than kernel speed. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Reference the exact fastlanes revision (spiraldb/fastlanes#140) instead of the branch for reproducibility. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

claude added 2 commits June 2, 2026 18:07

joseph-isaacs added the changelog/performance A performance improvement label Jun 2, 2026 — with Claude

claude added 2 commits June 2, 2026 18:33

claude added 3 commits June 2, 2026 20:05

Track fastlanes feature rename to delta_for_bitpacking

1d5c569

Point `unstable_encodings` at `fastlanes/delta_for_bitpacking` and bump the patched fastlanes git revision accordingly. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Pin fastlanes patch to rev 267717c

1565f71

Reference the exact fastlanes revision (spiraldb/fastlanes#140) instead of the branch for reproducibility. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused delta(for(bitpacking)) decode (unstable_encodings)#8224

Fused delta(for(bitpacking)) decode (unstable_encodings)#8224
joseph-isaacs wants to merge 7 commits into
developfrom
claude/delta-bitpacking-fastlanes-V6mTZ

joseph-isaacs commented Jun 2, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Feature flag

Tests

Performance — fused vs the real current Vortex decode

Is the kernel itself optimal? (asm)

Code-size analysis

Uh oh!

codspeed-hq Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joseph-isaacs commented Jun 2, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 2, 2026 •

edited

Loading