Skip to content

Fused delta(for(bitpacking)) decode (unstable_encodings)#8224

Draft
joseph-isaacs wants to merge 7 commits into
developfrom
claude/delta-bitpacking-fastlanes-V6mTZ
Draft

Fused delta(for(bitpacking)) decode (unstable_encodings)#8224
joseph-isaacs wants to merge 7 commits into
developfrom
claude/delta-bitpacking-fastlanes-V6mTZ

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs commented Jun 2, 2026

Summary

Wires the new fastlanes::Delta::unfor_undelta_pack fused kernel into delta decompression, behind a new default-off unstable_encodings feature.

When a DeltaArray's deltas child is a FoR array (unsigned reference) wrapping a BitPacked array stored as full, zero-offset chunks with no patches, delta_decompress takes a fully fused fast path (try_fused_for_bitpackingdecompress_fused): each chunk is unpacked, FoR-decoded, and un-delta'd in a single pass before untransposing. Every other shape (signed reference, patches, sliced bit-packing) falls back to the existing generic path unchanged.

Note

Depends on spiraldb/fastlanes#140 (the delta_for_bitpacking kernel). This branch carries a temporary [patch.crates-io] pinning fastlanes to rev 267717cd72e8b6f0ed0e5321ae3fc785fa433058. It must be replaced by a published fastlanes version bump before merge — until then, Rust publish dry-run and Rust build (all-features) are expected red because crates.io fastlanes 0.5.0 has no delta_for_bitpacking feature (this is a standard stacked cross-repo PR: merge + release fastlanes first).

Feature flag

  • vortex-fastlanes: new unstable_encodings = ["fastlanes/delta_for_bitpacking"]. The fused path, its imports, the round-trip test, and the bench are all #[cfg(feature = "unstable_encodings")].
  • vortex-btrblocks's existing unstable_encodings feature propagates vortex-fastlanes/unstable_encodings.

With the feature off (default) the kernel and fast path are compiled out — no behavior or code-size change on the default build.

Tests

fused_for_bitpacking_roundtrip builds the stack from non-strictly-increasing u32/u64 columns, asserts the fused path is actually taken (not a silent fallback), and round-trips. cargo test -p vortex-fastlanes --lib delta:: (61 tests) passes; cargo clippy --all-targets --all-features, the default lib build, and nightly fmt --check are clean. The compat suite passes 35/35.

Performance — fused vs the real current Vortex decode

benches/delta_for_bitpack.rs A/Bs the real decode entry points on the same array: fused = delta_decompress (fast path) vs current = delta_decompress_generic (the pre-fusion path Vortex uses today). Cold each iteration, fastest time:

case current Vortex fused speedup
u32, 64 Ki 146 µs 32.0 µs 4.6×
u32, 1 Mi 3.34 ms 600 µs 5.6×
u64, 64 Ki 81.9 µs 44.3 µs 1.85×
u64, 1 Mi 6.88 ms 1.00 ms 6.9×

The win is eliminating the intermediate FoR-decoded PrimitiveArray materialization (+ its validity mask + a second allocation/pass), not the kernel itself: the kernel is ~0.16 ns/elem while current spends ~3.3 ns/elem, i.e. ~95% of the current path is array machinery.

Is the kernel itself optimal? (asm)

Yes — measured locally. The fused kernel is at parity with the shipped unfor_pack/undelta_pack (within ~3%), and wider SIMD regresses realistic widths (AVX2/AVX-512 ~10% slower than SSE2; asm is clean %ymm, zero shuffles — it's port-throughput/frequency-bound, not codegen). Details in spiraldb/fastlanes#140.

Code-size analysis

The kernel is monomorphized per (type × bit-width). Release libfastlanes rlib:

unstable_encodings new symbols new .text
off (default) 0 0 B
on 128 ~254 KiB

Fully opt-in via the feature.

🤖 Generated with Claude Code

claude added 2 commits June 2, 2026 18:07
Wire the new `fastlanes::Delta::unfor_undelta_pack` kernel into delta
decompression. When a DeltaArray's `deltas` child is a FoR array (unsigned
reference) wrapping a BitPacked array stored as full, zero-offset chunks with
no patches, `delta_decompress` now takes a fully fused fast path
(`try_fused_for_bitpacking` -> `decompress_fused`) that unpacks, applies the
frame-of-reference, and inverts the delta encoding in a single pass per chunk
before untransposing. All other shapes fall back to the existing generic path.

A round-trip test builds the stack from non-strictly-increasing (monotone
non-decreasing) u32/u64 columns and asserts the fused path is actually taken.

The `delta_for_bitpack` divan bench compares the fused decode against an
unfused baseline (materialize the FoR(bitpacked) deltas, then generic delta
decode). On non-decreasing columns the fused path is ~1.3-2.0x faster, with the
gap widening at larger sizes and for u64.

A local-dev `[patch.crates-io]` points fastlanes at the sibling checkout that
carries the kernel; it would be replaced by a published version bump.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Put the fused decode fast path (`try_fused_for_bitpacking` /
`decompress_fused`), its imports, the round-trip test, and the bench behind a
new `unstable_encodings` feature on vortex-fastlanes that enables
`fastlanes/unstable`. With the feature off (the default) the kernel is compiled
out entirely, so there is no `.text` cost; vortex-btrblocks' existing
`unstable_encodings` feature now propagates it.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 2, 2026 — with Claude
claude added 2 commits June 2, 2026 18:33
The `[patch.crates-io]` previously pointed at a sibling `../fastlanes` checkout,
which does not exist in CI and broke workspace resolution for every job. Point
it at the pushed fastlanes branch (spiraldb/fastlanes#140) so the workspace
resolves and both default and all-features builds compile. To be replaced by a
published fastlanes version bump once that PR merges.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Split the combined `use` statements into one item per line and regroup, matching
the repo's nightly rustfmt config (imports_granularity = "Item",
group_imports = "StdExternalCrate"). No functional change.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 2, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

✅ 1275 untouched benchmarks
🆕 8 new benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
🆕 Simulation current_u64[65536] N/A 654 µs N/A
🆕 Simulation fused_u32[65536] N/A 235.3 µs N/A
🆕 Simulation fused_u64[65536] N/A 378 µs N/A
🆕 Simulation fused_u64[1048576] N/A 5.7 ms N/A
🆕 Simulation fused_u32[1048576] N/A 3.5 ms N/A
🆕 Simulation current_u32[1048576] N/A 5.6 ms N/A
🆕 Simulation current_u64[1048576] N/A 13.5 ms N/A
🆕 Simulation current_u32[65536] N/A 379.2 µs N/A

Comparing claude/delta-bitpacking-fastlanes-V6mTZ (1565f71) with develop (81046d7)

Open in CodSpeed

claude added 3 commits June 2, 2026 20:05
Point `unstable_encodings` at `fastlanes/delta_for_bitpacking` and bump the
patched fastlanes git revision accordingly.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Expose `delta_decompress` / `delta_decompress_generic` under the `_test-harness`
feature and rewrite the bench so both arms call the real decode entry points on
the identical delta(for(bitpacking)) array: `fused` (the unfor_undelta_pack fast
path) vs `current` (the pre-fusion generic decode). The previous baseline reused
a cached intermediate and understated the gap; the cold-vs-cold comparison shows
~4.6x (u32 64Ki) to ~6.9x (u64 1Mi), dominated by avoiding the intermediate
FoR-decoded PrimitiveArray materialization rather than kernel speed.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Reference the exact fastlanes revision (spiraldb/fastlanes#140) instead of the
branch for reproducibility.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants