Skip to content

perf(fastlanes): fuse bit-packed compare into a transposed mask + untranspose#8239

Open
joseph-isaacs wants to merge 8 commits into
developfrom
claude/confident-hamilton-mZIEo
Open

perf(fastlanes): fuse bit-packed compare into a transposed mask + untranspose#8239
joseph-isaacs wants to merge 8 commits into
developfrom
claude/confident-hamilton-mZIEo

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs commented Jun 3, 2026

Summary

Stacked on #8238 (the benchmark) so the change lands as a CodSpeed diff.

Replaces the unpack-then-compare streaming kernel for compare-against-constant with the FastLanes fused unpack_cmp:

  • compare each value as it is unpacked, accumulating results straight into a transposed 1024-bit mask ([u64; 16], one register-resident word per lane — no [bool; 1024]/[T; 1024] scratch),
  • a single SIMD untranspose_bits per block rotates the mask into logical row order, copied directly into the output bit buffer,
  • inline patches are spliced in afterwards; sliced (offset != 0) arrays fall back to the scalar streaming predicate.

Add `bitpack_compare_sweep`, which exercises the public `array.binary(rhs,
op)` compare-against-constant path over all eight integer types and every
valid bit width (64Ki in-range elements per case, no patches). It isolates
the `<BitPacked as CompareKernel>` unpack + per-element compare kernel so a
kernel change shows up as a CodSpeed diff.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 3, 2026 — with Claude
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 3, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 2 regressed benchmarks
✅ 1504 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_canonical_into[(100, 100)] 269.9 µs 304.9 µs -11.47%
Simulation baseline_lt[16, 65536] 216.1 µs 244.1 µs -11.44%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 197.1 µs 160.7 µs +22.59%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/confident-hamilton-mZIEo (211903c) with develop (4e6e9ed)

Open in CodSpeed

…ranspose

Replace the unpack-then-compare streaming kernel for compare-against-constant
with the FastLanes fused `unpack_cmp`: compare each value as it is unpacked,
accumulating results straight into a transposed 1024-bit mask (`[u64; 16]`,
one register-resident word per lane - no `[bool; 1024]`/`[T; 1024]` scratch),
then a single SIMD `untranspose_bits` per block rotates the mask into logical
row order, copied directly into the output bit buffer. Inline patches are
spliced in afterwards; sliced (offset != 0) arrays fall back to the scalar
streaming predicate.

This requires the in-development FastLanes (PR #141 fused mask + PR #145
width-generic BMI2/VBMI untranspose), pinned via a git patch until released.

Benchmarked end-to-end through the public compare path (`bitpack_compare_sweep`,
64Ki elements, all integer types and bit widths): fused beats the streaming
baseline for every type and width -

  i8/u8   ~6.2-7.7x
  i16/u16 ~4.5-6.0x
  i32/u32 ~1.9-4.3x
  i64/u64 ~1.2-1.9x

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs force-pushed the claude/confident-hamilton-mZIEo branch from e27f5f4 to 48da899 Compare June 3, 2026 17:00
Base automatically changed from claude/confident-hamilton-mZIEo-benches to develop June 4, 2026 10:07
Comment thread Cargo.toml Outdated
Comment on lines +418 to +419
[patch.crates-io]
fastlanes = { git = "https://github.com/spiraldb/fastlanes", rev = "6c10ea72cf693a17e994aa6401604ebedbeda453" }
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will remove this before we merge this PR

@joseph-isaacs joseph-isaacs added the do not merge Pull requests that are not intended to merge label Jun 4, 2026
claude and others added 3 commits June 4, 2026 10:31
…space

wasm-test is excluded from the workspace, so it does not inherit the root
[patch.crates-io] and was building vortex-fastlanes against published fastlanes
0.5.0 (old `[bool;1024]` unpack_cmp, no `untranspose_bits`) -> compile error in
compare_fused.rs. Add the matching git `rev` pin here. Temporary, like the root
pin: both are removed when a FastLanes release is cut and the version is bumped.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs removed the do not merge Pull requests that are not intended to merge label Jun 4, 2026
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants