Skip to content

Add per-variant CPU feature gating to bit_transpose benchmarks#8227

Draft
joseph-isaacs wants to merge 2 commits into
developfrom
claude/bituntranspose-bench-variants-ArFmf
Draft

Add per-variant CPU feature gating to bit_transpose benchmarks#8227
joseph-isaacs wants to merge 2 commits into
developfrom
claude/bituntranspose-bench-variants-ArFmf

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

What

Trial of explicit, per-benchmark CPU feature-set / architecture gating, applied to the bit_transpose benchmarks. Each benchmark declares inline which variants it should be measured under in CI, while a plain cargo bench ignores all gating and runs everything once on the host.

Single source of truth = the compile-time BENCH_VARIANT env var, which drives both the name prefix (so arch-neutral scalar benches don't collide in CodSpeed) and the gate (run vs skip).

Macros (encodings/fastlanes/benches/shared/mod.rs)

  • variant!("name") — prefixes the bench name with the active variant.
  • variant_tag!(ident) — maps a known variant identifier to its string tag; an unknown identifier fails to compile (typo safety).
  • ignore_unless_variant!(...) — expands to divan's ignore boolean: skip unless BENCH_VARIANT=local (default) or the active variant is one of the listed feature sets.

Per-benchmark tags (bit_transpose.rs)

benchmarks tags
scalar (baseline) simulation, x86_64, aarch64
bmi2 / vbmi simulation, x86_64
neon aarch64

CI (.github/workflows/codspeed.yml)

  • The existing bench-codspeed job now builds with BENCH_VARIANT=simulation, so the simulation-tagged variants run there in simulation mode (x86_64 + avx2) — no local:: rename, no duplication.
  • New bench-codspeed-bittranspose job: walltime legs on real silicon, one per architecture, each building only --bench bit_transpose with its own target features + BENCH_VARIANT:
    • x86_64amd64-medium / ubuntu24-full-x64-pre-v2, -C target-feature=+avx2
    • aarch64arm64-medium / ubuntu24-full-arm64-pre-v2, -C target-feature=+neon

Behavior

Context BENCH_VARIANT What runs Names Mode
Local cargo bench local all benches once local::<fn> divan walltime
bench-codspeed simulation scalar + bmi2 + vbmi simulation::<fn> simulation
bittranspose x86_64 leg x86_64 scalar + bmi2 + vbmi x86_64::<fn> walltime
bittranspose aarch64 leg aarch64 scalar + neon aarch64::<fn> walltime

Checks

  • cargo build / cargo clippy --all-features / cargo +nightly fmt --check on the bench — clean.
  • yamllint --strict -c .yamllint.yaml on the workflow — clean.
  • Runtime gating verified on an x86 host: BENCH_VARIANT=aarch64 skips bmi2/vbmi ((ignored)) and runs only the scalar baselines; BENCH_VARIANT=x86_64 runs scalar + bmi2 (vbmi shows the pre-existing "no function registered" warning because the dev host lacks AVX512-VBMI).

Notes:

  • The vbmi path shares the x86_64 build rather than forcing global +avx512vbmi (which risks SIGILL in surrounding code on non-AVX512 runners); the #[target_feature] intrinsics + has_vbmi() runtime guard handle it safely.
  • Variant tags are arch-level (x86_64/aarch64) rather than avx2/neon, because bit_transpose's x86 paths (BMI2/VBMI) are runtime-selected within a single x86 build.

https://claude.ai/code/session_01MkzByEJLta4WN2vLqRyvZ1


Generated by Claude Code

Each bit_transpose benchmark now declares, inline, which CPU feature
sets / architectures it should be measured under in CI, via a small set
of macros driven by the compile-time BENCH_VARIANT environment variable:

- variant! prefixes the benchmark name with the active variant so the
  architecture-neutral scalar benchmarks (which run on every leg) do not
  collide in CodSpeed.
- variant_tag! maps a known variant identifier to its string tag; an
  unknown identifier fails to compile, giving typo-safe tags.
- ignore_unless_variant! expands to divan's `ignore` boolean, skipping a
  benchmark unless we run locally (BENCH_VARIANT=local, the default) or
  the active variant is one of the listed feature sets.

A plain `cargo bench` leaves BENCH_VARIANT at its `local` default (set in
.cargo/config.toml) and runs every benchmark once on the host. CI sets
BENCH_VARIANT per leg:

- the existing bench-codspeed job builds with BENCH_VARIANT=simulation, so
  the simulation-tagged scalar/bmi2/vbmi variants run there in simulation
  mode on x86_64+avx2;
- a new bench-codspeed-bittranspose job adds walltime legs on real
  silicon, one per architecture (x86_64 with +avx2, aarch64 with +neon),
  each building only the bit_transpose bench with its own target features.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs added the changelog/skip Do not list PR in the changelog label Jun 2, 2026 — with Claude
The bit_transpose aarch64 Codspeed leg failed in the system-info step:
`grep -m1 "model name" /proc/cpuinfo` returns no match on ARM (no such
line; the model is shown by lscpu), and under GitHub's `bash -e` the
failing grep aborts the otherwise-diagnostic step. ARM also exposes CPU
features as "Features" rather than "flags".

Make both cpuinfo greps non-fatal and match the aarch64 "Features" line
so the diagnostic step never fails the build on either architecture.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/skip Do not list PR in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants