Skip to content

Parametric Sectorized Bloom filter policy#808

Open
sleeepyjack wants to merge 87 commits into
NVIDIA:devfrom
sleeepyjack:bloom-filter-release
Open

Parametric Sectorized Bloom filter policy#808
sleeepyjack wants to merge 87 commits into
NVIDIA:devfrom
sleeepyjack:bloom-filter-release

Conversation

@sleeepyjack

@sleeepyjack sleeepyjack commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator

Lands the GPU bloom filter optimizations from arXiv:2512.15595.

sleeepyjack and others added 30 commits September 9, 2025 08:53
…mpilation of example is choking on #pragma unroll.
@sleeepyjack

Copy link
Copy Markdown
Collaborator Author

/ok to test 20be4e3

@sleeepyjack sleeepyjack left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self review

Comment thread benchmarks/bloom_filter/add_bench.cu Outdated
Comment thread benchmarks/bloom_filter/contains_bench.cu Outdated
Comment thread include/cuco/detail/bloom_filter/parametric_filter_policy.cuh Outdated
Comment thread include/cuco/detail/bloom_filter/bloom_filter_impl.cuh Outdated
Comment thread include/cuco/detail/bloom_filter/bloom_filter_impl.cuh
Comment thread include/cuco/detail/bloom_filter/bloom_filter_impl.cuh Outdated
@PointKernel

Copy link
Copy Markdown
Member

@srinivasyadav18 could use your help reviewing this PR as well

@sleeepyjack sleeepyjack self-assigned this Jun 17, 2026
@sleeepyjack sleeepyjack added Needs Review Awaiting reviews before merging and removed In Progress Currently a work in progress labels Jun 17, 2026
@sleeepyjack sleeepyjack changed the title Bloom filter overhaul Parametric Sectorized Bloom filter policy Jun 17, 2026
@sleeepyjack sleeepyjack marked this pull request as ready for review June 17, 2026 23:41
@sleeepyjack sleeepyjack requested a review from PointKernel as a code owner June 17, 2026 23:41
@sleeepyjack

Copy link
Copy Markdown
Collaborator Author

/ok to test 8b1e995

@sleeepyjack sleeepyjack added the topic: performance Performance related issue label Jun 17, 2026
@sleeepyjack

Copy link
Copy Markdown
Collaborator Author

/ok to test eae3049

@PointKernel PointKernel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual code changes are not that big, but the use of work stealing definitely caught my attention.

@sleeepyjack, could you please review all files touched by this PR and make sure the copyright years are updated where necessary?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sleeepyjack Could you please share a performance comparison between the baseline and the current implementation? It would be helpful to have those numbers documented for future reference.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a small ablation study testing all those different tuning knobs. This will also help answer some of your other comments.

Comment on lines +150 to +165
// Exhaustive sweep across block sizes and vectorization layouts. Uncomment for performance
// tuning / paper-style characterization; not run by default because the matrix is large.
// NVBENCH_BENCH_TYPES(
// bloom_filter_contains,
// NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
// nvbench::type_list<nvbench::uint64_t, nvbench::uint32_t>, ///< Word
// nvbench::enum_type_list<64, 128, 256, 512, 1024>, ///< BlockBits
// nvbench::enum_type_list<8, 16>, ///< PatternBits
// nvbench::enum_type_list<1, 2, 4, 8, 16>, ///<
// HorizontalLayout nvbench::enum_type_list<1, 2, 4, 8, 16> ///< VerticalLayout
// ))
// .set_name("bloom_filter_contains_full_sweep_u64")
// .set_type_axes_names(
// {"Key", "Word", "BlockBits", "PatternBits", "HorizontalLayout", "VerticalLayout"})
// .add_int64_axis("NumInputs", {defaults::BF_N})
// .add_int64_axis("FilterSizeMB", defaults::BF_SIZE_MB_RANGE_CACHE); No newline at end of file

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we remove it since unused?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about adding a flag to enable more extensive benchmarks, since compile- and runtime for these setups can be quite long. Maybe in a follow-up PR?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice cleanup


/**
* @brief A GPU-accelerated Blocked Bloom Filter.
* @brief A GPU-accelerated Bloom filter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief A GPU-accelerated Bloom filter.
* @brief A GPU-accelerated Bloom Filter.


/**
* @brief A GPU-accelerated Blocked Bloom Filter.
* @brief A GPU-accelerated Bloom filter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add a brief section describing the underlying algorithm, along with a reference to the original paper. I noticed the paper is referenced in the policy document, but it doesn't appear to be mentioned here.

Comment thread include/cuco/detail/bloom_filter/bloom_filter_impl.cuh
constexpr auto num_threads = tile_size_v<CG>;
auto num_keys = cuco::detail::distance(first, last);
if constexpr (tile_size_v<CG> == add_horizontal_layout && add_horizontal_layout > 1) {
auto constexpr num_threads = static_cast<decltype(num_keys)>(tile_size_v<CG>);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use an explicit type instead of decltype throughout this func?

Comment on lines +632 to +638
// TODO
// [[nodiscard]] __host__ double occupancy() const;
// [[nodiscard]] __host__ double expected_false_positive_rate(size_t unique_keys) const
// [[nodiscard]] __host__ __device__ static uint32_t optimal_pattern_bits(size_t num_blocks)
// template <typename CG, cuda::thread_scope NewScope = thread_scope>
// [[nodiscard]] __device__ constexpr auto make_copy(CG group, word_type* const
// memory_to_use, cuda_thread_scope<NewScope> scope = {}) const noexcept;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still relevant?

Comment thread include/cuco/detail/bloom_filter/bloom_filter_impl.cuh
// [[nodiscard]] __device__ constexpr auto make_copy(CG group, word_type* const
// memory_to_use, cuda_thread_scope<NewScope> scope = {}) const noexcept;
template <bool ConditionalAtomic>
__device__ constexpr void atomic_or(word_type* word_ptr, word_type pattern) const

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sleeepyjack could you elaborate a bit on why we need this custom atomic_or?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During development we found that cuda::atomic_ref::fetch_or sometimes leads to suboptimal codegen so we added a tuning flag to switch between the CCCL atomics and the plain CUDA atomicOr. This function is the wrapper around that tuning knob.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any CCCL issue we could track this down?

@sleeepyjack

Copy link
Copy Markdown
Collaborator Author

Regarding the tuning knobs, I (or better Codex) did an ablation study:

Bloom filter tuning sweep summary

I ran a tuning sweep on sleeepyjack/bloom-filter-release / PR #808 head
c571c33f34631fcfa05b0807188a5c12bbfda617 in the default cuCollections
devcontainer (cuda13.1-gcc14, CTK 13.1) on:

  • GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  • L2 cache: 128 MiB
  • Characterization sizes: {1, 16, 32, 64, 128, 256, 512} MiB
  • Metrics collected: throughput, time, DRAM throughput, L2 hit rate, FPR
  • FPR stayed stable at ~0.00131 for contains variants, so the differences below are performance-only.

Overall recommendation

The only clear default change suggested by this run is:

  • For vertical contains, prefer the custom kernel path over the CUB path:
    • use_cub_kernels = false

Everything else should stay at the current default unless we want a very small size-specific horizontal-contains optimization.

Per-knob findings

use_warp_cooperative_add_kernel

Recommendation: keep enabled.

This is strongly beneficial for add.

Screen-stage main effect:

  • on vs off: +77.1%

When disabled, add throughput dropped substantially at the screen sizes. This knob should remain enabled.

use_cuda_atomic_ref

Recommendation: keep disabled.

Enabling cuda::atomic_ref hurt add.

Characterization, compared to the matching non-cuda_atomic_ref path:

  • Average: -3.44%
  • <= 128 MiB: about -3.6%
  • 256–512 MiB: about -0.16%

So the current non-cuda::atomic_ref path is better.

use_invoke_one

Recommendation: neutral; current default is fine.

For the useful add/horizontal-contains configurations, toggling invoke_one was essentially noise-level.

Observed deltas:

  • Add with warp-coop on and cuda_atomic_ref=0: -0.02%
  • Horizontal contains with warp-coop on: about -0.02%

It does not look like a meaningful tuning lever for this workload.

use_cub_kernels

Recommendation: disable for vertical contains.

This was the largest actionable win.

Vertical contains, cub_kernels=false vs baseline cub_kernels=true:

  • Average: +14.5%
  • <= 128 MiB: +15–16%
  • 256–512 MiB: +1.2–1.5%

Average throughput:

  • Baseline vertical contains: 61.29 G elem/s
  • cub_kernels=false, early_exit=false: 70.19 G elem/s

This suggests the custom vertical contains kernel should be preferred over the CUB transform path on this setup.

use_early_exit

Recommendation: keep disabled.

Early exit did not help in this sweep.

With cub_kernels=false:

  • early_exit=true was slightly worse overall: about -0.1%

With cub_kernels=true:

  • Difference was noise-level, about +0.02%

No reason to enable it by default from these data.

use_warp_cooperative_contains_kernel

Recommendation: keep enabled.

This is critical for horizontal contains.

Horizontal contains with warp-coop disabled:

  • Average: -42.2%
  • <= 128 MiB: about -47.2%
  • 256–512 MiB: about -4.2%

Disabling this increased L2 hit rate somewhat but throughput collapsed, so it should remain enabled.

use_work_stealing_add_kernel

Recommendation: keep disabled.

Work stealing hurt add throughput.

work_stealing_add=true vs baseline add:

  • Average: -1.76%
  • <= 128 MiB: -1.82%
  • 256–512 MiB: -0.30%

Per-size deltas for {1,16,32,64,128,256,512} MiB:

  • -1.97%, -2.74%, -2.84%, -0.19%, -0.92%, -0.35%, -0.24%

No evidence this helps add on the tested workload.

use_work_stealing_contains_kernel

Recommendation: generally keep disabled; possibly consider only for small horizontal contains.

Vertical contains with cub_kernels=false:

  • work_stealing=true, early_exit=false: -0.69% average
  • work_stealing=true, early_exit=true: -0.44% average

So work stealing does not help vertical contains.

Horizontal contains with warp-coop enabled:

  • Average with invoke_one=true: +0.78%
  • <= 128 MiB: +0.90%
  • 256–512 MiB: -0.13%

Per-size deltas for {1,16,32,64,128,256,512} MiB:

  • +2.02%, +1.15%, +0.63%, +0.54%, -0.28%, -0.16%, -0.11%

So work stealing has a small niche benefit for small/cache-resident horizontal contains, but it turns slightly negative for larger filters. I would not enable it globally.

Suggested default policy from this run

For the tested Blackwell + CTK 13.1 setup:

use_invoke_one                       = true;  // neutral / current default OK
use_early_exit                       = false;
use_cub_kernels                      = false; // for vertical contains; biggest win
use_warp_cooperative_add_kernel      = true;
use_warp_cooperative_contains_kernel = true;
use_work_stealing_add_kernel         = false;
use_work_stealing_contains_kernel    = false; // unless specializing small horizontal contains
use_cuda_atomic_ref                  = false;

Code paths that could be removed if we want to simplify

If the goal is to keep the implementation lean rather than preserve all experimental tuning paths, the sweep suggests the following removal candidates.

Good removal candidates

  • use_cuda_atomic_ref alternate path:

    • The cuda::atomic_ref path was consistently slower for add.
    • Suggested simplification: remove the use_cuda_atomic_ref knob and keep the current faster non-cuda::atomic_ref atomic OR path only.
  • use_early_exit path:

    • Early exit did not improve vertical contains and was slightly negative with the recommended cub_kernels=false path.
    • Suggested simplification: remove the use_early_exit knob and simplify the compare recursion to always evaluate the full pattern.
  • use_work_stealing_add_kernel path:

    • Work stealing was slower for add at every characterized size.
    • Suggested simplification: remove add_work_stealing_n_impl, add_work_stealing_n, and the host-side launch branch guarded by use_work_stealing_add_kernel.
  • Vertical-contains use of use_work_stealing_contains_kernel:

    • Work stealing was slower with the recommended vertical contains path (cub_kernels=false).
    • Suggested simplification: do not route vertical contains through the work-stealing kernel.
  • use_cub_kernels contains path:

    • The CUB DeviceTransform path for vertical contains was substantially slower than the custom contains kernel.
    • Suggested simplification: remove the use_cub_kernels knob for contains and always use the custom contains kernel.
    • Note: this does not imply removing CUB from the bloom filter implementation entirely, because other operations still use CUB utilities.

Probably keep, or remove only with care

  • use_warp_cooperative_add_kernel:

    • Keep this path. It is strongly beneficial.
    • If simplifying, remove the non-warp-cooperative add path instead, not the warp-cooperative path.
  • use_warp_cooperative_contains_kernel:

    • Keep this path. It is critical for horizontal contains.
    • If simplifying, remove the non-warp-cooperative horizontal contains path instead, not the warp-cooperative path.
  • use_invoke_one:

    • Performance impact was effectively neutral.
    • I would keep this as a compatibility/implementation knob unless we want to simplify aggressively.
    • If removing it, keep the current default behavior (invoke_one enabled when available) and remove the fallback branch only where toolkit support guarantees it.
  • Horizontal use_work_stealing_contains_kernel:

    • This has a small benefit for small/cache-resident horizontal contains (+0.5–2.0% up to 64 MiB, +0.9% through 128 MiB), but turns slightly negative for larger filters.
    • I would not enable it by default. Removal is reasonable if we do not want a size-specialized path; otherwise keep it only behind a size/architecture heuristic.

Caveat

These results are from a uniform benchmark workload on one Blackwell GPU. The clearest and most robust conclusions are:

  1. keep warp-cooperative add/contains enabled,
  2. keep cuda_atomic_ref disabled,
  3. disable CUB for vertical contains,
  4. keep work stealing disabled by default.

@sleeepyjack

sleeepyjack commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

tl;dr here is a summary and my suggestion on what we should do with each tuning knob/ code path:

  • use_invoke_one: No performance benefit in using it so we can safely remove any use of invoke_one as well as the tuning knob.
  • use_early_exit: Although it didn't show any perf impact in the tested scenarios, it might become relevant for when the match rate of the filter is very low. Kevin had a usecase that was showing significant improvements from this knob. We should even expose this as a tparam of the policy instead of hiding it inside the implementation.
  • use_cub_kernels: I'd suggest we keep this knob for now, set it to false for contains and true for add.
  • use_warp_cooperative*: Keep this code path. Remove the tuning knobs and make warp cooperative kernels the default.
  • use_work_stealing*: Not much benefit with the tested configuration but in the paper there were some scenarios on B200 where it was slightly better. I'd say it's not worth the complexity keeping it in our codebase right now. Instead, I would rather wait for a neat CCCL abstraction and then reintroduce it.
  • use_atomic_ref: atomicOr is consistently faster than cuda::atomic_ref::fetch_or so I suggest to remove this knob and use atomicOr as default. This is fine since we control the type of the underlying atomic aka word_type.

What do you think?

@PointKernel

Copy link
Copy Markdown
Member

tl;dr here is a summary and my suggestion on what we should do with each tuning knob/ code path:

Too late. I've already gone through the whole lengthy AI-generated report. 😉

What do you think?

All looks valid to me. Several points:

  • use_cub_kernels: this is surprising. Worth bringing this up to CCCL?
  • use_atomic_ref: this aligns with my observations duing the new hash table design as well, should report it to CCCL
  • use_work_stealing: Bloom filter operations tend to have fairly uniform costs compared to hash table operations, so I don't expect work stealing to provide much benefit here. Still, it was great to explore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

helps: rapids Helps or needed by RAPIDS Needs Review Awaiting reviews before merging topic: bloom_filter Issues related to bloom_filter topic: performance Performance related issue type: improvement Improvement / enhancement to an existing function

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants