Parametric Sectorized Bloom filter policy by sleeepyjack · Pull Request #808 · NVIDIA/cuCollections

sleeepyjack · 2026-04-30T01:25:33Z

Lands the GPU bloom filter optimizations from arXiv:2512.15595.

…bloom_filter_impl.

…les.

…t interface to the new APIs.

…mpilation of example is choking on #pragma unroll.

…try with early exit next.

…ctions into bloom-filter-release

sleeepyjack · 2026-04-30T01:41:54Z

/ok to test 20be4e3

sleeepyjack

Self review

PointKernel · 2026-06-11T22:51:35Z

@srinivasyadav18 could use your help reviewing this PR as well

… and are not needed; polish benchmarks

sleeepyjack · 2026-06-17T23:43:59Z

/ok to test 8b1e995

sleeepyjack · 2026-06-17T23:52:15Z

/ok to test eae3049

PointKernel

The actual code changes are not that big, but the use of work stealing definitely caught my attention.

@sleeepyjack, could you please review all files touched by this PR and make sure the copyright years are updated where necessary?

PointKernel · 2026-06-22T23:14:52Z

@sleeepyjack Could you please share a performance comparison between the baseline and the current implementation? It would be helpful to have those numbers documented for future reference.

I'm working on a small ablation study testing all those different tuning knobs. This will also help answer some of your other comments.

PointKernel · 2026-06-22T23:15:48Z

+// Exhaustive sweep across block sizes and vectorization layouts. Uncomment for performance
+// tuning / paper-style characterization; not run by default because the matrix is large.
+// NVBENCH_BENCH_TYPES(
+//   bloom_filter_contains,
+//   NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
+//                     nvbench::type_list<nvbench::uint64_t, nvbench::uint32_t>, ///< Word
+//                     nvbench::enum_type_list<64, 128, 256, 512, 1024>,         ///< BlockBits
+//                     nvbench::enum_type_list<8, 16>,                           ///< PatternBits
+//                     nvbench::enum_type_list<1, 2, 4, 8, 16>,                  ///<
+//                     HorizontalLayout nvbench::enum_type_list<1, 2, 4, 8, 16> ///< VerticalLayout
+//                     ))
+//   .set_name("bloom_filter_contains_full_sweep_u64")
+//   .set_type_axes_names(
+//     {"Key", "Word", "BlockBits", "PatternBits", "HorizontalLayout", "VerticalLayout"})
+//   .add_int64_axis("NumInputs", {defaults::BF_N})
+//   .add_int64_axis("FilterSizeMB", defaults::BF_SIZE_MB_RANGE_CACHE);


shall we remove it since unused?

I'm thinking about adding a flag to enable more extensive benchmarks, since compile- and runtime for these setups can be quite long. Maybe in a follow-up PR?

PointKernel · 2026-06-22T23:18:38Z

nice cleanup

PointKernel · 2026-06-22T23:24:10Z


 /**
- * @brief A GPU-accelerated Blocked Bloom Filter.
+ * @brief A GPU-accelerated Bloom filter.


Suggested change

* @brief A GPU-accelerated Bloom filter.

* @brief A GPU-accelerated Bloom Filter.

PointKernel · 2026-06-22T23:31:27Z


 /**
- * @brief A GPU-accelerated Blocked Bloom Filter.
+ * @brief A GPU-accelerated Bloom filter.


It would be helpful to add a brief section describing the underlying algorithm, along with a reference to the original paper. I noticed the paper is referenced in the policy document, but it doesn't appear to be mentioned here.

PointKernel · 2026-06-23T01:58:57Z

-    constexpr auto num_threads = tile_size_v<CG>;
+    auto num_keys = cuco::detail::distance(first, last);
+    if constexpr (tile_size_v<CG> == add_horizontal_layout && add_horizontal_layout > 1) {
+      auto constexpr num_threads = static_cast<decltype(num_keys)>(tile_size_v<CG>);


could we use an explicit type instead of decltype throughout this func?

PointKernel · 2026-06-23T01:59:56Z

+  // TODO
+  // [[nodiscard]] __host__ double occupancy() const;
+  // [[nodiscard]] __host__ double expected_false_positive_rate(size_t unique_keys) const
+  // [[nodiscard]] __host__ __device__ static uint32_t optimal_pattern_bits(size_t num_blocks)
+  // template <typename CG, cuda::thread_scope NewScope = thread_scope>
+  // [[nodiscard]] __device__ constexpr auto make_copy(CG group, word_type* const
+  // memory_to_use, cuda_thread_scope<NewScope> scope = {}) const noexcept;


still relevant?

PointKernel · 2026-06-23T02:07:31Z

-  // [[nodiscard]] __device__ constexpr auto make_copy(CG group, word_type* const
-  // memory_to_use, cuda_thread_scope<NewScope> scope = {}) const noexcept;
+  template <bool ConditionalAtomic>
+  __device__ constexpr void atomic_or(word_type* word_ptr, word_type pattern) const


@sleeepyjack could you elaborate a bit on why we need this custom atomic_or?

During development we found that cuda::atomic_ref::fetch_or sometimes leads to suboptimal codegen so we added a tuning flag to switch between the CCCL atomics and the plain CUDA atomicOr. This function is the wrapper around that tuning knob.

Any CCCL issue we could track this down?

sleeepyjack · 2026-06-23T22:44:43Z

Regarding the tuning knobs, I (or better Codex) did an ablation study:

Bloom filter tuning sweep summary

I ran a tuning sweep on sleeepyjack/bloom-filter-release / PR #808 head
c571c33f34631fcfa05b0807188a5c12bbfda617 in the default cuCollections
devcontainer (cuda13.1-gcc14, CTK 13.1) on:

GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
L2 cache: 128 MiB
Characterization sizes: {1, 16, 32, 64, 128, 256, 512} MiB
Metrics collected: throughput, time, DRAM throughput, L2 hit rate, FPR
FPR stayed stable at ~0.00131 for contains variants, so the differences below are performance-only.

Overall recommendation

The only clear default change suggested by this run is:

For vertical contains, prefer the custom kernel path over the CUB path:
- use_cub_kernels = false

Everything else should stay at the current default unless we want a very small size-specific horizontal-contains optimization.

Per-knob findings

`use_warp_cooperative_add_kernel`

Recommendation: keep enabled.

This is strongly beneficial for add.

Screen-stage main effect:

on vs off: +77.1%

When disabled, add throughput dropped substantially at the screen sizes. This knob should remain enabled.

`use_cuda_atomic_ref`

Recommendation: keep disabled.

Enabling cuda::atomic_ref hurt add.

Characterization, compared to the matching non-cuda_atomic_ref path:

Average: -3.44%
<= 128 MiB: about -3.6%
256–512 MiB: about -0.16%

So the current non-cuda::atomic_ref path is better.

`use_invoke_one`

Recommendation: neutral; current default is fine.

For the useful add/horizontal-contains configurations, toggling invoke_one was essentially noise-level.

Observed deltas:

Add with warp-coop on and cuda_atomic_ref=0: -0.02%
Horizontal contains with warp-coop on: about -0.02%

It does not look like a meaningful tuning lever for this workload.

`use_cub_kernels`

Recommendation: disable for vertical contains.

This was the largest actionable win.

Vertical contains, cub_kernels=false vs baseline cub_kernels=true:

Average: +14.5%
<= 128 MiB: +15–16%
256–512 MiB: +1.2–1.5%

Average throughput:

Baseline vertical contains: 61.29 G elem/s
cub_kernels=false, early_exit=false: 70.19 G elem/s

This suggests the custom vertical contains kernel should be preferred over the CUB transform path on this setup.

`use_early_exit`

Recommendation: keep disabled.

Early exit did not help in this sweep.

With cub_kernels=false:

early_exit=true was slightly worse overall: about -0.1%

With cub_kernels=true:

Difference was noise-level, about +0.02%

No reason to enable it by default from these data.

`use_warp_cooperative_contains_kernel`

Recommendation: keep enabled.

This is critical for horizontal contains.

Horizontal contains with warp-coop disabled:

Average: -42.2%
<= 128 MiB: about -47.2%
256–512 MiB: about -4.2%

Disabling this increased L2 hit rate somewhat but throughput collapsed, so it should remain enabled.

`use_work_stealing_add_kernel`

Recommendation: keep disabled.

Work stealing hurt add throughput.

work_stealing_add=true vs baseline add:

Average: -1.76%
<= 128 MiB: -1.82%
256–512 MiB: -0.30%

Per-size deltas for {1,16,32,64,128,256,512} MiB:

-1.97%, -2.74%, -2.84%, -0.19%, -0.92%, -0.35%, -0.24%

No evidence this helps add on the tested workload.

`use_work_stealing_contains_kernel`

Recommendation: generally keep disabled; possibly consider only for small horizontal contains.

Vertical contains with cub_kernels=false:

work_stealing=true, early_exit=false: -0.69% average
work_stealing=true, early_exit=true: -0.44% average

So work stealing does not help vertical contains.

Horizontal contains with warp-coop enabled:

Average with invoke_one=true: +0.78%
<= 128 MiB: +0.90%
256–512 MiB: -0.13%

Per-size deltas for {1,16,32,64,128,256,512} MiB:

+2.02%, +1.15%, +0.63%, +0.54%, -0.28%, -0.16%, -0.11%

So work stealing has a small niche benefit for small/cache-resident horizontal contains, but it turns slightly negative for larger filters. I would not enable it globally.

Suggested default policy from this run

For the tested Blackwell + CTK 13.1 setup:

use_invoke_one                       = true;  // neutral / current default OK
use_early_exit                       = false;
use_cub_kernels                      = false; // for vertical contains; biggest win
use_warp_cooperative_add_kernel      = true;
use_warp_cooperative_contains_kernel = true;
use_work_stealing_add_kernel         = false;
use_work_stealing_contains_kernel    = false; // unless specializing small horizontal contains
use_cuda_atomic_ref                  = false;

Code paths that could be removed if we want to simplify

If the goal is to keep the implementation lean rather than preserve all experimental tuning paths, the sweep suggests the following removal candidates.

Good removal candidates

use_cuda_atomic_ref alternate path:
- The cuda::atomic_ref path was consistently slower for add.
- Suggested simplification: remove the use_cuda_atomic_ref knob and keep the current faster non-cuda::atomic_ref atomic OR path only.
use_early_exit path:
- Early exit did not improve vertical contains and was slightly negative with the recommended cub_kernels=false path.
- Suggested simplification: remove the use_early_exit knob and simplify the compare recursion to always evaluate the full pattern.
use_work_stealing_add_kernel path:
- Work stealing was slower for add at every characterized size.
- Suggested simplification: remove add_work_stealing_n_impl, add_work_stealing_n, and the host-side launch branch guarded by use_work_stealing_add_kernel.
Vertical-contains use of use_work_stealing_contains_kernel:
- Work stealing was slower with the recommended vertical contains path (cub_kernels=false).
- Suggested simplification: do not route vertical contains through the work-stealing kernel.
use_cub_kernels contains path:
- The CUB DeviceTransform path for vertical contains was substantially slower than the custom contains kernel.
- Suggested simplification: remove the use_cub_kernels knob for contains and always use the custom contains kernel.
- Note: this does not imply removing CUB from the bloom filter implementation entirely, because other operations still use CUB utilities.

Probably keep, or remove only with care

use_warp_cooperative_add_kernel:
- Keep this path. It is strongly beneficial.
- If simplifying, remove the non-warp-cooperative add path instead, not the warp-cooperative path.
use_warp_cooperative_contains_kernel:
- Keep this path. It is critical for horizontal contains.
- If simplifying, remove the non-warp-cooperative horizontal contains path instead, not the warp-cooperative path.
use_invoke_one:
- Performance impact was effectively neutral.
- I would keep this as a compatibility/implementation knob unless we want to simplify aggressively.
- If removing it, keep the current default behavior (invoke_one enabled when available) and remove the fallback branch only where toolkit support guarantees it.
Horizontal use_work_stealing_contains_kernel:
- This has a small benefit for small/cache-resident horizontal contains (+0.5–2.0% up to 64 MiB, +0.9% through 128 MiB), but turns slightly negative for larger filters.
- I would not enable it by default. Removal is reasonable if we do not want a size-specialized path; otherwise keep it only behind a size/architecture heuristic.

Caveat

These results are from a uniform benchmark workload on one Blackwell GPU. The clearest and most robust conclusions are:

keep warp-cooperative add/contains enabled,
keep cuda_atomic_ref disabled,
disable CUB for vertical contains,
keep work stealing disabled by default.

sleeepyjack · 2026-06-23T23:06:44Z

tl;dr here is a summary and my suggestion on what we should do with each tuning knob/ code path:

use_invoke_one: No performance benefit in using it so we can safely remove any use of invoke_one as well as the tuning knob.
use_early_exit: Although it didn't show any perf impact in the tested scenarios, it might become relevant for when the match rate of the filter is very low. Kevin had a usecase that was showing significant improvements from this knob. We should even expose this as a tparam of the policy instead of hiding it inside the implementation.
use_cub_kernels: I'd suggest we keep this knob for now, set it to false for contains and true for add.
use_warp_cooperative*: Keep this code path. Remove the tuning knobs and make warp cooperative kernels the default.
use_work_stealing*: Not much benefit with the tested configuration but in the paper there were some scenarios on B200 where it was slightly better. I'd say it's not worth the complexity keeping it in our codebase right now. Instead, I would rather wait for a neat CCCL abstraction and then reintroduce it.
use_atomic_ref: atomicOr is consistently faster than cuda::atomic_ref::fetch_or so I suggest to remove this knob and use atomicOr as default. This is fine since we control the type of the underlying atomic aka word_type.

What do you think?

PointKernel · 2026-06-23T23:16:22Z

tl;dr here is a summary and my suggestion on what we should do with each tuning knob/ code path:

Too late. I've already gone through the whole lengthy AI-generated report. 😉

What do you think?

All looks valid to me. Several points:

use_cub_kernels: this is surprising. Worth bringing this up to CCCL?
use_atomic_ref: this aligns with my observations duing the new hash table design as well, should report it to CCCL
use_work_stealing: Bloom filter operations tend to have fairly uniform costs compared to hash table operations, so I don't expect work stealing to provide much benefit here. Still, it was great to explore.

sleeepyjack and others added 30 commits September 9, 2025 08:53

Add support for horizontal/verstical vectorization parameter

d67ae07

Restructure policies

cb7a78d

Fix indexing bug

21ff88c

Coalesced output write

41e217a

Add unit test for adaptive contains kernel

2b8ecde

Add parametric filter policy (dummy)

c322825

Merge remote-tracking branch 'upstream' into exp-filter-policy

17f9c19

Multiplicative hashing implemented in policy. Some changes needed to …

97693a3

…bloom_filter_impl.

Finalized proposed policy interface.

4285b39

Fixed a mistake in thread_dispatch(). Removed some dead static variab…

0934cba

…les.

Multiplicative hashing calling code infrastructure.

cf43c8f

New example script for sanity checking. Still need to connect the hos…

52d7f17

…t interface to the new APIs.

host and device APIs are connected for multiplicative hashing, but co…

91714b0

…mpilation of example is choking on #pragma unroll.

Debugging done. End-to-end filters working properly.

fa7d9a7

Tests updated.

13918c2

Good performance agains arrow FP when early exit is turned off. Will …

7008773

…try with early exit next.

Updated bloom filter nvbench script.

9252f69

Changing exp kernels from if to while for grid-striding.

12b4847

Bug fix in filter size in PFP_EVALUATION_EXAMPLE

d3fcce2

Bug fix in while loop in exp kernels.

a30d1b1

Small PR review fixes.

18e9c34

group-cooperative parametric filter policy code paths implemented.

a018896

Benchmark scripts updated.

8020b72

Notebook with theoretical FPR calculators.

b655183

Remove static checks on hash result type that are blocking NVBench.

e558f1c

Enum type lists for the add benchmark added.

c83912c

Added salt generation script. Updated the total number of salts to 64.

46cf45f

Updated block index selection in PFP to match Arrow policy.

892e4a9

Merge remote-tracking branch 'upstream' into exp-filter-policy

e9f8ac9

Enable magic modulo

1a4b5e0

sleeepyjack added 2 commits April 29, 2026 18:36

Merge branch 'bloom-filter-release' of github.com:sleeepyjack/cuColle…

2a8de62

…ctions into bloom-filter-release

Address Doxygen

20be4e3

sleeepyjack commented May 8, 2026

View reviewed changes

sleeepyjack added 2 commits May 13, 2026 06:59

Fix CTK 12.0 build: gate CG invoke_one and cluster launch control

4ac613e

Merge remote-tracking branch 'upstream' into bloom-filter-release

0a2509f

sleeepyjack mentioned this pull request Jun 11, 2026

[FEA]: [CUCO] Migrate cuco::bloom_filter to cudax NVIDIA/cccl#9414

Open

sleeepyjack added 7 commits June 17, 2026 07:53

Review fixes

739079f

Merge remote-tracking branch 'upstream/dev' into bloom-filter-release

46fbaf0

Remove CSBF

4dfb811

Remove IO-less benchmarks

97635fe

More tests

dbe61ec

Remove range scalar device functions as they were ambiguous overloads…

10f09d6

… and are not needed; polish benchmarks

Cleanups

8b1e995

sleeepyjack self-assigned this Jun 17, 2026

sleeepyjack added Needs Review Awaiting reviews before merging and removed In Progress Currently a work in progress labels Jun 17, 2026

sleeepyjack changed the title ~~Bloom filter overhaul~~ Parametric Sectorized Bloom filter policy Jun 17, 2026

sleeepyjack marked this pull request as ready for review June 17, 2026 23:41

sleeepyjack requested a review from PointKernel as a code owner June 17, 2026 23:41

sleeepyjack added the topic: performance Performance related issue label Jun 17, 2026

Fix CUCO_HAS_CG_INVOKE_ONE usage

eae3049

PointKernel reviewed Jun 23, 2026

View reviewed changes

Merge branch 'dev' into bloom-filter-release

c571c33

	* @brief A GPU-accelerated Bloom filter.
	* @brief A GPU-accelerated Bloom Filter.

Conversation

sleeepyjack commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sleeepyjack commented Apr 30, 2026

Uh oh!

sleeepyjack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PointKernel commented Jun 11, 2026

Uh oh!

sleeepyjack commented Jun 17, 2026

Uh oh!

sleeepyjack commented Jun 17, 2026

Uh oh!

PointKernel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sleeepyjack commented Jun 23, 2026

Bloom filter tuning sweep summary

Overall recommendation

Per-knob findings

use_warp_cooperative_add_kernel

use_cuda_atomic_ref

use_invoke_one

use_cub_kernels

use_early_exit

use_warp_cooperative_contains_kernel

use_work_stealing_add_kernel

use_work_stealing_contains_kernel

Suggested default policy from this run

Code paths that could be removed if we want to simplify

Good removal candidates

Probably keep, or remove only with care

Caveat

Uh oh!

sleeepyjack commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PointKernel commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

sleeepyjack commented Apr 30, 2026 •

edited

Loading

`use_warp_cooperative_add_kernel`

`use_cuda_atomic_ref`

`use_invoke_one`

`use_cub_kernels`

`use_early_exit`

`use_warp_cooperative_contains_kernel`

`use_work_stealing_add_kernel`

`use_work_stealing_contains_kernel`

sleeepyjack commented Jun 23, 2026 •

edited

Loading