Parametric Sectorized Bloom filter policy#808
Conversation
…bloom_filter_impl.
…t interface to the new APIs.
…mpilation of example is choking on #pragma unroll.
…try with early exit next.
…ctions into bloom-filter-release
|
/ok to test 20be4e3 |
|
@srinivasyadav18 could use your help reviewing this PR as well |
… and are not needed; polish benchmarks
|
/ok to test 8b1e995 |
|
/ok to test eae3049 |
PointKernel
left a comment
There was a problem hiding this comment.
The actual code changes are not that big, but the use of work stealing definitely caught my attention.
@sleeepyjack, could you please review all files touched by this PR and make sure the copyright years are updated where necessary?
There was a problem hiding this comment.
@sleeepyjack Could you please share a performance comparison between the baseline and the current implementation? It would be helpful to have those numbers documented for future reference.
There was a problem hiding this comment.
I'm working on a small ablation study testing all those different tuning knobs. This will also help answer some of your other comments.
| // Exhaustive sweep across block sizes and vectorization layouts. Uncomment for performance | ||
| // tuning / paper-style characterization; not run by default because the matrix is large. | ||
| // NVBENCH_BENCH_TYPES( | ||
| // bloom_filter_contains, | ||
| // NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>, | ||
| // nvbench::type_list<nvbench::uint64_t, nvbench::uint32_t>, ///< Word | ||
| // nvbench::enum_type_list<64, 128, 256, 512, 1024>, ///< BlockBits | ||
| // nvbench::enum_type_list<8, 16>, ///< PatternBits | ||
| // nvbench::enum_type_list<1, 2, 4, 8, 16>, ///< | ||
| // HorizontalLayout nvbench::enum_type_list<1, 2, 4, 8, 16> ///< VerticalLayout | ||
| // )) | ||
| // .set_name("bloom_filter_contains_full_sweep_u64") | ||
| // .set_type_axes_names( | ||
| // {"Key", "Word", "BlockBits", "PatternBits", "HorizontalLayout", "VerticalLayout"}) | ||
| // .add_int64_axis("NumInputs", {defaults::BF_N}) | ||
| // .add_int64_axis("FilterSizeMB", defaults::BF_SIZE_MB_RANGE_CACHE); No newline at end of file |
There was a problem hiding this comment.
shall we remove it since unused?
There was a problem hiding this comment.
I'm thinking about adding a flag to enable more extensive benchmarks, since compile- and runtime for these setups can be quite long. Maybe in a follow-up PR?
|
|
||
| /** | ||
| * @brief A GPU-accelerated Blocked Bloom Filter. | ||
| * @brief A GPU-accelerated Bloom filter. |
There was a problem hiding this comment.
| * @brief A GPU-accelerated Bloom filter. | |
| * @brief A GPU-accelerated Bloom Filter. |
|
|
||
| /** | ||
| * @brief A GPU-accelerated Blocked Bloom Filter. | ||
| * @brief A GPU-accelerated Bloom filter. |
There was a problem hiding this comment.
It would be helpful to add a brief section describing the underlying algorithm, along with a reference to the original paper. I noticed the paper is referenced in the policy document, but it doesn't appear to be mentioned here.
| constexpr auto num_threads = tile_size_v<CG>; | ||
| auto num_keys = cuco::detail::distance(first, last); | ||
| if constexpr (tile_size_v<CG> == add_horizontal_layout && add_horizontal_layout > 1) { | ||
| auto constexpr num_threads = static_cast<decltype(num_keys)>(tile_size_v<CG>); |
There was a problem hiding this comment.
could we use an explicit type instead of decltype throughout this func?
| // TODO | ||
| // [[nodiscard]] __host__ double occupancy() const; | ||
| // [[nodiscard]] __host__ double expected_false_positive_rate(size_t unique_keys) const | ||
| // [[nodiscard]] __host__ __device__ static uint32_t optimal_pattern_bits(size_t num_blocks) | ||
| // template <typename CG, cuda::thread_scope NewScope = thread_scope> | ||
| // [[nodiscard]] __device__ constexpr auto make_copy(CG group, word_type* const | ||
| // memory_to_use, cuda_thread_scope<NewScope> scope = {}) const noexcept; |
| // [[nodiscard]] __device__ constexpr auto make_copy(CG group, word_type* const | ||
| // memory_to_use, cuda_thread_scope<NewScope> scope = {}) const noexcept; | ||
| template <bool ConditionalAtomic> | ||
| __device__ constexpr void atomic_or(word_type* word_ptr, word_type pattern) const |
There was a problem hiding this comment.
@sleeepyjack could you elaborate a bit on why we need this custom atomic_or?
There was a problem hiding this comment.
During development we found that cuda::atomic_ref::fetch_or sometimes leads to suboptimal codegen so we added a tuning flag to switch between the CCCL atomics and the plain CUDA atomicOr. This function is the wrapper around that tuning knob.
There was a problem hiding this comment.
Any CCCL issue we could track this down?
|
Regarding the tuning knobs, I (or better Codex) did an ablation study: Bloom filter tuning sweep summaryI ran a tuning sweep on
Overall recommendationThe only clear default change suggested by this run is:
Everything else should stay at the current default unless we want a very small size-specific horizontal-contains optimization. Per-knob findings
|
|
tl;dr here is a summary and my suggestion on what we should do with each tuning knob/ code path:
What do you think? |
Too late. I've already gone through the whole lengthy AI-generated report. 😉
All looks valid to me. Several points:
|
Lands the GPU bloom filter optimizations from arXiv:2512.15595.