Wilcoxon refactor#636
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #636 +/- ##
==========================================
- Coverage 90.29% 89.28% -1.01%
==========================================
Files 104 104
Lines 8964 9464 +500
==========================================
+ Hits 8094 8450 +356
- Misses 870 1014 +144
|
|
@coderabbitci review |
|
✅ Action performedReview finished.
|
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR replaces Wilcoxon rank-sum GPU computation with new OVR/OVO tiered kernel implementations using CUB segmented radix sort, adds a _rank_stats_cuda module for group statistics, upgrades infrastructure to CUDA 12.9.1, refactors Python rank_genes_groups to support new sparse/dense streaming paths and optional U-value output, and expands test coverage for non-negative data and empty-group filtering. ChangesWilcoxon GPU Refactor and Rank Statistics
🎯 4 (Complex) | ⏱️ ~75 minutes Possibly Related PRs
Suggested Reviewers
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 20
🧹 Nitpick comments (1)
.gitignore (1)
57-57: ⚡ Quick winRemove duplicate
/benchmarks/entry.This pattern already appears on line 7.
🧹 Proposed fix
-/benchmarks/🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.gitignore at line 57, Duplicate .gitignore entry: remove the redundant "/benchmarks/" line so the pattern appears only once; locate the duplicate entry matching "/benchmarks/" and delete one occurrence (keep the first instance) to avoid repeated patterns in .gitignore.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pyproject.toml`:
- Line 176: Update the auditwheel repair command string (the line containing
"auditwheel repair --exclude libcublas.so.12 --exclude libcublas.so.13 ...") to
use wildcard SONAME exclusions and include the same libraries excluded in
publish.yml: replace hardcoded suffixes (e.g., libcublas.so.12/.13) with
wildcard patterns like libcublas.so.*, libcublasLt.so.*, libcudart.so.*, and add
libcusolver.so.*, libcusparse.so.*, libnvJitLink.so* (and keep librmm.so and
librapids_logger.so) so the CIBW_REPAIR_WHEEL_COMMAND behavior in pyproject.toml
matches the publish.yml exclusions and avoids bundling large CUDA-dependent
libraries.
In `@src/rapids_singlecell/_cuda/rank_genes/csr_tile_to_dense.cuh`:
- Around line 28-32: The kernel narrows templated IndexT by doing const int col
= static_cast<int>(indices[k]) which can truncate large indices; change the code
to keep 64-bit math by declaring col using the template type (e.g. IndexT or
int64_t), perform comparisons against col_lb/col_ub without narrowing, compute
the column offset as a 64-bit value (col - col_lb) and only cast to the integer
type required for indexing/atomicAdd at the final array access; update the
occurrence of static_cast<int>(indices[k]) and any uses of (col - col_lb) *
n_cells to use 64-bit arithmetic so writes into out and the atomicAdd call are
correct for large indices.
In `@src/rapids_singlecell/_cuda/rank_genes/rank_stats.cu`:
- Around line 73-85: Before launching csr_tile_to_dense_kernel, validate CSR and
output buffer contracts: ensure indptr.shape(0) >= 2 and equals n_cells+1,
indptr is monotonic non-decreasing and its last value equals the expected nnz
(so indices.size() and data.size() >= nnz), out.dim(0) == n_cells and out.dim(1)
>= (col_ub - col_lb) (or exactly equals the tile width used), and that
pointers/data buffers (indptr.data(), indices.data(), data.data(), out.data())
are non-null; also validate col_lb and col_ub are within [0, num_columns] (use
available column count) and col_lb < col_ub before computing grid and launching
csr_tile_to_dense_kernel; if any check fails, return or propagate an error
instead of launching the kernel (leave
CUDA_CHECK_LAST_ERROR(csr_tile_to_dense_kernel) after the launch).
- Around line 125-138: The kernel launch trusts caller-provided shapes and can
OOB; before computing grid and launching group_chunk_stats_kernel, validate that
block is 2D and n_rows*n_cols>0 (already checked), that group_codes.length ==
n_rows, and that group_sums.size(), group_sum_sq.size(), and group_nnz.size()
all equal n_groups (the value derived from group_sums.shape(0)), and that none
of the device pointers/arrays (block.data(), group_codes.data(),
group_sums.data(), group_sum_sq.data(), group_nnz.data()) are null; if any check
fails, return or raise an error instead of launching the kernel. Ensure these
checks are placed immediately before the call site that computes grid =
strided_grid(total, GROUP_STATS_BLOCK) and before invoking
group_chunk_stats_kernel (passing compute_nnz unchanged).
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_device_sparse.cuh`:
- Around line 158-161: The segmented radix sort calls to
cub::DeviceSegmentedRadixSort::SortKeys currently drop the returned cudaError_t;
capture the return value (e.g. cudaError_t err =
cub::DeviceSegmentedRadixSort::SortKeys(...)) for both occurrences (the call
using ref_cub_temp_buf/ref_temp/d_ref_dense/d_ref_sorted/d_ref_seg_offsets and
the second call around lines 357-360) and check err != cudaSuccess; on failure
log or propagate the error and abort/return early from the enclosing function so
downstream ranking code never consumes invalid d_ref_sorted data. Ensure the
error check uses the same stream/context as the call (ref_stream) and returns or
forwards the cudaError_t so callers can handle the failure.
- Line 17: The early-return combines the all-empty test-group case (n_all_grp ==
0) with other short-circuits, leaving rank_sums and tie_corr uninitialized;
instead, detect n_all_grp == 0 separately and write the expected defaults
(rank_sums = 0.0 and tie_corr = 1.0 for each relevant output entry) before
returning so downstream OVO kernels see the valid values. Update the conditional
in wilcoxon_ovo_device_sparse.cuh to only early-return for n_cols == 0 || n_ref
== 0, and add a branch that iterates the output slots and sets rank_sums and
tie_corr to the correct default when n_all_grp == 0; apply the same change at
the other occurrence noted (around the duplicated early-return).
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_host_sparse.cuh`:
- Around line 226-229: The call to cub::DeviceSegmentedRadixSort::SortKeys is
currently invoked without checking its return status, so failures can leave
buf.ref_sorted invalid; update both occurrences (the SortKeys calls around the
ref_sorted usage) to capture the returned cudaError_t (or cub return code),
check it for success, and on failure propagate/return the error or log and abort
the kernel path before any ranking kernels run (and ensure the stream is
synchronized or the error forwarded to the caller) so downstream code never
operates on invalid ref_sorted.
- Line 19: The early return when (n_cols == 0 || n_ref == 0 || n_all_grp == 0)
leaves outputs uninitialized for the "all-empty test-group" case; before
returning when n_all_grp == 0 explicitly zero/initialize the output buffers used
by both CSC and CSR paths (e.g. d_rank_sums, d_tie_corr) and set the group-stat
output tensors to their deterministic values (zeros for sums/corrections and
appropriate zeros/ones for stats/p-values) so downstream consumers get defined
results; apply the same initialization at the other identical early-return site
referenced (around the other occurrence) so both exits behave consistently.
- Around line 209-214: The OVO host ranking path is incorrectly downcasting
double inputs to float (see launch_ovr_cast_and_accumulate_sparse using
d_sparse_data_f32 and the use of d_ref_data_f32/d_grp_data_f32 at the other
locations), which can collapse distinct doubles into ties; fix by making the OVO
ranking pipeline templated on the input value type (add and wire through double
variants like d_sparse_data_f64, d_ref_data_f64, d_grp_data_f64 and templated
kernels such as launch_ovr_cast_and_accumulate_sparse<InT=double, ...>) so no
double→float cast occurs for _f64 entrypoints, and add the numerical-stability
guards called out in the guidelines (epsilon checks before divisions, safe
accumulation/reduction for double, and NaN/Inf handling) in the corresponding
kernels referenced around lines 209, 535-541, and 706-713.
- Around line 141-154: The per-stream scratch buffers bufs[s].d_group_sums,
bufs[s].d_group_sq_sums, and bufs[s].d_group_nnz must be zeroed before each CSC
sub-batch to avoid leaking previous-batch stats into
launch_ovr_cast_and_accumulate_sparse(...); add a zeroing step just before
calling launch_ovr_cast_and_accumulate_sparse that issues asynchronous memsets
(e.g. cudaMemsetAsync) on each of those device pointers using the stream for
that stream, using the correct byte size (n_groups_stats * sub_batch_cols *
sizeof(double) for the first two and the same/count fallback used at allocation
for the third), and apply the same fix at the other allocation site referenced
(lines ~209-214) so all per-stream reused scratch buffers are cleared before
use.
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh`:
- Around line 268-279: The CUB segmented sort return value is ignored causing
the HUGE kernel (ovo_rank_huge_kernel) to run on potentially invalid
sc.grp_sorted; capture the return from cub::DeviceSegmentedRadixSort::SortKeys
into a cudaError_t (e.g., status) and check it before launching
ovo_rank_huge_kernel — on error use the existing error handling pattern
(CUDA_CHECK or equivalent logging/return) to abort/propagate the failure instead
of proceeding to launch the kernel that reads sc.grp_sorted.
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr_kernels.cuh`:
- Around line 34-49: The kernel csr_scatter_to_csc_kernel wrongly assumes CSR
row indices are sorted; replace the binary-search + early-break logic with a
full linear scan of the CSR row range [rs,re) using indices[p] to test
membership in [col_start,col_stop) so unsorted rows are handled correctly:
iterate p from rs to re, for each int c = (int)indices[p] if (c >= col_start &&
c < col_stop) use atomicAdd(&write_pos[c - col_start],1) and write data[p] and
row into csc_vals/csc_row_idx. Keep references to indices, indptr bounds rs/re,
col_start/col_stop and write_pos/csc_vals/csc_row_idx as the places to modify.
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_sparse.cu`:
- Around line 22-41: The sparse binding macro RSC_OVR_SPARSE_DEVICE_BINDING
currently forwards the int parameter sub_batch_cols verbatim into IMPL, which
can be zero or negative and cause invalid launches/allocations; update the
lambda body so it validates and normalizes sub_batch_cols before calling IMPL
(e.g., if sub_batch_cols <= 0 then set it to SUB_BATCH_COLS or clamp to a
minimum of 1), then pass the sanitized value to IMPL; apply the same validation
change for every sparse binding instance referenced by this file so all calls
into the IMPL functions receive a safe, validated sub_batch_cols.
- Around line 196-200: The _f64 binding variants (RSC_OVO_CSC_HOST_BINDING
entries "ovo_streaming_csc_host_f64", "ovo_streaming_csc_host_f64_i64" and the
CSR equivalents) advertise double precision but the implementations
ovo_streaming_csc_host_impl and ovo_streaming_csr_host_impl currently downcast
InT=double to float32 for sorting/tie handling, which can change ordering and
ties; either remove these _f64 bindings or implement a true double-precision
path in those functions that preserves double inputs (avoid casting
double→float), add small epsilons before any divisions, handle accumulation
stability in reductions, and explicitly treat NaN/Inf inputs during ranking so
the exported _f64 entrypoints genuinely preserve double precision.
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu`:
- Around line 91-94: The calls to cub::DeviceSegmentedRadixSort::SortPairs in
the dense launchers are currently ignoring the cudaError_t return value (e.g.,
the call passing buf.cub_temp, temp, keys_in, buf.keys_out, buf.vals_in,
buf.vals_out, sb_items, sb_cols, buf.seg_offsets, buf.seg_offsets + 1,
BEGIN_BIT, END_BIT, stream), which can hide failures and corrupt downstream rank
computations; update both occurrences (around the shown call and the one at
lines ~263-266) to capture the return value, check it against cudaSuccess, and
on error propagate/return an appropriate cudaError_t (or call CUDA
error-handling macro used in this module) so the caller sees the failure instead
of receiving corrupted rank_sums/tie_corr.
- Around line 138-139: The early return on n_all_grp == 0 leaves rank_sums and
tie_corr uninitialized; instead, before returning set the OVO outputs to the
defined empty-group contract: fill rank_sums with 0.0 and tie_corr with 1.0 for
all expected slots (use n_cols and n_groups / n_all_grp semantics used in this
file). Locate the checks around n_all_grp, n_groups, n_cols in wilcoxon.cu and,
when n_all_grp == 0, perform a host/device fill (or launch a small kernel) that
writes 0.0 into the rank_sums buffer for every column×group slot and 1.0 into
the tie_corr buffer, then return. Ensure you respect the same memory
layout/strides used elsewhere in this file so downstream code sees the
initialized values.
In `@src/rapids_singlecell/tools/_rank_genes_groups/_core.py`:
- Around line 430-440: Call _reject_complex(self.X) before branching on method
to ensure complex inputs are rejected for all statistic methods (not just
Wilcoxon); leave the existing Wilcoxon-only negative-sparse fallback logic
(_sparse_negative_fallback = _sparse_has_negative(self.X)) in place inside the
if block. In short: move or add a standalone _reject_complex(self.X) invocation
just prior to the method check so _compute_statistics_arrays and any
astype(self._score_dtype) conversions never silently drop imaginary parts, while
keeping the method-specific setting of self._sparse_negative_fallback for
"wilcoxon" and "wilcoxon_binned".
- Around line 520-526: The top-N selector _rank_indices_matrix mishandles
n_top==0 because np.argpartition(..., -0) returns all columns; add an explicit
check at the start of _rank_indices_matrix for n_top == 0 and return an empty 2D
indices array with shape (scores.shape[0], 0) (dtype int) so
rank_genes_groups(..., n_genes=0) yields no genes; otherwise keep the existing
logic and fall back to _RankGenes._argsort_desc_matrix when n_top >=
scores.shape[1].
In `@src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py`:
- Around line 367-388: The helper _device_sparse_arrays_f32 currently accepts
cp.float64 inputs but always downcasts X.data via X.data.astype(cp.float32,
copy=False), which silently loses precision; instead, detect cp.float64 in
_device_sparse_arrays_f32 and do not downcast—either raise a clear TypeError
stating device-sparse float64 is unsupported (and recommend converting to
float32 or using a host/dense float64 path) or explicitly dispatch to a float64
device kernel/fallback routine; update the code around the data_dtype check to
branch for np.float64 and either raise the error or call the corresponding
float64 handler rather than calling X.data.astype(cp.float32,...).
---
Nitpick comments:
In @.gitignore:
- Line 57: Duplicate .gitignore entry: remove the redundant "/benchmarks/" line
so the pattern appears only once; locate the duplicate entry matching
"/benchmarks/" and delete one occurrence (keep the first instance) to avoid
repeated patterns in .gitignore.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 598c312f-2884-4f0e-8a38-b33679306584
📒 Files selected for processing (31)
.github/workflows/docker.yml.github/workflows/publish.yml.gitignoreCMakeLists.txtconda/rsc_rapids_26.04_cuda12.ymldocker/Dockerfilepyproject.tomlsrc/rapids_singlecell/_cuda/__init__.pysrc/rapids_singlecell/_cuda/nb_types.hsrc/rapids_singlecell/_cuda/rank_genes/csr_tile_to_dense.cuhsrc/rapids_singlecell/_cuda/rank_genes/rank_stats.cusrc/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuhsrc/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuhsrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cusrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_fast_common.cuhsrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_device_sparse.cuhsrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_host_sparse.cuhsrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuhsrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr_kernels.cuhsrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr_sparse.cuhsrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_rmm.cusrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_sparse.cusrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_sparse_kernels.cuhsrc/rapids_singlecell/tools/_rank_genes_groups/__init__.pysrc/rapids_singlecell/tools/_rank_genes_groups/_core.pysrc/rapids_singlecell/tools/_rank_genes_groups/_utils.pysrc/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.pysrc/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon_binned.pytests/test_rank_genes_groups_ttest.pytests/test_rank_genes_groups_wilcoxon.pytests/test_rank_genes_groups_wilcoxon_binned.py
Implement a better GPU accelerated version of wilcoxon in rapids-singlecell inspired by the super fast illico verison.