compute_total_bounds: CUB BlockReduce and __syncthreads() inside divergent if (tid < num_items) cause partial-block UB

Describe the bug

In warp/native/bvh.cu, compute_total_bounds calls a CUB BlockReduce collective and __syncthreads() from inside a thread-divergent branch if (tid < num_items):

const int blockStart = blockDim.x * blockIdx.x;
const int numValid   = ::min(num_items - blockStart, blockDim.x);
const int tid        = blockStart + threadIdx.x;

if (tid < num_items) {
    vec3 lower = item_lowers[tid];
    vec3 upper = item_uppers[tid];

    vec3 block_upper = BlockReduce(temp_storage).Reduce(upper, Vec3Max, numValid);
    __syncthreads();
    vec3 block_lower = BlockReduce(temp_storage).Reduce(lower, Vec3Min, numValid);

    if (threadIdx.x == 0) {
        atomic_max(total_upper, block_upper);
        atomic_min(total_lower, block_lower);
    }
}

BlockReduce::Reduce and __syncthreads() are block-wide operations that require every thread in the block to reach them. For the last block, when num_items is not an exact multiple of blockDim.x, the threads with tid >= num_items take the else path and never reach the collective/barrier. That is undefined behavior — it can hang or produce a wrong reduction — and the result feeds the total scene bounds used to build the Morton-code LBVH, so a wrong reduction silently corrupts the acceleration structure.

Location

warp/native/bvh.cu, compute_total_bounds (≈ lines 462–478).

Reachability

The default GPU LBVH build path; triggered whenever the item count is not a multiple of the block size (i.e. most inputs).

Suggested fix

Hoist the loads and both BlockReduce(...).Reduce(...) calls (and the __syncthreads()) out of the if (tid < num_items) so all threads participate; feed neutral sentinels for out-of-range lanes (+inf for the min/lower reduce, -inf for the max/upper reduce) — or have all threads call Reduce(..., numValid) (the valid count is already computed) — and guard only the atomic_* writes with threadIdx.x == 0.

Note on verification

I found this by static analysis. I don't have a Warp build on my machine (single-GPU WSL2 without the native toolchain), so I have not reproduced it under a sanitizer. A maintainer could confirm with compute-sanitizer --tool synccheck on an LBVH built from an item count not divisible by the block size (placing one far-outlier AABB last makes a wrong reduction observable against a CPU reference). Filing in case it's useful; happy to open a PR if the suggested direction looks right.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

compute_total_bounds: CUB BlockReduce and __syncthreads() inside divergent if (tid < num_items) cause partial-block UB #1601

Describe the bug

Location

Reachability

Suggested fix

Note on verification

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

compute_total_bounds: CUB BlockReduce and __syncthreads() inside divergent if (tid < num_items) cause partial-block UB #1601

Description

Describe the bug

Location

Reachability

Suggested fix

Note on verification

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions