Custom updater fails after ~160 steps on GPU #2263

cslominski · 2026-03-25T05:55:08Z

cslominski
Mar 25, 2026

Hi everyone,

I’ve implemented a custom updater to remove drift in absolute coordinates from a subset of particles. The CPU version behaves as expected, but the GPU version only works for ~160 steps before particle positions become corrupted. No runtime error is thrown; my logged potential energy values simply become nonsensical.

I’m looking for suggestions on where to focus my debugging.

For context, I’ve included my reference position setter below. I exposed ParticleGroup.m_member_tags as a constant reference and pass it to the kernel as d_group_tags. On subsequent triggers:

The first kernel computes absolute displacements from the reference positions and stores them in a scratch GPUVector.
The second kernel computes the mean drift from the scratch vector using thrust::reduce with thrust::plus<Scalar3>.
The third kernel removes the drift using ParticleGroup.getIndexArray (ordering doesn’t matter for this step).

I tried checking if any of the rebuilds in ParticleGroup corresponded with the breakdown, but I didn't find any clear correlation.

Any ideas on what might be going wrong or what I should check next would be greatly appreciated. I am happy to provide more details if needed.

Best,
Charlie

__global__ void gpu_set_reference_positions_kernel(Scalar3* d_reference_pos,
                                                   const Scalar4* d_pos,
                                                   const int3* d_image,
                                                   const unsigned int* d_rtag,
                                                   const unsigned int* d_group_tags,
                                                   const unsigned int group_size,
                                                   const BoxDim& box)
    {
    const auto group_idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (group_idx < group_size)
        {
        const auto tag = d_group_tags[group_idx];
        const auto idx = d_rtag[tag];
        d_reference_pos[tag] = gpu_get_absolute_position(idx, d_pos, d_image, box);
        }
    }

Answered by cslominski

Mar 27, 2026

Thank you for your debugging suggestions -- they helped me track down the issue.

I tried the high-level checks first, but none revealed anything. After closer inspection, I found that the box data was being corrupted inside the kernel after ~160 steps. The fix was to pass the box by value to the kernels:

const BoxDim box)

View full answer

joaander · 2026-03-25T16:05:03Z

joaander
Mar 25, 2026
Maintainer

Have you run with gpu_error_checking=True and verified that no errors are reported by the GPU runtime?
Does your other GPU code work when you use the CPU implementation of set_reference_positions?
Have you visualized a trajectory with period 1 to see which particles move unexpectedly?
Have you disabled the sorter to ensure that the problem is not related to changing indices?
Have you added printf to check that values you compute in your kernels are what you expect?
If using an NVIDIA GPU, have you run your code through compute-sanitizer to check for out of bounds or uninitialized memory accesses?

1 reply

cslominski Mar 27, 2026
Author

Thank you for your debugging suggestions -- they helped me track down the issue.

I tried the high-level checks first, but none revealed anything. After closer inspection, I found that the box data was being corrupted inside the kernel after ~160 steps. The fix was to pass the box by value to the kernels:

const BoxDim box)

Answer selected by cslominski

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom updater fails after ~160 steps on GPU #2263

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Custom updater fails after ~160 steps on GPU #2263

Uh oh!

cslominski Mar 25, 2026

Replies: 1 comment · 1 reply

Uh oh!

joaander Mar 25, 2026 Maintainer

Uh oh!

cslominski Mar 27, 2026 Author

cslominski
Mar 25, 2026

Replies: 1 comment 1 reply

joaander
Mar 25, 2026
Maintainer

cslominski Mar 27, 2026
Author