Copy States Optimization #297

Angeladadd · 2025-07-17T11:04:33Z

Description

This PR introduces a two-phase optimisation to address communication bottlenecks in the copy_states routine during distributed resampling:

Deduplication & Local Replication:

Transmit each unique particle only once across ranks.
Reconstruct duplicates locally with multi-threaded replication.

Communication-Efficient Redistribution:

Reformulates resampling redistribution as a lightweight rank-level transportation problem.
Minimises unnecessary cross-rank transfers caused by global index ordering.
Solved efficiently with HiGHS, adding negligible overhead relative to communication savings.

Issue

#116

Testing

Added integration test 7 for covering the cases using optimised copy states and resampling function
Added mpi test for optimised copy states function
Added unit test for optimised resampling function
Added slurm scripts for
- running end-to-end run_particle.jl
- running mpi test for optimised copy states function

viz

…p or not

codecov · 2025-09-01T10:47:58Z

Codecov Report

❌ Patch coverage is 97.29730% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.94%. Comparing base (e314fbd) to head (e475640).

Files with missing lines	Patch %	Lines
src/utils.jl	97.24%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #297      +/-   ##
==========================================
+ Coverage   94.64%   94.94%   +0.29%     
==========================================
  Files           9        9              
  Lines         654      752      +98     
==========================================
+ Hits          619      714      +95     
- Misses         35       38       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v4...v5) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

try fixing benchmarking upgrade manifest add Test(try fixing benchmarking) let ci refresh manifest Revert "let ci refresh manifest" This reverts commit 5385785. try fixing benchmarking

This reverts commit 2b10cee.

tkoskela

Hi Chenge, I hope you are doing well. Sorry it's taken so long to review your code after marking the report. I think this is very good work and should definitely get merged into the upstream repo.

I'd like to have a few clarifying comments in some places that I've highlighted in my review comments. I think overall the optimisations you made should be the default behaviour, rather than something the user has to switch on. This would especially simplify the copy_states! and copy_states_dedup! functions that duplicate some code at the moment.

If Matt and Mose could also take a look at this, that would be great!

tkoskela · 2025-10-17T09:53:04Z

extra/Plot_copy_states_julia.ipynb

Does this notebook reproduce the plots of your report? It's really nice to have it if it does! It could use some plain text description of what it's doing.

tkoskela · 2025-10-17T09:59:43Z

extra/weak_scaling/kathleen_slurm_copy_states.sh

+/home/ucabc46/.julia/bin/mpiexecjl -n $SLURM_NNODES\
+     julia --project=. \
+     /home/ucabc46/exp/ParticleDA.jl/test/mpi_optimized_copy_states.jl -t /home/ucabc46/exp/ParticleDA.jl/test/output/dedup_threading_optimize_resampling/all_timers_$SLURM_NNODES.h5 -o


Having absolute paths to your home directory here will force anyone else using this to manually edit all of them. To reduce the amount of manual editing a different user would have to do, I'd put the paths into variables and use an env variable for home.

Suggested change

/home/ucabc46/.julia/bin/mpiexecjl -n $SLURM_NNODES\

julia --project=. \

/home/ucabc46/exp/ParticleDA.jl/test/mpi_optimized_copy_states.jl -t /home/ucabc46/exp/ParticleDA.jl/test/output/dedup_threading_optimize_resampling/all_timers_$SLURM_NNODES.h5 -o

PARTICLEDA_TEST_DIR=$HOME/exp/ParticleDA.jl/test

JULIA_DIR=$HOME/.julia

$JULIA_DIR/bin/mpiexecjl -n $SLURM_NNODES\

julia --project=. \

$PARTICLEDA_TEST_DIR/mpi_optimized_copy_states.jl -t $PARTICLEDA_TEST_DIR/output/dedup_threading_optimize_resampling/all_timers_$SLURM_NNODES.h5 -o

tkoskela · 2025-10-17T10:00:00Z

extra/weak_scaling/kathleen_slurm_weak_scaling.sh

+/home/ucabc46/.julia/bin/mpiexecjl -n $SLURM_NNODES\
+     julia --project=. \
+     /home/ucabc46/exp/ParticleDA.jl/extra/weak_scaling/run_particleda.jl


See previous comment

tkoskela · 2025-10-17T10:00:43Z

extra/weak_scaling/parametersW1.yaml

-    station_filename: "stationsW1.txt"
-    obs_noise_std: [0.01]
-
+    station_filename: "/home/ucabc46/exp/ParticleDA.jl/extra/weak_scaling/stationsW1.txt"


Again, this should not point to your home dir. Can we use a relative path here?

tkoskela · 2025-10-17T10:08:17Z

extra/weak_scaling/run_particleda.jl

+# Verify BLAS implementation is OpenBLAS
+@assert occursin("openblas", string(BLAS.get_config()))
+
+# Set size of thread pool for BLAS operations to 1
+BLAS.set_num_threads(1)


This feels sensible. We don't want BLAS oversubscribing threads. Could you put a comment here to explain why we require OpenBLAS?

tkoskela · 2025-10-17T10:15:30Z

src/params.jl

    particle_save_time_indices::V = []
    seed::Union{Nothing, Int} = nothing
    n_tasks::Int = -1
+    optimize_copy_states::Bool = false


Since you demonstrated this works well, I would like to have it true by default.

tkoskela · 2025-10-17T10:18:57Z

test/Project.toml

@@ -1,5 +1,6 @@
 [deps]
 ChunkSplitters = "ae650224-84b6-46f8-82ea-d812ca08434e"
+Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"


I don't see this being used anywhere? Do we need the test for stale dependencies @giordano

tkoskela · 2025-10-17T11:03:53Z

src/utils.jl

+    dedup::Bool = false
 ) where T

+    if dedup
+        return copy_states_dedup!(particles, buffer, resampling_indices, my_rank, nprt_per_rank, to)
+    end


There's a bit of code duplication here that is not ideal, and I think is confusing GitHub in showing the diff. I would be happy with replacing the old copy_states! with the new deduplicating version entirely. I think you showed the overhead of removing the duplicates is small in all realistic cases. That would make the code easier to read and maintain in the future.

I'm not sure I understand why the timer object is in the arguments of this function. If I still remember how this works, in the main run_particle_filter function the timer will be updated by the @timeit_debug macro in the calling function and returning it as an argument is redundant.

The timer object is in the arguments because we have a separate testing script for the copy_states! function.

tkoskela · 2025-10-17T11:04:37Z

src/utils.jl

+end

-    particles .= buffer
+function _categorize_wants(particles_want, my_rank::Int, nprt_per_rank::Int)


particles_want is missing its type here

tkoskela · 2025-10-17T11:07:34Z

src/utils.jl

+        if source_rank == my_rank
+            get!(() -> Int[], local_copies, id) |> v -> push!(v, k)
+        else
+            get!(() -> Int[], remote_copies, id) |> v -> push!(v, k)
+        end


This bit is quite difficult to understand. Can you add some comments to explain what it does? If I understand correctly, you are pushing id into an element of either local_copies or remote_copies based on the outcome of the if statement.

ucabc46 and others added 21 commits July 16, 2025 16:33

slurm script

94fbfb9

draft benchmarking scripts

5339204

viz

reduce duplicated message

9604bae

update viz

16435ed

add measurement of overall execution time & pass flag whether to dedu…

df21e0e

…p or not

add slurm script

42133cc

local copy/remote duplicate in parallel

d768e56

optimize indices

622fe22

stage viz

d80974c

draft optimized indices

b14678d

add optimized resample flag to test

6860148

update test scripts

79da49c

update viz

66ed782

update test script

9b350da

update viz

f44bed9

update timer naming

1820978

update code

596832d

update viz

ac1a126

remove serialization

87e287b

upgrade julia

d2d3df9

add TimerOutput

883045e

ucabc46 and others added 6 commits September 1, 2025 15:41

remove unused files

2d2d33d

update benchmark deps

2b10cee

try fixing benchmarking upgrade manifest add Test(try fixing benchmarking) let ci refresh manifest Revert "let ci refresh manifest" This reverts commit 5385785. try fixing benchmarking

Revert "update benchmark deps"

019bbf0

This reverts commit 2b10cee.

downgrade julia

a1c9ff8

downgrade to 1.7

273cde9

Angeladadd force-pushed the cgsun/copy_states_refine branch from 8ac5abb to 273cde9 Compare September 3, 2025 22:54

fix ci

879553f

Angeladadd and others added 11 commits September 4, 2025 00:11

add Test

78270ca

fix benchmark

d397e69

upgrade benchmark julia to 1.10

687bd42

fix test depts

12a6972

Merge branch 'main' into cgsun/copy_states_refine

6f4dbfb

add tests

49d3eae

reduce modified files and update viz

eed0ac0

update gitignore

1b9be26

update scripts

41e74ef

use filter paramter

ed4267d

update extra

7e2c788

Angeladadd marked this pull request as ready for review September 4, 2025 02:56

Angeladadd added 4 commits September 4, 2025 10:15

add comments

334d597

remove serialization pkg

2d45aa9

update comment

95b8e5e

fix comments

e475640

tkoskela requested changes Oct 17, 2025

View reviewed changes

tkoskela requested review from giordano and matt-graham October 17, 2025 11:28

Copy States Optimization #297

Are you sure you want to change the base?

Copy States Optimization #297

Conversation

Angeladadd commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue

Testing

Uh oh!

codecov bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tkoskela left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Angeladadd commented Jul 17, 2025 •

edited

Loading

codecov bot commented Sep 1, 2025 •

edited

Loading