-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Describe the problem you're observing
I was doing some testing using the write-static code and found a subtle bug in https://github.com/LLNL/UnifyFS/blob/58ece4441716678f5111a6dbff9baadd6188c2b6/examples/src/testutil.h#L1251
Describe how to reproduce the problem
When the test file already exists, setting the "remove and reuse test file" flag (-r) would easily trigger the bug.
mpirun -n 2 write-static -n 1 -r
Issue
At the end of the test_remove_file(), we do a barrier() to sync processes.
The problem is, there is a potential execution path where one or more processes may exit the function earlier, causing wrong barrier matching.
https://github.com/LLNL/UnifyFS/blob/58ece4441716678f5111a6dbff9baadd6188c2b6/examples/src/testutil.h#L1262-L1268
This part of code assumes all ranks check the existence of file, get rc = 0, and continue.
Problematic execution sequence:
Rank 0 calls stat(), get rc = 0, then go ahead delete the file. Then Rank 1 came, its stat() call will return -1 since Rank 0 has already deleted the file. Then Rank 1 exits the function without calling the barrier() at the end.
Fix
Add a barrier immediately after stat() to make sure everyone sees the same result.
Overall, we should be careful when using "return" if we will do a barrier later. We need to make sure all processes will always follow the same path: either all return, or all call barrier.