Skip to content
Discussion options

You must be logged in to vote

A data race detected by TPU Interpret Mode should never be a false positive. (There are no known issues here, but it is possible there is a bug that is permitting false positives.)

I suspect that some later stage/step's RDMA is signaling a semaphore while an earlier stage/step is still waiting for an earlier RDMA to signal the same semaphore, and this is leading to a real race.

If a second RDMA is started before an earlier RDMA using the same send/receive semaphores has completed, Pallas permits the second RDMA to to complete and signal the semaphores before the first RDMA. But this will only happen in TPU Interpret Mode with dma_execution_mode="on_wait", which is why this kind of race is…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@tengyifei
Comment options

Answer selected by tengyifei
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants