Skip to content

Conversation

@mh0lt
Copy link
Contributor

@mh0lt mh0lt commented Oct 9, 2025

This PR changes the way SharedDomains and TemporalTx's are passed to stage components. Rather than wraping them inside of a TxContainer.

The reason for doing this is that the container removes type and content saftey when we're passing domains an transactions. For the most part now the tx passed is temporal - and all of the underlying code now handles temproal transactions, so I think for the stage loop thie wrapper has become redundant.

Why Now ? Becuase I have discovered an issue with unwinds when running parallel processing with an external tx. This is becuase in the case the shared domain is not shared, or its memory gets clear in unwind, and becuase the unwind flush is to the external RoTx of the loop, the RoTx's of the parallel workers can't see this update as its not been commited yet.

To fix this I need to trace the SD usage and use an SD which is shared between unwind and exec. This will make that easier to do.

@sudeepdino008
Copy link
Member

sudeepdino008 commented Oct 9, 2025

where exactly is the logic that causes sharing of the sd between unwind and exec?

@mh0lt
Copy link
Contributor Author

mh0lt commented Oct 9, 2025

where exactly is the logic that causes sharding of the sd between unwind and exec?

In the stage loop unwind and exec are seperate finctions and both operate on the shared domain, however the code here:

if err := sd.Flush(ctx, tx); err != nil {

flushes and clears the domain. i.e. writes the data to the underlying transaction and clears the SD. There are two cases here:

  1. The Sd & the tx is local to the unexec stage
  2. The tx is shared, becuase it is external, but either the sd is local and not visible, or the sd is shared and its RAM is cleared.

In 2 the unwind is not visible to the parallel executers until the bounding tx is commited. I think that the sd's local data and the bounding tx need to have thier updates co-ordinated. However in the code there are various places where local shared domains are created and accessed. I think that many of these are likely to break things.

@sudeepdino008
Copy link
Member

where exactly is the logic that causes sharding of the sd between unwind and exec?

In the stage loop unwind and exec are seperate finctions and both operate on the shared domain, however the code here:

if err := sd.Flush(ctx, tx); err != nil {

flushes and clears the domain. i.e. writes the data to the underlying transaction and clears the SD. There are two cases here:

  1. The Sd & the tx is local to the unexec stage
  2. The tx is shared, becuase it is external, but either the sd is local and not visible, or the sd is shared and its RAM is cleared.

In 2 the unwind is not visible to the parallel executers until the bounding tx is commited. I think that the sd's local data and the bounding tx need to have thier updates co-ordinated. However in the code there are various places where local shared domains are created and accessed. I think that many of these are likely to break things.

ok. So this is a preparatory PR to help make debugging easier for tracing the SD usage, correct?

@mh0lt
Copy link
Contributor Author

mh0lt commented Oct 10, 2025

where exactly is the logic that causes sharding of the sd between unwind and exec?

In the stage loop unwind and exec are seperate finctions and both operate on the shared domain, however the code here:

if err := sd.Flush(ctx, tx); err != nil {

flushes and clears the domain. i.e. writes the data to the underlying transaction and clears the SD. There are two cases here:

  1. The Sd & the tx is local to the unexec stage
  2. The tx is shared, becuase it is external, but either the sd is local and not visible, or the sd is shared and its RAM is cleared.

In 2 the unwind is not visible to the parallel executers until the bounding tx is commited. I think that the sd's local data and the bounding tx need to have thier updates co-ordinated. However in the code there are various places where local shared domains are created and accessed. I think that many of these are likely to break things.

ok. So this is a preparatory PR to help make debugging easier for tracing the SD usage, correct?

Yes. It also simplifies the inner code - becuase it can just use the temporal tx rather than checking both. It seems now its pretty much always a temporal tx, so I have forced this to be true.

As you can see from the various checkins above I wanted to do the fiddly changes to do with unwrapping here, rather than in the PR where I'm trying to trace the stage code.

The only place where this is not the case is in tests - so I've changed them to be conformant with the working code. Hence the addition of the test temporal db.

@mh0lt mh0lt merged commit d2b228e into main Oct 14, 2025
17 checks passed
@mh0lt mh0lt deleted the unwrap_tx_and-sd_params branch October 14, 2025 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants