Skip to content

Conversation

@hylkevds
Copy link
Collaborator

The client may have started a new session before the old one is purged, therefore the purge must check if the session being removed should actually be removed and is not a new session.

By basing the remove method on the session instead of the client id, it can ensure only the correct session is removed from the pool.

@hylkevds hylkevds requested a review from andsel July 27, 2025 19:58
@andsel
Copy link
Collaborator

andsel commented Aug 2, 2025

@hylkevds do you know when this happen in production? For example is this a valid case that manifest the problem?

  1. client_1 is connected with name session_1
  2. another instance of a client, name it client_2 connects with same name session_1
  3. client_1 is expected to disconnect and its session should be removed without interfering with new session object created for client_2.

If that's the case, is it possible to write an integration test to cover it?

@hylkevds
Copy link
Collaborator Author

hylkevds commented Aug 4, 2025

@hylkevds do you know when this happen in production? For example is this a valid case that manifest the problem?
1. client_1 is connected with name session_1
2. another instance of a client, name it client_2 connects with same name session_1
3. client_1 is expected to disconnect and its session should be removed without interfering with new session object created for client_2.

That specific case is covered. When client_1 is still connected, when client_2 connects with the same ClientId, client_1 will be disconnected, and depending on the clean_session setting, the old Session will be cleaned up before the new one is created.

Though the call to sessionsRepository.delete(session.getSessionData()); is currently missing here!
Looking through it again, I think this was the main reason I changed it. As sessionsRepository.delete(session.getSessionData()); requires a session, and we want to avoid multiple calls to pool.get(clientId).

The tricky issue would be when client_1 makes a session with cleanSession=false (or with a timeout in MQTT5) and the timed cleanup happens right as client_2 connects, and a race-condition makes the session cleaner execute after the new session was made.

The correct solution would be to make the cleanup happen on the SessionLoop, and add a simple check to see if the session is actually connected. A connected session should never be removed.

Copy link
Collaborator

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if it's a race condition then has to be managed with proper synchronization strategy.

@hylkevds
Copy link
Collaborator Author

Yes, hence why we should move the cleanup to the SessionLoop. Though the currently missing sessionsRepository.delete(session.getSessionData()); is the bigger issue.
I'll see what I can come up with.

The client may have started a new session before the old one is purged,
therefore the purge must check if the session being removed should
actually be removed and is not a new session.
@hylkevds hylkevds force-pushed the fix_SessionRegistryRemove branch from e9e6713 to 1de0990 Compare August 13, 2025 17:32
@hylkevds
Copy link
Collaborator Author

I've had another look and made two additional changes:

  • the session-timeout purge now happens on the SessionLoop, to make sure the purge can not happen at the same time (on a different thread) as a session re-open.
  • I added an extra check on the session state, in case the cleanup get scheduled after a re-open. This is important, since the cleanup also removes all subscriptions.

Regarding tests, unfortunately, race conditions are nearly impossible to reliably test, since they depend on the exact interleaving of threads...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants