Skip to content

Conversation

uncovsky
Copy link

@uncovsky uncovsky commented Sep 24, 2025

I'm pretty sure the CQL ood loss is not implemented correctly.

a) The softmax is not backpropagated through, so no penalty to OOD actions is actually applied
b) The softmax reduces over the ensemble dimension, rather than over the batch

Additionally, I'm not entirely sure if next_pi_q should really be evaluated in the next observations. While it technically makes more sense, CORL evaluates the actions in batch.obs, and they seemed to have been able to replicate the CQL performance (albeit with some difficulty - tinkoff-ai/CORL#14).

I think taking the next_actions for estimating the logsumexp for (s,a) is just something that helped during training, its a little odd sure, but i think we should keep the batch.obs there.

To illustrate, here's a simple 1D bandit benchmark with the estimated q-values before/after the fixes.

CQL broken_q_vals CQL with 0 1 weight_q_vals

Thank you for the great library!

@uncovsky
Copy link
Author

My bad, the logsumexp axis remark is obviously incorrect, what I meant is that IMO the logsumexp should be over the dimension with sampled actions, rather than over all the ensemble values. By summing over the ensemble dim afterwards, we recover the standard CQL loss for num_critics=2.

nonetheless the stopgrad is incorrect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant