Fix: CQL ood regularization #7

uncovsky · 2025-09-24T15:42:59Z

I'm pretty sure the CQL ood loss is not implemented correctly.

a) The softmax is not backpropagated through, so no penalty to OOD actions is actually applied
~~b) The softmax reduces over the ensemble dimension, rather than over the batch~~

Additionally, I'm not entirely sure if next_pi_q should really be evaluated in the next observations. While it technically makes more sense, CORL evaluates the actions in batch.obs, and they seemed to have been able to replicate the CQL performance (albeit with some difficulty - tinkoff-ai/CORL#14).

I think taking the next_actions for estimating the logsumexp for (s,a) is just something that helped during training, its a little odd sure, but i think we should keep the batch.obs there.

To illustrate, here's a simple 1D bandit benchmark with the estimated q-values before/after the fixes.

Thank you for the great library!

uncovsky · 2025-09-27T14:12:34Z

My bad, the logsumexp axis remark is obviously incorrect, what I meant is that IMO the logsumexp should be over the dimension with sampled actions, rather than over all the ensemble values. By summing over the ensemble dim afterwards, we recover the standard CQL loss for num_critics=2.

nonetheless the stopgrad is incorrect

uncovsky added 2 commits September 24, 2025 17:26

Fix ood regularization for CQL

ebe230b

Change logsumexp logic

223f98d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: CQL ood regularization #7

Fix: CQL ood regularization #7

Uh oh!

uncovsky commented Sep 24, 2025 •

edited

Loading

Uh oh!

uncovsky commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix: CQL ood regularization #7

Are you sure you want to change the base?

Fix: CQL ood regularization #7

Uh oh!

Conversation

uncovsky commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uncovsky commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

uncovsky commented Sep 24, 2025 •

edited

Loading