Hi,
I am recently reading your excellent continual-learning implementation, in particular about the SI. In the following line of code, you used p.grad, which is the gradient of the regularized loss. However, based on my understanding about SI, the gradient should be computed merely on the data loss, so that it measures how much each weight contributes to the fitting error of the present task. Am I wrong about it, or I missed important factors in your implementation? Thanks ahead for your clarification.
|
W[n].add_(-p.grad*(p.detach()-p_old[n])) |
Hi,
I am recently reading your excellent continual-learning implementation, in particular about the SI. In the following line of code, you used
p.grad, which is the gradient of the regularized loss. However, based on my understanding about SI, the gradient should be computed merely on the data loss, so that it measures how much each weight contributes to the fitting error of the present task. Am I wrong about it, or I missed important factors in your implementation? Thanks ahead for your clarification.continual-learning/train.py
Line 248 in d281967