-
Notifications
You must be signed in to change notification settings - Fork 23
DVEmbAttributor #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DVEmbAttributor #193
Conversation
|
Please address the failed tests. |
We are still working on a performance test to reproduce the a small experiment's result presented on the paper. |
|
Please add some tests. |
I've added some tests. This is still a work in progress as we're currently adding the ghost product part. |
TheaperDeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have not fully completed the review.
dattri/algorithm/dvemb.py
Outdated
| def __init__( | ||
| self, | ||
| model: nn.Module, | ||
| loss_func: Callable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to use AttributionTask here for model, loss_func's integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
dattri/algorithm/dvemb.py
Outdated
| data_tensors: Tuple[Tensor, Tensor], | ||
| ) -> Tensor: | ||
| inputs = data_tensors[0].unsqueeze(0) | ||
| targets = data_tensors[1].unsqueeze(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use AttributionTask (https://github.com/TRAIS-Lab/dattri/blob/main/dattri/task.py#L40) This will be a user defined function and we don't really need to assume user's data is a tuple of 2 tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
dattri/algorithm/dvemb.py
Outdated
| factorization_type: Type of gradient factorization to use. Options are | ||
| "none"(default), | ||
| "kronecker"(same with paper), | ||
| or "elementwise"(more memory-efficient). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is not directly related to memory since our behavior repect the proj_dim parameter as the main parameter to control the memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
dattri/algorithm/dvemb.py
Outdated
| loss_func: Callable, | ||
| device: str = "cpu", | ||
| proj_dim: Optional[int] = None, | ||
| factorization_type: str = "none", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also have a layer_names parameter supported here so that we can allow users to define the layers they want the grad decomposition happens. Remember, not only nn.Linear is a linear layer :)
dattri/dattri/algorithm/logra/logra.py
Lines 42 to 44 in e5eb9e7
| layer_names: Optional[ | |
| Union[str, List[str]] | |
| ] = None, # Maybe support layer class as input? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
dattri/algorithm/dvemb.py
Outdated
| project_input = self._generate_projector( | ||
| input_dim, | ||
| self.projection_dim, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the projector once #206 is merged.
dattri/algorithm/dvemb.py
Outdated
| ) | ||
| projected_grads = per_sample_grads @ self.projector | ||
|
|
||
| scaling_factor = 1.0 / math.sqrt(self.projection_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a normalization in line 134 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. Now dvemb's score is essentially equal to groundtruth's without using factorization.
dattri/algorithm/dvemb.py
Outdated
| loss_func: Callable, | ||
| device: str = "cpu", | ||
| task: AttributionTask, | ||
| criterion: nn.Module, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have already integrated the criterion into the loss_func of AttributionTask, it may be better to use get_loss_func to get it instead of redefining it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's hard to completely remove criterion. While the task does hold all the components, we need to access the functional loss_func and the stateful criterion for two different paths.
The functional task.get_grad_loss_func() is used for the full gradient path when factorization_type="none". However, when factorization is enabled, we must use _calculate_gradient_factors, which relies on backward hooks to capture the intermediate gradient factors. These hooks are only triggered by loss.backward(), requiring us to use the nn.Module directly after a standard model.forward(). Using the functional task.get_loss_func() in this case would bypass the hook system entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply! You could use loss = task.get_loss_func(params, data) and then call loss.backward() to trigger the backward hook. (In some versions of PyTorch, torch.func.functional_call may bypass the forward hook, so you might want to be aware of that.)
By the way, you might consider optimizing calculate_gradient_factors — instead of performing an additional forward/backward pass, you could collect the factors during the parameter update step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second that we should avoid the criterion argument. Factorization is also used in the LoGraAttributor, which did not use a separate criterion. Maybe take a look there about how to avoid this argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason is that we are using vmap and torch.func to calculate the gradient for non-factorization path and the autograd for factorization path of dvemb(we support both of them). And the loss function (target function) has a different formatting between these two. E.g.,
dattri/examples/pretrained_benchmark/influence_function_lds.py
Lines 23 to 27 in f5fb64b
| def f(params, data_target_pair): | |
| image, label = data_target_pair | |
| loss = nn.CrossEntropyLoss() | |
| yhat = torch.func.functional_call(model_details["model"], params, image) | |
| return loss(yhat, label.long()) |
dattri/examples/pretrained_benchmark/logra_lds.py
Lines 23 to 28 in f5fb64b
| def f(model, batch, device): | |
| inputs, targets = batch | |
| inputs = inputs.to(device) | |
| targets = targets.to(device) | |
| outputs = model(inputs) | |
| return nn.functional.cross_entropy(outputs, targets) |
One we we can do is to use task and remove this criterion for now, and detect if a false format is given by the user and provide clear instructions in document and error messages.
| ) | ||
|
|
||
| def _run_dvemb_simulation(self, attributor: DVEmbAttributor): | ||
| """A generic simulation runner for any configured DVEmbAttributor.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be fine to collect gradients for the training dataset just to verify that the algorithm runs correctly without updating the model parameters in the unit test, but it would be good to add a comment clarifying that this isn’t the correct way to collect gradients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've added the comment clarifying that.
|
|
||
| from dattri.benchmark.load import load_benchmark | ||
|
|
||
| """This file define the MLP model.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring about this file should be put at the top. Please elaborate a bit more about this file. "This file defines the MLP model" is inaccurate about this file.
dattri/algorithm/dvemb.py
Outdated
| loss_func: Callable, | ||
| device: str = "cpu", | ||
| task: AttributionTask, | ||
| criterion: nn.Module, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second that we should avoid the criterion argument. Factorization is also used in the LoGraAttributor, which did not use a separate criterion. Maybe take a look there about how to avoid this argument?
examples/dvemb/dvemb_mlp.py
Outdated
| @@ -0,0 +1,199 @@ | |||
| """Example code to compute Leave-One-Out (LOO) scores on MLP trained on MNIST dataset. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this docstrng.
TheaperDeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the API structure and added descriptive in-line comments.
| DVEmbAttributor( | ||
| task=self.task_eager, | ||
| factorization_type="none", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a unit test to verify the validation of invalid loss function formats.
dattri/algorithm/dvemb.py
Outdated
| If None, uses all Linear layers. | ||
| You can check the names using model.named_modules(). | ||
| Hooks will be registered on these layers to collect gradients. | ||
| Only available when factorization_type is not "none". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Will only be used when ..."
dattri/algorithm/dvemb.py
Outdated
| factorization_type: Type of gradient factorization to use. Options are | ||
| "none" (default), | ||
| "kronecker" (same as in the paper), | ||
| or "elementwise" (better performance while |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe elaborate a little bit about what "elementwise" does?
dattri/algorithm/dvemb.py
Outdated
| ValueError: If embeddings for the specified `epoch` are not found. | ||
| """ | ||
| if not self.embeddings: | ||
| msg = "Embeddings not computed. Call compute_embeddings first." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call compute_embeddings in self.cache() for consistency with other attributors? And in this error message we can ask the user to call cache()
| def fwd_hook(idx: int) -> Callable: | ||
| def _hook( | ||
| layer: nn.Module, | ||
| inputs: Tuple[Tensor, ...], | ||
| _output: Tensor, | ||
| ) -> None: | ||
| a = inputs[0].detach() | ||
|
|
||
| if a.dim() > 2: # noqa: PLR2004 | ||
| a = a.reshape(a.size(0), -1) | ||
| caches[idx]["A"] = a | ||
| caches[idx]["has_bias"] = layer.bias is not None | ||
|
|
||
| return _hook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Mixed Precision, this hook forces the retention of fp32 input activations. Since Autograd saves a separate bf16 copy for the backward pass, this hook prevents the fp32 tensor from being freed, significantly increasing memory usage. To fix this, we need to use saved_tensors_hooks instead, which is non-trivial (see my recent revision to GhostSuite). For now, I suggest to make a note here and in the README.
Description
1. Motivation and Context
This pull request introduces the Data Value Embedding (DVEmb) attributor, a method for calculating trajectory-specific data influence. Unlike existing methods that often overlook the temporal dynamics of model training, DVEmb captures how the influence of a data point evolves over time by creating epoch-specific embeddings, which allows for a more accurate analysis of data value.
2. Summary of the change
github issue
3. What tests have been added/updated for the change?