Skip to content

Conversation

@DanielNi868
Copy link

Fix the transformers' error ( setting the attn_implementation = 'eager' )

Modify checkpoints_load_func for score_logra.py and score_TRAK.py to fix hugging face error

Update readme and comment put the vamp error for transformers

--proj_dim 256 \
--proj_max_batch_size 8 \
--proj_type random_mask

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this file

--output_dir ../checkpoints \
--block_size 512 \
--seed ${SEED}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this file

trust_remote_code=args.trust_remote_code,
attn_implementation="eager", # Use eager attention for better performance
)
model = model.cuda()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have information of this troubleshotting since it is no longer an issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the troubleshotting message and no need to tell user how did the toolkit developer resolved the problem.

config=config,
low_cpu_mem_usage=args.low_cpu_mem_usage,
trust_remote_code=args.trust_remote_code,
attn_implementation="eager", # Use eager attention for better performance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing in this PR is to add this line in score_TRAK and score_logra.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are something else you need to change in score_logra and score_TRAK, please comment why they are needed in order to fix the transformer error regarding the vmap. Otherwise, we may keep them unchanged.

@TheaperDeng TheaperDeng changed the title Fix the transformers' error and update the score_logra and score_TRAK [WIP] Fix the transformers' error and update the score_logra and score_TRAK Nov 10, 2025
@TheaperDeng TheaperDeng added bug Something isn't working work-in-progress labels Nov 10, 2025
@DanielNi868
Copy link
Author

I deleted those files and updated readme.md

trust_remote_code=args.trust_remote_code,
attn_implementation="eager", # Use eager attention for better performance
)
model = model.cuda()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the troubleshotting message and no need to tell user how did the toolkit developer resolved the problem.

config=config,
low_cpu_mem_usage=args.low_cpu_mem_usage,
trust_remote_code=args.trust_remote_code,
attn_implementation="eager", # Use eager attention for better performance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are something else you need to change in score_logra and score_TRAK, please comment why they are needed in order to fix the transformer error regarding the vmap. Otherwise, we may keep them unchanged.

@DanielNi868
Copy link
Author

I updated readme.md

in TRAK, the function f in main should keep the unsqueeze(0) added to avoid the dimension mismatch error
and I think this is needed to fix error

def f(params, batch):
"""
Log-odds objective for TRAK.
"""
input_ids, attention_mask, labels = batch

    # Re-add batch dimension removed by vmap
    input_ids = input_ids.unsqueeze(0).cuda()
    attention_mask = attention_mask.unsqueeze(0).cuda()
    labels = labels.unsqueeze(0).cuda()

    outputs = torch.func.functional_call(
        model,
        params,
        (input_ids,),  # Pass as tuple to avoid dimension issues
        kwargs={"attention_mask": attention_mask, "labels": labels},
    )
    logp = -outputs.loss
    return logp - torch.log(1 - torch.exp(logp))

try:
from transformers.utils import send_example_telemetry
except ImportError:
send_example_telemetry = None # Not available in newer transformers versions
Copy link
Author

@DanielNi868 DanielNi868 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have ImportError: cannot import name 'send_example_telemetry' from 'transformers.utils' using original code

#fix the import error in newer transformers versions
if send_example_telemetry is not None:
send_example_telemetry("run_clm_no_trainer", args)

Copy link
Author

@DanielNi868 DanielNi868 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot import name 'send_example_telemetry' from 'transformers.utils'

default="random_mask",
choices=["normal", "rademacher", "random_mask", "sjlt", "grass"],
help="Random projection type used for TRAK/TracIn (default: random_mask).",
)
Copy link
Author

@DanielNi868 DanielNi868 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have torch.OutOfMemoryError: CUDA out of memory. When I did not have these 3 parameters

input_ids = input_ids.unsqueeze(0).cuda()
attention_mask = attention_mask.unsqueeze(0).cuda()
labels = labels.unsqueeze(0).cuda()

Copy link
Author

@DanielNi868 DanielNi868 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have IndexError: too many indices for tensor of dimension 2

# Re-add batch dimension removed by vmap
input_ids = input_ids.unsqueeze(0).cuda()
attention_mask = attention_mask.unsqueeze(0).cuda()
labels = labels.unsqueeze(0).cuda()
Copy link
Author

@DanielNi868 DanielNi868 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have IndexError: too many indices for tensor of dimension 2

input_ids = input_ids.unsqueeze(0).cuda()
attention_mask = attention_mask.unsqueeze(0).cuda()
labels = labels.unsqueeze(0).cuda()

Copy link
Author

@DanielNi868 DanielNi868 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have IndexError: too many indices for tensor of dimension 2

if len(parts) == 2 and parts[1].isdigit():
num_checkpoints = int(parts[1])
requested_checkpoints = int(parts[1])
else:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run this again and it can run this time, I think this modification can be deleted


checkpoints = [str(p) for p in available_checkpoint_dirs[:requested_checkpoints]]

elif method in ["TracIn", "Grad-Dot", "Grad-Cos"]:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run this again and it can run this time, I think this modification can be deleted

method,
)
checkpoints = [str(p) for p in available_checkpoint_dirs[:requested_checkpoints]]
else:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run this again and it can run this time, I think this modification can be deleted

"proj_dim": 2048,
"proj_dim": args.proj_dim,
"proj_max_batch_size": args.proj_max_batch_size,
"proj_type": args.proj_type,
Copy link
Author

@DanielNi868 DanielNi868 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 0 has a total capacity of 44.35 GiB of which 41.56 GiB is free. Including non-PyTorch memory, this process has 2.79 GiB memory in use. Of the allocated memory 2.36 GiB is allocated by PyTorch, and 114.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting

"proj_dim": 2048,
"proj_dim": args.proj_dim,
"proj_max_batch_size": args.proj_max_batch_size,
"proj_type": args.proj_type,
Copy link
Author

@DanielNi868 DanielNi868 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 0 has a total capacity of 44.35 GiB of which 41.56 GiB is free. Including non-PyTorch memory, this process has 2.79 GiB memory in use. Of the allocated memory 2.36 GiB is allocated by PyTorch, and 114.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting

else:
raise e

new_model.eval()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this modification can be deleted

from transformers.utils import send_example_telemetry
except ImportError:
send_example_telemetry = None # Not available in newer transformers versions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same import error as in TRAK

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make the requirement.txt to have transformers==4.46.0

#fix the import error in newer transformers versions
if send_example_telemetry is not None:
send_example_telemetry("run_clm_no_trainer", args)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same import error as in TRAK


model_id = -1
model_id = 0 # Use checkpoint 0 (final checkpoint)
checkpoint = f"{args.output_dir}/{model_id}"
Copy link
Author

@DanielNi868 DanielNi868 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileNotFoundError: Checkpoint directory not found: /dattri/experiments/gpt2_wikitext/checkpoints/-1. Please ensure the checkpoint exists at this path.

else:
raise e

model.eval()
Copy link
Author

@DanielNi868 DanielNi868 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '../checkpoints/-1'. Use repo_type argument if needed.

huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '../checkpoints/-1'. Use repo_type argument if needed.

@jiaqima
Copy link
Contributor

jiaqima commented Nov 29, 2025

@DanielNi868 please don't paste the error message to the PR.

```bash
return AttentionMaskConverter._expand_mask(mask=mask, dtype=dtype, tgt_len=tgt_len)
```
the troubleshooting can be avoided by setting the attn_implementation paramater to 'eager' in from_pretrained function
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just delete the troubleshooting section.

from transformers.pytorch_utils import Conv1D
from dattri.task import AttributionTask

model_id = -1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we need a fully trained model in id=-1

checkpoint = f"{args.output_dir}/{model_id}"

def checkpoints_load_func(model, checkpoint):
model = AutoModelForCausalLM.from_pretrained(checkpoint).cuda()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what error message did you meet for this line 596-597 function.

@DanielNi868 DanielNi868 force-pushed the main branch 2 times, most recently from 032c7ab to aa51bd0 Compare December 25, 2025 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working work-in-progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants