-
Notifications
You must be signed in to change notification settings - Fork 23
[WIP] Fix the transformers' error and update the score_logra and score_TRAK #214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| --proj_dim 256 \ | ||
| --proj_max_batch_size 8 \ | ||
| --proj_type random_mask | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this file
| --output_dir ../checkpoints \ | ||
| --block_size 512 \ | ||
| --seed ${SEED} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this file
| trust_remote_code=args.trust_remote_code, | ||
| attn_implementation="eager", # Use eager attention for better performance | ||
| ) | ||
| model = model.cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to have information of this troubleshotting since it is no longer an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove the troubleshotting message and no need to tell user how did the toolkit developer resolved the problem.
| config=config, | ||
| low_cpu_mem_usage=args.low_cpu_mem_usage, | ||
| trust_remote_code=args.trust_remote_code, | ||
| attn_implementation="eager", # Use eager attention for better performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing in this PR is to add this line in score_TRAK and score_logra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are something else you need to change in score_logra and score_TRAK, please comment why they are needed in order to fix the transformer error regarding the vmap. Otherwise, we may keep them unchanged.
|
I deleted those files and updated readme.md |
| trust_remote_code=args.trust_remote_code, | ||
| attn_implementation="eager", # Use eager attention for better performance | ||
| ) | ||
| model = model.cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove the troubleshotting message and no need to tell user how did the toolkit developer resolved the problem.
| config=config, | ||
| low_cpu_mem_usage=args.low_cpu_mem_usage, | ||
| trust_remote_code=args.trust_remote_code, | ||
| attn_implementation="eager", # Use eager attention for better performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are something else you need to change in score_logra and score_TRAK, please comment why they are needed in order to fix the transformer error regarding the vmap. Otherwise, we may keep them unchanged.
|
I updated readme.md in TRAK, the function f in main should keep the unsqueeze(0) added to avoid the dimension mismatch error def f(params, batch): |
| try: | ||
| from transformers.utils import send_example_telemetry | ||
| except ImportError: | ||
| send_example_telemetry = None # Not available in newer transformers versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have ImportError: cannot import name 'send_example_telemetry' from 'transformers.utils' using original code
| #fix the import error in newer transformers versions | ||
| if send_example_telemetry is not None: | ||
| send_example_telemetry("run_clm_no_trainer", args) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot import name 'send_example_telemetry' from 'transformers.utils'
| default="random_mask", | ||
| choices=["normal", "rademacher", "random_mask", "sjlt", "grass"], | ||
| help="Random projection type used for TRAK/TracIn (default: random_mask).", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have torch.OutOfMemoryError: CUDA out of memory. When I did not have these 3 parameters
| input_ids = input_ids.unsqueeze(0).cuda() | ||
| attention_mask = attention_mask.unsqueeze(0).cuda() | ||
| labels = labels.unsqueeze(0).cuda() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have IndexError: too many indices for tensor of dimension 2
| # Re-add batch dimension removed by vmap | ||
| input_ids = input_ids.unsqueeze(0).cuda() | ||
| attention_mask = attention_mask.unsqueeze(0).cuda() | ||
| labels = labels.unsqueeze(0).cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have IndexError: too many indices for tensor of dimension 2
| input_ids = input_ids.unsqueeze(0).cuda() | ||
| attention_mask = attention_mask.unsqueeze(0).cuda() | ||
| labels = labels.unsqueeze(0).cuda() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have IndexError: too many indices for tensor of dimension 2
| if len(parts) == 2 and parts[1].isdigit(): | ||
| num_checkpoints = int(parts[1]) | ||
| requested_checkpoints = int(parts[1]) | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run this again and it can run this time, I think this modification can be deleted
|
|
||
| checkpoints = [str(p) for p in available_checkpoint_dirs[:requested_checkpoints]] | ||
|
|
||
| elif method in ["TracIn", "Grad-Dot", "Grad-Cos"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run this again and it can run this time, I think this modification can be deleted
| method, | ||
| ) | ||
| checkpoints = [str(p) for p in available_checkpoint_dirs[:requested_checkpoints]] | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run this again and it can run this time, I think this modification can be deleted
| "proj_dim": 2048, | ||
| "proj_dim": args.proj_dim, | ||
| "proj_max_batch_size": args.proj_max_batch_size, | ||
| "proj_type": args.proj_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 0 has a total capacity of 44.35 GiB of which 41.56 GiB is free. Including non-PyTorch memory, this process has 2.79 GiB memory in use. Of the allocated memory 2.36 GiB is allocated by PyTorch, and 114.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting
| "proj_dim": 2048, | ||
| "proj_dim": args.proj_dim, | ||
| "proj_max_batch_size": args.proj_max_batch_size, | ||
| "proj_type": args.proj_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 0 has a total capacity of 44.35 GiB of which 41.56 GiB is free. Including non-PyTorch memory, this process has 2.79 GiB memory in use. Of the allocated memory 2.36 GiB is allocated by PyTorch, and 114.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting
| else: | ||
| raise e | ||
|
|
||
| new_model.eval() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this modification can be deleted
| from transformers.utils import send_example_telemetry | ||
| except ImportError: | ||
| send_example_telemetry = None # Not available in newer transformers versions | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same import error as in TRAK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just make the requirement.txt to have transformers==4.46.0
| #fix the import error in newer transformers versions | ||
| if send_example_telemetry is not None: | ||
| send_example_telemetry("run_clm_no_trainer", args) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same import error as in TRAK
|
|
||
| model_id = -1 | ||
| model_id = 0 # Use checkpoint 0 (final checkpoint) | ||
| checkpoint = f"{args.output_dir}/{model_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileNotFoundError: Checkpoint directory not found: /dattri/experiments/gpt2_wikitext/checkpoints/-1. Please ensure the checkpoint exists at this path.
| else: | ||
| raise e | ||
|
|
||
| model.eval() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '../checkpoints/-1'. Use repo_type argument if needed.
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '../checkpoints/-1'. Use repo_type argument if needed.
|
@DanielNi868 please don't paste the error message to the PR. |
experiments/gpt2_wikitext/readme.md
Outdated
| ```bash | ||
| return AttentionMaskConverter._expand_mask(mask=mask, dtype=dtype, tgt_len=tgt_len) | ||
| ``` | ||
| the troubleshooting can be avoided by setting the attn_implementation paramater to 'eager' in from_pretrained function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just delete the troubleshooting section.
| from transformers.pytorch_utils import Conv1D | ||
| from dattri.task import AttributionTask | ||
|
|
||
| model_id = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we need a fully trained model in id=-1
| checkpoint = f"{args.output_dir}/{model_id}" | ||
|
|
||
| def checkpoints_load_func(model, checkpoint): | ||
| model = AutoModelForCausalLM.from_pretrained(checkpoint).cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what error message did you meet for this line 596-597 function.
…nd unused ssh_config_template.txt
032c7ab to
aa51bd0
Compare
Fix the transformers' error ( setting the attn_implementation = 'eager' )
Modify checkpoints_load_func for score_logra.py and score_TRAK.py to fix hugging face error
Update readme and comment put the vamp error for transformers