Skip to content

Cpu_fallback for the files#181

Open
Varun-sai-500 wants to merge 2 commits intoz-x-yang:mainfrom
Varun-sai-500:cpu
Open

Cpu_fallback for the files#181
Varun-sai-500 wants to merge 2 commits intoz-x-yang:mainfrom
Varun-sai-500:cpu

Conversation

@Varun-sai-500
Copy link
Contributor

No description provided.

Copy link
Owner

@z-x-yang z-x-yang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the CPU-fallback refactor — good direction. I can't approve this yet because there are still blocking issues:

  1. CPU path still breaks in eval.py / train.py
  • The CPU branch passes rank=None into Evaluator/Trainer.
  • But manager code still assumes GPU-only semantics (cfg.* + rank, torch.cuda.set_device(...), many .cuda(...) usages).
  • This will fail before evaluation/training starts on CPU.
  1. Single-GPU device-id regression
  • Single-device launch now calls main_worker(args.gpu_id, ...) / main_worker(args.start_gpu, ...).
  • Manager code also adds offsets (cfg.TEST_GPU_ID + rank, cfg.DIST_START_GPU + rank).
  • For non-zero GPU ids this can double-offset and select the wrong device.
  1. CUDA-only timing/memory calls are still unguarded in evaluator
  • torch.cuda.Event, torch.cuda.synchronize, torch.cuda.max_memory_allocated still run in code paths that need CPU compatibility.

Please fix these before merge. Suggested direction:

  • Keep rank as an integer (e.g., 0) on CPU;
  • Guard all CUDA-only APIs with torch.cuda.is_available();
  • Keep device selection in one place to avoid double offsets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants