AMD GPU (DirectML) Optimization for Live Mode (No README changes)#1726
AMD GPU (DirectML) Optimization for Live Mode (No README changes)#1726ozp3 wants to merge 4 commits intohacksider:mainfrom
Conversation
remove lnk and bat files as requested
remove lnk and bat files as requested
Reviewer's GuideIntroduces AMD GPU / DirectML-friendly live mode optimizations by serializing DirectML calls with a global lock, adjusting live preview behavior, and simplifying the live webcam detection pipeline while keeping existing functionality intact. Sequence diagram for live webcam processing with DirectML lock and inline detectionsequenceDiagram
actor User
participant UI as ui_webcam_preview
participant CaptureThread
participant ProcessingThread
participant FaceAnalyser
participant FaceSwapper
participant DMLLock as modules_globals_dml_lock
User->>UI: click_start_webcam_preview(camera_index)
UI->>FaceAnalyser: get_face_analyser()
UI->>FaceSwapper: get_face_swapper()
UI->>CaptureThread: start_capture_thread()
UI->>ProcessingThread: start_processing_thread()
Note over CaptureThread,ProcessingThread: Detection thread is not started
loop capture_frames
CaptureThread->>CaptureThread: read_frame_from_camera()
CaptureThread-->>ProcessingThread: push_frame_to_capture_queue(frame)
end
loop process_frames
ProcessingThread->>ProcessingThread: pop_frame_from_capture_queue()
alt every_third_frame
ProcessingThread->>FaceAnalyser: get_one_face_or_get_many_faces(temp_frame)
activate FaceAnalyser
FaceAnalyser->>DMLLock: acquire()
FaceAnalyser-->>FaceAnalyser: DirectML_inference()
FaceAnalyser->>DMLLock: release()
deactivate FaceAnalyser
ProcessingThread-->>ProcessingThread: update_detection_result_cache()
else reuse_cached_detection
ProcessingThread-->>ProcessingThread: read_detection_result_cache()
end
ProcessingThread-->>FaceSwapper: swap_face(source_face,target_face,temp_frame)
activate FaceSwapper
FaceSwapper->>DMLLock: acquire()
FaceSwapper-->>FaceSwapper: DirectML_inference()
FaceSwapper->>DMLLock: release()
deactivate FaceSwapper
ProcessingThread-->>UI: push_processed_frame_to_display_queue()
UI-->>User: show_live_preview_frame()
end
Class diagram for modules using DirectML lock and live processing changesclassDiagram
class modules_globals {
+dml_lock Lock
}
class face_analyser {
+get_face_analyser()
+get_one_face(frame)
+get_many_faces(frame)
}
class face_swapper {
+get_face_swapper()
+swap_face(source_face,target_face,temp_frame)
}
class ui_live_webcam {
+webcam_preview(root,camera_index)
+create_webcam_preview(camera_index)
+_processing_thread_func(capture_queue,processed_queue,stop_event,latest_frame_holder,detection_result,detection_lock)
+_detection_thread_func(latest_frame_holder,detection_result,detection_lock,stop_event)
}
modules_globals <.. face_analyser : uses_dml_lock
modules_globals <.. face_swapper : uses_dml_lock
face_analyser <.. ui_live_webcam : detection_calls
face_swapper <.. ui_live_webcam : swapping_calls
face_analyser : +get_one_face(frame) uses dml_lock
face_analyser : +get_many_faces(frame) uses dml_lock
face_swapper : +swap_face(source_face,target_face,temp_frame) uses dml_lock
ui_live_webcam : +webcam_preview preloads_face_analyser_and_face_swapper
ui_live_webcam : +_processing_thread_func inlines_detection_every_third_frame
ui_live_webcam : -_detection_thread_func disabled_in_live_mode
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Now that the detection thread is effectively disabled (
det_thread.start()commented out) but_detection_thread_funcanddetection_lockare still wired up, consider either removing or clearly gating this unused threading code to avoid confusion and accidental re‑activation later. - Using a function attribute (
_processing_thread_func._det_count) to track detection cadence hides mutable state on the function object; consider moving this counter into a small helper class or a closure-local dict to keep state management more explicit and testable. - The new
dml_lockis a global lock shared by both face analysis and swapping; if future work adds more DirectML consumers, it may be worth encapsulating this in a dedicated DML/ORT execution manager to avoid ad‑hoc locking scattered across modules.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Now that the detection thread is effectively disabled (`det_thread.start()` commented out) but `_detection_thread_func` and `detection_lock` are still wired up, consider either removing or clearly gating this unused threading code to avoid confusion and accidental re‑activation later.
- Using a function attribute (`_processing_thread_func._det_count`) to track detection cadence hides mutable state on the function object; consider moving this counter into a small helper class or a closure-local dict to keep state management more explicit and testable.
- The new `dml_lock` is a global lock shared by both face analysis and swapping; if future work adds more DirectML consumers, it may be worth encapsulating this in a dedicated DML/ORT execution manager to avoid ad‑hoc locking scattered across modules.
## Individual Comments
### Comment 1
<location path="modules/globals.py" line_range="75-76" />
<code_context>
# --- END OF FILE globals.py ---
+
+import threading
+dml_lock = threading.Lock()
</code_context>
<issue_to_address>
**suggestion (performance):** Using a single global DML lock for both analysis and swapping may cause unnecessary serialization
`dml_lock` is now taken for both `get_one_face`/`get_many_faces` and `swap_face`, which serializes all ONNX/DML work and removes parallelism between analysis and swapping. This may become a throughput bottleneck on capable hardware. If the root issue is driver/ORT instability under concurrency, consider narrowing the lock scope (e.g., per-session/per-ORT instance) or clearly documenting where/when this global lock must be used so future changes don’t over‑serialize work.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| import threading | ||
| dml_lock = threading.Lock() |
There was a problem hiding this comment.
suggestion (performance): Using a single global DML lock for both analysis and swapping may cause unnecessary serialization
dml_lock is now taken for both get_one_face/get_many_faces and swap_face, which serializes all ONNX/DML work and removes parallelism between analysis and swapping. This may become a throughput bottleneck on capable hardware. If the root issue is driver/ORT instability under concurrency, consider narrowing the lock scope (e.g., per-session/per-ORT instance) or clearly documenting where/when this global lock must be used so future changes don’t over‑serialize work.
|
lmk if you want something else |
| # single thread doubles cuda performance - needs to be set before torch import | ||
| if any(arg.startswith('--execution-provider') for arg in sys.argv): | ||
| os.environ['OMP_NUM_THREADS'] = '1' | ||
| os.environ['OMP_NUM_THREADS'] = '6' |
There was a problem hiding this comment.
Why? The comment above this literally tells you why it was set at 1
There was a problem hiding this comment.
Yes I know. But I'm pretty sure that threading change affects every type of computer doesn't it? There's no hardware specific threading change
As requested, this PR contains the exact same code optimizations from #1710, but excludes any modifications to the README.md file.
Summary by Sourcery
Optimize live webcam face swapping for DirectML/AMD GPUs and improve responsiveness and stability in live mode.
Bug Fixes:
Enhancements: