Skip to content

AMD GPU (DirectML) Optimization for Live Mode (No README changes)#1726

Open
ozp3 wants to merge 4 commits intohacksider:mainfrom
ozp3:amd-dml-optimization-v2
Open

AMD GPU (DirectML) Optimization for Live Mode (No README changes)#1726
ozp3 wants to merge 4 commits intohacksider:mainfrom
ozp3:amd-dml-optimization-v2

Conversation

@ozp3
Copy link
Copy Markdown
Contributor

@ozp3 ozp3 commented Apr 1, 2026

As requested, this PR contains the exact same code optimizations from #1710, but excludes any modifications to the README.md file.

Summary by Sourcery

Optimize live webcam face swapping for DirectML/AMD GPUs and improve responsiveness and stability in live mode.

Bug Fixes:

  • Ensure face analysis and swapping operations are serialized via a global DirectML lock to avoid concurrent execution issues in live mode.

Enhancements:

  • Reduce default live preview resolution for improved performance, especially on constrained GPUs.
  • Warm up face analyser and face swapper models when starting webcam preview to reduce initial latency.
  • Move face detection from a dedicated thread into the processing loop with cached results updated every few frames to reduce overhead.
  • Increase the idle sleep interval when no frame is available and duplicate frames before GPU color conversion to improve stability and resource usage.

ozp3 and others added 3 commits April 1, 2026 18:21
remove lnk and bat files as requested
remove lnk and bat files as requested
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Apr 1, 2026

Reviewer's Guide

Introduces AMD GPU / DirectML-friendly live mode optimizations by serializing DirectML calls with a global lock, adjusting live preview behavior, and simplifying the live webcam detection pipeline while keeping existing functionality intact.

Sequence diagram for live webcam processing with DirectML lock and inline detection

sequenceDiagram
    actor User
    participant UI as ui_webcam_preview
    participant CaptureThread
    participant ProcessingThread
    participant FaceAnalyser
    participant FaceSwapper
    participant DMLLock as modules_globals_dml_lock

    User->>UI: click_start_webcam_preview(camera_index)
    UI->>FaceAnalyser: get_face_analyser()
    UI->>FaceSwapper: get_face_swapper()
    UI->>CaptureThread: start_capture_thread()
    UI->>ProcessingThread: start_processing_thread()
    Note over CaptureThread,ProcessingThread: Detection thread is not started

    loop capture_frames
        CaptureThread->>CaptureThread: read_frame_from_camera()
        CaptureThread-->>ProcessingThread: push_frame_to_capture_queue(frame)
    end

    loop process_frames
        ProcessingThread->>ProcessingThread: pop_frame_from_capture_queue()
        alt every_third_frame
            ProcessingThread->>FaceAnalyser: get_one_face_or_get_many_faces(temp_frame)
            activate FaceAnalyser
            FaceAnalyser->>DMLLock: acquire()
            FaceAnalyser-->>FaceAnalyser: DirectML_inference()
            FaceAnalyser->>DMLLock: release()
            deactivate FaceAnalyser
            ProcessingThread-->>ProcessingThread: update_detection_result_cache()
        else reuse_cached_detection
            ProcessingThread-->>ProcessingThread: read_detection_result_cache()
        end

        ProcessingThread-->>FaceSwapper: swap_face(source_face,target_face,temp_frame)
        activate FaceSwapper
        FaceSwapper->>DMLLock: acquire()
        FaceSwapper-->>FaceSwapper: DirectML_inference()
        FaceSwapper->>DMLLock: release()
        deactivate FaceSwapper

        ProcessingThread-->>UI: push_processed_frame_to_display_queue()
        UI-->>User: show_live_preview_frame()
    end
Loading

Class diagram for modules using DirectML lock and live processing changes

classDiagram

    class modules_globals {
        +dml_lock Lock
    }

    class face_analyser {
        +get_face_analyser()
        +get_one_face(frame)
        +get_many_faces(frame)
    }

    class face_swapper {
        +get_face_swapper()
        +swap_face(source_face,target_face,temp_frame)
    }

    class ui_live_webcam {
        +webcam_preview(root,camera_index)
        +create_webcam_preview(camera_index)
        +_processing_thread_func(capture_queue,processed_queue,stop_event,latest_frame_holder,detection_result,detection_lock)
        +_detection_thread_func(latest_frame_holder,detection_result,detection_lock,stop_event)
    }

    modules_globals <.. face_analyser : uses_dml_lock
    modules_globals <.. face_swapper : uses_dml_lock

    face_analyser <.. ui_live_webcam : detection_calls
    face_swapper <.. ui_live_webcam : swapping_calls

    face_analyser : +get_one_face(frame) uses dml_lock
    face_analyser : +get_many_faces(frame) uses dml_lock
    face_swapper : +swap_face(source_face,target_face,temp_frame) uses dml_lock

    ui_live_webcam : +webcam_preview preloads_face_analyser_and_face_swapper
    ui_live_webcam : +_processing_thread_func inlines_detection_every_third_frame
    ui_live_webcam : -_detection_thread_func disabled_in_live_mode
Loading

File-Level Changes

Change Details Files
Serialize DirectML-related face analysis and swapping operations to avoid concurrent access issues on AMD GPUs.
  • Wrap face analysis calls in get_one_face and get_many_faces with a global dml_lock to ensure exclusive access to the analyser
  • Wrap face_swapper.get in swap_face with the same global dml_lock to serialize DirectML inference execution
  • Introduce a global dml_lock in modules.globals backed by threading.Lock
modules/face_analyser.py
modules/processors/frame/face_swapper.py
modules/globals.py
Refactor live webcam detection/processing pipeline to run detection inline on the processing thread with frame-skipping and cached results for performance and stability.
  • Disable the separate detection thread and move detection logic into the processing thread
  • Run detection every 3 frames and reuse cached detection results on intermediate frames to reduce DirectML load
  • Adjust handling of target and many_faces detection results to work with the new inline detection flow
  • Increase the wait time when no frame is available to reduce busy-waiting
modules/ui.py
Adjust live preview and initialization behavior for better stability and performance in live mode.
  • Reduce default preview resolution from 960x540 to 640x360 to lighten GPU/CPU load during live preview
  • Ensure frames are copied before color conversion in the display path to avoid side effects on shared arrays
  • Eagerly initialize face_analyser and face_swapper when starting webcam preview to avoid first-frame stalls
  • Leave a commented-out hook in core.run for preloading face_analyser on GUI startup (no behavioral change)
modules/ui.py
modules/core.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Now that the detection thread is effectively disabled (det_thread.start() commented out) but _detection_thread_func and detection_lock are still wired up, consider either removing or clearly gating this unused threading code to avoid confusion and accidental re‑activation later.
  • Using a function attribute (_processing_thread_func._det_count) to track detection cadence hides mutable state on the function object; consider moving this counter into a small helper class or a closure-local dict to keep state management more explicit and testable.
  • The new dml_lock is a global lock shared by both face analysis and swapping; if future work adds more DirectML consumers, it may be worth encapsulating this in a dedicated DML/ORT execution manager to avoid ad‑hoc locking scattered across modules.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Now that the detection thread is effectively disabled (`det_thread.start()` commented out) but `_detection_thread_func` and `detection_lock` are still wired up, consider either removing or clearly gating this unused threading code to avoid confusion and accidental re‑activation later.
- Using a function attribute (`_processing_thread_func._det_count`) to track detection cadence hides mutable state on the function object; consider moving this counter into a small helper class or a closure-local dict to keep state management more explicit and testable.
- The new `dml_lock` is a global lock shared by both face analysis and swapping; if future work adds more DirectML consumers, it may be worth encapsulating this in a dedicated DML/ORT execution manager to avoid ad‑hoc locking scattered across modules.

## Individual Comments

### Comment 1
<location path="modules/globals.py" line_range="75-76" />
<code_context>

 # --- END OF FILE globals.py ---
+
+import threading
+dml_lock = threading.Lock()
</code_context>
<issue_to_address>
**suggestion (performance):** Using a single global DML lock for both analysis and swapping may cause unnecessary serialization

`dml_lock` is now taken for both `get_one_face`/`get_many_faces` and `swap_face`, which serializes all ONNX/DML work and removes parallelism between analysis and swapping. This may become a throughput bottleneck on capable hardware. If the root issue is driver/ORT instability under concurrency, consider narrowing the lock scope (e.g., per-session/per-ORT instance) or clearly documenting where/when this global lock must be used so future changes don’t over‑serialize work.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +75 to +76
import threading
dml_lock = threading.Lock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Using a single global DML lock for both analysis and swapping may cause unnecessary serialization

dml_lock is now taken for both get_one_face/get_many_faces and swap_face, which serializes all ONNX/DML work and removes parallelism between analysis and swapping. This may become a throughput bottleneck on capable hardware. If the root issue is driver/ORT instability under concurrency, consider narrowing the lock scope (e.g., per-session/per-ORT instance) or clearly documenting where/when this global lock must be used so future changes don’t over‑serialize work.

@ozp3
Copy link
Copy Markdown
Contributor Author

ozp3 commented Apr 1, 2026

lmk if you want something else

# single thread doubles cuda performance - needs to be set before torch import
if any(arg.startswith('--execution-provider') for arg in sys.argv):
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '6'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? The comment above this literally tells you why it was set at 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No cuda for AMD gpus

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I know. But I'm pretty sure that threading change affects every type of computer doesn't it? There's no hardware specific threading change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants