You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/vdcores.md
+15-14Lines changed: 15 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -89,6 +89,21 @@ Hard to tell which one is faster Huh?
89
89
Manually morphing between these two schedules requires significant changes to the kernel implementation. With decoupled cores abstraction, switching between them requires **instruction flow level change**, all tasks remain composable, without sacrificing performance.
90
90
We try both with in 10 minutes with VDCores, and get a quick 7% performance gain in this operator.
91
91
92
+
## Deep Dive: Turning GPU SMs into Virtual Decoupled Cores
93
+
94
+
> We turn every SM on H200 into a pair of Memory/Compute decoupeld cores, connected by message queues, all run at the speed of GPU!
95
+
96
+
We built Virtual Decoupled Cores on top of the GPU's SM hardware.
97
+
Making these virtual components keep up with raw GPU speed poses a major performance challenge.
98
+
To reach PFLOPs of compute and multi-terabytes-per-second memory bandwidth, every SM cycle counts, and there is only limited headroom for virtual-core overheads.
99
+
100
+
Our main idea is to build {{< highlight-text >}}**virtual software memory cores and compute cores on top of warps**{{< /highlight-text >}}, and let them communicate through explicit queues and ports. VDCores assembles the warps within a single SM into two kinds of "cores" (memory cores and compute cores), implementing a small, software-defined superscalar processor. On the memory side, we expose (i) an **allocation & branch / control unit**, and (ii) **configurable load and store units**, all running asynchronously.
Recall that VDCores aim to achieve pipelining without programmers explictly defining them. Under this principle, some designs emerge to further optimize the performance while keeping the flexibility:
105
+
- Instruction issue is ordered but completion can be out-of-order. Control flow keeps program order when needed, while the load dispatch unit (LDU) can complete loads out of order (with compiler hints) to unlock overlap.
106
+
- Programmable dependencies with software-controlled virtual ports. Control logic routes instructions to load/store “engines” without baking scheduling policy into every kernel.
92
107
93
108
## Decoupled Cores: In Live Action and in the Wild
94
109
@@ -105,18 +120,4 @@ We’ll cover these topics in future posts in this series. Before that, we’re
## Deep Dive: Turning GPU SMs into Virtual Decoupled Cores
109
-
110
-
> We turn every SM on H200 into a pair of Memory/Compute decoupeld cores, connected by message queues, all run at the speed of GPU!
111
-
112
-
We materialize the concept of decoupled cores on top of single GPU SM's hardware, and call them **Virtual** Decoupled Cores.
113
-
Making these virtual components keep up with raw GPU speed remains a major performance-engineering challenge. To reach PFLOPs of compute and multi-terabytes-per-second memory bandwidth, every SM cycle counts, and there is only limited headroom for virtual-core overheads.
114
-
115
-
The main idea is to build {{< highlight-text >}}**virtual software memory cores and compute cores on top of warps**{{< /highlight-text >}}, and let them communicate through explicit queues and ports. VDCores assembles the warps within a single SM into two kinds of "cores" (memory cores and compute cores), implementing a small, software-defined superscalar processor. On the memory side, we expose (i) an **allocation & branch / control unit**, and (ii) **configurable load and store units**, all running asynchronously.
Recall that VDCores aim to achieve pipelining without programmers explictly defining them. Under this principle, some designs emerge to further optimize the performance while keeping the flexibility:
120
-
- Instruction issue is ordered but completion can be out-of-order. Control flow keeps program order when needed, while the load dispatch unit (LDU) can complete loads out of order (with compiler hints) to unlock overlap.
121
-
- Programmable dependencies with software-controlled virtual ports. Control logic routes instructions to load/store “engines” without baking scheduling policy into every kernel.
0 commit comments