Skip to content

Commit 92e190d

Browse files
committed
vdcore order change
1 parent 552205c commit 92e190d

File tree

1 file changed

+15
-14
lines changed

1 file changed

+15
-14
lines changed

content/posts/vdcores.md

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,21 @@ Hard to tell which one is faster Huh?
8989
Manually morphing between these two schedules requires significant changes to the kernel implementation. With decoupled cores abstraction, switching between them requires **instruction flow level change**, all tasks remain composable, without sacrificing performance.
9090
We try both with in 10 minutes with VDCores, and get a quick 7% performance gain in this operator.
9191

92+
## Deep Dive: Turning GPU SMs into Virtual Decoupled Cores
93+
94+
> We turn every SM on H200 into a pair of Memory/Compute decoupeld cores, connected by message queues, all run at the speed of GPU!
95+
96+
We built Virtual Decoupled Cores on top of the GPU's SM hardware.
97+
Making these virtual components keep up with raw GPU speed poses a major performance challenge.
98+
To reach PFLOPs of compute and multi-terabytes-per-second memory bandwidth, every SM cycle counts, and there is only limited headroom for virtual-core overheads.
99+
100+
Our main idea is to build {{< highlight-text >}}**virtual software memory cores and compute cores on top of warps**{{< /highlight-text >}}, and let them communicate through explicit queues and ports. VDCores assembles the warps within a single SM into two kinds of "cores" (memory cores and compute cores), implementing a small, software-defined superscalar processor. On the memory side, we expose (i) an **allocation & branch / control unit**, and (ii) **configurable load and store units**, all running asynchronously.
101+
102+
<!-- {{< placeholder "VDCores overview with divided responsibility [programmer, runtime]" >}} -->
103+
104+
Recall that VDCores aim to achieve pipelining without programmers explictly defining them. Under this principle, some designs emerge to further optimize the performance while keeping the flexibility:
105+
- Instruction issue is ordered but completion can be out-of-order. Control flow keeps program order when needed, while the load dispatch unit (LDU) can complete loads out of order (with compiler hints) to unlock overlap.
106+
- Programmable dependencies with software-controlled virtual ports. Control logic routes instructions to load/store “engines” without baking scheduling policy into every kernel.
92107

93108
## Decoupled Cores: In Live Action and in the Wild
94109

@@ -105,18 +120,4 @@ We’ll cover these topics in future posts in this series. Before that, we’re
105120

106121
<hr style="border-top: 1px solid var(--text-color, #ccc); opacity: 0.3; margin: 2rem 0;">
107122

108-
## Deep Dive: Turning GPU SMs into Virtual Decoupled Cores
109-
110-
> We turn every SM on H200 into a pair of Memory/Compute decoupeld cores, connected by message queues, all run at the speed of GPU!
111-
112-
We materialize the concept of decoupled cores on top of single GPU SM's hardware, and call them **Virtual** Decoupled Cores.
113-
Making these virtual components keep up with raw GPU speed remains a major performance-engineering challenge. To reach PFLOPs of compute and multi-terabytes-per-second memory bandwidth, every SM cycle counts, and there is only limited headroom for virtual-core overheads.
114-
115-
The main idea is to build {{< highlight-text >}}**virtual software memory cores and compute cores on top of warps**{{< /highlight-text >}}, and let them communicate through explicit queues and ports. VDCores assembles the warps within a single SM into two kinds of "cores" (memory cores and compute cores), implementing a small, software-defined superscalar processor. On the memory side, we expose (i) an **allocation & branch / control unit**, and (ii) **configurable load and store units**, all running asynchronously.
116123

117-
<!-- {{< placeholder "VDCores overview with divided responsibility [programmer, runtime]" >}} -->
118-
119-
Recall that VDCores aim to achieve pipelining without programmers explictly defining them. Under this principle, some designs emerge to further optimize the performance while keeping the flexibility:
120-
- Instruction issue is ordered but completion can be out-of-order. Control flow keeps program order when needed, while the load dispatch unit (LDU) can complete loads out of order (with compiler hints) to unlock overlap.
121-
- Programmable dependencies with software-controlled virtual ports. Control logic routes instructions to load/store “engines” without baking scheduling policy into every kernel.
122-
- And so much more!

0 commit comments

Comments
 (0)