Skip to content

Commit 0552df7

Browse files
Merge remote-tracking branch 'oss/main' into HEAD
2 parents b1dc287 + 04bf2c3 commit 0552df7

File tree

185 files changed

+8066
-2836
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

185 files changed

+8066
-2836
lines changed

.github/workflows/integration-tests-amd.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,13 +116,18 @@ jobs:
116116
echo "Could not find '${INSTRUMENTATION_LIB_DIR}'" ; exit -1
117117
fi
118118
119+
# Install hip-python
120+
pip install -i https://test.pypi.org/simple/ hip-python
121+
119122
# Test gluon
120123
pytest --capture=tee-sys -rfs -n 8 python/test/gluon/
121124
122125
pytest --capture=tee-sys -rfs python/tutorials/06-fused-attention.py
123126
pytest --capture=tee-sys -rfs -n 8 third_party/amd/python/test/ \
124127
--ignore=third_party/amd/python/test/test_scalarize_packed_fops.py \
125-
--ignore=third_party/amd/python/test/test_address_sanitizer.py
128+
--ignore=third_party/amd/python/test/test_address_sanitizer.py \
129+
--ignore=third_party/amd/python/test/test_gluon_gfx1250.py
130+
pytest --capture=tee-sys -rfs -n 8 third_party/amd/python/test/test_gluon_gfx1250.py -k "test_compile"
126131
TRITON_ALWAYS_COMPILE=1 pytest --capture=tee-sys -rfs third_party/amd/python/test/test_scalarize_packed_fops.py
127132
cd python/test/unit
128133
pytest --capture=tee-sys -rfs -n 12 \

.github/workflows/runner-preparation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ jobs:
3737
- name: Detect if build deps (e.g. LLVM hash) changed
3838
id: detect-change
3939
if: github.event_name == 'push'
40-
uses: tj-actions/changed-files@v46
40+
uses: tj-actions/changed-files@v47
4141
with:
4242
files: |
4343
cmake/*.txt

README.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,26 @@
1-
<div align="center">
2-
<img src="https://lh5.googleusercontent.com/wzQKEsTFkrgNQO9JjhGH5wFvslJr1saLtLaJ_a6Fp_gNENpvt3VG7BmztwngU9hFJaU4CPwGiw1opQtDvTkLrxWRbO_a12Q-pdESWHgtmheIHcPbOL5ZMC4TSiJVe5ty1w=w3517" alt="Triton logo">
3-
</div>
41

52
| **`Documentation`** | **`Nightly Wheels`** |
63
|-------------------- | -------------------- |
74
| [![Documentation](https://github.com/triton-lang/triton/actions/workflows/documentation.yml/badge.svg)](https://triton-lang.org/) | [![Wheels](https://github.com/triton-lang/triton/actions/workflows/wheels.yml/badge.svg)](https://github.com/triton-lang/triton/actions/workflows/wheels.yml) |
85

9-
# Conference Registration
6+
# Triton Conference 2025
7+
8+
![Triton Registration Banner](https://github.com/user-attachments/assets/b4b6972a-857c-417f-bf2c-f16f38a358c0)
9+
10+
### Registration
1011

1112
The 3rd Triton conference is scheduled to take place on October 21, 2025. Click [here](https://tritonconference.eventbuilder.com/TritonDeveloperConference) to register!
1213

14+
### Poster Submission
15+
16+
We invite members of the Triton community who are attending the Triton Developer Conference to present posters about their Triton-related technical work.
17+
18+
Please submit basic information of your poster, including author information and abstract using this [form](https://forms.gle/QfgTF8o1CWNENAnA7).
19+
20+
**Important Dates**
21+
- Submission: 10/1/2025
22+
- Author notification: 10/7/2025
23+
- Final version (PDF): 10/14/2025
1324

1425
# Triton
1526

@@ -251,6 +262,15 @@ export TRITON_OVERRIDE_DIR=<override_dir>
251262
# Step 4: Run the kernel again to see the overridden result
252263
```
253264

265+
**Compiler Pipeline Inspection Steps**
266+
To introspect the pipeline `add_stages`, before running your kernels, simply set
267+
the add_stages_inspection_hook like so:
268+
269+
```python
270+
def inspect_stages(_self, stages, options, language, capability):
271+
# inspect or modify add_stages here
272+
triton.knobs.runtime.add_stages_inspection_hook = inspect_stages
273+
```
254274

255275
# Changelog
256276

docs/meetups/09-03-2025/notes.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Agenda:
2+
* Intros: Cicie Wang, and Whitney Tsang (co-organizers).
3+
* Multi-pass profiler - a federated GPU Tooling Framework for Orchestrated and LLM Agentic Profiling Applications (Kevin Fang, et al., Meta)
4+
* Triton Developer Conference updates (Ofer Dekel, Microsoft)
5+
* Q> Who is using tritonbench? How are you using it? OpenAI? (Cicie Wang, Meta)
6+
* Q> Triton testing strategy - what do folks think? What are we missing? Where would you like to see additional coverage? (Bill Yoshimi, Meta)
7+
* Q> Free threaded Python. Any plans for making it compatible with free threading? (Bill Yoshimi, Meta)
8+
* Open mic for other topics.
9+
10+
# Notes:
11+
* MPP
12+
* Lots of new DSLs (like Gluon and TLX) and profilers.
13+
* Working with Keren from OAI on profiling
14+
* Integrated wth compiler
15+
* Supports new DSLs
16+
* Structure-level profiling timelines
17+
* Operator-level latency
18+
* See OSDI ‘25 paper (accepted)
19+
* Approach
20+
* Connecting tools like profilers, LLM agents, etc to to different profiling backends (like proton, ncu, nvbit, etc.)
21+
* Requirements
22+
* Programmable interfaces
23+
* Eager execution (makes debugging easier)
24+
* Amenable to parallelization
25+
* Sandboxing - like for enabling agents to try experiments (to get a clean environment)
26+
* Debuggable.
27+
* Prototype
28+
* Data structures - program IR, execution traces, performance report
29+
* Abstractions - tasks and jobs (jobs can be nested)
30+
* System architecture
31+
* Job graph
32+
* MPP runtime - schedules tasks & eager execution
33+
* Backend - state caching, GPU/CPU pools. DB for error recovery
34+
* Case study 1: Profiling Async Operations
35+
* Sometimes difficult because some resources are shared.
36+
* We do multiple passes and measure statistical metrics.
37+
* Statistical timeline view.
38+
* MPP allows you to see distribution of execution times (P20, P50, P80)
39+
* Case study 2: Triton PGO Agent
40+
* Phases/Agents: profiling, summary, optimizer
41+
* Profiling: gets profile results
42+
* Summary: compress context window, generate a TL;DR
43+
* Optimizer: rewrites kernel to improve performance
44+
* Experimenting with TTGIR rewrites.
45+
* Examples: identifies section with high execution variation. Identifies critical path and suggests how to shorten them.
46+
* Results: compared to no profiling, NCU, with MPP (7-12% improvement).
47+
* Failure modes:
48+
* Kernel results change
49+
* Deadlocks
50+
* Case study 3: fine-grained IPC
51+
* Timing from proton intra kernel profiler
52+
* Instruction type stats from nvbit or cutracer (developed by Meta)
53+
* Can identify register pressure.
54+
* Conclusion
55+
* On top of proton, orchestrating profiling workflows
56+
* Soon to be open-source
57+
58+
Q> How difficult is this to add other GPU vendors like AMD?
59+
60+
A> If your backend can give you the data, we can do it. We didn’t do it because we were interested in warp specialization. It's general and you can implement the interface API.
61+
62+
Q> Have you experimented with using the optimizer to rewrite assembly code?
63+
64+
A> Demo used TTGIR but you can create an agent that could rewrite PTX or assembly.
65+
66+
Q> Did you need to write prompt for the agent?
67+
68+
A> Yes. It's a very simple prompt.
69+
70+
* Triton conference updates (Ofer Dekel, MSFT)
71+
* [https://aka.ms/tritonconference2025](https://aka.ms/tritonconference2025)
72+
* Schedule
73+
* Please show up to the happy hour to mingle (probably the most important part).
74+
* Register. You’ll also need it for the live-stream too. Sorry, you will not be able to register on the day of conference.
75+
* When you register, status is pending. Will take up to a week to get it approved. (Why? Its going through Microsoft security review).
76+
* Please register with your institutional/professional email vs. yahoo/gmail/generic email. Generic email will take longer approve. You can ping Ofer if you haven’t seen your approval after 8+ days.
77+
* There will be busses to venue from SF.
78+
* Visa letter? Register soon so we can get you an invitation letter
79+
* Program
80+
* Phil & Thomas - Triton: today and beyond
81+
* Mark Saroufim - GPU MODE: the state of Triton
82+
* Jason Ansel - Helion: A higher-level DSL for Kernel Authoring
83+
* Keren Zhou (George Mason) & Kevin Fang (Proton: portable performance profiling)
84+
* Lixun Zhang (AMD) - No warm up needed: Triton day-one speed on AMD GPUS
85+
* Chris Sullivan (Nvidia) - Nvida Blackwell GPU backend for Triton
86+
* Peter Bell (OpenAI) - Gluon: tilebased GPU programming with low-level control.
87+
* Hongtao Y (Meta) - TLX
88+
* Wenlei Bao (Bytedance ) - Triton - distributed computation and communication overlapping
89+
* Yanming Chen (Linked in) - Evolution of Liger Kernels to post training
90+
* Q> Who is using tritonbench? How are you using it? OpenAI?
91+
* [Kernelize.ai](Kernelize.ai) - vLLM testing tritonbench nightly. Built a visualization (noticed H100 and B200 regressions on Liger kernel and BF16).
92+
* OpenAI - not using tritonbench, using internal benchmarking system. Lowtech stuff, ocaml (some of it is open sources in repo). Simple benchmarking.
93+
* Q> no new kernels added
94+
* A> we’re continuously updating them, thinking of upstreaming more, attention, but no timeline. We are keeping MoE update.
95+
* Q> Triton testing strategy - what do folks think? What are we missing? Where would you like to see additional coverage?
96+
* Ettore - want so seem more lit test coverage, doesn’t require GPU. Easier and fast to run. Vs testing operator end to end.
97+
* 20K unit tests are good, but if we want better improvements. Is to beef up the lit tests.GPU tests should be in third-party directory. Add lit
98+
* Alex Baden: Tests: for important kernels, IR diffing! Cheaper to run (if the IR doesn’t change you shouldn’t have a regression.). Use LLVM tooling to eliminate white space changes. **For important kernels, extract & compare IR changes.**
99+
* Q> What is the Free-threading Python strategy?
100+
* Lots of things to fix in the front end (backend is pretty thread-safe.)
101+
* But its not high on the list of work we're doing (OAI).
102+
* Q> Flex attention: update comments/docs to use tensor descriptors instead of TMA (unless TMA is really being referenced).
103+
* PyTorch flex attention uses tensor descriptors but comments/code reference TMA. Reaching out to owners of flex attention PyTorch inductor template kernels to update comments and code. Confusing for people who use GPUs that don’t implement TMA.
104+
* Ettore: FlexAttention FWD uses tensor descriptors but BWD doesn't, can someone add tensor descriptor support?
105+
106+
# Minutes
107+
* Recording link [here](https://youtu.be/Ji1rCo6qvXc)
108+
* MPP presentation link [here](https://tinyurl.com/4r7cfzhu)

docs/meetups/for_moderators/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,3 +124,4 @@ If this is your first time using Microsoft Teams, work with the meeting creator
124124
| ---- | ---------- | ------------ | --------- |
125125
| 2025-05-01 | [Link](https://tinyurl.com/mr397f6x) | Topic: what are plans for existing block pointer programming model? (Context: Intel GPU backend relies heavily on it and will need time to fully move to tensor descriptor programming model.) - Jianhui Li, Intel <br/> Topic: infrastructure for Triton performance tests - Sayce, Google<br/>Topic: what talks/tutorials/open discussions would you like to see at the 2025 Triton Developers’ Summit? How can we help? - Adnan Aziz, Meta <br/> Topic: what are plans for existing block pointer programming model? (Context: Intel GPU backend relies heavily on it and will need time to fully move to tensor descriptor programming model.) - Jianhui Li, Intel<br/>Topic: infrastructure for Triton performance tests - Sayce, Google<br/>Topic: what talks/tutorials/open discussions would you like to see at the 2025 Triton Developers’ Summit? How can we help? - Adnan Aziz, Meta </pre> | https://www.youtube.com/watch?v=W16BrXc5BYE |
126126
| 2025-07-09 |[Link](https://tinyurl.com/mus5wyax) | Topic: Gluon update - Jeff Niu, OpenAI <br/> Topic: Interest and requirements for a nightly performance regression suite - Simon Waters, kernelize.ai<br/>Triton developer's summit update - Ofer Dekel, Microsoft | https://youtu.be/zoSY_WXHmF0 |
127+
| 2025-09-03 |[Link](https://tinyurl.com/4r7cfzhu) | Topic: Intros: Cicie Wang, and Whitney Tsang (co-organizers).<br/>Topic: Multi-pass profiler - a federated GPU Tooling Framework for Orchestrated and LLM Agentic Profiling Applications (Kevin Fang, et al., Meta)<br/>Topic: Triton Developer Conference updates (Ofer Dekel, Microsoft)<br/>Topic: Q> Who is using tritonbench? How are you using it? OpenAI? (Cicie Wang, Meta)<br/>Topic: Triton testing strategy - what do folks think? What are we missing? Where would you like to see additional coverage? (Bill Yoshimi, Meta)<br/>Q> Topic: Free threaded Python. Any plans for making it compatible with free threading? (Bill Yoshimi, Meta) | https://youtu.be/Ji1rCo6qvXc |

include/triton/Dialect/Triton/IR/TritonOpInterfaces.td

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,11 @@ def TT_DescriptorStoreLikeOpInterface : OpInterface<"DescriptorStoreLikeOpInterf
106106
/*retType=*/"::mlir::TypedValue<mlir::RankedTensorType>",
107107
/*methodName=*/"getSrc",
108108
/*args=*/(ins)>,
109+
InterfaceMethod<
110+
/*desc=*/"Get mutable source tensor",
111+
/*retType=*/"::mlir::OpOperand&",
112+
/*methodName=*/"getSrcMutable",
113+
/*args=*/(ins)>,
109114
];
110115
}
111116

include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -428,7 +428,7 @@ and `pN` to mean padding:
428428
x1, x3, p2, p3
429429
...]
430430

431-
2. 2D single interval-padding with rearanged rows.
431+
2. 2D single interval-padding with rearranged rows.
432432

433433
#ttg.padded_shared<[16:+1] {offset = [[0, 1], [0, 2], /*gap, stride by 2 rows*/[2, 0], [4, 0], [1, 0]]], block = []}>
434434
[
@@ -1202,15 +1202,16 @@ def AMDWmmaEncodingAttr : DistributedEncoding<"AMDWmmaEncoding", "amd_wmma_encod
12021202
let description = [{
12031203
An encoding for tensors that have been produced by WMMA matrix core instructions,
12041204
available on AMD Radeon GPUs of RDNA architectures.
1205-
- A `version` parameter specifies instruction version to lower in. The data
1206-
distribution within one warp is also depends on it. Following architectures are
1207-
supported:
1208-
- 1: gfx11
1209-
- 2: gfx12
1210-
- A `warpsPerCTA` parameter characterizes data distribution between warps.
1211-
An important limitation of WMMA for layout is a shape for tiles processed
1212-
by a single warp. It is [16, 16].
1213-
This encoding assumes specific access to matrix elements by threads.
1205+
1206+
It is characterized by the following parameters:
1207+
- `version` indicates the GPU architecture:
1208+
- 1: RDNA3; e.g., gfx1100, gfx1101
1209+
- 2: RDNA4; e.g., gfx1200, gfx1201
1210+
- 3: gfx1250
1211+
- `warpsPerCTA` indicates the warp layout in the block.
1212+
- `instrShape` indicates the shape in the form of (M, N, K) of the matrix
1213+
operation performed by a single WMMA instruction. Defaults to (16, 16, 16).
1214+
- `isTransposed` indicates the layout of the result tensor is transposed.
12141215

12151216
Example:
12161217
Suppose we have a tensor with shape [32, 64], `warpsPerCTA` set to [2, 2].
@@ -1239,7 +1240,7 @@ Row | warp 0 warp 1
12391240
30 |[0 1 2 ... 14 15] [0 1 2 ... 14 15] [0 1 2 ... 14 15] [0 1 2 ... 14 15]
12401241
31 |[16 17 18 ... 30 31] [16 17 18 ... 30 31] [16 17 18 ... 30 31] [16 17 18 ... 30 31]
12411242

1242-
// ------------------------ version = 2, isTransposed = false ------------------------ //
1243+
// ------------------------ version = 2/3, isTransposed = false ------------------------ //
12431244

12441245
Row | warp 0 warp 1
12451246
|/--------^---------\ /---------^--------\
@@ -1267,7 +1268,7 @@ Row | warp 0 warp 1
12671268
30 |[16 17 18 ... 30 31] [16 17 18 ... 30 31]
12681269
31 |[16 17 18 ... 30 31] [16 17 18 ... 30 31]
12691270

1270-
// ------------------------ version = 2, isTransposed = true ------------------------ //
1271+
// ------------------------ version = 2/3, isTransposed = true ------------------------ //
12711272

12721273
| warp 0 warp 1
12731274
|/----------------^----------------\ /-------^-------\
@@ -1293,18 +1294,21 @@ Row |
12931294
"unsigned": $version,
12941295
"bool":$isTransposed,
12951296
ArrayRefParameter<"unsigned">:$warpsPerCTA,
1296-
"CTALayoutAttr":$CTALayout
1297+
"CTALayoutAttr":$CTALayout,
1298+
ArrayRefParameter<"unsigned">:$instrShape
12971299
);
12981300

12991301
let genVerifyDecl = 1;
13001302
let hasCustomAssemblyFormat = 1;
13011303

13021304
let extraClassDeclaration = extraDistributedDeclaration # [{
1303-
SmallVector<int64_t> getElemsPerInstrForOperands(int kDim, int opIdx) const;
13041305
SmallVector<int64_t> getRepForOperand(ArrayRef<int64_t> operandShape,
1305-
Type elemType, int kWidth, int kDim, int opIdx) const;
1306+
Type elemType, int opIdx) const;
13061307
SmallVector<unsigned> getRepOrderForOperand(int opIdx) const;
1307-
static SmallVector<unsigned> getMNKDimPerInstr();
1308+
1309+
static SmallVector<unsigned, 3> getDefaultInstrShape() {
1310+
return {16, 16, 16};
1311+
}
13081312

13091313
// Returns a swizzled shared layout matching this WMMA layout for the
13101314
// dot operand at the given |operandIdx| with |operandShape|.

0 commit comments

Comments
 (0)