|
| 1 | +# Agenda: |
| 2 | +* Intros: Cicie Wang, and Whitney Tsang (co-organizers). |
| 3 | +* Multi-pass profiler - a federated GPU Tooling Framework for Orchestrated and LLM Agentic Profiling Applications (Kevin Fang, et al., Meta) |
| 4 | +* Triton Developer Conference updates (Ofer Dekel, Microsoft) |
| 5 | +* Q> Who is using tritonbench? How are you using it? OpenAI? (Cicie Wang, Meta) |
| 6 | +* Q> Triton testing strategy - what do folks think? What are we missing? Where would you like to see additional coverage? (Bill Yoshimi, Meta) |
| 7 | +* Q> Free threaded Python. Any plans for making it compatible with free threading? (Bill Yoshimi, Meta) |
| 8 | +* Open mic for other topics. |
| 9 | + |
| 10 | +# Notes: |
| 11 | +* MPP |
| 12 | + * Lots of new DSLs (like Gluon and TLX) and profilers. |
| 13 | + * Working with Keren from OAI on profiling |
| 14 | + * Integrated wth compiler |
| 15 | + * Supports new DSLs |
| 16 | + * Structure-level profiling timelines |
| 17 | + * Operator-level latency |
| 18 | + * See OSDI ‘25 paper (accepted) |
| 19 | + * Approach |
| 20 | + * Connecting tools like profilers, LLM agents, etc to to different profiling backends (like proton, ncu, nvbit, etc.) |
| 21 | + * Requirements |
| 22 | + * Programmable interfaces |
| 23 | + * Eager execution (makes debugging easier) |
| 24 | + * Amenable to parallelization |
| 25 | + * Sandboxing - like for enabling agents to try experiments (to get a clean environment) |
| 26 | + * Debuggable. |
| 27 | + * Prototype |
| 28 | + * Data structures - program IR, execution traces, performance report |
| 29 | + * Abstractions - tasks and jobs (jobs can be nested) |
| 30 | + * System architecture |
| 31 | + * Job graph |
| 32 | + * MPP runtime - schedules tasks & eager execution |
| 33 | + * Backend - state caching, GPU/CPU pools. DB for error recovery |
| 34 | + * Case study 1: Profiling Async Operations |
| 35 | + * Sometimes difficult because some resources are shared. |
| 36 | + * We do multiple passes and measure statistical metrics. |
| 37 | + * Statistical timeline view. |
| 38 | + * MPP allows you to see distribution of execution times (P20, P50, P80) |
| 39 | + * Case study 2: Triton PGO Agent |
| 40 | + * Phases/Agents: profiling, summary, optimizer |
| 41 | + * Profiling: gets profile results |
| 42 | + * Summary: compress context window, generate a TL;DR |
| 43 | + * Optimizer: rewrites kernel to improve performance |
| 44 | + * Experimenting with TTGIR rewrites. |
| 45 | + * Examples: identifies section with high execution variation. Identifies critical path and suggests how to shorten them. |
| 46 | + * Results: compared to no profiling, NCU, with MPP (7-12% improvement). |
| 47 | + * Failure modes: |
| 48 | + * Kernel results change |
| 49 | + * Deadlocks |
| 50 | + * Case study 3: fine-grained IPC |
| 51 | + * Timing from proton intra kernel profiler |
| 52 | + * Instruction type stats from nvbit or cutracer (developed by Meta) |
| 53 | + * Can identify register pressure. |
| 54 | + * Conclusion |
| 55 | + * On top of proton, orchestrating profiling workflows |
| 56 | + * Soon to be open-source |
| 57 | + |
| 58 | + Q> How difficult is this to add other GPU vendors like AMD? |
| 59 | + |
| 60 | + A> If your backend can give you the data, we can do it. We didn’t do it because we were interested in warp specialization. It's general and you can implement the interface API. |
| 61 | + |
| 62 | + Q> Have you experimented with using the optimizer to rewrite assembly code? |
| 63 | + |
| 64 | + A> Demo used TTGIR but you can create an agent that could rewrite PTX or assembly. |
| 65 | + |
| 66 | + Q> Did you need to write prompt for the agent? |
| 67 | + |
| 68 | + A> Yes. It's a very simple prompt. |
| 69 | + |
| 70 | +* Triton conference updates (Ofer Dekel, MSFT) |
| 71 | + * [https://aka.ms/tritonconference2025](https://aka.ms/tritonconference2025) |
| 72 | + * Schedule |
| 73 | + * Please show up to the happy hour to mingle (probably the most important part). |
| 74 | + * Register. You’ll also need it for the live-stream too. Sorry, you will not be able to register on the day of conference. |
| 75 | + * When you register, status is pending. Will take up to a week to get it approved. (Why? Its going through Microsoft security review). |
| 76 | + * Please register with your institutional/professional email vs. yahoo/gmail/generic email. Generic email will take longer approve. You can ping Ofer if you haven’t seen your approval after 8+ days. |
| 77 | + * There will be busses to venue from SF. |
| 78 | + * Visa letter? Register soon so we can get you an invitation letter |
| 79 | + * Program |
| 80 | + * Phil & Thomas - Triton: today and beyond |
| 81 | + * Mark Saroufim - GPU MODE: the state of Triton |
| 82 | + * Jason Ansel - Helion: A higher-level DSL for Kernel Authoring |
| 83 | + * Keren Zhou (George Mason) & Kevin Fang (Proton: portable performance profiling) |
| 84 | + * Lixun Zhang (AMD) - No warm up needed: Triton day-one speed on AMD GPUS |
| 85 | + * Chris Sullivan (Nvidia) - Nvida Blackwell GPU backend for Triton |
| 86 | + * Peter Bell (OpenAI) - Gluon: tilebased GPU programming with low-level control. |
| 87 | + * Hongtao Y (Meta) - TLX |
| 88 | + * Wenlei Bao (Bytedance ) - Triton - distributed computation and communication overlapping |
| 89 | + * Yanming Chen (Linked in) - Evolution of Liger Kernels to post training |
| 90 | +* Q> Who is using tritonbench? How are you using it? OpenAI? |
| 91 | + * [Kernelize.ai](Kernelize.ai) - vLLM testing tritonbench nightly. Built a visualization (noticed H100 and B200 regressions on Liger kernel and BF16). |
| 92 | + * OpenAI - not using tritonbench, using internal benchmarking system. Lowtech stuff, ocaml (some of it is open sources in repo). Simple benchmarking. |
| 93 | + * Q> no new kernels added |
| 94 | + * A> we’re continuously updating them, thinking of upstreaming more, attention, but no timeline. We are keeping MoE update. |
| 95 | +* Q> Triton testing strategy - what do folks think? What are we missing? Where would you like to see additional coverage? |
| 96 | + * Ettore - want so seem more lit test coverage, doesn’t require GPU. Easier and fast to run. Vs testing operator end to end. |
| 97 | + * 20K unit tests are good, but if we want better improvements. Is to beef up the lit tests.GPU tests should be in third-party directory. Add lit |
| 98 | + * Alex Baden: Tests: for important kernels, IR diffing! Cheaper to run (if the IR doesn’t change you shouldn’t have a regression.). Use LLVM tooling to eliminate white space changes. **For important kernels, extract & compare IR changes.** |
| 99 | +* Q> What is the Free-threading Python strategy? |
| 100 | + * Lots of things to fix in the front end (backend is pretty thread-safe.) |
| 101 | + * But its not high on the list of work we're doing (OAI). |
| 102 | +* Q> Flex attention: update comments/docs to use tensor descriptors instead of TMA (unless TMA is really being referenced). |
| 103 | + * PyTorch flex attention uses tensor descriptors but comments/code reference TMA. Reaching out to owners of flex attention PyTorch inductor template kernels to update comments and code. Confusing for people who use GPUs that don’t implement TMA. |
| 104 | + * Ettore: FlexAttention FWD uses tensor descriptors but BWD doesn't, can someone add tensor descriptor support? |
| 105 | + |
| 106 | +# Minutes |
| 107 | +* Recording link [here](https://youtu.be/Ji1rCo6qvXc) |
| 108 | +* MPP presentation link [here](https://tinyurl.com/4r7cfzhu) |
0 commit comments