Updated:20250426 16:00
vllm因为版本更新缘故,我设定在了0.8.3,近期会将triton以及vllm一块升级到最新版。晚点更新。
mlc首发比较快,文本长了以后会降速,超过1万多字,群友反馈降速约一半,因此还是推荐使用vllm。
vllm安装后如果跑gguf单发还是比较快的,但是几乎没有并发能力。只能归结于vllm对gguf的优化不好。
经过一段时间的摸索,我们找到了vllm在amd显卡下并发效率最高的量化模型,gptq。
理论上任何gptq的量化模型均能使用mi50带来非常好的并发效果。大家可以去魔塔自行下载模型使用vllm运行gptq量化模型,可以获得目前最好的体验。
但是需要注意的是gptq的如果使用老显卡跑,问题短了会感叹号,官方给的方案是,在问题里塞垃圾字符,破除这个乱码的限制。但是今天搜索,似乎已经有新版的模型解决了感叹号问题。
pip install modelscope
modelscope download --model tclf90/qwq-32b-gptq-int8
modelscope download --model tclf90/qwq-32b-gptq-int4
https://modelscope.cn/models/tclf90/qwq-32b-gptq-int4
https://modelscope.cn/models/tclf90/qwq-32b-gptq-int8
【模型更新日期】
2025-04-24
1. add group 128v3
2. 用黑科技修掉之前 Compute 7 显卡(GPTQ格式)运行时候出现的感叹号问题
3. 删除对话模板的<think>字符,使得模型作答的时候,能完整吐出<think>模块
4. 提供一个无思考模式的对话模板(tokenizer_config-无思考模式.json)供参考把玩。
Updated:20250418 16:00
因为本人买了10卡gpu服务器做测试,发现vllm对x99平台的多卡机器优化可能有问题,所以研究了另外一个张量并行的框架。
下面有请 mlc-llm https://github.com/mlc-ai/mlc-llm
很开心的说,安装该框架比较简单,而且该框架在amd架构的多卡机系统中有很高的gpu利用率,是目前发现的对amd的老卡支持最好的,但美中不足的是,模型需要特制模型,因此使用此框架唯一的缺点是需要重新下模型,而且只有一部分特制模型。 应该可以对现有模型转换但是我还没有尝试。
测试结果 mi50*8
qwq 32b q4f16 -----1卡16-18token/s,并发没测(10卡机实测)
qwq 32b q4f16 -----2卡29-36token/s,并发60token/s (36token是特殊双路epyc4代环境跑出来的)
qwq 32b q4f16 -----4卡30-35token/s,并发150-180token/s(10卡机实测)
qwq 32b q4f16 -----8卡38-41token/s,并发280-300token/s(10卡机实测)
已经安装conda的用户跳过此步骤,已经安装conda的用户跳过此步骤,已经安装conda的用户跳过此步骤
建议使用conda作为虚拟环境
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
eval "$(/root/miniconda3/bin/conda shell.bash hook)"
conda init
conda --version
conda config --set auto_activate_base true
conda create --name mlc python=3.10
conda activate mlc
已经安装rocm的用户跳过此步骤,已经安装rocm的用户跳过此步骤,已经安装rocm的用户跳过此步骤
AMD官方rocm安装(不建议使用官方的,因为他现在默认是6.4)
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html
sudo apt update
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
wget https://repo.radeon.com/amdgpu-install/6.2.2/ubuntu/jammy/amdgpu-install_6.2.60202-1_all.deb
sudo apt install ./amdgpu-install_6.2.60202-1_all.deb
sudo apt update
sudo apt install amdgpu-dkms rocm
# linker
sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF
/opt/rocm/lib
/opt/rocm/lib64
EOF
sudo ldconfig
# path
export PATH=$PATH:/opt/rocm-6.2.2/bin
# verify drivers
dkms status
# You need to close all windows and reboot to make changes effective.
reboot
下面开始安装mlc-llm
已经创建该conda环境的用户跳过此步骤 conda create --name mlc python=3.10
conda activate mlc
conda install -c conda-forge libstdcxx-ng
只需要这一条命令即可安装全部的环境,摆脱编译的烦恼(如果遇到timeout,可能需要开启代理)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-rocm62 mlc-ai-nightly-rocm62
下载模型qwq 32b q4f16(其余模型请到下面链接自己找),请只下载f16_1类型的模型,f16_0这种无法多卡并行
https://huggingface.co/mlc-ai/QwQ-32B-q4f16_1-MLC/tree/main
温馨提示:git clone应该无需代理
安装git-lfs
git lfs install
如果上面的命令不好使试试这个
apt install git-lfs
git clone https://huggingface.co/mlc-ai/QwQ-32B-q4f16_1-MLC
下载完模型后会在本地创建QwQ-32B-q4f16_1-MLC目录
可以新开一个窗口不停使用命令查看目前下载的进度,git clone下载大文件不会显示进度
du -sh QwQ-32B-q4f16_1-MLC
运行推理
mlc_llm serve /root/QwQ-32B-q4f16_1-MLC --mode server --overrides "tensor_parallel_shards=2" --host 0.0.0.0
温馨提示:
1.rocm-smi -b 可以查看每张pcie的实时带宽
2.q0f16是指未量化的 float16 格式
3.如果要在多个 GPU 上启用张量并行,请 向配置生成命令添加参数。--tensor-parallel-shards $NGPU
4.MLC 支持的量化完整列表 https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/quantization/quantization.py#L29
5.MLC LLM 中的引擎模式。我们提供了三种预设模式:local、interactive和server。默认模式为local。
5.1 local是指本地服务器部署,请求并发度较低。因此,最大批处理大小将设置为 4,最大总序列长度和预填充块大小将设置为模型的上下文窗口大小(或滑动窗口大小)。
5.2 interactive模式是指服务器以交互式方式使用,最多只能处理 1 个并发请求。因此,最大批处理大小将设置为 1,最大总序列长度和预填充块大小将设置为模型的上下文窗口大小(或滑动窗口大小)。
5.3 server模式指的是大型服务器用例,它可能处理大量并发请求,并希望尽可能多地使用 GPU 内存。在此模式下,我们将自动推断最大可能的最大批次大小和最大总序列长度。
6. 更多的内容查看 https://llm.mlc.ai/docs/deploy/rest.html#rest-launch-server
triton的替换
因为不清楚mlc所安装的是否是魔改triton,如果你对性能不满意,可以安装自己编译的triton=3.2.0,该版本,可能对性能有略微的提升。
请查看下面额外的triton相关的编译安装教程,使用下面的命令替换triton为自己编译的版本
cd triton-3.2.0-mi50/python/dist/
pip install triton-3.2.0-cp310-cp310-linux_x86_64.whl --force-reinstall
Updated:20250404----以下是vllm安装教程,mlc的不需要看,如果需要编译triton,可以只看triton部分
请使用Ubuntu 22.04 多人测试24版本很多报错
建议使用conda作为虚拟环境
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
eval "$(/root/miniconda3/bin/conda shell.bash hook)"
conda init
conda --version
conda config --set auto_activate_base true
conda create --name vllmnew python=3.10
conda activate vllmnew
魔改最新版Triton 3.2.0安装指南 推荐使用python3.10这样的话跟我的测试环境完全一样
有问题可以在issue区留言或者进qq群:1028429001 ->本地部署deepseek-r1:671b纯cpu方案
本介绍基于https://github.com/Said-Akbar/triton-gcn5/blob/gcn5-edits/README.md (需要triton3.1.0可以去这里自取)
请注意,不要尝试rocm 6.2.2之外的版本, triton与vllm版本依赖严重, 6.3.3是测试版尝试过全是报错.
This is a fork of triton for AMD MI25/50/60. I assume you already have rocm 6.2.2 installed with GPU drivers. If not, use these commands (assuming you have Ubuntu 22.04):
Install rocm 6.2.2
AMD官方rocm安装说明(不建议使用官方的,因为他现在默认是6.3.3) https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html
sudo apt update
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
wget https://repo.radeon.com/amdgpu-install/6.2.2/ubuntu/jammy/amdgpu-install_6.2.60202-1_all.deb
sudo apt install ./amdgpu-install_6.2.60202-1_all.deb
sudo apt update
sudo apt install amdgpu-dkms rocm
# linker
sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF
/opt/rocm/lib
/opt/rocm/lib64
EOF
sudo ldconfig
# path
export PATH=$PATH:/opt/rocm-6.2.2/bin
# verify drivers
dkms status
# You need to close all windows and reboot to make changes effective.
reboot
Now you have rocm 6.2.2.
Install triton 3.2.0.
# create venv
#如果你不想使用python venv可以使用conda代替python3 -m venv vllmenv
#conda create --name vllmnew python=3.10
#conda activate vllmnew
python3 -m venv vllmenv
source vllmenv/bin/activate
git clone https://github.com/gengchaogit/triton-3.2.0-mi50.git
cd triton-3.2.0-mi50
#这一步非常关键 本次改动只针对release最新版3.2.0
git checkout release/3.2.x
cd python
# install triton req’s
pip3 install ninja cmake wheel pybind11
# install torch first
pip3 install --no-cache-dir --pre torch>=2.6 torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2
# build triton
# 这里注意,必须要构建whl包,因为安装中的失败、重新安装依赖被覆盖等原因,我们可能需要最后安装完成的时候重新装一遍这个whl,一次构建多次安装
pip wheel . -w dist/
# 生成的文件将会出现在triton-3.2.0-mi50/python/dist/triton-3.2.0-cp310-cp310-linux_x86_64.whl (因为我是python3.10所以这里是310,根据python不同版本可能是311也可能是312)
cd dist
pip install triton-3.2.0-cp310-cp310-linux_x86_64.whl --force-reinstall
# 将来如果重新安装triton-3.2.0的话只需重新cd dist & triton-3.2.0-cp310-cp310-linux_x86_64.whl --force-reinstall
# 验证安装是否成功
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
预期输出
CUDA available: True
End of installation.
Triton3.2.0 编译安装教程结束,现在你可以编译最新版本的vllm了
下面是安装最新版vllm教程 或者你可以自己去看官方文档,传送门 https://github.com/vllm-project/vllm
# Clone git
git clone https://github.com/vllm-project/vllm.git
cd vllm
#由于vllm更新到了下一个版本,依赖改为了triton3.3.0因此在这里指定版本号为v0.8.3
git checkout v0.8.3
#如果你使用conda 使用下面命令代替source vllmenv/bin/activate
#conda activate vllmnew
source vllmenv/bin/activate
pip install --upgrade pip
# Build & install AMD SMI
pip install /opt/rocm/share/amd_smi
# Install dependencies
pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm
pip install "numpy<2"
pip install -r requirements/rocm.txt
# Build vLLM for MI50.
export PYTORCH_ROCM_ARCH="gfx906"
python3 setup.py develop --verbose
#如果遇到cmake 3.5的问题,就去那个文件里开头第十几行把小于3.5的3.x版本改成3.5保存即可
sed -i '21s/3.3/3.5/' /opt/rocm/lib/cmake/hiprtc/hiprtc-config.cmake
#如果本次安装某些地方安装依赖覆盖掉了魔改版的triton需要你返回dist重新安装编译好的triton
#pip install triton-3.2.0-cp310-cp310-linux_x86_64.whl --force-reinstall
# vllm amd双卡启动命令-随便写的
# ps:可以接入dify/openwebui等支持openai接口的前端进行测试
VLLM_USE_TRITON_FLASH_ATTN=1 ROCM_PATH=/opt/rocm TORCH_BLAS_PREFER_HIPBLASLT=0 PYTORCH_ROCM_ARCH=gfx906 vllm serve /data1/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf --port 8099 --max-model-len 4096 --tensor-parallel-size 2 --served-model-name vllm
#若conda环境启动遇到动态链接库问题
conda install -c conda-forge libstdcxx-ng
#预期输出
(vllmnew) root@epyc:~# strings /root/miniconda3/envs/vllmnew/lib/libstdc++.so.6|grep GLIBCXX_3.4.30
GLIBCXX_3.4.30
GLIBCXX_3.4.30
下面的可以不用看,原始的官方readme
Documentation |
Nightly Wheels |
|---|---|
This is the development repository of Triton, a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs.
The foundations of this project are described in the following MAPL2019 publication: Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. Please consider citing this work if you use Triton!
The official documentation contains installation instructions and tutorials. See also these third-party Triton puzzles, which can all be run using the Triton interpreter -- no GPU required.
You can install the latest stable release of Triton from pip:
pip install tritonBinary wheels are available for CPython 3.9-3.13.
The main branch now features support for NVIDIA Blackwell GPUs using 5th generation tensor cores. To enable this, you will need two additional steps:
- Build a pre-release PyTorch from source with CUDA 12.8
- Build triton from the latest source
First, to build pytorch you need to have CUDA 12.8 installed locally. If not, follow the instructions for your platform
# Clone and checkout pytorch 2.6 release candidate
git clone https://github.com/pytorch/pytorch
cd pytorch
git checkout v2.6.0-rc9
git submodule sync
git submodule update --init --recursive -j 8
# Install build dependencies (assumes you already have a system compiler)
pip install -r requirements.txt
pip install mkl-static mkl-include wheel
# Build PyTorch (will take a long time)
export CUDA_HOME=/usr/local/cuda-12.8
export CUDA_PATH=$CUDA_HOME
export TORCH_CUDA_ARCH_LIST=Blackwell
python setup.py develop
# Optional, package build into a wheel to install on other machines.
python setup.py bdist_wheel
ls dist # Wheel should be output in this directoryNote that if you use the domain libraries (torchvision, torchtext,
torchaudio, etc.) these will need to be built from source as well, otherwise
their custom PyTorch extensions will not work.
Finally, follow the instructions below to install triton from source.
git clone https://github.com/triton-lang/triton.git
cd triton
pip install -r python/requirements.txt # build-time dependencies
pip install -e pythonOr with a virtualenv:
git clone https://github.com/triton-lang/triton.git
cd triton
python -m venv .venv --prompt triton
source .venv/bin/activate
pip install -r python/requirements.txt # build-time dependencies
pip install -e pythonTriton uses LLVM to generate code for GPUs and CPUs. Normally, the Triton build downloads a prebuilt LLVM, but you can also build LLVM from source and use that.
LLVM does not have a stable API, so the Triton build will not work at an arbitrary LLVM version.
-
Find the version of LLVM that Triton builds against. Check
cmake/llvm-hash.txtto see the current version. For example, if it says: 49af6502c6dcb4a7f7520178bd14df396f78240cThis means that the version of Triton you have builds against LLVM 49af6502.
-
git checkoutLLVM at this revision. Optionally, make additional modifications to LLVM. -
Build LLVM. For example, you might run
$ cd $HOME/llvm-project # your clone of LLVM. $ mkdir build $ cd build $ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON ../llvm -DLLVM_ENABLE_PROJECTS="mlir;llvm;lld" -DLLVM_TARGETS_TO_BUILD="host;NVPTX;AMDGPU" $ ninja -
Grab a snack, this will take a while.
-
Build Triton as above, but set the following environment variables.
# Modify as appropriate to point to your LLVM build. $ export LLVM_BUILD_DIR=$HOME/llvm-project/build $ cd <triton install> $ LLVM_INCLUDE_DIRS=$LLVM_BUILD_DIR/include \ LLVM_LIBRARY_DIR=$LLVM_BUILD_DIR/lib \ LLVM_SYSPATH=$LLVM_BUILD_DIR \ pip install -e python
-
Set
TRITON_BUILD_WITH_CLANG_LLD=trueas an environment variable to use clang and lld. lld in particular results in faster builds. -
Set
TRITON_BUILD_WITH_CCACHE=trueto build with ccache. -
Set
TRITON_HOME=/some/pathto change the location of the.tritondirectory where Triton's cache is located and downloads are stored during the build. By default, this is the user's home directory. It can be changed anytime. -
If you're running out of memory when building Triton, specify the
MAX_JOBSenvironment variable (to thepip install -e pythoncommand) to limit the number of jobs. -
Pass
--no-build-isolationtopip installto make nop builds faster. Without this, every invocation ofpip installuses a different symlink to cmake, and this forces ninja to rebuild most of the.afiles. -
vscode intellisense has some difficulty figuring out how to build Triton's C++ (probably because, in our build, users don't invoke cmake directly, but instead use setup.py). Teach vscode how to compile Triton as follows.
- Do a local build. Run command
pip install -e python - Get the full path to the
compile_commands.jsonfile produced by the build:find python/build -name 'compile_commands.json' | xargs readlink -f. You might get a full path similar to/Users/{username}/triton/python/build/cmake.macosx-11.1-arm64-cpython-3.12/compile_commands.json - In vscode, install the
C/C++
extension,
then open the command palette (
Shift + Command + Pon Mac, orShift + Ctrl + Pon Windows/Linux) and openC/C++: Edit Configurations (UI). - Open "Advanced Settings" and paste the full path to
compile_commands.jsoninto the "Compile Commands" textbox.
- Do a local build. Run command
There currently isn't a turnkey way to run all the Triton tests, but you can follow the following recipe.
# One-time setup. Note this will reinstall local Triton because torch
# overwrites it with the public version.
$ make dev-install
# To run all tests (requires a GPU)
$ make test
# Or, to run tests without a gpu
$ make test-nogpuFor detailed instructions on how to debug Triton's frontend, please refer to this tutorial. The following includes additional tips for hacking on Triton's backend.
Helpful environment variables
-
MLIR_ENABLE_DUMP=1dumps the IR before every MLIR pass Triton runs, for all kernels. UseMLIR_ENABLE_DUMP=kernelNameto dump for a specific kernel only.- Triton cache can interfere with the dump. In cases where
MLIR_ENABLE_DUMP=1does not work, try cleaning your triton cache:rm -r ~/.triton/cache/*
- Triton cache can interfere with the dump. In cases where
-
MLIR_DUMP_PATHspecifies whereMLIR_ENABLE_DUMPwill dump to. If unset will dump to stderr. -
LLVM_IR_ENABLE_DUMP=1dumps the IR before every pass run over the LLVM IR. -
TRITON_REPRODUCER_PATH=<reproducer_path>will generate an MLIR reproducer file at<reproducer_path>before each MLIR compiler stage. If any of the stages fail,<reproducer_path>will be a local MLIR reproducer captured right before the failing pass. -
TRITON_INTERPRET=1uses the Triton interpreter instead of running on the GPU. You can insert Python breakpoints in your kernel code! -
TRITON_ENABLE_LLVM_DEBUG=1passes-debugto LLVM, printing a lot of debugging information to stdout. If this is too noisy, run with justTRITON_LLVM_DEBUG_ONLYinstead to limit the output.An alternative way to reduce output noisiness is running with
LLVM_IR_ENABLE_DUMP=1, extract the IR before the LLVM pass of interest, and then run LLVM'soptstandalone, perhaps passing-debug-only=fooon the command line. -
TRITON_LLVM_DEBUG_ONLY=<comma-separated>is the equivalent of LLVM's-debug-onlycommand-line option. This limits the LLVM debug output to specific pass or component names (which are specified using#define DEBUG_TYPEthroughout LLVM and Triton) in order to allow the debug output to be less noisy.TRITON_LLVM_DEBUG_ONLYallows for one or more comma separated values to be specified (egTRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions"orTRITON_LLVM_DEBUG_ONLY="tritongpu-remove-layout-conversions,regalloc"). -
TRITON_ENABLE_ASAN=1invokes the LLVM address sanitizer for memory leak and out of bounds access detection. Currently only supported on the AMD backend. This must be run using the ASAN libraries documented here.When enabling the address sanitizer it is recommended to disable various memory caching strategies both within the ROCm stack and PyTorch. This will give the address sanitizer the best chance at finding the memory fault where it originates. See this test for more details.
-
USE_IR_LOC={ttir,ttgir}reparses the IR such that the location information will be the line number of the IR file with that particular extension, instead of line number of the python file. This can provide a direct mapping from the IR to llir/ptx. When used with performance tools, it can provide a breakdown on IR instructions. -
TRITON_PRINT_AUTOTUNING=1prints out the best autotuning config and total time spent for each kernel after autotuning is complete. -
DISABLE_LLVM_OPTwill disable llvm optimizations for make_llir and make_ptx if its value is true when parsing as Bool. Otherwise, it will be parsed as a list of flags to disable llvm optimizations. One usage case isDISABLE_LLVM_OPT="disable-lsr"Loop strength reduction is known to cause up to 10% performance changes for certain kernels with register pressure. -
TRITON_ALWAYS_COMPILE=1forces to compile kernels regardless of cache hit. -
MLIR_ENABLE_TIMINGdumps the timing information for each MLIR pass. -
LLVM_ENABLE_TIMINGdumps the timing information for each LLVM pass. -
TRITON_DEFAULT_FP_FUSIONoverrides the default behavior of allowing fp fusion (mul+add->fma). -
MLIR_ENABLE_DIAGNOSTICS=<comma-separated>controls diagnostic emission in MLIR. Options are:warnings,remarks,stacktraces,operations. Use comma-separated values to customize output. For example,MLIR_ENABLE_DIAGNOSTICS=remarks,operationsenables remarks and IR operations, whileMLIR_ENABLE_DIAGNOSTICS=warnings,stacktracesenables warnings with stacktraces. By default, only errors are shown. Settingwarningsincludes errors and warnings;remarksincludes errors, warnings, and remarks. -
MLIR_ENABLE_REMARKis deprecated. Please useMLIR_ENABLE_DIAGNOSTICS=remarks. -
TRITON_KERNEL_DUMPenables the dumping of the IR from each compilation stage and the final ptx/amdgcn. -
TRITON_DUMP_DIRspecifies the directory to save the dumped IR and ptx/amdgcn whenTRITON_KERNEL_DUMPis set to 1. -
TRITON_KERNEL_OVERRIDEenables the override of the compiled kernel with a user-specified IR/ptx/amdgcn at the beginning of each compilation stage. -
TRITON_OVERRIDE_DIRspecifies the directory from which to load the IR/ptx/amdgcn files whenTRITON_KERNEL_OVERRIDEis set to 1. -
TRITON_F32_DEFAULTsets the default input precision oftl.dotwhen using 32-bit floats, which can be eitherieee,tf32, ortf32x3. -
TRITON_FRONT_END_DEBUGGING=1disables exception wrapping when an error occurs in the compiler frontend, allowing the full stack trace to be seen.
Kernel Override Steps
export TRITON_ALWAYS_COMPILE=1
export TRITON_KERNEL_DUMP=1
export TRITON_DUMP_DIR=<dump_dir>
export TRITON_KERNEL_OVERRIDE=1
export TRITON_OVERRIDE_DIR=<override_dir>
# Step 1: Run the kernel once to dump kernel's IRs and ptx/amdgcn in $TRITON_DUMP_DIR
# Step 2: Copy $TRITON_DUMP_DIR/<kernel_hash> to $TRITON_OVERRIDE_DIR
# Step 3: Delete the stages that you do not want to override and modify the stage you do want to override
# Step 4: Run the kernel again to see the overridden resultVersion 2.0 is out! New features include:
- Many, many bug fixes
- Performance improvements
- Backend rewritten to use MLIR
- Support for kernels that contain back-to-back matmuls (e.g., flash attention)
Community contributions are more than welcome, whether it be to fix bugs or to add new features at github. For more detailed instructions, please visit our contributor's guide.
Supported Platforms:
- Linux
Supported Hardware:
- NVIDIA GPUs (Compute Capability 8.0+)
- AMD GPUs (ROCm 6.2+)
- Under development: CPUs
Dev Containers for the Triton project are available from the triton-dev-containers repository
- Consistency: All developers can work with the same development environment, ensuring uniform behavior across different systems.
- Isolation: The container prevents potential conflicts with software installed on your local machine.
- Portability: Easily share the development environment with team members, minimizing onboarding time and setup issues.
For detailed instructions on how to use the dev containers please see the dev container user guide