Last modified: Jun 11, 2026
Install OpenAI Triton in Python
What Is OpenAI Triton?
OpenAI Triton is a powerful language and compiler for writing custom deep learning kernels. It helps you write high-performance GPU code without needing CUDA expertise. Triton compiles your Python-like code into efficient GPU instructions.
Many developers use it to speed up operations like matrix multiplications, attention mechanisms, and custom neural network layers. It works well with PyTorch and other frameworks.
Before you install, check that your system has a compatible NVIDIA GPU and the CUDA toolkit installed. Triton requires CUDA 11.4 or newer.
System Requirements
You need a Linux or Windows system with a CUDA-capable GPU. The GPU must have compute capability 7.0 or higher. This includes Volta, Turing, Ampere, and later architectures.
Your Python version should be 3.8 or newer. Install pip and ensure you have at least 8 GB of RAM. A stable internet connection is needed for downloading packages.
If you use Windows, consider using WSL2 with Ubuntu for better compatibility. Triton works best on Linux.
Step 1: Install CUDA Toolkit
First, install the CUDA toolkit from NVIDIA's official website. Choose version 11.4 or later. Follow the installer instructions for your operating system.
After installation, verify CUDA is available by running this command in your terminal:
nvcc --version
You should see output similar to this:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
If you see a version number, CUDA is installed correctly. If not, check your PATH environment variable or reinstall.
Step 2: Install PyTorch
Triton works best with PyTorch. Install PyTorch with CUDA support using pip. Visit the official PyTorch website to get the correct command for your system.
For most users, this command works:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Replace cu118 with your CUDA version. For CUDA 12.1, use cu121.
Verify PyTorch sees your GPU:
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
Expected output:
True
NVIDIA GeForce RTX 3080
If torch.cuda.is_available() returns False, check your CUDA installation and GPU drivers.
Step 3: Install Triton from PyPI
The easiest way to install Triton is using pip. Run this command:
pip install triton
This installs the latest stable version of Triton. The installation takes a few minutes because it compiles some components.
If you need a specific version, specify it:
pip install triton==2.1.0
Check the installation by importing Triton in Python:
import triton
print(triton.__version__)
Expected output:
2.1.0
If no error appears, Triton is installed successfully.
Step 4: Verify with a Simple Kernel
Write a small Triton kernel to test the installation. This example adds two vectors element-wise:
import torch
import triton
import triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
def add(x: torch.Tensor, y: torch.Tensor):
output = torch.empty_like(x)
n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
return output
# Test the kernel
x = torch.randn(10000, device='cuda')
y = torch.randn(10000, device='cuda')
result = add(x, y)
print(result[:5])
print(x[:5] + y[:5])
Expected output:
tensor([-0.1234, 0.5678, -1.2345, 2.3456, 0.9876], device='cuda:0')
tensor([-0.1234, 0.5678, -1.2345, 2.3456, 0.9876], device='cuda:0')
Both tensors should match. This confirms Triton works correctly.
Common Installation Issues
Sometimes you may see errors like ImportError: No module named triton. This means pip didn't install it correctly. Try reinstalling with pip install --upgrade triton.
Another common error is CUDA error: no kernel image is available for execution on the device. This happens when your GPU is too old. Check your GPU's compute capability. Triton requires compute capability 7.0+.
If you get RuntimeError: Triton Error [CUDA]: device-side assert triggered, your kernel has a bug. Double-check your code for incorrect memory access.
For Windows users, you might see OSError: [WinError 126] The specified module could not be found. Install Visual Studio Build Tools and the Windows SDK. Then try installing Triton again.
Best Practices for Using Triton
Always use @triton.jit decorator for your kernels. This tells Triton to compile the function into GPU code.
Use tl.constexpr for compile-time constants like block sizes. This helps Triton optimize your kernel better.
Profile your kernels with triton.testing.do_bench() to measure performance. Compare against PyTorch implementations to see speedups.
Keep your kernel functions simple. Complex control flow can slow down compilation. Break large kernels into smaller helper functions.
Conclusion
Installing OpenAI Triton in Python is straightforward when you follow the right steps. Start with CUDA and PyTorch, then use pip to install Triton. Test with a simple kernel to confirm everything works.
Triton lets you write custom GPU kernels with Python-like syntax. It is a valuable tool for deep learning researchers and engineers who need maximum performance.
If you encounter issues, check your GPU compatibility and CUDA version. The Triton community is active and helpful. Use the official GitHub repository for support.
Now you are ready to build fast custom operations for your neural networks. Happy coding with Triton!