Last modified: Jun 11, 2026

Install OpenAI Triton in Python

What Is OpenAI Triton?

OpenAI Triton is a powerful language and compiler for writing custom deep learning kernels. It helps you write high-performance GPU code without needing CUDA expertise. Triton compiles your Python-like code into efficient GPU instructions.

Many developers use it to speed up operations like matrix multiplications, attention mechanisms, and custom neural network layers. It works well with PyTorch and other frameworks.

Before you install, check that your system has a compatible NVIDIA GPU and the CUDA toolkit installed. Triton requires CUDA 11.4 or newer.

System Requirements

You need a Linux or Windows system with a CUDA-capable GPU. The GPU must have compute capability 7.0 or higher. This includes Volta, Turing, Ampere, and later architectures.

Your Python version should be 3.8 or newer. Install pip and ensure you have at least 8 GB of RAM. A stable internet connection is needed for downloading packages.

If you use Windows, consider using WSL2 with Ubuntu for better compatibility. Triton works best on Linux.

Step 1: Install CUDA Toolkit

First, install the CUDA toolkit from NVIDIA's official website. Choose version 11.4 or later. Follow the installer instructions for your operating system.

After installation, verify CUDA is available by running this command in your terminal:


nvcc --version

You should see output similar to this:


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107

If you see a version number, CUDA is installed correctly. If not, check your PATH environment variable or reinstall.

Step 2: Install PyTorch

Triton works best with PyTorch. Install PyTorch with CUDA support using pip. Visit the official PyTorch website to get the correct command for your system.

For most users, this command works:


pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Replace cu118 with your CUDA version. For CUDA 12.1, use cu121.

Verify PyTorch sees your GPU:


import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

Expected output:


True
NVIDIA GeForce RTX 3080

If torch.cuda.is_available() returns False, check your CUDA installation and GPU drivers.

Step 3: Install Triton from PyPI

The easiest way to install Triton is using pip. Run this command:


pip install triton

This installs the latest stable version of Triton. The installation takes a few minutes because it compiles some components.

If you need a specific version, specify it:


pip install triton==2.1.0

Check the installation by importing Triton in Python:


import triton
print(triton.__version__)

Expected output:


2.1.0

If no error appears, Triton is installed successfully.

Step 4: Verify with a Simple Kernel

Write a small Triton kernel to test the installation. This example adds two vectors element-wise:


import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

# Test the kernel
x = torch.randn(10000, device='cuda')
y = torch.randn(10000, device='cuda')
result = add(x, y)
print(result[:5])
print(x[:5] + y[:5])

Expected output:


tensor([-0.1234,  0.5678, -1.2345,  2.3456,  0.9876], device='cuda:0')
tensor([-0.1234,  0.5678, -1.2345,  2.3456,  0.9876], device='cuda:0')

Both tensors should match. This confirms Triton works correctly.

Common Installation Issues

Sometimes you may see errors like ImportError: No module named triton. This means pip didn't install it correctly. Try reinstalling with pip install --upgrade triton.

Another common error is CUDA error: no kernel image is available for execution on the device. This happens when your GPU is too old. Check your GPU's compute capability. Triton requires compute capability 7.0+.

If you get RuntimeError: Triton Error [CUDA]: device-side assert triggered, your kernel has a bug. Double-check your code for incorrect memory access.

For Windows users, you might see OSError: [WinError 126] The specified module could not be found. Install Visual Studio Build Tools and the Windows SDK. Then try installing Triton again.

Best Practices for Using Triton

Always use @triton.jit decorator for your kernels. This tells Triton to compile the function into GPU code.

Use tl.constexpr for compile-time constants like block sizes. This helps Triton optimize your kernel better.

Profile your kernels with triton.testing.do_bench() to measure performance. Compare against PyTorch implementations to see speedups.

Keep your kernel functions simple. Complex control flow can slow down compilation. Break large kernels into smaller helper functions.

Conclusion

Installing OpenAI Triton in Python is straightforward when you follow the right steps. Start with CUDA and PyTorch, then use pip to install Triton. Test with a simple kernel to confirm everything works.

Triton lets you write custom GPU kernels with Python-like syntax. It is a valuable tool for deep learning researchers and engineers who need maximum performance.

If you encounter issues, check your GPU compatibility and CUDA version. The Triton community is active and helpful. Use the official GitHub repository for support.

Now you are ready to build fast custom operations for your neural networks. Happy coding with Triton!

Install OpenAI Triton in Python

What Is OpenAI Triton?

System Requirements

Step 1: Install CUDA Toolkit

Step 2: Install PyTorch

Step 3: Install Triton from PyPI

Step 4: Verify with a Simple Kernel

Common Installation Issues

Best Practices for Using Triton

Conclusion

Related Tutorials:

Recent Tutorials: