Last modified: Jun 14, 2026

Install TensorRT-LLM in Python

TensorRT-LLM is a powerful library for optimizing large language models on NVIDIA GPUs. It speeds up inference and reduces memory usage. This guide shows you how to install TensorRT-LLM in Python step by step.

What is TensorRT-LLM?

TensorRT-LLM is an open-source library. It combines TensorRT with LLM optimizations. You can run models like LLaMA, GPT, and Falcon faster. It supports in-flight batching and paged attention. These features make it ideal for production use.

System Requirements

Before you start, check your system. You need a Linux machine with an NVIDIA GPU. The GPU must have compute capability 7.0 or higher. Examples include V100, A100, or H100. You also need CUDA 12.1 or later and cuDNN 8.9.

Python version 3.8 to 3.11 is required. Use a virtual environment to avoid conflicts. TensorRT-LLM works best with PyTorch 2.0 or later.

Step 1: Install NVIDIA Drivers and CUDA

First, update your NVIDIA driver. Run nvidia-smi to check your driver version. If it is older than 525, update it.

sudo apt update
sudo apt install nvidia-driver-535
sudo reboot

Then install CUDA 12.1. Download the runfile from NVIDIA's website. Or use the package manager:

wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
sudo sh cuda_12.1.0_530.30.02_linux.run

Add CUDA to your PATH in .bashrc:

export PATH=/usr/local/cuda-12.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH
source ~/.bashrc

Step 2: Install cuDNN and TensorRT

Download cuDNN 8.9 from NVIDIA's developer site. Install it:

sudo dpkg -i cudnn-linux-x86_64-8.9.0.131_cuda12-archive.deb

Next, install TensorRT 8.6. Use the tar package:

tar -xzvf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.1.tar.gz
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/TensorRT-8.6.1.6/lib

Install the Python wheel for TensorRT:

pip install /path/to/TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl

Step 3: Set Up a Python Virtual Environment

Create a new environment to isolate dependencies:

python3 -m venv trtllm_env
source trtllm_env/bin/activate

Upgrade pip and install PyTorch with CUDA support:

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Step 4: Install TensorRT-LLM from Source

Clone the TensorRT-LLM repository:

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM

Install Python dependencies:

pip install -r requirements.txt

Build the library with pip. This step compiles CUDA kernels and may take 10-20 minutes:

pip install -e .

If you get a memory error, reduce build jobs:

MAX_JOBS=4 pip install -e .

Step 5: Verify the Installation

Test that TensorRT-LLM works. Run a simple Python script:

import tensorrt_llm
print("TensorRT-LLM version:", tensorrt_llm.__version__)

Output should show a version number like 0.7.0. If you see an import error, check your environment.

Step 6: Build a Model Example

Now build a small LLaMA model for testing. Use the provided script:

cd examples/llama
python build.py --model_dir /path/to/llama-model --output_dir /tmp/llama-engine

Replace /path/to/llama-model with your model weights. The script creates an optimized engine. Run inference:

from tensorrt_llm.runtime import ModelRunner
runner = ModelRunner.from_dir("/tmp/llama-engine")
output = runner.run(["Hello, how are you?"])
print(output)

You should see a generated text response. This confirms your installation works.

Common Issues and Fixes

Error: CUDA not found. Ensure CUDA is in your PATH. Run nvcc --version. If missing, reinstall CUDA.

Error: Out of memory during build. Reduce MAX_JOBS to 2 or 1. Also close other GPU programs.

Error: GLIBC version mismatch. Update your system libc or use a Docker container. NVIDIA provides official Docker images.

Using Docker (Alternative Method)

Docker simplifies installation. Pull the official TensorRT-LLM image:

docker pull nvidia/cuda:12.1.0-devel-ubuntu22.04
docker run --gpus all -it nvidia/cuda:12.1.0-devel-ubuntu22.04 /bin/bash

Inside the container, follow steps 3 to 5. This avoids system dependency issues.

Performance Tips

Use fp16 or int4 quantization for faster inference. TensorRT-LLM supports these modes. Set --dtype float16 in the build script. Also enable paged attention for long sequences.

Monitor GPU memory with nvidia-smi. TensorRT-LLM uses memory efficiently, but large models need 24GB or more.

Conclusion

Installing TensorRT-LLM in Python requires careful setup of NVIDIA tools. Follow the steps: install CUDA, cuDNN, TensorRT, then build from source. Use a virtual environment to keep things clean. Test with a small model to confirm success. For production, consider Docker to avoid conflicts. With TensorRT-LLM, you can run LLMs up to 4x faster. Start optimizing your models today.