Last modified: Jun 14, 2026
Install TensorRT-LLM in Python
TensorRT-LLM is a powerful library for optimizing large language models on NVIDIA GPUs. It speeds up inference and reduces memory usage. This guide shows you how to install TensorRT-LLM in Python step by step.
What is TensorRT-LLM?
TensorRT-LLM is an open-source library. It combines TensorRT with LLM optimizations. You can run models like LLaMA, GPT, and Falcon faster. It supports in-flight batching and paged attention. These features make it ideal for production use.
System Requirements
Before you start, check your system. You need a Linux machine with an NVIDIA GPU. The GPU must have compute capability 7.0 or higher. Examples include V100, A100, or H100. You also need CUDA 12.1 or later and cuDNN 8.9.
Python version 3.8 to 3.11 is required. Use a virtual environment to avoid conflicts. TensorRT-LLM works best with PyTorch 2.0 or later.
Step 1: Install NVIDIA Drivers and CUDA
First, update your NVIDIA driver. Run nvidia-smi to check your driver version. If it is older than 525, update it.
sudo apt update
sudo apt install nvidia-driver-535
sudo reboot
Then install CUDA 12.1. Download the runfile from NVIDIA's website. Or use the package manager:
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
sudo sh cuda_12.1.0_530.30.02_linux.run
Add CUDA to your PATH in .bashrc:
export PATH=/usr/local/cuda-12.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH
source ~/.bashrc
Step 2: Install cuDNN and TensorRT
Download cuDNN 8.9 from NVIDIA's developer site. Install it:
sudo dpkg -i cudnn-linux-x86_64-8.9.0.131_cuda12-archive.deb
Next, install TensorRT 8.6. Use the tar package:
tar -xzvf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.1.tar.gz
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/TensorRT-8.6.1.6/lib
Install the Python wheel for TensorRT:
pip install /path/to/TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl
Step 3: Set Up a Python Virtual Environment
Create a new environment to isolate dependencies:
python3 -m venv trtllm_env
source trtllm_env/bin/activate
Upgrade pip and install PyTorch with CUDA support:
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Step 4: Install TensorRT-LLM from Source
Clone the TensorRT-LLM repository:
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
Install Python dependencies:
pip install -r requirements.txt
Build the library with pip. This step compiles CUDA kernels and may take 10-20 minutes:
pip install -e .
If you get a memory error, reduce build jobs:
MAX_JOBS=4 pip install -e .
Step 5: Verify the Installation
Test that TensorRT-LLM works. Run a simple Python script:
import tensorrt_llm
print("TensorRT-LLM version:", tensorrt_llm.__version__)
Output should show a version number like 0.7.0. If you see an import error, check your environment.
Step 6: Build a Model Example
Now build a small LLaMA model for testing. Use the provided script:
cd examples/llama
python build.py --model_dir /path/to/llama-model --output_dir /tmp/llama-engine
Replace /path/to/llama-model with your model weights. The script creates an optimized engine. Run inference:
from tensorrt_llm.runtime import ModelRunner
runner = ModelRunner.from_dir("/tmp/llama-engine")
output = runner.run(["Hello, how are you?"])
print(output)
You should see a generated text response. This confirms your installation works.
Common Issues and Fixes
Error: CUDA not found. Ensure CUDA is in your PATH. Run nvcc --version. If missing, reinstall CUDA.
Error: Out of memory during build. Reduce MAX_JOBS to 2 or 1. Also close other GPU programs.
Error: GLIBC version mismatch. Update your system libc or use a Docker container. NVIDIA provides official Docker images.
Using Docker (Alternative Method)
Docker simplifies installation. Pull the official TensorRT-LLM image:
docker pull nvidia/cuda:12.1.0-devel-ubuntu22.04
docker run --gpus all -it nvidia/cuda:12.1.0-devel-ubuntu22.04 /bin/bash
Inside the container, follow steps 3 to 5. This avoids system dependency issues.
Performance Tips
Use fp16 or int4 quantization for faster inference. TensorRT-LLM supports these modes. Set --dtype float16 in the build script. Also enable paged attention for long sequences.
Monitor GPU memory with nvidia-smi. TensorRT-LLM uses memory efficiently, but large models need 24GB or more.
Conclusion
Installing TensorRT-LLM in Python requires careful setup of NVIDIA tools. Follow the steps: install CUDA, cuDNN, TensorRT, then build from source. Use a virtual environment to keep things clean. Test with a small model to confirm success. For production, consider Docker to avoid conflicts. With TensorRT-LLM, you can run LLMs up to 4x faster. Start optimizing your models today.