Last modified: Jun 14, 2026

Install Llama-cpp-python Guide

Installing llama-cpp-python lets you run large language models locally. It is fast and works on most computers. This guide will help you install it step by step.

You do not need to be an expert. Just follow each section carefully. We cover Windows, Linux, and macOS with CPU and GPU support.

What is llama-cpp-python?

Llama-cpp-python is a Python wrapper for llama.cpp. It allows you to load and run models like LLaMA, Mistral, and Gemma. It uses C++ under the hood for speed.

You can use it for inference, chat, or fine-tuning. It is a popular choice for developers who want local AI.

Prerequisites

Before you start, make sure you have these tools:

  • Python 3.8 or newer installed
  • pip package manager updated
  • A C++ compiler (like gcc or MSVC)
  • At least 4GB of free disk space

To check your Python version, run this command in your terminal:


python --version

If Python is not installed, download it from python.org.

Step 1: Install via pip (CPU only)

The easiest way is to use pip. Open your terminal and run:


pip install llama-cpp-python

This will install the CPU version. It works on all operating systems. The installation may take a few minutes because it compiles C++ code.

If you see errors, try upgrading pip first:


pip install --upgrade pip

Step 2: Install with GPU support

For faster inference, install with GPU support. This uses CUDA on NVIDIA GPUs.

First, install CUDA Toolkit 11.8 or newer from NVIDIA's site. Then run:


CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

For AMD GPUs (ROCm), use:


CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

For Apple Silicon (M1/M2), GPU support is automatic on macOS. Just run the standard pip command.

Step 3: Verify the installation

After installation, test it with a simple script. Create a file called test_llama.py:


# test_llama.py
from llama_cpp import Llama

# Load a small model (you need to download it first)
model = Llama(model_path="path/to/model.gguf")
output = model("Hello, how are you?", max_tokens=50)
print(output["choices"][0]["text"])

Run the script:


python test_llama.py

If you see text output, the installation works. If there is an error, check the troubleshooting section below.

Step 4: Download a model

You need a GGUF model file. Many are available on Hugging Face. For example, download llama-2-7b-chat.gguf or a smaller one like phi-2.gguf.

Use the huggingface hub library or download manually:


pip install huggingface-hub
huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models

Now update the path in your script to "./models/llama-2-7b-chat.Q4_K_M.gguf".

Troubleshooting common issues

If you get a compilation error, install a C++ compiler first. On Windows, use Visual Studio Build Tools. On Linux, install build-essential. On macOS, run xcode-select --install.

If you see ModuleNotFoundError, your pip might be for a different Python version. Use python -m pip install llama-cpp-python instead.

For GPU errors, ensure your CUDA version matches the llama-cpp-python build. Check with nvcc --version.

Memory issues happen with large models. Use a smaller model like phi-2 or reduce context length with n_ctx=512.

Using the library in your code

Here is a full example that loads a model and generates text:


# chat_example.py
from llama_cpp import Llama

# Load model with custom settings
llm = Llama(
    model_path="./models/phi-2.Q4_K_M.gguf",
    n_ctx=2048,        # Context length
    n_threads=4,       # CPU threads
    n_gpu_layers=35    # GPU offload (if available)
)

# Generate a response
prompt = "What is the capital of France?"
response = llm(prompt, max_tokens=100, temperature=0.7)
print(response["choices"][0]["text"])

Output example:


The capital of France is Paris. It is known for the Eiffel Tower and its rich culture.

You can also use the create_chat_completion method for chat interactions:


# chat_completion.py
from llama_cpp import Llama

llm = Llama(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf")
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]
response = llm.create_chat_completion(messages)
print(response["choices"][0]["message"]["content"])

Tips for better performance

Use quantized models like Q4_K_M to reduce memory usage. They run faster on CPU.

Set n_gpu_layers to offload layers to GPU. Start with 35 and adjust based on VRAM.

Use n_threads equal to your CPU core count for faster processing.

Limit max_tokens to 512 or 1024 for quicker responses.

Conclusion

Installing llama-cpp-python is straightforward. You can install it with pip for CPU or with extra flags for GPU. Download a GGUF model and start generating text in minutes.

This library is powerful for local AI tasks. It works offline and respects your privacy. Try it today and experiment with different models.

If you hit any issues, revisit the troubleshooting section or check the official llama-cpp-python GitHub page.