28 November 2025

PythonLLMllama-cpp-python

Run AI on Your Laptop

I am going to walk you through how we can spin up a real chatbot on your own machine using llama.cpp and its Python wrapper llama-cpp-python. Everything stays local, it runs on your CPU, and we talk to it like we would talk to any other assistant.

By the end of this, we will have:

Installed llama-cpp-python on a CPU only setup
Chosen a good starter chat model and downloaded it in GGUF format
Wired up the create_chat_completion chat API
Built a tiny command line chatbot that remembers the conversation

What we are actually building

Think of this as a very minimal version of ChatGPT that lives on your laptop.

It uses a local GGUF model file
It runs through llama-cpp-python
You type messages, it responds
We keep a messages list so the model sees the full chat history

Nothing fancy, very hackable, and completely under your control.

Step 1: Requirements and mental checklist

Before we install anything, here is what I assume you have:

Python 3.8 or newer
At least 8 GB RAM
A CPU with 4 or more cores
Basic comfort with a terminal or command prompt

If your machine is modest, that is fine. We will choose smaller models to start with and use quantized GGUF files so things still feel responsive.

Step 2: Creating a project and installing `llama-cpp-python`

I like to isolate things in a virtual environment so the rest of the system stays clean.

So, I will first create a project folder and activate the virtual environment. [See how we can create and activate virtul environment]

Upgrade pip just to avoid older packaging quirks:

pip install --upgrade pip

Now install llama-cpp-python.

pip install llama_cpp_python

Step 3: Choosing a starter chat model

We want a model that:

Instruction tuned or explicitly labeled Chat or Instruct
Comes in GGUF format
Has or supports a proper chat format

Here are some starter options you can consider when you go browsing Hugging Face or other model hubs.

TinyLlama 1.1B Chat

Very small, especially in 4 bit quantization
Good for testing, quick experiments, and older laptops
You will not get cutting edge reasoning, but it is great to verify your setup

Qwen2 0.5B or 1.5B Instruct GGUF

Modern, instruction tuned, and surprisingly capable for the size
Nice choice if you want something compact but still “smart enough”

Mistral 7B Instruct GGUF

Bigger jump in quality compared to tiny models
Needs more RAM so I would aim for at least 16 GB
With 4 bit quantization it is still very usable on CPU

You do not have to pick exactly these three. The important thing is that you choose a model that is:

Instruction tuned
Compatible with llama.cpp in GGUF format
Documented as a chat or assistant model

Performance

A 2 Billion Parameter Models (2B)

RAM Required: 2GB to 3GB
Performance on CPU: Very fast. You will likely get text generated as quickly as you can read it.

A 7 or 8 Billion Parameter Models (7B / 8B)

RAM Required: 5GB to 7GB
Performance on CPU: Slower. It might take a few seconds to generate a sentence, but the quality will be much higher.

Step 4: Downloading the GGUF model file

Once you pick a model, grab the GGUF file that matches your hardware. For example, a quantized file might look like:

TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf
Mistral-7B-Instruct-v0.2.Q4_K_M.gguf

Create a folder for models in your project and drop the file there:

local-llm-chatbot/
  .venv/
  models/
    mistral-7b-instruct.Q4_K_M.gguf

Step 5: First chat call with create_chat_completion

Create a file single_turn_chat.py:

from llama_cpp import Llama

# Point this to your own GGUF file on disk
MODEL_PATH = "models/mistral-7b-instruct.Q4_K_M.gguf"

# Feel free to tweak n_ctx and n_threads for your machine
llm = Llama(
    model_path=MODEL_PATH,
    n_ctx=4096,
    n_threads=8,
    verbose=False,
    # If the model docs say to use a specific chat format, you can set it:
    # chat_format="llama-2",
)

# This is the core idea of a chat based completion
messages = [
    {
        "role": "system",
        "content": "You are a friendly local chatbot running on my CPU.",
    },
    {
        "role": "user",
        "content": "In a few sentences, explain what a local LLM is.",
    },
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=200,
    temperature=0.7,
)

assistant_message = response["choices"][0]["message"]["content"]
print("\nAssistant:\n")
print(assistant_message)

Run it:

python single_turn_chat.py

The first token might take a moment, especially on slower CPUs, but you should see a coherent answer appear.

We just used the chat completion API that takes a list of messages and returns a response in an OpenAI style structure with choices, message, and content. [API reference]

Step 6: Turning this into a real chatbot loop

Now let us build an actual chatbot where you and I can talk in multiple turns. The core trick is very simple:

We keep a shared messages list
After every user input we append a {"role": "user", ...}
After every model reply we append a {"role": "assistant", ...}
We pass the full list back to create_chat_completion every time

Create chatbot.py:

from llama_cpp import Llama

MODEL_PATH = "models/mistral-7b-instruct.Q4_K_M.gguf"

llm = Llama(
    model_path=MODEL_PATH,
    n_ctx=4096,
    n_threads=8,
    verbose=False,
)

# This is our shared conversation state
messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful local chatbot. "
            "Keep answers short by default, but add detail if I ask."
        ),
    }
]

print("Local CPU chatbot")
print("Type 'exit' to quit")
print()

while True:
    user_input = input("You: ").strip()

    if user_input.lower() in {"exit", "quit"}:
        print("Bot: Bye, talk soon")
        break

    # Add the user message to the chat history
    messages.append({"role": "user", "content": user_input})

    # Ask the model for the next assistant message
    response = llm.create_chat_completion(
        messages=messages,
        max_tokens=256,
        temperature=0.7,
    )

    assistant_message = response["choices"][0]["message"]["content"].strip()

    print(f"\nBot:\n{assistant_message}\n")

    # Add the assistant reply back into the history
    messages.append({"role": "assistant", "content": assistant_message})

Run it:

python chatbot.py

Now we have a tiny terminal chatbot that remembers what was said earlier in the conversation.

One thing to be aware of: as the messages list grows, the prompt gets longer which means more tokens to process. Over time that will slow responses down. A common next step is to keep only the last few exchanges or to summarize older parts of the conversation.

Where you can take this next

Right now our chatbot is very simple. You type into the terminal and it responds. From here you could:

Wrap this in a lightweight HTTP API with FastAPI or Flask
Drop a simple web front end on top using a small React or Vue app
Integrate a vector database to give your chatbot long term memory across documents and past chats. [See example using LlamaIndex]

28 November 2025

PythonLLMllama-cpp-python

Run AI on Your Laptop

By the end of this, we will have:

Installed llama-cpp-python on a CPU only setup
Chosen a good starter chat model and downloaded it in GGUF format
Wired up the create_chat_completion chat API
Built a tiny command line chatbot that remembers the conversation

What we are actually building

Think of this as a very minimal version of ChatGPT that lives on your laptop.

It uses a local GGUF model file
It runs through llama-cpp-python
You type messages, it responds
We keep a messages list so the model sees the full chat history

Nothing fancy, very hackable, and completely under your control.

Step 1: Requirements and mental checklist

Before we install anything, here is what I assume you have:

Python 3.8 or newer
At least 8 GB RAM
A CPU with 4 or more cores
Basic comfort with a terminal or command prompt

If your machine is modest, that is fine. We will choose smaller models to start with and use quantized GGUF files so things still feel responsive.

Step 2: Creating a project and installing `llama-cpp-python`

I like to isolate things in a virtual environment so the rest of the system stays clean.

So, I will first create a project folder and activate the virtual environment. [See how we can create and activate virtul environment]

Upgrade pip just to avoid older packaging quirks:

pip install --upgrade pip

Now install llama-cpp-python.

pip install llama_cpp_python

Step 3: Choosing a starter chat model

We want a model that:

Instruction tuned or explicitly labeled Chat or Instruct
Comes in GGUF format
Has or supports a proper chat format

Here are some starter options you can consider when you go browsing Hugging Face or other model hubs.

TinyLlama 1.1B Chat

Very small, especially in 4 bit quantization
Good for testing, quick experiments, and older laptops
You will not get cutting edge reasoning, but it is great to verify your setup

Qwen2 0.5B or 1.5B Instruct GGUF

Modern, instruction tuned, and surprisingly capable for the size
Nice choice if you want something compact but still “smart enough”

Mistral 7B Instruct GGUF

Bigger jump in quality compared to tiny models
Needs more RAM so I would aim for at least 16 GB
With 4 bit quantization it is still very usable on CPU

You do not have to pick exactly these three. The important thing is that you choose a model that is:

Instruction tuned
Compatible with llama.cpp in GGUF format
Documented as a chat or assistant model

Performance

A 2 Billion Parameter Models (2B)

RAM Required: 2GB to 3GB
Performance on CPU: Very fast. You will likely get text generated as quickly as you can read it.

A 7 or 8 Billion Parameter Models (7B / 8B)

RAM Required: 5GB to 7GB
Performance on CPU: Slower. It might take a few seconds to generate a sentence, but the quality will be much higher.

Step 4: Downloading the GGUF model file

Once you pick a model, grab the GGUF file that matches your hardware. For example, a quantized file might look like:

TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf
Mistral-7B-Instruct-v0.2.Q4_K_M.gguf

Create a folder for models in your project and drop the file there:

local-llm-chatbot/
  .venv/
  models/
    mistral-7b-instruct.Q4_K_M.gguf

Step 5: First chat call with create_chat_completion

Create a file single_turn_chat.py:

from llama_cpp import Llama

# Point this to your own GGUF file on disk
MODEL_PATH = "models/mistral-7b-instruct.Q4_K_M.gguf"

# Feel free to tweak n_ctx and n_threads for your machine
llm = Llama(
    model_path=MODEL_PATH,
    n_ctx=4096,
    n_threads=8,
    verbose=False,
    # If the model docs say to use a specific chat format, you can set it:
    # chat_format="llama-2",
)

# This is the core idea of a chat based completion
messages = [
    {
        "role": "system",
        "content": "You are a friendly local chatbot running on my CPU.",
    },
    {
        "role": "user",
        "content": "In a few sentences, explain what a local LLM is.",
    },
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=200,
    temperature=0.7,
)

assistant_message = response["choices"][0]["message"]["content"]
print("\nAssistant:\n")
print(assistant_message)

Run it:

python single_turn_chat.py

The first token might take a moment, especially on slower CPUs, but you should see a coherent answer appear.

We just used the chat completion API that takes a list of messages and returns a response in an OpenAI style structure with choices, message, and content. [API reference]

Step 6: Turning this into a real chatbot loop

Now let us build an actual chatbot where you and I can talk in multiple turns. The core trick is very simple:

We keep a shared messages list
After every user input we append a {"role": "user", ...}
After every model reply we append a {"role": "assistant", ...}
We pass the full list back to create_chat_completion every time

Create chatbot.py:

from llama_cpp import Llama

MODEL_PATH = "models/mistral-7b-instruct.Q4_K_M.gguf"

llm = Llama(
    model_path=MODEL_PATH,
    n_ctx=4096,
    n_threads=8,
    verbose=False,
)

# This is our shared conversation state
messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful local chatbot. "
            "Keep answers short by default, but add detail if I ask."
        ),
    }
]

print("Local CPU chatbot")
print("Type 'exit' to quit")
print()

while True:
    user_input = input("You: ").strip()

    if user_input.lower() in {"exit", "quit"}:
        print("Bot: Bye, talk soon")
        break

    # Add the user message to the chat history
    messages.append({"role": "user", "content": user_input})

    # Ask the model for the next assistant message
    response = llm.create_chat_completion(
        messages=messages,
        max_tokens=256,
        temperature=0.7,
    )

    assistant_message = response["choices"][0]["message"]["content"].strip()

    print(f"\nBot:\n{assistant_message}\n")

    # Add the assistant reply back into the history
    messages.append({"role": "assistant", "content": assistant_message})

Run it:

python chatbot.py

Now we have a tiny terminal chatbot that remembers what was said earlier in the conversation.

Where you can take this next

Right now our chatbot is very simple. You type into the terminal and it responds. From here you could:

Wrap this in a lightweight HTTP API with FastAPI or Flask
Drop a simple web front end on top using a small React or Vue app
Integrate a vector database to give your chatbot long term memory across documents and past chats. [See example using LlamaIndex]

What we are actually building

Step 1: Requirements and mental checklist

Step 2: Creating a project and installing llama-cpp-python

Step 3: Choosing a starter chat model

Performance

Step 4: Downloading the GGUF model file

Step 5: First chat call with create_chat_completion

Step 6: Turning this into a real chatbot loop

Where you can take this next

What we are actually building

Step 1: Requirements and mental checklist

Step 2: Creating a project and installing llama-cpp-python

Step 3: Choosing a starter chat model

Performance

Step 4: Downloading the GGUF model file

Step 5: First chat call with create_chat_completion

Step 6: Turning this into a real chatbot loop

Where you can take this next

Step 2: Creating a project and installing `llama-cpp-python`

Step 2: Creating a project and installing `llama-cpp-python`