
Run AI on Your Laptop
I am going to walk you through how we can spin up a real chatbot on your own machine using llama.cpp and its Python wrapper llama-cpp-python.
Everything stays local, it runs on your CPU, and we talk to it like we would talk to any other assistant.
By the end of this, we will have:
- Installed
llama-cpp-pythonon a CPU only setup - Chosen a good starter chat model and downloaded it in GGUF format
- Wired up the
create_chat_completionchat API - Built a tiny command line chatbot that remembers the conversation
What we are actually building
Think of this as a very minimal version of ChatGPT that lives on your laptop.
- It uses a local GGUF model file
- It runs through
llama-cpp-python - You type messages, it responds
- We keep a
messageslist so the model sees the full chat history
Nothing fancy, very hackable, and completely under your control.
Step 1: Requirements and mental checklist
Before we install anything, here is what I assume you have:
- Python 3.8 or newer
- At least 8 GB RAM
- A CPU with 4 or more cores
- Basic comfort with a terminal or command prompt
If your machine is modest, that is fine. We will choose smaller models to start with and use quantized GGUF files so things still feel responsive.
Step 2: Creating a project and installing llama-cpp-python
I like to isolate things in a virtual environment so the rest of the system stays clean.
So, I will first create a project folder and activate the virtual environment. [See how we can create and activate virtul environment]
Upgrade pip just to avoid older packaging quirks:
pip install --upgrade pip
Now install llama-cpp-python.
pip install llama_cpp_python
Step 3: Choosing a starter chat model
We want a model that:
- Instruction tuned or explicitly labeled Chat or Instruct
- Comes in GGUF format
- Has or supports a proper chat format
Here are some starter options you can consider when you go browsing Hugging Face or other model hubs.
TinyLlama 1.1B Chat
- Very small, especially in 4 bit quantization
- Good for testing, quick experiments, and older laptops
- You will not get cutting edge reasoning, but it is great to verify your setup
Qwen2 0.5B or 1.5B Instruct GGUF
- Modern, instruction tuned, and surprisingly capable for the size
- Nice choice if you want something compact but still “smart enough”
Mistral 7B Instruct GGUF
- Bigger jump in quality compared to tiny models
- Needs more RAM so I would aim for at least 16 GB
- With 4 bit quantization it is still very usable on CPU
You do not have to pick exactly these three. The important thing is that you choose a model that is:
- Instruction tuned
- Compatible with llama.cpp in GGUF format
- Documented as a chat or assistant model
Performance
A 2 Billion Parameter Models (2B)
- RAM Required: 2GB to 3GB
- Performance on CPU: Very fast. You will likely get text generated as quickly as you can read it.
A 7 or 8 Billion Parameter Models (7B / 8B)
- RAM Required: 5GB to 7GB
- Performance on CPU: Slower. It might take a few seconds to generate a sentence, but the quality will be much higher.
Step 4: Downloading the GGUF model file
Once you pick a model, grab the GGUF file that matches your hardware. For example, a quantized file might look like:
TinyLlama-1.1B-Chat-v1.0.Q4_K_M.ggufMistral-7B-Instruct-v0.2.Q4_K_M.gguf
Create a folder for models in your project and drop the file there:
local-llm-chatbot/
.venv/
models/
mistral-7b-instruct.Q4_K_M.gguf
Step 5: First chat call with create_chat_completion
Create a file single_turn_chat.py:
from llama_cpp import Llama
# Point this to your own GGUF file on disk
MODEL_PATH = "models/mistral-7b-instruct.Q4_K_M.gguf"
# Feel free to tweak n_ctx and n_threads for your machine
llm = Llama(
model_path=MODEL_PATH,
n_ctx=4096,
n_threads=8,
verbose=False,
# If the model docs say to use a specific chat format, you can set it:
# chat_format="llama-2",
)
# This is the core idea of a chat based completion
messages = [
{
"role": "system",
"content": "You are a friendly local chatbot running on my CPU.",
},
{
"role": "user",
"content": "In a few sentences, explain what a local LLM is.",
},
]
response = llm.create_chat_completion(
messages=messages,
max_tokens=200,
temperature=0.7,
)
assistant_message = response["choices"][0]["message"]["content"]
print("\nAssistant:\n")
print(assistant_message)
Run it:
python single_turn_chat.py
The first token might take a moment, especially on slower CPUs, but you should see a coherent answer appear.
We just used the chat completion API that takes a list of messages and returns a response in an OpenAI style structure with choices, message, and content. [API reference]
Step 6: Turning this into a real chatbot loop
Now let us build an actual chatbot where you and I can talk in multiple turns. The core trick is very simple:
- We keep a shared messages list
- After every user input we append a
{"role": "user", ...} - After every model reply we append a
{"role": "assistant", ...} - We pass the full list back to create_chat_completion every time
Create chatbot.py:
from llama_cpp import Llama
MODEL_PATH = "models/mistral-7b-instruct.Q4_K_M.gguf"
llm = Llama(
model_path=MODEL_PATH,
n_ctx=4096,
n_threads=8,
verbose=False,
)
# This is our shared conversation state
messages = [
{
"role": "system",
"content": (
"You are a helpful local chatbot. "
"Keep answers short by default, but add detail if I ask."
),
}
]
print("Local CPU chatbot")
print("Type 'exit' to quit")
print()
while True:
user_input = input("You: ").strip()
if user_input.lower() in {"exit", "quit"}:
print("Bot: Bye, talk soon")
break
# Add the user message to the chat history
messages.append({"role": "user", "content": user_input})
# Ask the model for the next assistant message
response = llm.create_chat_completion(
messages=messages,
max_tokens=256,
temperature=0.7,
)
assistant_message = response["choices"][0]["message"]["content"].strip()
print(f"\nBot:\n{assistant_message}\n")
# Add the assistant reply back into the history
messages.append({"role": "assistant", "content": assistant_message})
Run it:
python chatbot.py
Now we have a tiny terminal chatbot that remembers what was said earlier in the conversation.
One thing to be aware of: as the messages list grows, the prompt gets longer which means more tokens to process. Over time that will slow responses down. A common next step is to keep only the last few exchanges or to summarize older parts of the conversation.
Where you can take this next
Right now our chatbot is very simple. You type into the terminal and it responds. From here you could:
- Wrap this in a lightweight HTTP API with FastAPI or Flask
- Drop a simple web front end on top using a small React or Vue app
- Integrate a vector database to give your chatbot long term memory across documents and past chats. [See example using LlamaIndex]