Run Teapot LLM on Windows: Step-by-Step Installation Guide (2026)
Last updated April 2026 — refreshed for current model versions, PyTorch 2.7, CUDA 13.x, and the new TinyTeapot 77M model.
Teapot LLM is a compact, hallucination-resistant language model built for local deployment — no cloud subscription, no data leakage, runs on a laptop CPU. This guide walks through every installation path on Windows: pip, Docker, and Llamafile. Whether you want a sub-second RAG pipeline or a structured JSON extractor that runs offline, you'll be up in under 20 minutes.
What changed since the original 2025 guideTinyTeapot (77M) launched — a new ultra-lightweight model running ~40 tokens/second on a Colab CPU, ideal for edge devices and mobile prototypes. TeapotLLM (0.8B) remains the accuracy-first choice.Base model corrected — Teapot is fine-tuned fromflan-t5-base, not flan-t5-large as previously stated.teapotai library v2.0.8 released April 2025, addingextract()structured output via Pydantic, improved refusal classifier, and therag()method.PyTorch 2.7.0 is the current stable release (requires Python ≥ 3.10). CUDA 11.8, 12.6, and 12.8 are all supported wheel targets.CUDA Toolkit 13.x — starting with CUDA 13.1, the Windows display driver is no longer bundled with the toolkit. Install the NVIDIA driver separately first.Python 3.14 is the latest stable release (April 2026), but PyTorch only guarantees support from 3.10 upward — use 3.10–3.12 for the most stable GPU stack today.
TL;DR: Which path should you take?
| Method | Best for | GPU needed? | Setup time |
|---|---|---|---|
| pip install teapotai | Python developers, RAG apps | No (CPU works) | ~5 min |
| Docker | Isolated environments, teams | Optional | ~10 min |
| Llamafile | Non-technical users, zero setup | No | ~2 min |
| Hugging Face Transformers (direct) | Custom fine-tuning, research | Recommended | ~10 min |
What is Teapot LLM?
Teapot is an open-source family of small language models from TeapotAI, designed for on-device, hallucination-resistant inference. Unlike general-purpose LLMs, Teapot is trained specifically to refuse answering when the provided context does not support the answer — making it well-suited for RAG pipelines, document Q&A, and structured data extraction where accuracy over the provided corpus matters more than generative creativity.
Two models are currently available:
| Model | Parameters | Speed (CPU) | Best use case | Hugging Face |
|---|---|---|---|---|
| TinyTeapot | 77M | ~40 tok/s | Edge, mobile, low-latency demos | teapotai/tinyteapot |
| TeapotLLM | 0.8B | ~5 tok/s | Production RAG, high-fidelity extraction | teapotai/teapotllm |
Both models are fine-tuned from flan-t5-base on a ~10MB synthetic dataset generated with DeepSeek-V3, trained for approximately 10 hours on an A100 GPU. The training data, model weights, and library are all MIT-licensed.
System Requirements
Minimum (CPU-only)
- Windows 10 64-bit or Windows 11
- Python 3.10–3.12 (recommended; 3.13+ works but has less PyTorch test coverage)
- 8 GB RAM (16 GB recommended for TeapotLLM; TinyTeapot runs fine on 4 GB)
- 2 GB free disk space
Recommended (GPU-accelerated)
- NVIDIA GPU with 6 GB+ VRAM (RTX 3060 or better)
- NVIDIA Driver ≥ 528.33 (supports CUDA 12.x); for CUDA 13.x, install driver separately — it is no longer bundled with the toolkit starting CUDA 13.1
- CUDA Toolkit 12.8 (most stable PyTorch target as of April 2026) or 13.x for cutting-edge features
- 16 GB RAM
- SSD with 10 GB+ free space
Note on CUDA versions: PyTorch 2.7.0 provides pre-built wheels for CUDA 11.8, 12.6, and 12.8. CUDA 13.x wheels are not yet in the stable PyTorch release as of April 2026 — check pytorch.org for the latest wheel availability before installing.
Method 1: pip install (Recommended for Developers)
The teapotai library wraps model loading, document embedding, prompt formatting, and error handling. For most use cases, this is the fastest path.
Step 1: Install Python
Download Python 3.11 or 3.12 from python.org. During installation, check "Add Python to PATH". Verify in Command Prompt:
python --version
# Python 3.11.x or 3.12.xStep 2: Create a virtual environment
python -m venv teapot-env
teapot-env\Scripts\activateYou should see (teapot-env) at the start of your prompt.
Step 3: Install PyTorch
Choose the correct command based on whether you have an NVIDIA GPU:
CPU only:
pip install torch torchvision torchaudioNVIDIA GPU with CUDA 12.8 (recommended for GPU users):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128NVIDIA GPU with CUDA 12.6:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126Verify CUDA is available:
python -c "import torch; print(torch.cuda.is_available())"
# True if GPU is correctly configuredStep 4: Install teapotai
pip install teapotaiThis installs teapotai 2.0.8 (latest as of April 2026) and all dependencies: transformers, numpy, scikit-learn, pydantic, langsmith, and regex.
Step 5: Run your first query
from teapotai import TeapotAI
teapot = TeapotAI()
# Basic Q&A grounded in context
result = teapot.query(
query="What is the boiling point of water?",
context="Water boils at 100 degrees Celsius (212°F) at standard atmospheric pressure."
)
print(result)
# "Water boils at 100 degrees Celsius (212°F)."On first run, the library downloads the teapotai/teapotllm model weights (~1.5 GB) from Hugging Face. Subsequent runs use the local cache.
Using TinyTeapot for faster inference
To use the 77M TinyTeapot model instead:
from transformers import pipeline
pipe = pipeline("text2text-generation", model="teapotai/tinyteapot")
result = pipe("question: What is the capital of France? context: France is a country in Western Europe. Its capital is Paris.")
print(result[0]['generated_text'])
# "Paris"Method 2: Retrieval-Augmented Generation (RAG)
Teapot's primary strength is RAG — it selects the most relevant documents from a collection before answering, and refuses to answer when no context is relevant.
from teapotai import TeapotAI
# Initialize with a document collection
teapot = TeapotAI(documents=[
"The Eiffel Tower is located in Paris, France. It was built in 1889.",
"The Colosseum is an ancient amphitheater in Rome, Italy, completed in 80 AD.",
"The Statue of Liberty was a gift from France to the United States, unveiled in 1886."
])
# The model retrieves the relevant document automatically
answer = teapot.query("When was the Eiffel Tower built?")
print(answer)
# "The Eiffel Tower was built in 1889."
# Ask about something not in the documents
answer = teapot.query("What is the population of Tokyo?")
print(answer)
# "I don't know." (refusal — no relevant context)
# Multi-turn chat with history
history = [
{"role": "user", "content": "Tell me about Rome."},
]
response = teapot.chat(history)
print(response)Method 3: Structured Data Extraction
The extract() method, added in v2.0.0, returns a validated Pydantic model — useful for parsing documents into structured records.
from teapotai import TeapotAI
from pydantic import BaseModel
class Monument(BaseModel):
name: str
location: str
year_built: int
teapot = TeapotAI()
monument = teapot.extract(
Monument,
context="The Eiffel Tower is located in Paris, France. It was built in 1889."
)
print(monument)
# Monument(name='Eiffel Tower', location='Paris, France', year_built=1889)Method 4: Docker Installation
Docker is the cleanest option for teams and CI environments. It avoids Python version conflicts and keeps dependencies isolated.
Prerequisites
- Docker Desktop for Windows (latest stable) with WSL2 backend enabled
- For GPU support: NVIDIA Container Toolkit — install via NVIDIA's Container Toolkit guide
Basic container with the Streamlit chat UI
# Pull and run the TeapotAI chat demo
docker run -p 8501:8501 --rm \
-e TEAPOT_MODEL=teapotllm \
ghcr.io/zakerytclarke/teapot:latestThen open http://localhost:8501 in your browser. The first run downloads model weights inside the container; add a volume mount to cache them between runs:
docker run -p 8501:8501 --rm \
-v %USERPROFILE%\.cache\huggingface:/root/.cache/huggingface \
ghcr.io/zakerytclarke/teapot:latestWith GPU passthrough
docker run -p 8501:8501 --rm --gpus all \
-v %USERPROFILE%\.cache\huggingface:/root/.cache/huggingface \
ghcr.io/zakerytclarke/teapot:latestNote: Docker image paths above are illustrative based on the project's GitHub Container Registry pattern. Verify the current image name and tag at ghcr.io/zakerytclarke/teapot before running, as the project does not yet have tagged releases on GitHub.
Method 5: Llamafile (Zero-setup)
Llamafile bundles the model weights and a minimal runtime into a single executable. No Python, no Docker, no CUDA setup required.
- Check the TeapotAI models page or the GitHub repository for a Llamafile download link.
- Download the
.llamafileexecutable for Windows. - Rename the file to add
.exeif Windows does not recognize it, or run from PowerShell:.\teapot.llamafile.exe - A local web server starts. Open the URL shown in the terminal (typically
http://localhost:8080).
Llamafile inference is CPU-only and slower than the GPU-accelerated pip path, but it requires zero configuration — useful for quick demos or for users who do not have Python installed.
Method 6: Direct Hugging Face Transformers (Advanced)
Use this path when you want full control over generation parameters, or are building a custom fine-tune on top of Teapot.
from transformers import pipeline
# TeapotLLM (0.8B)
pipe = pipeline(
"text2text-generation",
model="teapotai/teapotllm",
revision="699ab39cbf586674806354e92fbd6179f9a95f4a", # pinned revision for reproducibility
device=0 # 0 = first GPU; remove for CPU
)
prompt = "question: What year was the Eiffel Tower built? context: The Eiffel Tower was built in 1889 in Paris."
result = pipe(prompt, max_new_tokens=50)
print(result[0]['generated_text'])
# "1889"The model uses a seq2seq architecture (T5-family). The input prompt format matters: prefix your query with question: and the supporting text with context:. The teapotai library handles this formatting automatically — use direct transformers only if you need low-level control.
Performance and Benchmarks
Teapot is designed for in-context tasks, not general-knowledge reasoning. The developers benchmark it specifically on context-faithful Q&A rather than standard suites like MMLU or HellaSwag, which measure world-knowledge recall.
| Model | Hardware | Throughput | VRAM / RAM |
|---|---|---|---|
| TinyTeapot (77M) | CPU (Colab free tier) | ~40 tok/s | ~500 MB RAM |
| TeapotLLM (0.8B) | CPU (Colab free tier) | ~5 tok/s | ~3 GB RAM |
| TeapotLLM (0.8B) | NVIDIA RTX 3060 (12 GB) | ~80–120 tok/s (estimated) | ~2 GB VRAM |
The CPU throughput figures are from the TeapotAI project's own benchmarks on a Colab free-tier instance (verify at teapotai.com/models). GPU throughput for TeapotLLM on consumer hardware is an estimate based on the model's parameter count — the project has not published official GPU benchmarks as of April 2026. Run python -c "import time; ..." with your specific hardware to measure actual throughput.
For comparison: models of similar scope like Phi-3-mini (3.8B) and Qwen3-0.6B (released 2025) offer broader general knowledge but do not have Teapot's trained refusal behavior. If your use case is strictly RAG or document extraction, Teapot's compact size and refusal training often make it the better fit over larger models that can hallucinate answers from training data.
Decision Tree: Which Model and Method?
- You have no NVIDIA GPU and want the fastest start → pip install teapotai + TinyTeapot (77M) on CPU
- You're building a production RAG pipeline → pip install teapotai + TeapotLLM (0.8B) + CUDA 12.8
- You need structured JSON output from documents → teapotai v2.0.8 with
extract()and Pydantic models - You're a team and want reproducible environments → Docker with volume-mounted HF cache
- You want to share a demo with non-technical stakeholders → Llamafile executable
- You're fine-tuning or doing research → Direct Hugging Face Transformers with a pinned revision
If your team needs a full backend developer to productionize a Teapot-based pipeline into a scalable service, Codersera's vetted Python developers can help accelerate that work.
Common Pitfalls and Troubleshooting
torch.cuda.is_available() returns False
- Confirm your PyTorch wheel matches your CUDA version:
pip show torch— look for+cu128or similar in the version string. A CPU-only wheel (no+cuXXX) will never see a GPU. - Verify your NVIDIA driver is up to date: run
nvidia-smiin Command Prompt. If the command is not found, the driver is not installed. - Starting with CUDA Toolkit 13.1, the driver is not included in the toolkit installer. Install the driver from nvidia.com/drivers first, then install the toolkit separately.
ModuleNotFoundError: No module named 'teapotai'
- Make sure your virtual environment is activated:
teapot-env\Scripts\activate - Confirm you installed into the active environment:
pip show teapotai
Model download hangs or times out
- The TeapotLLM model is ~1.5 GB. On slow connections, set
HF_HUB_DOWNLOAD_TIMEOUT=300in your environment before running. - If Hugging Face is blocked in your region, set
HF_ENDPOINT=https://hf-mirror.comor usehuggingface-cli download teapotai/teapotllmto pre-cache.
Slow inference on CPU
- Switch to TinyTeapot (77M) for 8× faster throughput on CPU.
- Reduce
max_new_tokensin direct transformers usage — the default can be higher than needed for short extraction tasks. - Enable PyTorch inference mode: wrap your call in
torch.inference_mode()to skip gradient tracking.
teapotai refuses all answers ("I don't know")
- This is by design when the query cannot be answered from context. Verify your
contextargument contains the relevant text. - The model uses a logistic regression refusal classifier (
teapot_refusal_classifier.joblib) — it errs on the side of refusing rather than hallucinating. If refusals are too aggressive for your use case, use thequery()method'ssystem_promptparameter to tune behavior, or switch to a general-purpose model.
Docker: Cannot connect GPU (Windows)
- WSL2 must be the Docker Desktop backend (not Hyper-V).
- Install NVIDIA Container Toolkit for WSL2: NVIDIA WSL User Guide.
- Run
docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu22.04 nvidia-smito confirm GPU passthrough works before adding your application container.
What Was Removed and Why
The original 2025 version of this guide referenced flan-t5-large as the base model — this was incorrect. The actual base is flan-t5-base, a smaller architecture. The model size (0.8B parameters) is the fine-tuned TeapotLLM, not flan-t5-large. Corrected throughout.
The original also did not mention TinyTeapot (77M), which launched in early 2025 and is now the recommended starting point for CPU-only Windows setups due to its 8× throughput advantage over TeapotLLM on CPU.
FAQ
Does Teapot LLM work without a GPU on Windows?
Yes. Both TinyTeapot (77M) and TeapotLLM (0.8B) run on CPU. TinyTeapot reaches ~40 tokens/second on a standard CPU, making it practical for interactive use. TeapotLLM at ~5 tokens/second is slow for chat but fast enough for batch document extraction tasks.
What Python version should I use?
Python 3.11 or 3.12 is the current recommendation. PyTorch 2.7.0 requires Python ≥ 3.10. Python 3.14 (the latest stable as of April 2026) works but has less PyTorch pre-built wheel coverage — stay on 3.11 or 3.12 for the most predictable install experience.
Which CUDA version should I install?
CUDA 12.8 for PyTorch GPU support — it is the newest CUDA for which PyTorch 2.7 ships pre-built wheels. CUDA 13.x is available but PyTorch 2.7 stable wheels do not yet target it. Also note: starting with CUDA 13.1, install the NVIDIA display driver separately before the CUDA Toolkit.
Is Teapot LLM free to use commercially?
Yes. The model weights, training code, and teapotai library are all MIT-licensed. There are no usage fees or API costs for local inference.
How does Teapot compare to Ollama + Llama 4 for RAG?
Ollama running Llama 4 Scout (or similar) gives you a much larger general-purpose model (17B+ active params in MoE configuration) with broader knowledge, but it requires more RAM/VRAM and does not have Teapot's trained refusal behavior. Teapot's advantage is its tiny size (fits on a CPU with 4–8 GB RAM), purpose-built hallucination resistance, and the structured extract() API. For a local RAG pipeline where accuracy-over-context matters and hardware is limited, Teapot wins on efficiency. For open-ended chat or code tasks, use a larger model. See our comparison of local AI workflow tools for a broader breakdown.
Can I fine-tune Teapot on my own data?
The base architecture (flan-t5-base) is a standard Hugging Face seq2seq model — standard fine-tuning via the transformers Trainer or PEFT (LoRA) applies. The TeapotAI project does not currently publish a fine-tuning notebook, but any standard T5 fine-tuning tutorial applies. Contact the project via Discord for enterprise fine-tuning support.
Does Teapot support languages other than English?
The model has been evaluated primarily in English. The developers note it has not been evaluated for non-English performance — treat non-English results as experimental.
What is the difference between teapotai library v1.x and v2.x?
Version 2.0.0 (April 2025) introduced the extract() method for Pydantic-schema structured output, an improved logistic regression refusal classifier, and the rag() standalone retrieval method. If you installed teapotai before April 2025 and are on a v1.x release, upgrade with pip install --upgrade teapotai.
References and Further Reading
- TeapotAI GitHub Repository (zakerytclarke/teapot) — source code, README, and issue tracker
- teapotai on PyPI — latest version, dependencies, and release history
- TeapotLLM Model Card on Hugging Face — model weights (0.8B), revision history
- TinyTeapot Model Card on Hugging Face — 77M model weights
- PyTorch Installation Selector — official CUDA/Python/OS wheel picker for PyTorch 2.7
- CUDA Toolkit Release Notes — CUDA 13.x changes, driver bundling policy
- TeapotAI Official Models Page — benchmark numbers and model comparison
- Running DeepSeek V4 Flash Locally: Full 2026 Setup Guide — related Codersera guide for larger local models