Run Teapot LLM on Windows: Installation Guide 2026

Last updated April 2026 — refreshed for current model versions, PyTorch 2.7, CUDA 13.x, and the new TinyTeapot 77M model.

Teapot LLM is a compact, hallucination-resistant language model built for local deployment — no cloud subscription, no data leakage, runs on a laptop CPU. This guide walks through every installation path on Windows: pip, Docker, and Llamafile. Whether you want a sub-second RAG pipeline or a structured JSON extractor that runs offline, you'll be up in under 20 minutes.

What changed since the original 2025 guideTinyTeapot (77M) launched — a new ultra-lightweight model running ~40 tokens/second on a Colab CPU, ideal for edge devices and mobile prototypes. TeapotLLM (0.8B) remains the accuracy-first choice.Base model corrected — Teapot is fine-tuned from flan-t5-base, not flan-t5-large as previously stated.teapotai library v2.0.8 released April 2025, adding extract() structured output via Pydantic, improved refusal classifier, and the rag() method.PyTorch 2.7.0 is the current stable release (requires Python ≥ 3.10). CUDA 11.8, 12.6, and 12.8 are all supported wheel targets.CUDA Toolkit 13.x — starting with CUDA 13.1, the Windows display driver is no longer bundled with the toolkit. Install the NVIDIA driver separately first.Python 3.14 is the latest stable release (April 2026), but PyTorch only guarantees support from 3.10 upward — use 3.10–3.12 for the most stable GPU stack today.

TL;DR: Which path should you take?

Method	Best for	GPU needed?	Setup time
pip install teapotai	Python developers, RAG apps	No (CPU works)	~5 min
Docker	Isolated environments, teams	Optional	~10 min
Llamafile	Non-technical users, zero setup	No	~2 min
Hugging Face Transformers (direct)	Custom fine-tuning, research	Recommended	~10 min

What is Teapot LLM?

Teapot is an open-source family of small language models from TeapotAI, designed for on-device, hallucination-resistant inference. Unlike general-purpose LLMs, Teapot is trained specifically to refuse answering when the provided context does not support the answer — making it well-suited for RAG pipelines, document Q&A, and structured data extraction where accuracy over the provided corpus matters more than generative creativity.

Two models are currently available:

Model	Parameters	Speed (CPU)	Best use case	Hugging Face
TinyTeapot	77M	~40 tok/s	Edge, mobile, low-latency demos	teapotai/tinyteapot
TeapotLLM	0.8B	~5 tok/s	Production RAG, high-fidelity extraction	teapotai/teapotllm

Both models are fine-tuned from flan-t5-base on a ~10MB synthetic dataset generated with DeepSeek-V3, trained for approximately 10 hours on an A100 GPU. The training data, model weights, and library are all MIT-licensed.

System Requirements

Minimum (CPU-only)

Windows 10 64-bit or Windows 11
Python 3.10–3.12 (recommended; 3.13+ works but has less PyTorch test coverage)
8 GB RAM (16 GB recommended for TeapotLLM; TinyTeapot runs fine on 4 GB)
2 GB free disk space

Recommended (GPU-accelerated)

NVIDIA GPU with 6 GB+ VRAM (RTX 3060 or better)
NVIDIA Driver ≥ 528.33 (supports CUDA 12.x); for CUDA 13.x, install driver separately — it is no longer bundled with the toolkit starting CUDA 13.1
CUDA Toolkit 12.8 (most stable PyTorch target as of April 2026) or 13.x for cutting-edge features
16 GB RAM
SSD with 10 GB+ free space

Note on CUDA versions: PyTorch 2.7.0 provides pre-built wheels for CUDA 11.8, 12.6, and 12.8. CUDA 13.x wheels are not yet in the stable PyTorch release as of April 2026 — check pytorch.org for the latest wheel availability before installing.

Method 1: pip install (Recommended for Developers)

The teapotai library wraps model loading, document embedding, prompt formatting, and error handling. For most use cases, this is the fastest path.

Step 1: Install Python

Download Python 3.11 or 3.12 from python.org. During installation, check "Add Python to PATH". Verify in Command Prompt:

python --version
# Python 3.11.x or 3.12.x

Step 2: Create a virtual environment

python -m venv teapot-env
teapot-env\Scripts\activate

You should see (teapot-env) at the start of your prompt.

Step 3: Install PyTorch

Choose the correct command based on whether you have an NVIDIA GPU:

CPU only:

pip install torch torchvision torchaudio

NVIDIA GPU with CUDA 12.8 (recommended for GPU users):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

NVIDIA GPU with CUDA 12.6:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

Verify CUDA is available:

python -c "import torch; print(torch.cuda.is_available())"
# True if GPU is correctly configured

Step 4: Install teapotai

pip install teapotai

This installs teapotai 2.0.8 (latest as of April 2026) and all dependencies: transformers, numpy, scikit-learn, pydantic, langsmith, and regex.

Step 5: Run your first query

from teapotai import TeapotAI

teapot = TeapotAI()

# Basic Q&A grounded in context
result = teapot.query(
    query="What is the boiling point of water?",
    context="Water boils at 100 degrees Celsius (212°F) at standard atmospheric pressure."
)
print(result)
# "Water boils at 100 degrees Celsius (212°F)."

On first run, the library downloads the teapotai/teapotllm model weights (~1.5 GB) from Hugging Face. Subsequent runs use the local cache.

Using TinyTeapot for faster inference

To use the 77M TinyTeapot model instead:

from transformers import pipeline

pipe = pipeline("text2text-generation", model="teapotai/tinyteapot")
result = pipe("question: What is the capital of France? context: France is a country in Western Europe. Its capital is Paris.")
print(result[0]['generated_text'])
# "Paris"

Method 2: Retrieval-Augmented Generation (RAG)

Teapot's primary strength is RAG — it selects the most relevant documents from a collection before answering, and refuses to answer when no context is relevant.

from teapotai import TeapotAI

# Initialize with a document collection
teapot = TeapotAI(documents=[
    "The Eiffel Tower is located in Paris, France. It was built in 1889.",
    "The Colosseum is an ancient amphitheater in Rome, Italy, completed in 80 AD.",
    "The Statue of Liberty was a gift from France to the United States, unveiled in 1886."
])

# The model retrieves the relevant document automatically
answer = teapot.query("When was the Eiffel Tower built?")
print(answer)
# "The Eiffel Tower was built in 1889."

# Ask about something not in the documents
answer = teapot.query("What is the population of Tokyo?")
print(answer)
# "I don't know." (refusal — no relevant context)

# Multi-turn chat with history
history = [
    {"role": "user", "content": "Tell me about Rome."},
]
response = teapot.chat(history)
print(response)

Method 3: Structured Data Extraction

The extract() method, added in v2.0.0, returns a validated Pydantic model — useful for parsing documents into structured records.

from teapotai import TeapotAI
from pydantic import BaseModel

class Monument(BaseModel):
    name: str
    location: str
    year_built: int

teapot = TeapotAI()

monument = teapot.extract(
    Monument,
    context="The Eiffel Tower is located in Paris, France. It was built in 1889."
)
print(monument)
# Monument(name='Eiffel Tower', location='Paris, France', year_built=1889)

Method 4: Docker Installation

Docker is the cleanest option for teams and CI environments. It avoids Python version conflicts and keeps dependencies isolated.

Prerequisites

Docker Desktop for Windows (latest stable) with WSL2 backend enabled
For GPU support: NVIDIA Container Toolkit — install via NVIDIA's Container Toolkit guide

Basic container with the Streamlit chat UI

# Pull and run the TeapotAI chat demo
docker run -p 8501:8501 --rm \
  -e TEAPOT_MODEL=teapotllm \
  ghcr.io/zakerytclarke/teapot:latest

Then open http://localhost:8501 in your browser. The first run downloads model weights inside the container; add a volume mount to cache them between runs:

docker run -p 8501:8501 --rm \
  -v %USERPROFILE%\.cache\huggingface:/root/.cache/huggingface \
  ghcr.io/zakerytclarke/teapot:latest

With GPU passthrough

docker run -p 8501:8501 --rm --gpus all \
  -v %USERPROFILE%\.cache\huggingface:/root/.cache/huggingface \
  ghcr.io/zakerytclarke/teapot:latest

Note: Docker image paths above are illustrative based on the project's GitHub Container Registry pattern. Verify the current image name and tag at ghcr.io/zakerytclarke/teapot before running, as the project does not yet have tagged releases on GitHub.

Method 5: Llamafile (Zero-setup)

Llamafile bundles the model weights and a minimal runtime into a single executable. No Python, no Docker, no CUDA setup required.

Check the TeapotAI models page or the GitHub repository for a Llamafile download link.
Download the .llamafile executable for Windows.
Rename the file to add .exe if Windows does not recognize it, or run from PowerShell:
.\teapot.llamafile.exe
A local web server starts. Open the URL shown in the terminal (typically http://localhost:8080).

Llamafile inference is CPU-only and slower than the GPU-accelerated pip path, but it requires zero configuration — useful for quick demos or for users who do not have Python installed.

Method 6: Direct Hugging Face Transformers (Advanced)

Use this path when you want full control over generation parameters, or are building a custom fine-tune on top of Teapot.

from transformers import pipeline

# TeapotLLM (0.8B)
pipe = pipeline(
    "text2text-generation",
    model="teapotai/teapotllm",
    revision="699ab39cbf586674806354e92fbd6179f9a95f4a",  # pinned revision for reproducibility
    device=0  # 0 = first GPU; remove for CPU
)

prompt = "question: What year was the Eiffel Tower built? context: The Eiffel Tower was built in 1889 in Paris."
result = pipe(prompt, max_new_tokens=50)
print(result[0]['generated_text'])
# "1889"

The model uses a seq2seq architecture (T5-family). The input prompt format matters: prefix your query with question: and the supporting text with context:. The teapotai library handles this formatting automatically — use direct transformers only if you need low-level control.

Performance and Benchmarks

Teapot is designed for in-context tasks, not general-knowledge reasoning. The developers benchmark it specifically on context-faithful Q&A rather than standard suites like MMLU or HellaSwag, which measure world-knowledge recall.

Model	Hardware	Throughput	VRAM / RAM
TinyTeapot (77M)	CPU (Colab free tier)	~40 tok/s	~500 MB RAM
TeapotLLM (0.8B)	CPU (Colab free tier)	~5 tok/s	~3 GB RAM
TeapotLLM (0.8B)	NVIDIA RTX 3060 (12 GB)	~80–120 tok/s (estimated)	~2 GB VRAM

The CPU throughput figures are from the TeapotAI project's own benchmarks on a Colab free-tier instance (verify at teapotai.com/models). GPU throughput for TeapotLLM on consumer hardware is an estimate based on the model's parameter count — the project has not published official GPU benchmarks as of April 2026. Run python -c "import time; ..." with your specific hardware to measure actual throughput.

For comparison: models of similar scope like Phi-3-mini (3.8B) and Qwen3-0.6B (released 2025) offer broader general knowledge but do not have Teapot's trained refusal behavior. If your use case is strictly RAG or document extraction, Teapot's compact size and refusal training often make it the better fit over larger models that can hallucinate answers from training data.

Decision Tree: Which Model and Method?

You have no NVIDIA GPU and want the fastest start → pip install teapotai + TinyTeapot (77M) on CPU
You're building a production RAG pipeline → pip install teapotai + TeapotLLM (0.8B) + CUDA 12.8
You need structured JSON output from documents → teapotai v2.0.8 with extract() and Pydantic models
You're a team and want reproducible environments → Docker with volume-mounted HF cache
You want to share a demo with non-technical stakeholders → Llamafile executable
You're fine-tuning or doing research → Direct Hugging Face Transformers with a pinned revision

If your team needs a full backend developer to productionize a Teapot-based pipeline into a scalable service, Codersera's vetted Python developers can help accelerate that work.

Common Pitfalls and Troubleshooting

torch.cuda.is_available() returns False

Confirm your PyTorch wheel matches your CUDA version: pip show torch — look for +cu128 or similar in the version string. A CPU-only wheel (no +cuXXX) will never see a GPU.
Verify your NVIDIA driver is up to date: run nvidia-smi in Command Prompt. If the command is not found, the driver is not installed.
Starting with CUDA Toolkit 13.1, the driver is not included in the toolkit installer. Install the driver from nvidia.com/drivers first, then install the toolkit separately.

ModuleNotFoundError: No module named 'teapotai'

Make sure your virtual environment is activated: teapot-env\Scripts\activate
Confirm you installed into the active environment: pip show teapotai

Model download hangs or times out

The TeapotLLM model is ~1.5 GB. On slow connections, set HF_HUB_DOWNLOAD_TIMEOUT=300 in your environment before running.
If Hugging Face is blocked in your region, set HF_ENDPOINT=https://hf-mirror.com or use huggingface-cli download teapotai/teapotllm to pre-cache.

Slow inference on CPU

Switch to TinyTeapot (77M) for 8× faster throughput on CPU.
Reduce max_new_tokens in direct transformers usage — the default can be higher than needed for short extraction tasks.
Enable PyTorch inference mode: wrap your call in torch.inference_mode() to skip gradient tracking.

teapotai refuses all answers ("I don't know")

This is by design when the query cannot be answered from context. Verify your context argument contains the relevant text.
The model uses a logistic regression refusal classifier (teapot_refusal_classifier.joblib) — it errs on the side of refusing rather than hallucinating. If refusals are too aggressive for your use case, use the query() method's system_prompt parameter to tune behavior, or switch to a general-purpose model.

Docker: Cannot connect GPU (Windows)

WSL2 must be the Docker Desktop backend (not Hyper-V).
Install NVIDIA Container Toolkit for WSL2: NVIDIA WSL User Guide.
Run docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu22.04 nvidia-smi to confirm GPU passthrough works before adding your application container.

What Was Removed and Why

The original 2025 version of this guide referenced flan-t5-large as the base model — this was incorrect. The actual base is flan-t5-base, a smaller architecture. The model size (0.8B parameters) is the fine-tuned TeapotLLM, not flan-t5-large. Corrected throughout.

The original also did not mention TinyTeapot (77M), which launched in early 2025 and is now the recommended starting point for CPU-only Windows setups due to its 8× throughput advantage over TeapotLLM on CPU.

FAQ

Does Teapot LLM work without a GPU on Windows?

Yes. Both TinyTeapot (77M) and TeapotLLM (0.8B) run on CPU. TinyTeapot reaches ~40 tokens/second on a standard CPU, making it practical for interactive use. TeapotLLM at ~5 tokens/second is slow for chat but fast enough for batch document extraction tasks.

What Python version should I use?

Python 3.11 or 3.12 is the current recommendation. PyTorch 2.7.0 requires Python ≥ 3.10. Python 3.14 (the latest stable as of April 2026) works but has less PyTorch pre-built wheel coverage — stay on 3.11 or 3.12 for the most predictable install experience.

Which CUDA version should I install?

CUDA 12.8 for PyTorch GPU support — it is the newest CUDA for which PyTorch 2.7 ships pre-built wheels. CUDA 13.x is available but PyTorch 2.7 stable wheels do not yet target it. Also note: starting with CUDA 13.1, install the NVIDIA display driver separately before the CUDA Toolkit.

Is Teapot LLM free to use commercially?

Yes. The model weights, training code, and teapotai library are all MIT-licensed. There are no usage fees or API costs for local inference.

How does Teapot compare to Ollama + Llama 4 for RAG?

Ollama running Llama 4 Scout (or similar) gives you a much larger general-purpose model (17B+ active params in MoE configuration) with broader knowledge, but it requires more RAM/VRAM and does not have Teapot's trained refusal behavior. Teapot's advantage is its tiny size (fits on a CPU with 4–8 GB RAM), purpose-built hallucination resistance, and the structured extract() API. For a local RAG pipeline where accuracy-over-context matters and hardware is limited, Teapot wins on efficiency. For open-ended chat or code tasks, use a larger model. See our comparison of local AI workflow tools for a broader breakdown.

Can I fine-tune Teapot on my own data?

The base architecture (flan-t5-base) is a standard Hugging Face seq2seq model — standard fine-tuning via the transformers Trainer or PEFT (LoRA) applies. The TeapotAI project does not currently publish a fine-tuning notebook, but any standard T5 fine-tuning tutorial applies. Contact the project via Discord for enterprise fine-tuning support.

Does Teapot support languages other than English?

The model has been evaluated primarily in English. The developers note it has not been evaluated for non-English performance — treat non-English results as experimental.

What is the difference between teapotai library v1.x and v2.x?

Version 2.0.0 (April 2025) introduced the extract() method for Pydantic-schema structured output, an improved logistic regression refusal classifier, and the rag() standalone retrieval method. If you installed teapotai before April 2025 and are on a v1.x release, upgrade with pip install --upgrade teapotai.

References and Further Reading

TeapotAI GitHub Repository (zakerytclarke/teapot) — source code, README, and issue tracker
teapotai on PyPI — latest version, dependencies, and release history
TeapotLLM Model Card on Hugging Face — model weights (0.8B), revision history
TinyTeapot Model Card on Hugging Face — 77M model weights
PyTorch Installation Selector — official CUDA/Python/OS wheel picker for PyTorch 2.7
CUDA Toolkit Release Notes — CUDA 13.x changes, driver bundling policy
TeapotAI Official Models Page — benchmark numbers and model comparison
Running DeepSeek V4 Flash Locally: Full 2026 Setup Guide — related Codersera guide for larger local models