13 min to read
Gemma 4 is Google’s newest open AI model and successor of Gemma 3 and Gemma 3n, Google's open AI model family that works well on local hardware, from phones to PCs.
You can run it on your own machine, keep data on device, and avoid cloud latency.
This guide explains what Gemma 4 is, how to install it, and how to run it in practice.
Gemma 4 is an open family of language models from Google DeepMind, released under the Apache 2.0 license.
“Open weights” means you can download the model files, run them on your own hardware, and fine‑tune them for your use cases within the license terms.
The family has four main sizes: Gemma 4 E2B, E4B, 26B A4B (Mixture‑of‑Experts), and 31B (dense).
E2B and E4B target phones, Raspberry Pi, and small PCs, while 26B and 31B target laptops with GPUs, workstations, and servers. Gemma 4 supports text and images for input, and it outputs text.
The smaller models support context windows up to 128K tokens, and the larger ones reach up to 256K tokens, which helps with long documents and codebases.
The models handle more than 140 languages and focus on reasoning, coding, and general assistant tasks.
Gemma 4 uses both dense and Mixture‑of‑Experts architectures, which trade off quality and speed in different ways. Google designed Gemma 4 to work as a local “agent” stack, not only as a plain chat model.
You have three main paths for local use:
Pick one path based on your skills and target device.
Ollama is a desktop app and CLI that downloads and runs models for you.
ollama --version to confirm that Ollama works.ollama pull gemma4.ollama list to see available Gemma 4 variants and tags.Ollama exposes different Gemma 4 sizes as tags:
gemma4:e2b for the small edge model.gemma4:e4b for the edge model with more capacity.gemma4:26b for the 26B Mixture‑of‑Experts model.gemma4:31b for the 31B dense model.Choose E2B or E4B if you have a laptop with shared memory or a low‑end GPU.
Use 26B or 31B only if you have at least 24 GB of GPU memory or a strong workstation.
Use this path if you want Python control, custom prompts, or integration into your own app.
pip install -U transformers torch accelerate.google/gemma-4-E2B or google/gemma-4-31B.The Gemma 4 E2B Hugging Face page includes ready example code for chat prompts and image inputs.
You pass a message list into a processor, create tensors, call model.generate, and then parse the response with the same processor.
LiteRT‑LM is Google’s open‑source inference framework for edge LLMs.
Its CLI makes it easy to run models from a terminal, with no extra code.
litert-community/gemma-4-E2B-it-litert-lm from Hugging Face.litert-lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
--prompt="What is the capital of France?"LiteRT‑LM supports function calling and local tools through a preset file, which you can pass with --preset=preset.py.
That file defines tools and routing logic, which turns Gemma 4 into a local agent for tasks such as file search or web lookups.
After installation, you can start a chat with Gemma 4 with one command.
ollama run gemma4:e2b (or another size tag).Ollama maintains a conversation context so you can send follow‑up questions.
You can stop generation with Ctrl+C if needed.
To use Gemma 4 as a local API, run:
bashollama serve
Then call the http://localhost:11434/api/chat endpoint from your app, with model: "gemma4:e2b" in the JSON body. This lets you connect desktop apps, scripts, or browser extensions to Gemma 4 without cloud calls.
A simple text‑only example with Gemma 4 E2B looks like this:
pythonfrom transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "google/gemma-4-E2B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
prompt = "Explain what a context window is in plain language."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For multimodal input, you use the Gemma 4 processor object, pass both image and text, and then decode the structured answer.
The Hugging Face page for Gemma 4 E2B shows complete examples for image questions and step‑by‑step response parsing.
LiteRT‑LM is useful when you need function calling, multiple tools, or low‑resource devices.
preset.py file that defines Python functions as tools, such as get_current_time or search_files.--preset=preset.py; Gemma 4 will return tool_call events as JSON, the CLI will run your function, and then the model will finish the answer.Google reports that LiteRT‑LM can process 4,000 input tokens across two skills in under three seconds on supported edge hardware, which gives a sense of real‑world response times.
The table below summarizes key quality benchmarks from the Gemma 4 model card and related sources.
These numbers show that even the small E2B and E4B models reach useful accuracy for many tasks, while 26B and 31B reach near‑frontier scores in reasoning and coding.
For local hardware, one published data point is Gemma 4 E2B on a Raspberry Pi 5 using LiteRT‑LM.
These numbers show that useful local agents are possible even on small, low‑power boards when you choose the right model and runtime.
The quality benchmarks above come from Google’s official Gemma 4 model cards, NVIDIA’s partner cards, and community summaries. They measure knowledge and reasoning (MMLU Pro, Tau2), math (AIME 2026), coding (LiveCodeBench v6), and science QA (GPQA Diamond).
Each score represents accuracy on a held‑out test set under standard evaluation setups, such as few‑shot prompts or chain‑of‑thought where defined.
The Raspberry Pi 5 performance numbers come from an analysis of Google’s local agent stack with Gemma 4 and LiteRT‑LM. The test used Gemma 4 E2B, quantized for edge use, and measured both prefill and decode rates under a local agent workload with two tools.
NVIDIA’s published model card for Gemma 4 31B IT with NVFP4 quantization shows that accuracy remains close to the full‑precision baseline, which supports practical GPU deployment at lower cost.
This table compares Gemma 4 E4B with three popular open local models as of April 2026.
*VRAM values are approximate community numbers and depend on quantization and runtime.
The table shows that Gemma 4 E4B offers long context and multimodal support at a memory cost close to other 7B‑class models, with a more permissive license than Llama 3.1.
Gemma 4 itself has no license fees.
You pay for hardware, power, and any paid platform or cloud service you choose.
Public pricing today mostly covers older Gemma 3 models and competitor APIs, but it offers a useful baseline.
For many users, the main “cost” of Gemma 4 is buying a GPU or using existing hardware. The open license makes it easier to keep per‑request cost low compared to cloud‑only models, especially at scale.
Gemma 4 stands out because it ships not only as an open model family, but as a complete local agent stack across phones, PCs, and edge devices.
Google released open weights under Apache 2.0 along with Android AICore access, AI Edge Gallery “agent skills,” LiteRT‑LM runtimes, and day‑one support from NVIDIA and Hugging Face.
This tight integration means the gap between “announcement” and “working local setup” is small, compared with many earlier open models.
This quick chart focuses on the four Gemma 4 variants and how they fit local setups.
You can treat E2B as the entry point, E4B as the balanced local default, 26B A4B as the performance step‑up, and 31B as the quality‑first choice.
This example builds a basic coding assistant using Gemma 4 E4B on a laptop with at least 16 GB RAM and a mid‑range GPU or fast CPU.
ollama --version.ollama pull gemma4:e4b.ollama run gemma4:e4b.ollama serve to expose a local HTTP API.http://localhost:11434/api/chat with model: "gemma4:e4b".On a mid‑range GPU, Gemma 4 E4B should respond within a few seconds while keeping all code on your machine.
For slower laptops, you can switch to E2B by using the gemma4:e2b tag, which reduces memory use and speeds up responses.
Gemma 4 brings strong open models to devices that many people already own, from phones and Raspberry Pi boards to laptops and workstations.
Its Apache 2.0 license, long context windows, and strong reasoning and coding scores make it a solid base for local assistants and agents.
No, you can run E2B or E4B on CPUs with runtimes like Ollama or LiteRT‑LM, although responses will be slower.
A GPU improves speed, especially for 26B and 31B.
Yes, Gemma 4 uses the Apache 2.0 license, which allows commercial use, modification, and redistribution within its terms.
Cloud platforms that host Gemma models may still charge per token or per hour.
E2B and E4B can run in around 3–4 GB of GPU memory or comparable system memory with quantization, while larger models need far more.
Always check your runtime’s docs for exact requirements for your quantization and batch size.
Yes, once you download the weights and any needed runtime, Gemma 4 can run without an internet connection.
This is one of the main benefits of local deployment.
Gemma 4 offers competitive or better reasoning and coding scores at similar or smaller parameter counts, plus Apache 2.0 licensing and strong edge support.
Llama 3.1 and Qwen2.5 still offer strong alternatives, especially in existing ecosystems, but focus less on a unified local agent stack.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.