Google AI Edge Gallery in 2026: Install, Benchmarks, and Real On-Device Gemma 4

Last updated April 2026 — refreshed for current model/tool versions.

Google AI Edge Gallery is the official open-source app from Google for running large language models, vision models, and tool-calling agents fully on-device on Android and iPhone. As of April 2026 it ships Gemma 4 (E2B and E4B), Agent Skills, Thinking Mode, and Snapdragon NPU acceleration — and it is no longer "experimental" or "Android-only." This guide is the practical install + benchmark + troubleshooting walkthrough we wish we had when we first tried it.

What changed in 2026 (read this if you've used the app before)iOS shipped. The iPhone app is live on the App Store (developer: Google LLC, requires iOS 17+). The earlier "coming soon" status is no longer accurate.Gemma 4 replaces Gemma 3 / 3n. The 1.0.11 release (April 2026) added Gemma 4 E2B and E4B, both multimodal (text + image + audio) with a 128K context window.Agent Skills + Thinking Mode. The app now supports modular tools (maps, Wikipedia search, hash, etc.) and visualises step-by-step reasoning for supported models.Snapdragon NPU acceleration. Release 1.0.12 (April 24, 2026) added NPU-accelerated Gemma 3 1B for Snapdragon 8 Gen 2/3, 8 Elite, and 8 Elite Gen 5 SoCs. NPU prefill is roughly 25–30× the CPU baseline on the same device.No more Hugging Face login. Gemma model downloads are streamlined since 1.0.10; you no longer need a Hugging Face token for the default catalog.New supported models. Qwen2.5 1.5B Instruct, Phi-4 Mini Instruct, DeepSeek-R1-Distill-Qwen 1.5B, FunctionGemma 270M (Mobile Actions, Tiny Garden) all run via the LiteRT-LM runtime.

TL;DR — Should you install it?

If you are…Install?Why
An Android dev evaluating on-device GenAIYesReference Kotlin implementation, in-app benchmarks, NPU support
An iOS dev exploring local LLMsYesFirst official Google app for local Gemma on iPhone (Swift, open-source)
A privacy-focused power userYesAll inference is local, no telemetry of prompts/images
Looking for a polished consumer assistantMaybeIt's a demo app — no persistent chats, ephemeral history, occasional crashes
On an older mid-range phone (<6 GB RAM)Skip for nowEven the E2B model needs ~3.2 GB free; older SoCs fall back to slow CPU paths

It's an open-source (Apache 2.0) showcase app from the google-ai-edge GitHub org that demonstrates the LiteRT-LM Kotlin/Swift APIs. You download model files (`.task` / `.litertlm`) once over Wi-Fi, then run them locally. There is no proxy server, no API key, no cloud — the only network call after model download is the optional update check.

If you're picking between local-AI tools, the Gallery is closer to a reference implementation than a daily-driver chat app. For comparison: the same Gemma 4 weights run on desktop via Ollama, llama.cpp, or LiteRT-LM directly (see our Run Gemma 4 on your PC and devices guide); on phones, the Gallery is currently the most fully-featured single app that exercises the whole on-device stack including NPU offload and tool-calling.

Feature surface (April 2026)

  • AI Chat — multi-turn conversation with Thinking Mode toggle for Gemma 4.
  • Ask Image — multimodal Q&A on a photo (camera or gallery) using Gemma 4 vision.
  • Audio Scribe — local transcription and translation; up to ~30s clips on Gemma 4 E2B/E4B.
  • Prompt Lab — 20+ templates (summarise, rewrite, code, format conversion) for single-turn LLM use.
  • Agent Skills — modular tool-calling: open maps, search Wikipedia, compute hashes, plus community-contributed skills loaded from /skills.
  • Mobile Actions — fine-tuned FunctionGemma 270M that toggles flashlight, opens apps, controls device functions.
  • Benchmark — in-app TTFT, prefill speed, decode speed, peak memory per model on your specific hardware.
  • Bring Your Own Model — load any LiteRT `.task` or `.litertlm` you've built/quantised yourself.

How to install on Android

You have three options. Pick by experience level.

Search for "Google AI Edge Gallery" or open play.google.com/store/apps/details?id=com.google.ai.edge.gallery. Requires Android 12+. The Play build automatically ships the right NPU runtime libraries for your SoC, so Snapdragon 8 Gen 2 / Gen 3 / Elite / Elite Gen 5 users get NPU acceleration with zero manual setup.

Option 2 — APK from GitHub Releases

Download the latest APK from github.com/google-ai-edge/gallery/releases. As of writing the current release is 1.0.12 (April 24, 2026). Pick the APK that matches your SoC if you want NPU support:

APKSnapdragon SoCExample devices
ai-edge-gallery-sm8550.apk8 Gen 2Galaxy S23 series, OnePlus 11, Xiaomi 13 Pro
ai-edge-gallery-sm8650.apk8 Gen 3Galaxy S24 Ultra, OnePlus 12, Xiaomi 14 Pro
ai-edge-gallery-sm8750.apk8 EliteGalaxy S25 Ultra, Xiaomi 15 Pro, OnePlus 13
ai-edge-gallery-sm8850.apk8 Elite Gen 5Galaxy S26 series, OnePlus 15R, iQOO 15R
ai-edge-gallery.apk (universal)anyCPU/GPU only, no NPU

To sideload:

1. Download the APK to your phone.
2. Settings → Apps → Special access → Install unknown apps → allow your browser/file manager.
3. Tap the APK in your file manager.
4. Open the app and grant storage permission so model downloads can persist.

Option 3 — build from source

If you want to fork or instrument it:

git clone https://github.com/google-ai-edge/gallery.git
cd gallery/Android/src
# open in Android Studio Hedgehog or newer, sync Gradle, run on a physical device
# (the emulator's CPU paths are usable but you'll never benchmark realistically there)

How to install on iPhone

The iOS build ("Google AI Edge Gallery" by Google LLC) is on the App Store as of February 2026 and was last updated April 20, 2026 (v1.0.3). Requirements:

  • iOS 17.0 or later.
  • ~68 MB app download, plus 2.5 GB+ per model.
  • iPhone with sufficient RAM — A15 Bionic and newer (iPhone 13 series and up) is the practical floor; A19 Pro is the fastest tested chip (see Performance section).

Install path:

  1. Open the App Store → search "Google AI Edge Gallery" or use the listing at apps.apple.com/us/app/google-ai-edge-gallery/id6749645337.
  2. Install. There is also a public TestFlight beta if you want pre-release builds: testflight.apple.com/join/nAtSQKTF.
  3. Open the app, accept the on-device-only privacy notice, then tap a model card and download.

The iPhone build is open-source Swift; the source lives in the same repo at iOS/. iPad and macOS are not supported (universal binary not yet shipped).


Models, sizes and what to download

The default catalog as of April 2026:

ModelParams (effective)DownloadMin free RAM (Q4_0)Best for
Gemma 4 E2B~2B effective~2.54 GB3.2 GBDefault. Multimodal, fast, fits mid-range flagships.
Gemma 4 E4B~4B effective~3.61 GB5.0 GBHigher quality. Flagship phones (Pixel 10 Pro XL, Galaxy S26 Ultra, iPhone 17 Pro+).
Gemma 3 1B (NPU)1B~530 MB~1.5 GBLowest-latency text on Snapdragon NPUs.
FunctionGemma 270M (Mobile Actions)270M~280 MB~1 GBTool-calling: device controls, app launching.
Qwen2.5 1.5B Instruct1.5B~1.0 GB~2 GBAlternative chat baseline; multilingual.
Phi-4 Mini Instruct3.8B~2.4 GB~4 GBReasoning-heavy text tasks.
DeepSeek-R1-Distill-Qwen 1.5B1.5B~1.0 GB~2 GBChain-of-thought style outputs in a tiny footprint.

If you want a deeper Gemma family comparison (Gemma 4 vs Gemma 3 vs Gemma 3n), we already wrote that up: Gemma 4 vs Gemma 3: what changed and should you switch?. For the larger Gemma 4 dense and MoE checkpoints (31B, 26B A4B), the desktop/server route is still your best bet — see Run Gemma 4 on your PC and devices locally.


Performance — real 2026 numbers

Independent on-device benchmarks of Gemma 4 E2B through AI Edge Gallery (Beebom, April 2026):

DeviceSoCBackendTime-to-first-tokenDecode speed
iPhone AirApple A19 ProGPU (Metal)0.10 s51.28 tok/s
iPhone AirApple A19 ProCPU0.38 s36.99 tok/s
Galaxy S26 UltraSnapdragon 8 Elite Gen 5GPU (Adreno)0.13 s48.55 tok/s
Vivo X300 ProMediaTek Dimensity 9500GPU0.29 s16.45 tok/s
Pixel 10 Pro FoldTensor G5CPU only1.65 s10.42 tok/s

Headline observations:

  • Apple A19 Pro and Snapdragon 8 Elite Gen 5 GPU paths are within ~6% of each other on decode.
  • Tensor G5 is dramatically slower because, ironically, Google's own TPU lacks a public LiteRT plugin — the app falls back to CPU on Pixel. NNAPI was deprecated in Android 15, and TPU support hasn't filled the gap yet for third-party LiteRT inference.
  • NPU paths (Snapdragon 8 Elite, Dragonwing IQ8) crush both — Google's own measurements on Dragonwing IQ8 show 3,700 tok/s prefill and 31 tok/s decode for Gemma 4 E2B.
  • FunctionGemma 270M on Pixel 7 Pro hits 1,916 tok/s prefill and 142 tok/s decode (Google Developers Blog, Feb 2026) — i.e. tiny tool-calling models are basically instant on any modern phone.

The historical "Gemma 3 1B at 2,585 tok/s" claim from the original 2025 post referred to prefill on Pixel 8 Pro and is still roughly correct for that specific config, but it was never a decode number. Treat any "tokens per second" claim that doesn't specify prefill vs decode as marketing.

How to benchmark on your own device

Since 1.0.10, the Gallery has an in-app benchmark screen. From the Models Management screen, open the top-left menu and pick "Benchmark". It runs a fixed prompt at a fixed token budget and reports prefill tok/s, decode tok/s, peak RSS, and energy proxies. Treat it as the source of truth for your device — third-party numbers vary wildly with thermal state, battery level, and background load.


How to choose: a quick decision tree

  • Have an iPhone 13 or newer? → install on iPhone, use Gemma 4 E2B (download is <3 GB, decode ~50 tok/s on A19 Pro, ~30+ on A17 Pro and up).
  • Have a Snapdragon 8 Gen 2 / Gen 3 / Elite Android phone? → install via Play Store (NPU runtime is bundled), then try the Gemma 3 1B NPU variant for lowest latency, plus Gemma 4 E2B or E4B for higher quality.
  • Have a Pixel? → expect CPU-bound performance until Google ships a Tensor TPU plugin for LiteRT. Stick to Gemma 4 E2B.
  • Have a mid-range Dimensity / sub-flagship? → Gemma 4 E2B at GPU backend; skip E4B unless you have 8+ GB RAM.
  • Want tool-calling / device control? → install FunctionGemma 270M (Mobile Actions). Pair it with Agent Skills.
  • Just want to play with reasoning? → Gemma 4 E4B with Thinking Mode on, on a flagship phone.
  • Need a daily-driver chat app? → wait. The Gallery has no persistent chat history (Simon Willison flagged this in his April 2026 review). Use a third-party MLC Chat / PocketPal-style app on top of the same Gemma 4 weights if you need continuity.

Agent Skills and tool-calling

Agent Skills shipped in 1.0.11 (April 2, 2026). A Skill is a small JSON manifest that declares one or more tools the model can call: a tool has a name, a JSON-schema'd argument list, and a Kotlin/Swift handler that runs locally on the phone. Built-in Skills include display_map, search_wikipedia, hash, and a handful more.

If you want to write your own, the README at github.com/google-ai-edge/gallery/tree/main/skills walks through it. Authoring a custom Skill takes about 30 minutes if you've ever shipped a Kotlin Android module before. You annotate a function with @Tool, schema your params with @ToolParam, register a performAction handler, and rebuild. The same pattern works under the Function Calling Guide for direct LiteRT-LM integration in your own app, no Gallery dependency required.

The model side: Gemma 4 has native function-calling baked in (one of the biggest improvements over Gemma 3). The Gallery exposes "Thinking Mode" to surface the JSON-shaped tool plan before execution.


Privacy and security: what's actually local?

  • Inference — 100% on-device. Prompts, images, audio never leave the phone.
  • Model download — comes from Hugging Face / Google's CDN over HTTPS. Once downloaded, models live in app-private storage; no further network calls during inference.
  • Update check — the app checks for new model manifests / app updates. You can airplane-mode after first run and the app keeps working.
  • Telemetry — the open-source repo is auditable. As of April 2026 there is no prompt/response telemetry; standard Play/App Store crash and install metrics apply.

This is the most defensible setup for handling sensitive content (legal docs, medical photos, internal code snippets) on a phone today. If you're a Codersera client whose engineering team is evaluating on-device GenAI for a privacy-bound product, this app is the cheapest way to demo what's actually possible — and we can staff the integration with mobile/ML engineers who've shipped LiteRT-LM in production.


Common pitfalls and troubleshooting

  • "Out of memory" or app kill on first generation. The OS is reclaiming RAM under pressure. Close other apps; restart; downgrade from E4B to E2B; for sideloaded APKs, try the universal build instead of the SoC-specific one (NPU runtimes are RAM-hungry).
  • Crash after Android 17 Beta updates on Pixel 6a / 7 / 7a. Tracked at issue #701. The MemoryLimiter + native library restrictions on recent betas kill the inference process. Workaround: use Gemma 4 E2B instead of E4B and switch to CPU backend, or stay on the Android 16 stable channel until the patch lands.
  • Decode tok/s "feels slow." Check the Benchmark screen. If your phone is throttling thermally (sustained load drops decode 30–50%) — chill the device or lower the model size.
  • Pixel benchmark looks bad. Expected: Tensor G5 has no LiteRT TPU plugin. The app falls back to CPU.
  • Conversation history vanishes when you close the app. Known limitation; chats are ephemeral by design. Copy out anything you want to keep.
  • Image-and-follow-up prompt freezes the app. Reported in Simon Willison's April 2026 hands-on; restart the chat after the first multimodal turn until the next patch.
  • Hugging Face download asks for a token. Custom models still require a token; the default Gemma catalog does not (since 1.0.10). Set the token in app settings if you want to load community LiteRT models.
  • "Why is my iPhone faster than my Pixel?" See above. It's not the silicon — it's the missing TPU plugin.

Cloud AI vs on-device AI (when to pick which)

FactorCloud (Gemini, Claude, GPT-5.5)On-device (AI Edge Gallery / Gemma 4)
Internet requiredYesNo (after model download)
Cost per token$0.05–$15 / 1M tokens$0 (electricity)
LatencyNetwork RTT + queue + decode0.1–1.7 s TTFT, 10–50 tok/s decode on flagship
Quality (general reasoning)Frontier (200B+ effective params)Good for ~4B class; loses on hard reasoning
MultimodalMature (video, long audio)Image + short audio + text on E2B/E4B
PrivacyVendor-dependentMaximum (data never leaves device)
Custom modelsLimited / fine-tuning feesBring any LiteRT .task / .litertlm file
BatteryNegligible client-side~5–10% per hour of heavy decode on flagship

The honest split as of April 2026: cloud wins on raw capability and long-context multimodal; on-device wins on privacy, latency, cost, and offline. A useful pattern is hybrid — route trivial / sensitive prompts to Gemma 4 locally, and escalate hard ones to a cloud frontier model. If you're building this for a customer, that routing logic is exactly the kind of thing our vetted mobile + ML engineers ship.


What was removed and why

  • "iOS coming soon" — the iOS app shipped (App Store, Feb 2026). Kept the install instructions instead.
  • "Experimental alpha" framing — the project is now at 1.0.12 with 22K+ GitHub stars, an Apache 2.0 license, and active weekly issue triage. Still labelled "Beta" by Google, but no longer alpha-quality.
  • Gemma 3 / Gemma 3n as the headline model — superseded by Gemma 4 E2B/E4B (same effective sizes, better benchmarks, native function-calling, native audio). Gemma 3n is still selectable for backwards-compat experiments.
  • "Hugging Face login required" — no longer true for the default Gemma catalog since 1.0.10.

FAQ

Yes. It shipped on the App Store in February 2026 (developer: Google LLC), latest build 1.0.3 (April 20, 2026), iOS 17+. There is also a public TestFlight beta.

Does it run on iPad or Mac?

No. The current iOS build is iPhone-only. iPad and Mac are not supported as of April 2026 — there is no universal Apple-silicon binary.

Can I run Gemma 4 E4B on a Pixel 8 / 9?

Technically yes if you have ~5 GB of free RAM, but you'll be on the CPU backend (no LiteRT TPU plugin for Tensor G3/G4/G5 yet), so decode will be ~10 tok/s. E2B is the better default on Pixel until that lands.

How big are the models?

Gemma 4 E2B is ~2.54 GB. E4B is ~3.61 GB. The Gemma 3 1B NPU variant is ~530 MB. FunctionGemma 270M is ~280 MB. Plan for at least 4–5 GB of free storage if you want both Gemma 4 sizes side by side.

Does the app phone home with my prompts?

No. Inference is 100% local. The repo is open-source and auditable; standard Play/App Store crash metrics apply, but prompt content does not leave the device.

Can I bring my own quantised model?

Yes. The "Bring Your Own Model" flow accepts LiteRT .task and .litertlm files. Quantise with the LiteRT-LM converter (Q4_0 is the typical mobile target). Common pairings: a custom-fine-tuned Gemma 4 E2B for a domain task, a distilled Qwen 3.5 1.5B, or a tool-calling FunctionGemma derivative.

How does it compare to MLC Chat / PocketPal / Llamao?

The Gallery is Google's reference implementation — it has the deepest LiteRT-LM integration, official NPU support on Snapdragon, and Gemma 4 / FunctionGemma first. Third-party apps like PocketPal and MLC Chat tend to support a broader model zoo (Llama 4, Qwen 3.5/3.6, Mistral 3) and ship persistent chat history. Use the Gallery for the Google-stack story; use the others for breadth and UX.

What's the next major release likely to bring?

Watch the GitHub roadmap, but plausible near-term work: TPU plugin for Tensor G5, persistent chat history, iPad support, and Gemma 4 E4B with NPU acceleration. Don't quote me on dates — the project is open-source and ships when it ships.



References & further reading

  1. google-ai-edge/gallery — official GitHub repo, Apache 2.0 source.
  2. Gallery releases — 1.0.12 (April 24, 2026), 1.0.11 (April 2, 2026), full changelog.
  3. Bring agentic skills to the edge with Gemma 4 — Google Developers Blog, April 2026.
  4. On-device function calling in Google AI Edge Gallery — Google Developers Blog, Feb 2026 (FunctionGemma benchmarks).
  5. Gemma 4 model overview — Google AI for Developers (param counts, context windows, memory tables).
  6. App Store listing — iOS 17+, latest 1.0.3 (April 2026).
  7. Google Play listing — 1M+ installs, Android 12+.
  8. Simon Willison — Google AI Edge Gallery (April 6, 2026) — independent hands-on with E2B on iPhone.
  9. Beebom — I tested on-device AI on Android and iPhone — A19 Pro / Snapdragon 8 Elite Gen 5 / Dimensity 9500 / Tensor G5 benchmark numbers.
  10. Android Authority — Local Gemma 4 hits the Play Store, April 2026.