Gemini

Gemini 3.5 Live Translate: A Developer's Guide

Google's Gemini 3.5 Live Translate is a new audio model for continuous speech-to-speech translation in 70+ languages. Here's how it works, where it ships, and how to build with it.

Published 09 Jun 2026 • Updated 09 Jun 2026 • 7 min read

Quick answer. Gemini 3.5 Live Translate is Google's latest audio model for near real-time speech-to-speech translation across 70+ languages. It auto-detects the spoken language and generates continuous translated speech that preserves the speaker's intonation, pacing, and pitch, staying just a few seconds behind. Developers can build with it in public preview through the Gemini Live API and Google AI Studio; it also ships in Google Translate (Android/iOS) and is coming to Google Meet.

On June 9, 2026, Google announced Gemini 3.5 Live Translate, its latest audio model for live speech-to-speech translation. If you've ever wired up a real-time translation feature, you already know the hard part isn't the dictionary lookup — it's latency, turn-taking, and making synthesized speech sound like a human instead of a robocaller. This release targets exactly those problems, and it exposes a developer surface through the Gemini Live API. Here's a practical breakdown of what it does, where it runs, and how you'd actually build with it.

What Gemini 3.5 Live Translate Actually Does

At its core, the model takes streamed speech in one language and emits translated speech in another, in over 70 languages. Two design decisions make it interesting for engineers.

First, language auto-detection. The model detects 70+ languages and handles multilingual inputs without you manually configuring source/target settings up front. For a call app where you can't predict who speaks what, that removes a whole config step from your pipeline.

Second, prosody preservation. The output isn't flat TTS — it's generated to preserve the speaker's intonation, pacing, and pitch. That matters more than it sounds. A translation that keeps the rhythm of the original speaker reads as a conversation, not a captioning service reading aloud.

Continuous Generation vs. Turn-by-Turn

The headline architectural difference is streaming. Most translation systems are turn-by-turn: they wait for the speaker to finish a sentence, then translate the whole chunk. That's simple to build but it stacks latency — every pause becomes dead air for the listener.

Gemini 3.5 Live Translate instead processes speech as it streams and generates translated audio continuously. Google describes it as balancing a trade-off: wait a little for more context (better quality) versus translate immediately (stay in sync with the speaker). The result, per the announcement, is fluid audio without awkward pauses, staying just a few seconds behind the speaker for the whole session. The model is also built for noise robustness, so it's meant to hold up in loud, unpredictable environments rather than only quiet studio conditions.

If you're a developer, the practical implication is that you stop thinking in discrete request/response turns and start thinking in audio streams — which is why the integration story leans on real-time media platforms (more on that below).

Where It Ships: Three Distribution Channels

Google is rolling the model out across three different surfaces, each with its own availability stage as of launch day:

Developers — public preview via the Gemini Live API and Google AI Studio.
Enterprises — private preview starting this month inside Google Meet, for select business Google Workspace customers.
Everyone — via the Google Translate app on Android and iOS.

That tiering tells you where Google expects the value: consumers get it free in Translate, enterprises get the meeting use case, and developers get the raw API to build whatever they want. If you're shipping a product, the API path is the one that matters.

Building With the Gemini Live API

The developer entry point is the Gemini Live API, which streams speech in and out so you get the continuous behavior described above rather than batch translation. Google's stated example use cases are live interpretation for multilingual calls, meetings, lessons, and broadcasts — anywhere two or more people don't share a language in real time.

Because the heavy lifting in any real-time voice app is the media transport (WebRTC, jitter buffers, echo cancellation, device handling), Google leans on partner platforms rather than asking you to build that layer. The announcement names Agora, Fishjam, LiveKit, Pipecat, and Vision Agents as platforms that integrate the Gemini Live API so you can deploy voice translation apps without owning the streaming infrastructure yourself. The example code lives in the Gemini Cookbook, which includes a demo of dubbing and simultaneous multi-language translation.

A reasonable build plan: pick one of those media SDKs for transport, wire the audio stream to the Live API, and let the model handle detection and synthesis. Your application code mostly becomes session management and UX — not signal processing. If you're evaluating how this sits next to your existing stack, our guide to AI coding agents and the broader Gemini 3.5 complete guide are useful companions.

Early Production Testing: Grab and Others

Real-world testing is the most credible signal in any model launch, and Google cites a concrete one. Ride-hailing platform Grab is testing the model to enable near real-time multilingual communication between drivers and travelers at pickups — a context where the two parties frequently don't share a language and the interaction is short and time-pressured. Google notes those Grab users make over 10 million voice calls per month, which gives a sense of the volume the model is being stress-tested against.

Beyond Grab, Google says companies including CJ ENM and LiveKit have shared positive feedback highlighting translation quality, accuracy, and low latency. As always with vendor-cited testimonials, treat these as directional rather than benchmarked — but a multilingual ride-pickup use case at that call volume is a meaningful proving ground.

Google Meet: From 5 Languages to 70+

The enterprise story is a big jump on paper. Speech translation in Google Meet will use Gemini 3.5 Live Translate, and the upgrade changes three things:

70+ languages, up from a previous limit of just five.
2000+ language combinations in a single meeting, expanding from the prior state of only translating to and from English.
An updated interface giving instant access to speech translation.

That second point is the structural one. Going from "English is always one side of every translation" to arbitrary pairwise combinations is what makes a genuinely multilingual meeting possible — three people in three languages, all hearing each other. It's launching in private preview for select business Workspace customers this month, with a broader rollout later in the year.

In the Translate App: Headphones and Listening Mode

On the consumer side, the model rolls out globally in the Google Translate app on Android and iOS. In the Live translate feature you connect any pair of headphones and get translation that mirrors the speaker's tone across 70+ languages.

Android also gets a new listening mode: hold the phone to your ear like a regular call, and translated audio streams straight to the phone's earpiece — no headphones required. Google frames it for situations where you want to hear a translation discreetly and quickly, such as following a guided tour in another language. It's a small UX detail, but it's the kind of thing that decides whether a translation feature gets used in the wild or stays a demo.

Provenance: SynthID Watermarking

One detail developers building user-facing products should note: all audio generated by the model is watermarked with SynthID. The watermark is imperceptible — woven directly into the audio output — and is intended to keep AI-generated content detectable to help prevent misinformation. If you're shipping synthesized translated speech, this provenance signal travels with the audio whether or not you surface it. Google points to the model card for its full safety and responsibility approach.

Should You Build On It?

If your product has a real-time, cross-language voice moment — support calls, marketplaces, education, events, field operations — this is worth a prototype. The continuous-streaming design directly attacks the latency and dead-air problems that make most translation features feel clunky, and the partner-platform integrations mean you don't have to become a WebRTC expert to ship something. The honest caveats: it's a preview, so expect API surface and behavior to shift; quality and latency claims are currently vendor-stated rather than independently benchmarked; and any speech model needs testing against your actual accents, jargon, and noise conditions before you trust it in production.

For teams weighing whether to lean on a hosted model like this versus running translation infrastructure themselves, the trade-offs mirror the broader build-vs-host debate we cover in our self-hosting LLMs guide and the open-source LLM landscape. And if you're shipping this inside a mobile app, validating the audio experience across devices is its own discipline — see our mobile app testing guide.

FAQ

What is Gemini 3.5 Live Translate?

It's Google's latest audio model, announced June 9, 2026, for near real-time speech-to-speech translation. It auto-detects 70+ languages and generates continuous translated speech that preserves the speaker's intonation, pacing, and pitch.

How many languages does it support?

The model automatically detects and translates across more than 70 languages. In Google Meet specifically, it enables 70+ languages and over 2000+ language combinations within a single meeting.

How is it different from turn-by-turn translation?

Turn-by-turn systems wait for the speaker to finish before translating, which adds latency. Gemini 3.5 Live Translate processes speech as it streams and generates translated audio continuously, staying just a few seconds behind the speaker without awkward pauses.

How do developers access it?

Developers can build with it in public preview through the Gemini Live API and Google AI Studio. Partner platforms including Agora, Fishjam, LiveKit, Pipecat, and Vision Agents integrate the Live API so you don't have to build the real-time media streaming layer yourself.

Where can regular users try it?

It's rolling out in the Google Translate app on Android and iOS. Connect headphones for the Live translate feature, or on Android use the new listening mode to hear translations through your phone's earpiece by holding it to your ear like a call.

Is the generated audio watermarked?

Yes. All audio generated by the model is watermarked with SynthID — an imperceptible watermark woven into the audio output — so AI-generated content remains detectable. Google documents its safety approach in the model card.

Is it production-ready?

It's in public preview for developers and private preview in Google Meet as of launch, so treat it as evolving. Partners like Grab (over 10 million voice calls per month) are testing it, but you should validate latency and quality against your own accents, noise conditions, and use case before relying on it in production.