AI Safety

Nemotron 3.5 Content Safety: A Developer's Guide to NVIDIA's Multimodal Guard Model

NVIDIA's Nemotron 3.5 Content Safety unifies multimodal input, 12-language coverage, custom policy enforcement, and auditable reasoning into one 4B guard model. Here's what it does and how to wire it into a production safety pipeline.

Published 07 Jun 2026 • Updated 07 Jun 2026 • 8 min read

Quick answer. Nemotron 3.5 Content Safety is NVIDIA's 4B-parameter open guard model, published June 4, 2026 and built on Google Gemma 3 4B IT. In a single inference call it evaluates a user prompt, an optional image, and an optional assistant response together, enforces custom natural-language policies, covers 12 explicitly-trained languages (with ~140-language zero-shot transfer), and can emit an auditable reasoning trace via an optional THINK mode. It runs on 8GB+ VRAM and ships under the NVIDIA Open Model License for research and commercial use.

If you ship anything that puts a language model in front of users, you eventually hit the same wall: you need a guardrail that classifies content as safe or unsafe before it reaches a model, a user, or a downstream service. The hard part isn't the binary decision — it's doing it across the languages your users actually speak, over images as well as text, against your policy rather than a generic one, and fast enough to run on every request without wrecking latency.

NVIDIA's Nemotron 3.5 Content Safety is the latest attempt to fold all of those into one compact model. This guide walks through what it does, how it's built, where it claims to perform well, and how you'd actually wire it into a production safety pipeline.

What Nemotron 3.5 Content Safety Actually Is

Nemotron 3.5 is a single 4B-parameter classifier that produces a safety verdict over combined multimodal input. According to NVIDIA, it consolidates four capabilities that previously lived in separate models or required separate fine-tuning runs: unified multimodal evaluation, multilingual reach, custom policy enforcement, and auditable reasoning — all reachable in one inference call.

It's the successor to Nemotron 3 Content Safety, released in March 2026, which was the first time NVIDIA combined multimodal and multilingual safety in a single 4B model. The 3.5 release adds two things that matter most to teams with real compliance requirements: the ability to reason over a custom policy you supply at inference time, and an optional reasoning trace that documents why a verdict was reached.

This sits squarely in the open-guard-model category alongside systems like LlamaGuard. If you're mapping out the broader field, our open-source LLM landscape guide covers where guard models fit relative to general-purpose open weights.

The Model Architecture

Nemotron 3.5 Content Safety is built on Google Gemma 3 4B IT — 4 billion parameters, a 128K context window, and the strong vision-language and multilingual coverage that the Gemma 3 base brings. NVIDIA fine-tunes that base with a LoRA adapter that installs the safety-classification behavior while keeping the footprint small enough to run in real time on 8GB+ VRAM GPUs.

That "build a specialist on top of a capable open base via LoRA" pattern is the same one most teams now use for domain adaptation. If you're considering training your own classifier on top of an open model, our fine-tuning LLMs guide covers the LoRA workflow end to end, and because the base here is Gemma 3, our Gemma guide is useful context on the family's capabilities.

The inference interface exposes three output modes:

Mode 1 — Low-latency binary verdict: returns just User Safety and Response Safety as safe/unsafe.
Mode 2 — Verdict with categories: adds the violated Safety Categories (e.g. Violence, Criminal Planning/Confessions).
Mode 3 — THINK mode: emits a step-by-step reasoning trace before the verdict and categories.

The taxonomy follows the Aegis 2.0 framework: 13 core categories aligned with the MLCommons safety taxonomy, plus 10 fine-grained subcategories. The practical benefit of that alignment is comparability — you can benchmark Nemotron 3.5 directly against other open and closed guard systems evaluated on Aegis-taxonomy datasets, instead of guessing across incompatible label sets.

Unified Multimodal Evaluation

The headline design decision in 3.5 is that it scores the combination of inputs, not each piece in isolation. The model takes a user prompt, an optional image, and an optional assistant response as a single context window and produces one coherent verdict over all three.

This closes a real gap in multimodal safety. Some policy violations only emerge from the interaction between a text request and an image, or between a request and the response it provokes — neither part is unsafe alone. Scoring each independently misses those cases; scoring them together in a single pass catches them. For any product that accepts image uploads alongside chat — a tutoring app, a marketplace, a support bot — that's the difference between a guardrail that works in production and one that only works on text benchmarks.

Global Language Coverage

Nemotron 3.5 maintains explicit training coverage across 12 languages: English, French, Spanish, German, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Portuguese, and Italian. On top of that, it inherits strong zero-shot generalization across roughly 140 languages from the Gemma 3 base.

That second part is what makes it deployable globally. If your users speak Southeast Asian, Scandinavian, or less-resourced African languages where dedicated safety training data is sparse, the base model's multilingual transfer gives you a baseline without a separate fine-tuning run per market. NVIDIA reports the model averages 96.5% harmful-content classification accuracy on Multilingual Aegis and 88.8% on RTP-LX across the 12 trained languages — a combined 92.7% — which is the kind of consistency that lets you apply one safety posture across customer-, employee-, and partner-facing flows instead of bolting on regional moderators.

Custom Policy Enforcement — The Most Important Addition

This is the capability that separates 3.5 from a generic guard model. Production deployments almost never operate under one universal safety taxonomy. A healthcare platform, a financial-services chatbot, a developer-tools IDE, and a children's education app all have genuinely different risk profiles.

Nemotron 3.5 accepts a custom policy specification alongside the input and reasons over that policy when producing its verdict, rather than deferring entirely to the built-in taxonomy. Two patterns this unlocks:

Category suppression: stop a "violence" trigger firing when a DevOps tool legitimately handles the phrase "terminate a process." This kind of false positive is exactly what frustrates developers using AI inside their tooling — see our AI coding agents guide for where these guardrails sit in agent pipelines.
Custom category injection: define proprietary risk categories specific to your regulatory or product policy, expressed in natural language at inference time.

NVIDIA also ships a Claude- and Codex-compatible skill for generating custom policies, plus cookbooks, to lower the authoring burden.

THINK Mode and Auditable Reasoning

Every verdict can be accompanied by a reasoning trace via the optional THINK mode. When enabled, the model writes out its step-by-step logic — what the prompt asked, what the response provided, how the image factored in — before delivering the final safe/unsafe label and violated categories.

That documented justification serves three concrete purposes NVIDIA calls out: compliance and audit logging in regulated industries, human review (so a reviewer can see why a verdict landed and spot systematic errors), and policy iteration (the traces reveal how the model reads edge cases, so you can refine your policy language).

The obvious objection to reasoning is latency, and NVIDIA addresses it by condensing reasoning chains into concise summaries — most traces come in under three sentences. The training recipe for that is itself interesting: larger teacher models (Qwen 397B) generate the chain-of-thought, then a second large model (Qwen 80B) rewrites each trace to fit in no more than three sentences. Ground-truth labels are fed in during generation to keep misclassifications out of the traces. The result is a dual-mode operation: disable reasoning for minimal latency on generic checks, enable it for complex policies or audit pipelines.

Benchmarks and Latency

NVIDIA evaluated Nemotron 3.5 across a broad set of multilingual, multimodal, and custom-policy benchmarks — VLGuard, MM-SafetyBench, PolyGuard, RTP-LX, Aya Redteaming, XSafety, MultiJail, Aegis, Dynaguardrail, and CoSA. The reported headline is roughly 85% average accuracy across the evaluated set, while keeping the compact 4B footprint.

For context, the predecessor Nemotron 3 set the baseline at 84% average accuracy on multimodal harmful-content tests at roughly half the latency of LlamaGuard-4-12B. Nemotron 3.5 holds that efficiency while adding custom policy support and reasoning. On latency specifically, NVIDIA reports 3x lower end-to-end latency on a multimodal benchmark versus an alternative multimodal safety model, and up to 50% fewer tokens when reasoning is enabled compared with another reasoning safety model. The default (no-THINK) latency profile is unchanged from Nemotron 3.

A useful honesty note from NVIDIA: the benchmark gap in multimodal safety is real and unsolved. Most widely-cited safety benchmarks (WildGuard, XSTest, HarmBench) are text-only, and many multimodal sets lean on SDXL-generated images rather than real photographs — which understates how hard production content actually is. NVIDIA's answer on the training side is that 99% of its training images are real photographs, with synthetic data held to roughly 10% of total volume and used mainly to diversify jailbreak and rare-violation cases.

Deploying It: Self-Hosted or via an Inference Provider

Nemotron 3.5 Content Safety is on Hugging Face under the NVIDIA Open Model License for research and commercial use, and — notably for a safety model — NVIDIA released the training dataset alongside it, which most OSS safety models don't do.

You have two broad deployment paths:

Self-host. It supports transformers, vLLM, and SGLang, and is packaged as an NVIDIA NIM microservice on build.nvidia.com for a GPU-optimized, pre-built inference service. Because it fits on 8GB+ VRAM, self-hosting is realistic even on modest hardware. If you're running guard models alongside your main stack on your own boxes, our self-hosting LLMs guide covers the serving and capacity decisions.
Use a hosted endpoint. The model is available through Baseten, Eigen AI, DeepInfra, OpenRouter, and Vultr if you'd rather not manage GPUs.

Architecturally, the common pattern is to run Mode 1 (binary verdict) synchronously on every request in the hot path, and run THINK mode asynchronously as part of an audit pipeline — so real-time decisions stay fast while you still capture documented justifications for review.

Should You Use It?

If you operate a multilingual product, accept image input, or have domain-specific policies that a generic taxonomy can't express, Nemotron 3.5 is worth evaluating — it's one of the few open guard models that targets all three at once while staying small enough to run cheaply on every request. If you're English-only and text-only, a lighter binary classifier may be all you need, and the extra capabilities are overhead you won't use. As always, benchmark against your own traffic before trusting any reported accuracy number in production.

FAQ

What is Nemotron 3.5 Content Safety?

It's NVIDIA's open-weight content-safety guard model, published June 4, 2026. It classifies a user prompt, an optional image, and an optional assistant response together as safe or unsafe, enforces custom policies, and can output an auditable reasoning trace.

What model is it built on?

It's built on Google Gemma 3 4B IT (4 billion parameters, 128K context window) and fine-tuned with a LoRA adapter that adds the safety-classification behavior while keeping it compact.

What hardware does it need?

NVIDIA states it runs on GPUs with 8GB or more of VRAM, which keeps it viable for real-time deployment on relatively modest hardware.

Which languages does it cover?

Twelve languages are explicitly trained — English, French, Spanish, German, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Portuguese, and Italian — plus zero-shot transfer across roughly 140 languages inherited from the Gemma 3 base.

What is THINK mode?

THINK mode is an optional output mode where the model writes a step-by-step reasoning trace before delivering its safe/unsafe verdict and violated categories. It's intended for audit logging, human review, and policy iteration, and can be disabled when low latency is the priority.

How do I access the model?

It's on Hugging Face under the NVIDIA Open Model License for research and commercial use. It supports transformers, vLLM, and SGLang, ships as an NVIDIA NIM microservice, and is also available through Baseten, Eigen AI, DeepInfra, OpenRouter, and Vultr.

Can it enforce my own custom safety policy?

Yes. You can supply a custom policy specification in natural language at inference time, and the model reasons over that policy — supporting both suppressing irrelevant built-in categories and injecting proprietary ones — instead of deferring entirely to its default taxonomy.

The Open-Source LLM Landscape (2026) — where guard models fit among open weights.
Self-Hosting LLMs: Complete Guide — serving and capacity planning for models like this.
Fine-Tuning LLMs: Complete Guide — the LoRA workflow used to build specialized classifiers.
The Gemma Guide — context on the base-model family Nemotron 3.5 is built on.
AI Coding Agents: Complete Guide — where safety guardrails sit in agent pipelines.