"Sanusflow" 2026

Last updated April 2026 — refreshed for current model/tool versions.

JanusFlow 1.3B is DeepSeek's open-source unified multimodal model that handles both image understanding and text-to-image generation in a single 1.3B-parameter architecture. Unlike earlier discrete-token diffusion hybrids, it combines an autoregressive language model with rectified flow — producing 384×384 images with better FID scores than SDXL despite being a fraction of the size. This guide covers the complete Linux installation from system prerequisites through verified image generation output, including practical troubleshooting drawn from real deployment issues.

What changed since the original 2025 postPython requirement clarified: The official repo now specifies Python ≥ 3.8 (not 3.10 as originally stated). Python 3.10 and 3.11 both work well in practice; 3.12 has minor dependency friction with older diffusers builds.Model download method: The old guide used a direct wget for a single .bin file — this is incorrect. JanusFlow requires downloading the full model repository (config, tokenizer, weights) via huggingface_hub or git lfs.Correct import path: Code must import from janus.janusflow.models, not generic diffusers.StableDiffusionPipeline — the original test script in the old post was entirely wrong for this model.bfloat16 required: The SDXL-VAE decoder does not work correctly with float16. Always load with torch.bfloat16.Janus-Pro released January 2025: The Janus family now includes Janus-Pro-1B and Janus-Pro-7B, which outperform JanusFlow on most generation benchmarks. See the decision section below for which model to choose.CVPR 2025: The JanusFlow paper was accepted at CVPR 2025, adding peer-reviewed benchmark credibility.

TL;DR — Quick Reference

Topic	Answer
Model size (disk)	~5.2 GB (weights + tokenizer)
Minimum GPU VRAM	8 GB (for inference at default batch size)
Recommended GPU	RTX 3080 / A10 or better
Python version	3.8+ (3.10 or 3.11 recommended)
CUDA version	CUDA 12.1+ (12.4 recommended)
Output resolution	384×384 pixels
License	MIT (code) + DeepSeek Model License (weights, commercial use allowed)
Hugging Face card	deepseek-ai/JanusFlow-1.3B

What Is JanusFlow 1.3B?

JanusFlow 1.3B, released by DeepSeek on November 12, 2024, is a unified multimodal model built on DeepSeek-LLM-1.3B-base. Its key design choice is decoupled visual encoding: one visual pathway (SigLIP-L) handles image understanding at 384×384 input resolution, while a separate pathway (ShallowUViT encoder/decoder + SDXL-VAE) handles image generation via rectified flow ODE solving.

This decoupling lets the model excel at both tasks without each pathway interfering with the other — a known failure mode in earlier unified architectures. The generation pathway adds only ~70M parameters on top of the ~300M vision encoder, keeping the full model compact.

The paper was accepted at CVPR 2025 (arXiv:2411.07975), making it one of the few open-source unified multimodal models with peer-reviewed benchmark validation.

Janus Model Family — Which One Should You Use?

Before installing, pick the right variant. DeepSeek's Janus family as of April 2026:

Model	Params	Generation method	GenEval	Best for
JanusFlow-1.3B	1.3B	Rectified flow (ODE)	0.63	Research, low VRAM, understanding + generation
Janus-1.3B	1.3B	Next-token prediction	~0.58	Understanding-heavy workloads
Janus-Pro-1B	1.5B	Next-token prediction	~0.75	Better generation quality, similar VRAM
Janus-Pro-7B	7B	Next-token prediction	0.80	Best quality, outperforms DALL-E 3; needs ~24 GB VRAM

Decision guide:

If you have ≥24 GB VRAM and want the best image quality → use Janus-Pro-7B.
If you have 8–16 GB VRAM and want a good balance → use Janus-Pro-1B.
If you are specifically studying rectified flow architectures or the CVPR 2025 paper → use JanusFlow-1.3B (this guide).
If you want both strong understanding and generation at low VRAM → JanusFlow-1.3B remains a solid choice.

If you are building a broader local AI setup, the OpenClaw + Ollama setup guide for running local AI agents covers how to orchestrate multiple models including multimodal ones with a unified agent layer.

System Requirements

Hardware

GPU: NVIDIA GPU with at least 8 GB VRAM (the model runs in bfloat16). An RTX 3080, RTX 3090, RTX 4070, or A10 is comfortable. CPU-only inference is possible but extremely slow (~5–10 minutes per image).
RAM: 16 GB system RAM minimum; 32 GB recommended.
Disk: ~8 GB free for model weights, virtualenv, and generated outputs.

Software

Linux distro: Ubuntu 20.04+, Debian 11+, Fedora 38+. Any distro with kernel ≥5.15 works.
Python: 3.8 or higher. Python 3.10 and 3.11 are the most tested; 3.12 may require pip install setuptools first to resolve distutils warnings.
CUDA: 12.1 or higher. CUDA 12.4 is the recommended version as of April 2026 (matches PyTorch 2.3+ binaries). Verify with nvcc --version.
Git + Git LFS: Required for cloning the repo and optionally downloading model files via LFS.

Step-by-Step Installation on Linux

Step 1: Install System Dependencies

Update your package list and install Python, pip, and git:

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3 python3-pip python3-venv git git-lfs

Verify your Python version:

python3 --version
# Should output Python 3.10.x, 3.11.x, or similar (3.8+ required)

Initialize git-lfs (needed if you choose the git clone method for model weights):

git lfs install

Step 2: Clone the Janus Repository

git clone https://github.com/deepseek-ai/Janus.git
cd Janus

Step 3: Create and Activate a Virtual Environment

Always use a dedicated virtual environment to isolate JanusFlow dependencies from your system Python:

python3 -m venv venv
source venv/bin/activate

Upgrade pip first to avoid resolution issues with newer packages:

pip install --upgrade pip setuptools wheel

Step 4: Install PyTorch with CUDA Support

Install PyTorch matching your CUDA version before installing the Janus package. For CUDA 12.1+:

# For CUDA 12.1 (adjust cu121 → cu124 for CUDA 12.4)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Verify GPU access:

python3 -c "import torch; print(torch.cuda.is_available()); print(torch.version.cuda)"

Step 5: Install JanusFlow Dependencies

From within the cloned Janus/ directory:

pip install -e .
pip install "diffusers[torch]"

If you want the interactive Gradio demo UI:

pip install -e ".[gradio]"

Step 6: Download the JanusFlow-1.3B Model

The correct way to download the model is to pull the entire repository from Hugging Face, not a single weight file. The model requires config files, tokenizer vocabulary, and multiple sharded weight files.

Method A — Python (recommended, uses local cache):

from huggingface_hub import snapshot_download

model_path = snapshot_download(repo_id="deepseek-ai/JanusFlow-1.3B")
print(f"Model downloaded to: {model_path}")

Method B — Hugging Face CLI:

pip install huggingface_hub[cli]
huggingface-cli download deepseek-ai/JanusFlow-1.3B --local-dir ./model/JanusFlow-1.3B

The download is approximately 5.2 GB. Models are stored in BF16 (bfloat16) format.

Step 7: Run the Interactive Gradio Demo

The quickest way to verify your installation is to launch the bundled demo:

python demo/app_janusflow.py

This starts a local web UI at http://127.0.0.1:7860 where you can test both image generation and image understanding without writing any code.

Using JanusFlow Programmatically

Image Understanding (Vision-Language QA)

import torch
from janus.janusflow.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images

model_path = "deepseek-ai/JanusFlow-1.3B"

vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt = MultiModalityCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "User",
        "content": "<image_placeholder>\nDescribe what you see in this image.",
        "images": ["path/to/your/image.jpg"],
    },
    {"role": "Assistant", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(answer)

Text-to-Image Generation

import os
import torch
import torchvision
from janus.janusflow.models import MultiModalityCausalLM, VLChatProcessor
from diffusers.models import AutoencoderKL

model_path = "deepseek-ai/JanusFlow-1.3B"

vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt = MultiModalityCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

# SDXL-VAE is required for decoding; must use bfloat16
vae = AutoencoderKL.from_pretrained("stabilityai/sdxl-vae")
vae = vae.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "User",
        "content": "A misty mountain landscape at sunrise, photorealistic, ultra detailed",
    },
    {"role": "Assistant", "content": ""},
]

sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
    conversations=conversation,
    sft_format=vl_chat_processor.sft_format,
    system_prompt="",
)
prompt = sft_format + vl_chat_processor.image_gen_tag

@torch.inference_mode()
def generate(mmgpt, vl_chat_processor, prompt, cfg_weight=5.0,
             num_inference_steps=30, batchsize=1):
    input_ids = torch.LongTensor(
        vl_chat_processor.tokenizer.encode(prompt)
    )
    tokens = torch.stack([input_ids] * 2 * batchsize).cuda()
    tokens[batchsize:, 1:] = vl_chat_processor.pad_id
    inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)
    inputs_embeds = inputs_embeds[:, :-1, :]

    z = torch.randn((batchsize, 4, 48, 48), dtype=torch.bfloat16).cuda()
    dt = torch.zeros_like(z) + 1.0 / num_inference_steps

    attention_mask = torch.ones(
        (2 * batchsize, inputs_embeds.shape[1] + 577)
    ).to(mmgpt.device)
    attention_mask[batchsize:, 1:inputs_embeds.shape[1]] = 0
    attention_mask = attention_mask.int()

    past_key_values = None
    for step in range(num_inference_steps):
        z_input = torch.cat([z, z], dim=0)
        t = torch.tensor(
            [step / num_inference_steps * 1000.0] * z_input.shape[0]
        ).to(dt)
        z_enc = mmgpt.vision_gen_enc_model(z_input, t)
        z_emb = z_enc[0].view(z_enc[0].shape[0], z_enc[0].shape[1], -1).permute(0, 2, 1)
        z_emb = mmgpt.vision_gen_enc_aligner(z_emb)
        t_emb, hs = z_enc[1], z_enc[2]
        llm_emb = torch.cat([inputs_embeds, t_emb.unsqueeze(1), z_emb], dim=1)

        outputs = mmgpt.language_model.model(
            inputs_embeds=llm_emb,
            use_cache=True,
            attention_mask=attention_mask,
            past_key_values=past_key_values,
        )
        past_key_values = outputs.past_key_values
        hidden_states = outputs.last_hidden_state

        hidden_states = mmgpt.vision_gen_dec_aligner(
            mmgpt.vision_gen_dec_aligner_norm(hidden_states[:, -576:, :])
        )
        hidden_states = hidden_states.reshape(
            z_emb.shape[0], 24, 24, 768
        ).permute(0, 3, 1, 2)
        v = mmgpt.vision_gen_dec_model(hidden_states, hs, t_emb)
        v_cond, v_uncond = torch.chunk(v, 2)
        v = cfg_weight * v_cond - (cfg_weight - 1.0) * v_uncond
        z = z + dt * v

    decoded = vae.decode(z / vae.config.scaling_factor).sample
    os.makedirs("generated_samples", exist_ok=True)
    torchvision.utils.save_image(
        decoded.clip_(-1.0, 1.0) * 0.5 + 0.5,
        "generated_samples/output.jpg"
    )
    print("Image saved to generated_samples/output.jpg")

generate(vl_gpt, vl_chat_processor, prompt, cfg_weight=5.0,
         num_inference_steps=30, batchsize=1)

Note on batch size: The original paper used batchsize=5 for demo purposes, which requires ~40 GB VRAM. For a single 8–16 GB consumer GPU, use batchsize=1. The output quality is identical per image.

ComfyUI Integration

If you prefer a node-based visual workflow, the community-maintained ComfyUI-Janus-Pro extension supports the Janus model family including JanusFlow-1.3B. Install via ComfyUI Manager by searching "Janus-Pro". Three nodes are available: JanusModelLoader, JanusImageGeneration, and JanusImageUnderstanding.

Performance Benchmarks

The following numbers are from the JanusFlow paper (CVPR 2025, arXiv:2411.07975) and represent standardized evaluations — not marketing claims.

Image Generation

Model	Params	MJHQ FID-30k ↓	GenEval ↑	DPG-Bench ↑
JanusFlow-1.3B	1.3B	9.51	0.63	80.09%
Janus-1.3B (baseline)	1.3B	15.18	~0.58	—
SDXL (generation-only)	~6.6B	~12.7	~0.55	—
DALL-E 2	—	—	~0.52	—
Janus-Pro-7B (2025)	7B	—	0.80	84.19%

Lower FID is better; higher GenEval/DPG-Bench is better. JanusFlow-1.3B achieves best-in-class results for its parameter count: it surpasses SDXL (a generation-only specialist with 5× the parameters) on both metrics.

Image Understanding

Benchmark	JanusFlow-1.3B score
MMBench	74.9
SeedBench	70.5
GQA	60.3
VQAv2	79.8
POPE	88.0
TextVQA	55.5

These understanding scores are competitive with dedicated vision-language models of similar size.

Optimizing Performance

Use bfloat16, not float16: The SDXL-VAE decoder fails silently with fp16 precision, producing corrupted images. Always call .to(torch.bfloat16) on both the main model and the VAE.
Reduce batch size for low VRAM: Set batchsize=1 if you have less than 16 GB VRAM. Memory scales roughly linearly with batch size.
Classifier-free guidance weight: cfg_weight=5.0 is the default. Lower values (2.0–3.0) produce more creative but less prompt-faithful outputs. Higher values (7.0–10.0) enforce the prompt more strictly but can oversaturate colors.
Inference steps: 30 steps is standard. Reducing to 15–20 steps is faster with modest quality loss. The model uses ODE integration so it handles fewer steps more gracefully than DDPM-based models.
Monitor VRAM: Use nvidia-smi dmon -s mu -d 1 during generation to watch memory usage in real time.
xFormers: Install with pip install xformers for faster attention on compatible GPUs (Ampere and later: RTX 3000+, A100, etc.).

Common Pitfalls and Troubleshooting

Wrong import — `ModuleNotFoundError: No module named 'janus'`

You must run pip install -e . from inside the cloned Janus/ directory with your virtual environment active. The package is not published to PyPI — it installs in editable mode from the local source. Confirm with pip show janus.

CUDA out of memory (OOM)

Reduce batchsize to 1.
Close other GPU processes: nvidia-smi to identify, then kill.
Enable gradient checkpointing if fine-tuning (inference doesn't need it).
If you still OOM at batchsize=1, the model may not fit your GPU. Try moving the VAE to CPU: vae = vae.cpu() and call vae.decode(z.cpu() / vae.config.scaling_factor) — slower but reduces GPU pressure.

CUDA errors — `nvcc --version` not found

PyTorch bundles its own CUDA runtime, so nvcc (the compiler) is not strictly required for inference. If CUDA is not detected at all, verify the NVIDIA driver is loaded: nvidia-smi should return GPU information. If it doesn't, reinstall NVIDIA drivers: sudo ubuntu-drivers install on Ubuntu, then reboot.

Corrupted or black images from the VAE

This almost always means float16 was used instead of bfloat16. Check that both vl_gpt and vae are loaded with torch.bfloat16. Note: bfloat16 requires Ampere GPUs (RTX 3000+) or newer on consumer hardware. On Turing GPUs (RTX 2000 series), bfloat16 operations fall back to float32 automatically, which is fine — but you'll use more VRAM.

Dependency conflicts with diffusers

The Janus repo was tested with diffusers==0.27.x. If you have a newer version that breaks the SDXL-VAE interface, pin the version:

pip install "diffusers[torch]==0.27.2"

Git LFS download hangs or times out

Use the huggingface_hub Python download method (Method A in Step 6) or set HF_HUB_ENABLE_HF_TRANSFER=1 for faster parallel downloads:

pip install hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download deepseek-ai/JanusFlow-1.3B --local-dir ./model/JanusFlow-1.3B

Gradio demo fails with package conflicts

The demo dependencies include specific gradio and numpy versions. If the demo fails, check the GitHub Issues page for the latest resolution. Some users have had success with the community fork maintained by kingabzpro on Hugging Face for a patched demo.

What Was Removed and Why

The original post contained a test script using diffusers.StableDiffusionPipeline to load JanusFlow. This is incorrect — JanusFlow does not expose a StableDiffusionPipeline interface. The SDXL-VAE is only used internally as the decoder; the model is loaded and called through janus.janusflow.models.MultiModalityCausalLM. Using the wrong loader produces a confusing error about incompatible config keys. That script has been replaced with the correct implementation from the official README.

The original post also suggested downloading a single diffusion_pytorch_model.bin via wget. A single weight file is not sufficient — the model needs its full config, tokenizer, and multiple shards. This section has been replaced with the correct huggingface_hub-based download instructions.

FAQ

Is JanusFlow 1.3B free to use commercially?

The code is MIT-licensed. The model weights use the DeepSeek Model License, which permits commercial use with some restrictions (you cannot use the model to harm DeepSeek or create competing foundation model services). Review the full license on Hugging Face before deploying in a commercial product.

Can I run JanusFlow 1.3B without a GPU?

Yes, but it is very slow. Remove the .cuda() calls and the model will run on CPU. Expect 5–15 minutes per image generation on a modern CPU. For practical use, a GPU with bfloat16 support (RTX 3000+ or A-series) is necessary.

Does JanusFlow work with AMD GPUs (ROCm)?

Theoretically yes if you install PyTorch with ROCm support, but the Janus team only tests on NVIDIA. Community reports suggest it works on RX 6900 XT and RX 7900 XTX with PyTorch ROCm builds, but bfloat16 support on ROCm can be inconsistent. Treat AMD support as experimental.

What is the difference between JanusFlow and Janus-Pro?

The core difference is the image generation method: JanusFlow uses rectified flow (ODE-based), while Janus-Pro uses next-token prediction (discrete tokenization). Janus-Pro-7B achieves higher GenEval scores (0.80 vs 0.63) and surpasses DALL-E 3. JanusFlow is more architecturally interesting for research into flow-based unified models. See the model comparison table above.

What output resolution does JanusFlow produce?

JanusFlow generates 384×384 pixel images. This is fixed by the architecture (48×48 latent space decoded by SDXL-VAE at 8× upscaling). You can upscale the output with a separate super-resolution model such as Real-ESRGAN or LDSR if higher resolution is needed.

Can I fine-tune JanusFlow on my own dataset?

Yes — the training code is included in the repository. Fine-tuning the generation pathway requires significant GPU resources (~1,600 A100 GPU-days for the original training run). LoRA fine-tuning of the LLM backbone for the understanding pathway is more accessible and follows the standard HuggingFace PEFT workflow.

Does JanusFlow support multi-GPU inference?

Not natively. The codebase uses single-GPU inference. You can use accelerate with model parallelism for multi-GPU, but there is no official guidance — this is a common GitHub Issue topic. For production multi-GPU deployments, consider using vLLM or TGI with a supported model.

Is there an API or cloud service for JanusFlow?

DeepSeek does not offer a cloud API for JanusFlow specifically. A demo Space is available on Hugging Face Spaces for quick testing without local installation. For production inference, self-hosting on a GPU instance is the path.

If you are building production AI applications and need engineering help scaling local model infrastructure, Codersera connects you with vetted remote ML engineers who have hands-on experience with open-source model deployment.