Run LTX-2 on ComfyUI Locally and Free Generate Videos with Audio

The landscape of AI video generation has fundamentally shifted in January 2026.

Published 08 Jan 2026 • Updated 11 May 2026 • 13 min read

Quick answer. LTX-2 runs locally on ComfyUI with 12 GB VRAM (FP8) or 24 GB (bf16). Install via ComfyUI Manager and search LTXVideo; weights auto-download on first use. Generates 720p audio+video clips in a single pass with no subscription, no cloud, and full commercial-use rights.

The landscape of AI video generation has fundamentally shifted in January 2026. For the first time, a production-ready, open-source model capable of generating synchronized 4K video and audio—LTX-2—is freely available to anyone with adequate hardware.

This comprehensive guide walks you through installing, configuring, and mastering LTX-2 on ComfyUI—the leading node-based AI interface—so you can create professional-grade videos with perfectly synchronized audio entirely on your own machine.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

What is LTX-2? The Game-Changing Model Explained

LTX-2, developed by Lightricks and released as open-source on January 6, 2026, is fundamentally different from earlier video generation models. While predecessors like Sora and Runway Gen-3 generate silent video and require post-hoc audio synthesis (leading to sync issues), LTX-2 generates video and audio simultaneously through an asymmetric dual-stream transformer architecture.

You can fine-tune it with LoRA (Low-Rank Adaptation), customize it for specific use cases, and deploy it anywhere. There's no per-second billing, no API rate limits, and no cloud dependency.

Why ComfyUI? The Best Interface for AI Video Generation

ComfyUI is a node-based interface designed specifically for generative AI workflows. Unlike traditional command-line tools, ComfyUI's visual node-based approach makes complex AI pipelines intuitive: you build workflows by connecting nodes.

ComfyUI is particularly well-suited to LTX-2 for several reasons.

First, the custom LTXVideo nodes integrate seamlessly into ComfyUI's architecture, providing intuitive controls for resolution, frame rate, sampling steps, and guidance scales.

Second, ComfyUI's built-in support for VRAM management, model offloading, and multi-GPU inference means you can run LTX-2 even on consumer hardware with careful optimization.

Third, the community has created extensive example workflows—pre-built templates for text-to-video, image-to-video, depth-guided video generation, and more—so beginners can start creating without building workflows from scratch.

What Hardware Do You Actually Need?

This is the question everyone asks first. According to official documentation and real-world testing from January 2026, here's what you need:

Minimum Hardware Requirements:

GPU: NVIDIA GPU with 32GB+ VRAM. Testers have successfully run LTX-2 on an RTX 5070 Ti (16GB) with FP8 quantization and careful VRAM optimization, though this represents the absolute lower bound.
System RAM: 32GB minimum
Storage: 100GB free SSD space (50GB for models, 30GB for cache, 20GB for output)
CUDA: Version 12.1 or higher
Python: 3.10+ (3.12 recommended)

Recommended Configuration for Comfortable Use:

GPU: RTX 4090 (24GB) or better, or NVIDIA A100/H100 for production
RAM: 64GB+ system memory
Storage: 200GB+ NVMe SSD
Processor: Modern multi-core CPU (8+ cores)

The critical insight is that LTX-2's VRAM requirements scale with output resolution and duration. A 4-second clip at 720p on an RTX 4090 uses approximately 20-21GB VRAM, leaving headroom for the full generation pipeline. Attempting native 4K generation at longer durations pushes even the 24GB 4090 to its limits. This is where quantization becomes essential.

The Quantization Advantage

One of LTX-2's most powerful features is support for multiple quantization formats, which compress the model weights while maintaining quality. NVIDIA's integration of NVFP4 and NVFP8 formats into LTX-2—announced in early January 2026—is a game-changer for local generation.

FP8 Quantization (Recommended for Most Users):

Model size: Uses ~30% less VRAM than full precision
Speed: ~2x faster generation
Quality impact: Minimal, imperceptible in most cases
Best for: Users with 32GB+ VRAM wanting a balance of speed and quality

NVFP4 Quantization (Maximum Speed):

Model size: Uses 60% less VRAM than full precision
Speed: 3x faster generation
Quality impact: Slight reduction, acceptable for many use cases
Requirement: NVIDIA RTX 40-series or newer GPUs
Best for: Users with limited VRAM or those prioritizing speed

For context, on an RTX 4090, NVFP4 can generate an 8-second clip at 720p in approximately 25 seconds, compared to 180+ seconds with the full precision model.

Distilled Model (8-Step Fast Generation):

Speed: 5-6x faster than full model
Steps: Fixed at 8 (cannot be adjusted)
Quality: Good for testing and iteration, slightly lower than full model
Best for: Rapid prompting iteration, A/B testing different concepts

The choice of quantization method directly impacts your usable generation resolution and duration.

Step-by-Step Installation: From Zero to Generating Videos

Step 1: Install ComfyUI

Begin by cloning the ComfyUI repository and setting up a Python virtual environment:

# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121

# Launch ComfyUI
python main.py

Open your browser to http://localhost:8188 to verify installation. You should see the ComfyUI interface with a blank workflow canvas.

Step 2: Install LTX-2 Custom Nodes

The recommended method is using ComfyUI Manager, which automates the installation:

Launch ComfyUI and press Ctrl+M (Windows/Linux) or Cmd+M (Mac)
Click "Install Custom Nodes"
Search for "LTXVideo"
Find "ComfyUI-LTXVideo" by Lightricks and click Install
Wait 2-5 minutes for installation to complete
Restart ComfyUI

After restart, right-click in the workflow canvas and navigate to "Add Node" → "LTXVideo" to verify the nodes are available.

Step 3: Download Model Files

LTX-2 requires several model files (approximately 50GB total). Create the proper directory structure in ComfyUI:

textComfyUI/
├── models/
│ ├── checkpoints/ # Main model
│ ├── text_encoders/ # Text encoder
│ └── latent_upscale_models/ # Upscalers (optional)

Download the FP8 quantized checkpoint (recommended):

pip install huggingface-hub
huggingface-cli download Lightricks/LTX-2 ltx-2-19b-dev-fp8 --local-dir ComfyUI/models/checkpoints/

Download the text encoder (Gemma 3 12B IT quantized):

bashhuggingface-cli download Lightricks/LTX-2 gemma-3-12b-it-qat-q4_0-unquantized --local-dir ComfyUI/models/text_encoders/

Step 4: Load Example Workflows

The easiest way to start is using pre-built workflows. In ComfyUI, click "Load" → "Template Library" and select "LTX-2 Text to Video" or download example workflows from the official repository.

The main workflows include:

Text-to-Video (Full): High-quality video from text prompts (30-50 steps)
Text-to-Video (Distilled): Fast testing with 8-step generation
Image-to-Video (Full): Animate still images with conditioning
Image-to-Video (Distilled): Quick image animation tests
Depth-Guided: Control video structure with depth maps
Multi-Control: Advanced control with depth, pose, and canny edges

Step 5: Configure Your First Generation

Let's create a text-to-video using the loaded workflow. Key parameters:

Text Prompt (critical for audio-video quality):

textA serene morning in a Japanese garden during cherry blossom season.
Soft pink petals gently fall to the ground. Water in a stone fountain
creates subtle ripples. Birds sing softly in the background. Soft
natural light filters through the trees. Camera slowly pans left
to right. Ambient forest sounds with distant bird calls.

Notice how this prompt describes both visual AND audio elements. LTX-2 generates better synchronized audio when you explicitly describe sounds.

Key Generation Parameters:

Frame Rate: 24 FPS (cinematic), 30 FPS (smooth), 60 FPS (ultra-smooth but requires more VRAM)
Resolution: Start with 768x512 for testing, scale up to 1024x576 (HD) or 1280x720 (Full HD) as hardware allows
Number of Frames: Must be divisible by 9. Start with 27 (for ~1.1 seconds at 24 FPS)
Sampling Steps: 30-50 for full model, 8 for distilled
CFG Scale (guidance strength): 5.0-7.0 recommended (higher = stricter prompt adherence but may reduce quality)

With these settings on an RTX 4090 using FP8, generation takes approximately 3-5 minutes for a 4-second clip at 720p.

Step 6: Generate and Review

Click "Queue Prompt" in the top-right corner. ComfyUI displays progress in the terminal and browser. The output video appears in the preview panel with both audio and video. Save the video by right-clicking the output node.

Advanced Features: Unlock LTX-2's Full Creative Potential

Once you've mastered basic text-to-video, explore these advanced capabilities:

Image-to-Video (I2V) Generation:

Convert still images into dynamic videos while maintaining composition and style. Load any image and provide a prompt describing desired motion:

textPositive Prompt: "The person in the image begins to smile,
then turns to face the camera. Subtle lighting adjustments.
Soft background music begins. Natural facial expression."

Using lower CFG values (3.0-5.0) preserves image consistency. This is ideal for product demos, character animation, and photo-to-motion workflows.

Depth-Guided Video Generation:

Use depth maps to control spatial structure and camera perspective. This is particularly powerful for maintaining consistent 3D geometry across generations:

Load a reference image
Apply "Image to Depth Map (Lotus)" preprocessing
Connect the depth map to the guidance node
Adjust guidance strength (0.5-1.0)

This creates cinematic camera movements while maintaining spatial coherence—useful for architectural walkthroughs and complex scene generation.

Pose-Driven Animation:

Control character movement with DWPose (DWPreprocessor). This enables frame-level control of human motion:

Extract pose from reference video/image
Optionally load Pose Control LoRA for enhanced accuracy
Connect pose guidance to the generation pipeline

Dancers, action sequences, and performance captures become possible without professional motion capture equipment.

Canny Edge Control:

Use edge detection to preserve structural boundaries and architectural details:

Apply Canny edge detection to your reference
Adjust threshold values (low: 100, high: 200)
Balance edge guidance with text prompt influence

Excellent for line art animation and maintaining precise object boundaries.

Spatial and Temporal Upscaling:

LTX-2 includes dedicated upscaler models to enhance quality post-generation:

Spatial Upscaler (2x): Doubles resolution (768×512 → 1536×1024) with sharp details
Temporal Upscaler (2x): Doubles frame rate (24 FPS → 48 FPS) for smooth motion

Chain them together for a 2-step pipeline: generate at 768×512 @ 24 FPS, then upscale to 1536×1024 @ 48 FPS. This often produces better results than attempting direct high-resolution generation.

LoRA Fine-Tuning for Consistent Style:

Train your own LoRA (Low-Rank Adaptation) weights to teach LTX-2 your specific artistic style or subject matter. Using LTX-2 Trainer with just 10-50 video clips of your target style:

Prepare training dataset (videos or image sequences)
Use official LTX-2 Trainer (available on GitHub)
Training takes 1-2 hours on modern GPUs
Load resulting LoRA weights in ComfyUI workflows
Blend with main model (strength 0.5-1.0)

This enables consistent character appearances, branded visual styles, and subject-specific generations that would be impossible with the base model.

Audio Quality and Synchronization: The Game-Changing Feature

The synchronized audio generation sets LTX-2 apart from every other open-source model. Unlike chaining separate video and audio models (e.g., Kling for video + ElevenLabs for speech), LTX-2 generates both modalities simultaneously, ensuring perfect temporal alignment.

Audio quality depends significantly on prompt description. Explicit audio specifications yield the best results:

Excellent Prompt (with audio):

textA coffee shop at morning. Espresso machine hisses and steams.
Cups clink as the barista sets them on the counter. Soft jazz
music plays in the background. Customers have hushed conversations.
The door chimes as a new customer enters. Ambient sounds of
urban morning traffic outside the window.

Poor Prompt (no audio description):

textA coffee shop scene with people.

The model generates synchronized:

Dialogue and speech with proper lip synchronization
Foley effects (mechanical sounds, impacts)
Ambient soundscapes
Music (though less detailed than specialized music generation models)

Audio quality is generally excellent for dialogue, good for foley and effects, and adequate for ambient sound. For music-heavy projects, you might still layer additional music post-generation, but the ambient audio typically doesn't require replacement.

Pricing and Cost Analysis

To fully appreciate LTX-2's advantages, consider the total cost of ownership for generating videos with different platforms over one year:

LTX-2 Local Setup (One-Time Investment):

RTX 4090 GPU: $1,500-$1,800
64GB RAM: $300-$400
1TB NVMe SSD: $80-$120
Electricity (250W GPU, $0.12/kWh, 500 hours/year): ~$15/month
Total Year 1: ~$2,100-$2,500 (amortized over 3-5 years)
Cost per video (1000 videos/year): $2.10-$2.50

Sora 2 Cloud API (Per-Usage Model):

10-second video (Standard): $1.00
10-second video (Pro 4K): $3.00-$5.00
Cost for 1000 videos/year (average 10 seconds): $1,000-$5,000
No upfront hardware investment

Runway Gen-3 (Credit-Based):

Typical cost: $0.05-$0.10 per second of video
10-second video: $0.50-$1.00
Cost for 1000 videos/year: $500-$1,000

Pika 2.1 Subscription:

Professional Plan: $79/month = $948/year
Includes 50 videos/month (600/year) at up to 10 minutes each
Additional videos require higher-tier subscription
Effective cost: Minimum $948/year for modest production

Breakeven Analysis:

LTX-2 breaks even with Pika at approximately 400 videos/year
LTX-2 breaks even with Sora 2 at approximately 100 videos/year
For content creators and agencies generating 50+ videos monthly, LTX-2 ROI occurs within 3-4 months

The financial advantage is undeniable for any serious creator. Even casual hobbyists generating 100 videos annually benefit from LTX-2's zero-per-video cost structure.

Advanced Optimization Techniques

For professionals and studios, extracting maximum efficiency is essential:

Multi-GPU Parallelization:

With 2+ NVIDIA GPUs, you can distribute LTX-2's inference across devices:

python main.py --multi-gpu --gpu-ids 0,1

Expected improvements:

2 GPUs: ~1.7x speed (not perfect 2x due to synchronization overhead)
4 GPUs: ~3x speed
Enables higher resolutions on systems that would VRAM-bottleneck with single GPU

Workflow Optimization Patterns:

Test with distilled model first: Confirm your prompt and parameters work before committing to 50-step full model runs
Use progressive resolution: Generate at 768×512, then upscale, rather than attempting direct 1280×720
Batch processing: Queue multiple prompts. Models stay loaded in VRAM between generations, avoiding reload overhead
Tiled VAE decoding: For VRAM-constrained systems:

textAdd "Tiled VAE Decode" node → Set tile size: 512×512, Overlap: 64px
Result: 50% VRAM reduction, 15-20% speed reduction

Cache text encodings: For variations of the same prompt:

textAdd "Save Text Encoding" node → Reuse encoded embeddings across generations
Result: Avoids re-encoding the 12B Gemma text encoder

Common Issues and Solutions

"CUDA out of memory" Errors:

Solutions in priority order:

Reduce resolution to 512×512 or 768×512
Reduce frame count to 18 or 27 frames
Enable NVFP4 quantization (requires RTX 40-series+)
Launch with python main.py --reserve-vram 4 (reserves 4GB for OS)
Use tiled VAE decoding
Reduce batch size or close other applications

Poor Prompt Adherence (Model Ignores Parts of Instructions):

Root causes and fixes:

Problem: Model ignores slow-motion request
- Fix: Explicitly include camera speed in prompt (e.g., "slow, deliberate movement")
Problem: Background details missing
- Fix: Lead with background description, use higher CFG (8-10)
Problem: Inconsistent style across generations
- Fix: Train/use LoRA weights for style consistency

Audio Quality Issues:

No audio: Verify muxing node is connected; regenerate (audio synthesis can be inconsistent)
Speech misalignment: Distilled model may have lower audio quality; use full model for dialogue
Silent scenes: Model struggles with speech-free content; describe ambient sounds explicitly

Slow Generation Times:

Diagnostics:

Check GPU utilization: nvidia-smi (should show 95%+ utilization)
If GPU underutilized: Check CPU bottleneck, verify PyTorch is compiled for CUDA
If GPU maxed: Use NVFP4, reduce resolution, or accept longer times
Update NVIDIA drivers to latest (performance improvements release monthly)

Practical Examples for Common Use Cases

Example 1: Product Demonstration Video (B2B Marketing)

textPrompt: "A sleek silver smartphone sits on a black glass table.
Soft studio lighting highlights the device edges. The phone's
screen illuminates with app icons. Camera slowly zooms in on
the device. Subtle ambient electronic sounds. Professional
product photography ambiance."

Settings:
- Resolution: 1024×576
- Duration: 27 frames (1.1 sec at 24 FPS)
- CFG: 7.0
- Steps: 35 (full model)
- Time: ~4 minutes on RTX 4090 FP8

Result: Professional product video requiring 30-60 seconds of footage
to demonstrate key features (requires multiple generations)

textPrompt: "A trendy young woman dances in a bright modern apartment.
Natural sunlight streams through large windows. Upbeat lo-fi hip-hop
music plays. Camera captures dynamic movement with quick cuts.
Energy is fun and relatable. The woman smiles at the camera.
Urban contemporary aesthetic."

Settings:
- Resolution: 768×512 (no upscaling needed for mobile)
- Duration: 36 frames (1.5 sec at 24 FPS)
- CFG: 6.0
- Steps: 30 (full model, quality is important for audience engagement)
- Time: ~3.5 minutes on RTX 4090 FP8

Result: 15-30 second total when combined with music and editing.
Ready for immediate posting.

Example 3: Long-Form Narrative Content

textCreate a 10-second cinematic scene by generating three 4-second clips:

Scene 1 - Establishing:
"Vast desert landscape during golden hour. Sand dunes extend
to the horizon. Warm sunlight creates dramatic shadows. Wind
gently moves sand. Sparse vegetation scattered across the dunes.
Soft, contemplative instrumental music. Camera pans across the landscape."

Scene 2 - Character Introduction:
"A lone figure walks across the desert dunes. Weathered clothing.
Determined expression. Footsteps echo in the silence. Wind whistles.
Dramatic shadows cast by the setting sun. Camera follows the character
from a distance. Tension-building music swells."

Scene 3 - Climactic Moment:
"The character reaches the peak of a tall dune and gazes at the
vast landscape. Powerful orchestral music crescendos. Golden light
bathes the scene. Camera slowly zooms out to show the character's
smallness against the immense landscape. Emotional, awe-inspiring mood."

Settings: 20-25 frames per clip, 24 FPS, 1024×576
Timeline: 10+ minutes total generation time
Post-Production: Stitch clips together, color grade, add transitional
effects, potentially add voice-over narration

Result: Professional cinematic micro-film suitable for film festivals,
YouTube, or short-form narrative platforms.

USP: What Makes LTX-2 Different

LTX-2's competitive advantages are not merely incremental:

Native Audio-Video Synchronization: Generative models that create video and audio in parallel, not sequential pipelines that chain separate models. This eliminates sync drift and rendering delays.
Production-Ready Open Source: Full access to weights, training code, and inference implementation. You can fine-tune, modify, and deploy without licensing restrictions or usage quotas.
20-Second Temporal Coherence: Long-form video generation (6-20 seconds) with consistent quality throughout—critical for narrative content, demonstrations, and complex scenes.
4K Native Output: 2160p resolution at 50 FPS from a single pass, not upscaled from lower resolutions (though upscalers improve quality further).
Local Privacy and Control: No cloud dependency means complete data privacy, no API rate limits, and ability to run offline after initial model download.
NVIDIA Optimization: Direct partnership with NVIDIA ensures cutting-edge optimizations (NVFP4, NVFP8) that provide 3x speedups and 60% VRAM reduction—benefits immediately accessible to RTX GPU owners.
LoRA Fine-Tuning: Efficient parameter adaptation allowing custom style training on consumer hardware without full model retraining.

Why LTX-2 Wins for Creators

Against Sora 2:

Audio: LTX-2 generates synchronized audio natively; Sora 2 offers limited audio support
Cost: LTX-2 free (after hardware); Sora 2 costs $1-$5 per 10-second video
Control: LTX-2 allows full fine-tuning and LoRA training; Sora 2 is a black box
Advantage to Sora 2: Superior photorealism and motion coherence in many scenarios (proprietary training data advantage)

Against Runway Gen-3:

Audio: LTX-2 synchronized audio; Runway Gen-3 limited
Cost: LTX-2 free; Runway requires API credits or subscription
Privacy: LTX-2 local; Runway uploads to cloud servers
Advantage to Runway: Faster iteration, more intuitive UI, better consistency in photorealistic scenarios

Against Pika 2.1:

Resolution: LTX-2 4K; Pika 720p
Cost: LTX-2 free; Pika $29-$76/month
Audio: LTX-2 synchronized; Pika none
Advantage to Pika: Better mobile UX, faster generation times, no hardware requirement

The verdict: For creators who can invest in hardware and value cost efficiency, privacy, and technical control, LTX-2 is unquestionably superior. For those prioritizing speed, ease-of-use, and photorealism without hardware investment, proprietary cloud solutions remain competitive.

Future of Local AI Video Generation

LTX-2 released as fully open-source on January 6, 2026—a date that will likely be remembered as pivotal in AI democratization. Within days, the community began:

Optimizing models for even lower VRAM requirements
Training custom LoRAs for specific styles and domains
Building inference optimizations exceeding NVIDIA's official ones
Exploring fine-tuning on domain-specific video datasets

By mid-January 2026, users reported successfully running LTX-2 on RTX 3090 (24GB) with careful quantization, RTX 4080 (12GB) with heavy optimization, and even testing RTX 5070 Ti (16GB) successfully. The community is systematically breaking hardware barriers, making LTX-2 accessible to those without enterprise-grade GPUs.

Conclusion: The Practical Guide to Getting Started Today

Running LTX-2 on ComfyUI is no longer a technical challenge reserved for machine learning experts. The installation process, while requiring some command-line comfort, is now straightforward and well-documented. The performance is production-grade: 4K video with synchronized audio, generated locally at costs that amount to mere pennies per video.