Run YuE-7B on a Mac (April 2026): Honest Guide to Open Lyrics-to-Song Generation
Last updated April 2026 — refreshed for current model/tool versions.
YuE is the open-source lyrics-to-song music generation model family released by HKUST and M-A-P. It is the closest open analogue to Suno or Udio, and it is heavily CUDA-bound. This guide is the honest, 2026-current account of how to run YuE-7B from a Mac — what works locally, what does not, and which routes are actually worth your time.
If you arrived expecting a one-line pip install that produces 30-second clips on an M2 in 10 seconds, that earlier write-up was wrong on multiple facts. The corrections are below.
What changed in 2026 (and corrections to the prior version of this post)Model identity correction. YuE-7B is from multimodal-art-projection (M-A-P) and HKUST, not DeCLaRe Lab. It is a lyrics-to-song model (vocals + accompaniment, up to ~5 minutes), not a generic text-to-audio sound-effects model. It does not use "Flow Matching" or "CRPO" — those terms came from a different model (Tango/Tango 2). The earlier post conflated the two.Apple Silicon support remains unofficial. The official YuE repo targets CUDA 11.8+ and FlashAttention 2. There is no first-party Mac/MPS path as of April 2026. Anyone telling you otherwise on Mac is using a community fork, CPU fallback, or cloud passthrough.Hardware reality. Per the official README, single-segment generation needs ~16 GB VRAM; full-song generation recommends 80 GB-class GPUs (H800 / A100 / multi-4090). The 8 GB / "M2 generates 30 seconds in under 10 seconds" claim from the previous version was fabricated.License moved to Apache 2.0 on 2025-01-30 (commercial use allowed with attribution: "YuE by HKUST/M-A-P"). The technical paper landed on arXiv 2025-03-12 (2503.08638).Python baseline bumped. Python 3.7 reached end-of-life in 2023. The repo recommends Python 3.8+; YuEGP recommends Python 3.10. We default to Python 3.10+ here.Easier paths exist. If you only have a Mac, the realistic options in 2026 are: Google Colab notebook (provided by the YuE team), fal.ai's hosted YuE endpoint, the YuEGP "GPU Poor" fork on a remote box, or a rented A100/H100 by the hour. Local M-series inference is possible only with heavy quantization, very slow, and not officially supported.
TL;DR — Can I actually run YuE-7B on a Mac?
| Path | Works on Mac? | Speed | Notes |
|---|---|---|---|
| Official repo, CUDA install | No | — | Requires NVIDIA GPU + CUDA 11.8/12.x. Will not run natively on Apple Silicon. |
| Colab notebook (T4 / A100) | Yes (browser) | Reasonable on A100 | Easiest practical path from a Mac. Free tier (T4) is tight; Colab Pro recommended. |
| fal.ai hosted YuE | Yes (API) | Fast | Pay-per-second; no setup. Good for one-off songs. |
| YuEGP fork on rented GPU | Yes (SSH) | Moderate | 16/12 GB VRAM profiles; runs on RTX 3090/4090/4070 Ti class. |
| Native M-series via PyTorch CPU/MPS | Partial, unofficial | Very slow | Community-only. Expect minutes-to-hours per 30 s of audio. No support. |
If you want a turnkey local Mac music generator instead of a YuE-on-Mac project, jump to the "What was removed and why" section — there are better-fit Mac-native options.
What YuE-7B actually is
YuE is a family of LLaMA-2-architecture foundation models trained for music generation. The flagship checkpoints are 7B-parameter Stage-1 models that take lyrics + a genre/style description as input and emit semantic audio tokens; a 1B Stage-2 model upsamples those tokens to a coarse waveform, and a separate upsampler stage produces the final 44.1 kHz output.
- Authors: M-A-P (Multimodal Art Projection) and HKUST.
- Architecture: Decoder-only LLaMA-2 backbone with a dual-token (vocal + instrumental) prediction scheme.
- Inputs: Lyrics text + genre/style tags. Optional: reference audio for in-context style transfer (ICL variants).
- Output: Up to ~5 minutes of music with aligned vocals and accompaniment.
- Languages: English, Mandarin, Japanese, Korean (separate checkpoints per language family).
- License: Apache 2.0 (since 2025-01-30); commercial use permitted with attribution.
The official model variants on Hugging Face:
m-a-p/YuE-s1-7B-anneal-en-cot— English, chain-of-thought.m-a-p/YuE-s1-7B-anneal-en-icl— English, in-context learning (style transfer via reference audio).m-a-p/YuE-s1-7B-anneal-zh-cot/-zh-icl— Mandarin variants.m-a-p/YuE-s1-7B-anneal-jp-kr-cot/-jp-kr-icl— Japanese / Korean variants.m-a-p/YuE-s2-1B-general— Stage-2 audio upsampler (used with all Stage-1 variants).
Hardware requirements (read this before installing anything)
From the official YuE README (verified April 2026):
- Up to 24 GB VRAM: max 2 concurrent inference sessions, short segments only.
- Full-song generation (4+ segments): 80 GB-class GPU recommended (H800, A100 80 GB, or multi-RTX 4090 setup).
- Indicative throughput: ~150 s of wall-clock to produce 30 s of audio on H800; ~360 s on RTX 4090.
- Quantized (YuEGP fork): 16 GB profile (default), 12 GB 8-bit profile, sub-10 GB profile with sequential offloading (very slow).
None of those targets are an Apple Silicon GPU. The M3 Max / M4 Max have unified memory pools that could hold the weights, but the official inference code paths (FlashAttention 2, CUDA kernels) are not portable to Metal Performance Shaders without significant rework. As of this refresh, no maintained MLX or Metal port of YuE exists.
Path A — Easiest from a Mac: Google Colab
The YuE team ships a Colab notebook (linked from the GitHub README under the 2025-02-17 update). This is the lowest-friction path if you're on a MacBook.
- Open the official Colab notebook from the YuE GitHub repo.
- Runtime → Change runtime type → A100 (Colab Pro) or T4 (free tier — tight on memory; expect short segments only).
- Run the install cell. It clones the repo, installs PyTorch + FlashAttention 2 against the Colab CUDA stack, and pulls weights from Hugging Face.
- Run the inference cell. On A100, expect ~3–5 minutes for a 30-second clip.
- Download the resulting
.mp3/.wavfrom the Colab file browser to your Mac.
Edit the lyrics and genre prompt cells:
genre.txt:
inspiring female uplifting pop airy vocal electronic bright vocal vocal
lyrics.txt:
[verse]
We danced through fields of gold beneath an endless sky
[chorus]
And I'll remember you for all the days gone by
This is genuinely the recommended path for individual experimentation when your only local hardware is a Mac.
Path B — Pay-per-call: fal.ai hosted YuE
fal.ai hosts an official YuE endpoint. From a Mac you only need curl or any HTTP client. No GPU, no install.
curl --request POST \
--url https://fal.run/fal-ai/yue \
--header "Authorization: Key $FAL_KEY" \
--header "Content-Type: application/json" \
--data '{
"lyrics": "[verse]\nWe danced through fields of gold...\n[chorus]\nAnd I will remember you...",
"genres": "inspiring female uplifting pop airy vocal electronic bright"
}'
Verify current pricing on the fal.ai model page before scripting bulk runs — it is per-second-of-GPU, and prices change.
Path C — Self-host on a rented GPU box: YuEGP
YuEGP ("YuE for the GPU Poor") is the actively maintained community fork that adds 8-bit quantization profiles, sequential CPU offloading, and a friendlier setup. As of v3.0 (2025-02-10) it supports multi-song generation and progress tracking. From a Mac you SSH into the rented box; the Mac itself only runs an editor and an SSH client.
# On a Linux box with an NVIDIA GPU (RTX 3090 / 4090 / 4070 Ti class)
git clone https://github.com/deepbeepmeep/YuEGP.git
cd YuEGP
python3.10 -m venv .venv && source .venv/bin/activate
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Optional: FlashAttention 2 for memory-efficient long generations
pip install flash-attn --no-build-isolation
# Run with the 12 GB 8-bit profile
python infer.py --profile 3 \
--stage1_model m-a-p/YuE-s1-7B-anneal-en-cot \
--stage2_model m-a-p/YuE-s2-1B-general \
--genre_txt prompts/genre.txt \
--lyrics_txt prompts/lyrics.txt \
--output_dir ./output
Profile cheat-sheet:
| Profile | VRAM | Notes |
|---|---|---|
| 1 (default) | 16 GB | Fastest. Fits on RTX 4070 Ti Super, 4080, 4090, 3090. |
| 3 | 12 GB | 8-bit quantized. Fits on RTX 4070, 3060 12 GB. |
| 4 | <10 GB | Sequential offload. Slow, but runs on 8 GB cards. |
| 5 | Minimal | Maximum offloading; very slow. |
Path D — Native Apple Silicon (community-only, slow)
If you genuinely insist on running YuE on the Mac itself, the only route in April 2026 is a hand-rolled CPU / MPS PyTorch install with quantized weights. Expect:
- No FlashAttention 2. It is CUDA-specific. You fall back to vanilla scaled-dot-product attention, which costs memory and time.
- MPS gaps. Some ops in the YuE inference path may fall back to CPU; in PyTorch 2.5+ the situation has improved but is not 100%.
- Wall-clock cost. Community reports for related 7B audio models on M3/M4 Max suggest 10–30× the H800 wall-clock time, i.e. tens of minutes to hours per 30 seconds of audio.
# Apple Silicon, unofficial best-effort
xcode-select --install
brew install python@3.11 git-lfs ffmpeg
git lfs install
python3.11 -m venv yue-env && source yue-env/bin/activate
pip install --upgrade pip
pip install torch torchaudio # PyTorch 2.5+ ships MPS support
pip install transformers accelerate sentencepiece soundfile
git clone https://github.com/multimodal-art-projection/YuE.git
cd YuE
git lfs pull
pip install -r requirements.txt # Skip flash-attn; it will fail on Mac
You will then need to patch the inference script to (a) replace any flash_attn import with a guard, (b) set device = "mps" instead of "cuda", and (c) load Stage-1 weights in 4-bit / 8-bit via bitsandbytes or AWQ — bitsandbytes itself has limited Mac support, so AWQ-quantized community checkpoints from Hugging Face are typically the more practical choice. This is genuinely not a supported configuration; treat it as a research exercise, not a workflow.
For broader local-AI orchestration on a Mac (model routing, agent loops, mixing local and cloud calls), the OpenClaw + Ollama setup guide for running local AI agents is a more productive starting point than fighting YuE's CUDA assumptions.
Prompt formatting (this matters more than people realize)
YuE's lyric prompt expects standard song-section markers. Use them — outputs degrade noticeably without them.
[verse]
First verse line.
Second verse line.
[chorus]
Chorus line one.
Chorus line two.
[verse]
Second verse line.
[bridge]
Bridge line.
[chorus]
Final chorus repeat.
The genre prompt is a free-text descriptor list, not a single label. Stack 5–10 descriptors covering instrumentation, vocal style, mood, and production:
inspiring female uplifting pop airy vocal electronic bright vocal vocal
Heavy or contradictory genre prompts ("aggressive ambient") produce incoherent results. Keep one dominant genre and add timbre/production qualifiers.
How to choose: a decision tree
- You have an NVIDIA GPU on the same network → install YuEGP, SSH in. Stop reading.
- You only have a Mac, occasional use → fal.ai hosted endpoint. Pay per song, skip setup.
- You only have a Mac, learning/experimenting → Colab Pro with the official notebook on A100.
- You only have a Mac, want it 100% local → reconsider. Run a Mac-native model instead (see "What was removed and why").
- You're building a product around lyrics-to-song generation → fal.ai or Replicate for prototyping; rent dedicated A100/H100 capacity (Lambda, RunPod, Vast.ai) for production. Budget for the 80 GB tier if you want full songs.
Performance / benchmarks (April 2026, official figures)
| Hardware | Wall-clock for 30 s of audio | Source |
|---|---|---|
| NVIDIA H800 (80 GB) | ~150 s | YuE GitHub README |
| NVIDIA RTX 4090 (24 GB) | ~360 s | YuE GitHub README |
| NVIDIA RTX 3090 (24 GB), 8-bit (YuEGP profile 3) | ~7–9 min | YuEGP issue tracker, anecdotal |
| NVIDIA RTX 4070 Ti, 8-bit, sequential offload | ~15–20 min | YuEGP issue tracker, anecdotal |
| Apple M-series (community CPU/MPS) | Tens of minutes to hours; not officially benchmarked | — |
YuEGP issue #22 is a useful real-world data point: a user reported ~1 hour for a full song on a 12 GB GPU using the quantized profile. Plan capacity accordingly.
Common pitfalls and troubleshooting
- FlashAttention build fails. The wheel must match your exact PyTorch + CUDA combination. On Linux, prefer the prebuilt wheel from
flash-attn's release page over a from-source build. On Mac, do not installflash-attnat all — patch the import. - OOM on a 24 GB card. You're trying to run more than 2 concurrent segments, or you forgot to set
--stage2_batch_size 4down to 1–2. Reducemax_new_tokensfirst, then batch size. - Garbled vocals. Genre prompt is too short or too contradictory. Aim for 6–10 descriptors that agree.
- Long silences mid-song. Lyrics include unmarked instrumental passages. Add explicit
[bridge]/[instrumental]markers. - Hugging Face download is slow / fails. The Stage-1 7B checkpoints are ~14 GB each in BF16. Use
HF_HUB_ENABLE_HF_TRANSFER=1andhuggingface-cli loginwith a token. - Apple Silicon:
RuntimeError: MPS does not support …SetPYTORCH_ENABLE_MPS_FALLBACK=1to fall back to CPU for unsupported ops. This is slower but unblocks generation. - Output is monophonic / lo-fi. You skipped Stage-2 / the upsampler. Both stages are required for the published 44.1 kHz output.
What was removed and why
The previous version of this post recommended a few things that have aged out or were never accurate. We removed them:
- "YuE is by DeCLaRe Lab" / Flow Matching / CRPO references — wrong attribution. Those concepts come from the Tango / Tango 2 text-to-audio family. YuE is from M-A-P + HKUST and uses a LLaMA-2-based dual-token approach.
- "Generates 30 s of audio in under 10 s on M2" — fabricated. Even the H800 takes ~150 s for 30 s of output.
- macOS 10.15 Catalina baseline — Apple security updates for Catalina ended in 2022. Bumped to macOS 12 Monterey or newer, with 14 Sonoma+ recommended for current PyTorch MPS support.
- Python 3.7 baseline — Python 3.7 reached end-of-life in June 2023. Bumped to Python 3.10+.
- "8 GB RAM minimum" — not viable for any local YuE path. Realistic Mac unified-memory minimum for community CPU/MPS attempts is 32 GB; 64 GB if you want any patience left.
If you wanted a Mac-native music generator and YuE turns out to be the wrong fit, two reasonable alternatives in April 2026 are:
- Stable Audio Open Small — text-to-audio (sound effects + short loops, not full songs). Has community MLX ports and runs locally on M-series.
- MusicGen (Meta) — older, but well-supported on Apple Silicon via the mlx-audio stack and standalone MusicGen-MLX ports. Generates instrumental music up to ~30 s; no vocals.
FAQ
Does YuE-7B run natively on an M3 Max or M4 Max?
Not officially. The repo targets CUDA. Community CPU/MPS runs are possible with patches and quantized weights but are not supported and are very slow. Use Colab or a rented NVIDIA GPU instead.
Is YuE actually open-source?
Yes. As of 2025-01-30 the weights are released under Apache 2.0. Commercial use is permitted with attribution: "YuE by HKUST/M-A-P".
How does YuE compare to Suno or Udio?
YuE is the strongest open lyrics-to-song model in early 2026 but still trails the closed Suno v4 / Udio v2 generations on production polish, vocal naturalness, and style coherence over multi-minute outputs. Its strength is being self-hostable, modifiable, and license-clean for commercial use.
Can I fine-tune YuE on my own music?
Yes. LoRA fine-tuning support landed in the official repo on 2025-06-04. Expect to need a multi-GPU 80 GB-class setup or significant gradient-checkpointing tricks; the Mac is not the right machine for training.
What's the cheapest cloud option for one-off songs?
For a few songs a month, fal.ai's hosted YuE is the cheapest in time and dollars combined. For dozens per week, rent an A100 hourly (RunPod / Lambda / Vast.ai) and run YuEGP yourself.
What about voice cloning with YuE?
The ICL ("in-context learning") variants accept a short reference audio clip and condition the generation on its vocal/instrumental style. This is style transfer, not strict voice cloning — and you should respect the consent and rights of any voice you use as a reference.
Why does the official guide insist on FlashAttention 2?
YuE generates long sequences (thousands of audio tokens). FlashAttention 2 cuts the attention memory footprint enough to make full-song generation tractable on 24–80 GB cards. Without it you run out of memory on long contexts.
Is there a smaller YuE I can run on a 16 GB Mac?
The Stage-1 model is 7B parameters across all language variants — there is no smaller official Stage-1 checkpoint as of April 2026. Quantized (4-bit/8-bit) community checkpoints reduce memory to ~6–10 GB but still need a GPU runtime that supports those quant kernels efficiently.
Where this fits in your stack
Open music generation is a hiring-heavy area: it sits at the intersection of audio DSP, large-model inference engineering, and distributed GPU ops. If you're building a product on top of YuE or a successor model and need engineers who have shipped this kind of work, Codersera maintains a vetted bench of remote AI/ML engineers with hands-on experience in PyTorch, CUDA, and audio model deployment. For broader local-AI architecture, the OpenClaw + Ollama setup guide covers how to wire local and cloud models behind a single agent layer — the same pattern you'd use to route YuE generations from a Mac frontend to GPU backends.
Related Codersera walkthroughs in this cluster:
- Install YuE-7B on Ubuntu — step-by-step guide (the Linux side of this same model).
- Run DeepSeek Janus-Pro 7B on Mac — multimodal 7B on Apple Silicon, similar memory class.
- Run DeepSeek Janus-Pro 7B on Mac with ComfyUI — graph-based local inference.
References & further reading
- multimodal-art-projection/YuE — official GitHub repo
- m-a-p/YuE-s1-7B-anneal-en-cot — Hugging Face model card
- YuE collection on Hugging Face (all variants)
- YuE: Scaling Open Foundation Models for Long-Form Music Generation (arXiv 2503.08638, March 2025)
- YuE demo site (audio samples)
- deepbeepmeep/YuEGP — community fork with quantized profiles
- fal.ai/yue — hosted inference endpoint
- mlx-audio — Apple Silicon audio model toolkit (alternative for native Mac)