10 agents · 4 GitHub deep-dives · May 21 2026 · Apple Silicon (M4 Pro, 64GB)
Per-engine: what it is, how it works, license, Apple Silicon viability, the invocation that produced its sample.
nanovllm-voxcpm (CUDA fork). MPS has explicit upstream-acknowledged limitations. Clone mode produces a "fluent-foreigner accent" on JP (HF discussion #14).speaker_kv_scale tuning.sequence_length=750 needs ~24GB unified memory. 16GB Macs use sequence_length≤400 + cfg_guidance_mode=alternating. Joe's 64GB has plenty of headroom.ja-JP-Chirp3-HD-{Name} pattern. Stock voices tested here: Charon, Kore, Aoede, Zephyr, Achernar.pyproject.toml pins SBV2 from tsukumijima's fork (AGPL-3.0), imported in-process — AGPL applies to whole stack for commercial network deploy. Confirmed via primary-source inspection.pyproject.toml selects plain onnxruntime for darwin/arm64 (no CoreML). Maintainer: 「GPU 対応は積極的には行っておりません」 (issue #21).12 voices · letters shuffled per load · Pick the one you like best · Reveal labels when done.
speaker_kv_scale) — first public test of the post-PR-#18 configuration nextHF discussion #14 + VoxCPM issue #222 maintainer-confirmed stack:
--mode default --instruct "...") — not clone--cfg-value 1.5)prompt_audio + prompt_text) in Voice Design mode — it produced 192s of hallucinated audio in our testspeaker_kv_scale (default 5.0) to control reference adherence vs naturalnessDefault: Google Chirp 3 HD Charon / Kore (you own output, $1.50 for whole catalog at scale).
Voice variety: mix in VoxCPM2 Voice Design with character-instructs for non-narrator lines.
Avoid: AivisSpeech / SBV2 (AGPL) — shipping public commercially with copyleft inheritance is the wrong fight.
Default: Irodori v3 caption mode with character-instructs ("迫力ある男性ナレーター、緊張感のある朗読" etc.). MIT license, JP-native voice, emoji for emphasis.
Backup: VoxCPM2 Voice Design with CFG 1.5 for variety.
Personal viewing = AGPL is acceptable if needed, but Irodori MIT is the cleaner default.
Default: Google Chirp 3 HD Charon — calm, polished, no fiddling. 1M/mo free covers most realistic length. Personal use = license irrelevant either way.
If offline matters: Irodori v3 with calm caption.
Default: Irodori v3 (offline, MIT) for JP; Qwen3-TTS Ono_Anna as backup.
Air-gapped Silo machine: NO cloud option. Local-only mandate is a hard constraint here.
Strategy: chunk screenplays sentence-by-sentence, render per-character with distinct captions.
Things we found that aren't going to change soon — adjust expectations, don't fight the platform.
nanovllm-voxcpm (CUDA fork). Apple's ml-explore/mlx team has zero in-flight TTS work across all 2026 issues. The accent artifact in clone mode is partly architectural.