~/wiki / github / voxcpm2-tokenizer-free-tts-voice-cloning-guide

VoxCPM2 – Voice and TTS cloning in 30 languages, free and open source

◷ 5 min read 5/31/2026

Main chat

A chat for vibe coders: news, guides, live cases, marketplace, and finding executors.

$ cd section/ $ join vibe dev

If you need a voice for a project — voice acting, a live speech bot, a clone of your own voice for automation — VoxCPM2 from China’s OpenBMB lab is now one of the strongest open options available. The 22.9k stars on GitHub, Apache-2.0, are set by a single command.

Repository: github.com/OpenBMB/VoxCPM

What is it

VoxCPM2 is a TTS model with 2 billion parameters, trained on more than 2 million hours of speech. Tokenizer-free architecture: The model does not translate text into tokens, but works directly in the audio space through a diffusion autoregression approach. In practice, this gives more natural intonation and better preservation of voice details during cloning.

Built on the basis of the language model MiniCPM-4, delivers audio in 48kHz studio quality.

Four modes of use

Voice Design – create a voice from a text description without reference audio. Describe the character of the voice in parentheses directly in the text, the model generates the appropriate:

python копировать
wav = model.generate(
    text="(молодой мужчина, спокойный и уверенный голос)Привет, я ваш ассистент.",
    cfg_value=2.0,
    inference_timesteps=10,
)

*Controllable Cloning – clone a voice from a short audio clip, while controlling the style: tempo, emotion, expression. The timbre is maintained, the voice behavior is flexible:

python копировать
wav = model.generate(
    text="(чуть быстрее, бодрый тон)Добрый день!",
    reference_wav_path="speaker.wav",
)

*Ultimate Cloning - maximum cloning accuracy: transmit audio and its transcription, the model continues to speak as a continuation of the original, preserving every detail - rhythm, timbre, emotion.

*Basic TTS is simply text-to-speech synthesis, with no references, in any of the 30 supported languages.

Supported languages

30 languages: arabic, burmese, vietnamese, greek, danish, hebrew, indonesian, spanish, italian, chinese (including 9 dialects: sichuan, cantonese, shanghai and others), korean, malay, netherlands, german, norwegian, polish, portuguese, russian, swahili, tagalog, thai, turkish, finnish, french, hindi, swedish, japanese, english, as well as khmer and lao.

There is no need to specify the language tag - the model determines the language automatically.

Installation and quick start

bash копировать
pip install voxcpm

Requirements: Python 3.10–3.12, PyTorch ≥ 2.5.0, CUDA ≥ 12.0. VRAM: ~8 GB for VoxCPM2.

python копировать
from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="Привет! Это VoxCPM2 — синтез речи на русском языке.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)

To run the web interface locally:

bash копировать
python app.py --port 8808
# открыть в браузере: http://localhost:8808

Productivity and production

On NVIDIA, the RTX 4090 RTF (real-time factor) is about 0.3—that is, one second of speech is generated in about 0.3 seconds. With Nano-vLLM, it accelerates to ~0.13 RTF, which makes real-time streaming work.

For production-deploy supported vLLM-Omni with OpenAI-compatible API /v1/audio/speech - you can connect as a replacement for ElevenLabs in any service:

bash копировать
vllm serve openbmb/VoxCPM2 --omni --port 8000

curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"openbmb/VoxCPM2","input":"Привет из VoxCPM2!","voice":"default"}' \
  --output out.wav

Fine tuning

The model supports LoRA and full file tuning – enough 5-10 minutes of audio to adapt to a specific voice or domain. For this, there is a ready-made WebUI:

bash копировать
python lora_ft_webui.py  # http://localhost:7860

What projects are suitable for

VoxCPM2 is well suited if you need to: create a voice for a Telegram bot or voice assistant, add voiceover to an application without buying an API from ElevenLabs, clone your voice to automate content, or embed TTS in production with an OpenAI-compatible API.

The Apache-2.0 license allows commercial use – there are no restrictions on monetization.

** Repository:** github.com/OpenBMB/VoxCPM · 22.9k ** Demo:** huggingface.co/spaces/OpenBMB/VoxCPM-Demo ** Documentation:** voxcpm.readthedocs.io

$ cd ../ ← back to GitHub