Most AI voice tools give you two options. Clone an existing voice or pick from a list of defaults. If neither works for what you need, you are stuck.
VoxCPM2 adds a third option. You describe what you want. A young woman, gentle tone, slightly slow pace. A deep male voice with a formal cadence. Whatever you can put into words, it generates from scratch, no recording needed.
That alone would make it interesting. But it also does voice cloning, supports 30 languages without needing a language tag, outputs 48kHz audio, runs on 8GB of VRAM, and ships under Apache 2.0. The whole thing is two billion parameters and installs with a single pip command.
I tried the audio samples and the results are genuinely good. Not fully human, but natural enough that you stop noticing the model and start paying attention to what it is saying. Mixed languages, different emotions, and you can steer all of it.
Table of Contents
What VoxCPM2 actually is

VoxCPM2 is a text to speech model from the MiniCPM team, built on a tokenizer-free diffusion autoregressive architecture. That means most TTS models convert text into tokens first, then generate audio from those tokens. VoxCPM2 skips the tokenization step entirely, which is part of why it handles 30 languages without needing you to specify which one you are using. You feed it text, it figures out the rest.
The model is 2 billion parameters, trained on over 2 million hours of multilingual speech data, and outputs 48kHz audio natively. It accepts 16kHz reference audio and upsamples internally through its AudioVAE V2 component, so you do not need to do anything special with your input files.
It runs on around 8GB of VRAM which puts it within reach of most people with a modern GPU. There is also a streaming mode that generates audio in real time rather than waiting for the full output, useful if you are building anything interactive.
The voice design feature
Describe a voice in plain text, wrap it in parentheses at the start of your prompt, and VoxCPM2 generates it. No reference audio, no sample recording needed.
Gender, age, tone, emotion, pace, you can specify any of it or combine them. Want a formal deep voice that sounds slightly tired? Describe it. Want something warm and conversational with a fast delivery? Describe that instead.
The honest limitation is that results vary between runs. It is recommended to generate one to three times to get what you want rather than expecting the first output to be perfect. That is a minor inconvenience, not a dealbreaker, especially when the alternative is hunting down a voice actor or recording your own sample.
For content creators, indie game developers, app builders, or anyone who has ever needed a specific voice and had no way to get it, this feature alone is worth the install.
Three ways to clone
Once you have a voice you want to work with, VoxCPM2 gives you three distinct approaches depending on how much control you need.
The basic route is controllable cloning. You give it a short audio clip of the target voice and it clones it. Simple. But you can also layer style guidance on top, telling it to deliver the cloned voice with a cheerful tone, or slightly faster, or more subdued. The timbre stays the same, the delivery changes. That combination is useful when you want a consistent voice across content but need it to carry different emotional weight in different places.
The second approach is what they call ultimate cloning. You provide the reference audio plus the exact transcript of what is being said in that clip. The model uses both together to capture every nuance of the original, not just the general voice characteristics but the specific way that person breathes, pauses, and moves between sounds. The result is noticeably closer to the source than standard cloning.
The third is voice design, which we already covered. No audio needed at all.
Most people will start with controllable cloning and reach for ultimate cloning when they need the output to be as close to a specific person’s voice as possible. Voice design is its own use case entirely, more for creating something new than reproducing something existing.
Related: Open-Source TTS Models That Can Clone Voices and Actually Sound Human
How it performs
On an RTX 4090 the real time factor sits around 0.30, meaning it generates audio roughly three times faster than the audio plays back. With Nano-vLLM acceleration that drops to around 0.13, which is fast enough for genuinely interactive applications.
On more modest hardware it is slower but still usable. The streaming mode helps here because you start getting audio back before the full generation is complete, which matters if you are building something where latency is noticeable.
The 48kHz output is the other practical performance detail. Most TTS models top out at 22kHz or 24kHz. 48kHz is studio quality, the kind of output you can drop directly into a production workflow without upsampling it afterward.
I tested the audio samples across mixed languages and different emotional prompts. The results are natural enough that you focus on the content rather than the generation. Not perfect, there are moments where expressiveness feels slightly mechanical on longer inputs, which matches the model card’s own warning about instability with very long or highly expressive text. But for most practical use cases it holds up well.
Benchmark results across Seed-TTS-eval, CV3-eval, and other standard TTS evaluations are available in the GitHub repo. We are not going to reproduce the full tables here but the numbers are competitive with leading models in the space according to their own testing.
Who it is for
If you create content regularly, build voice interfaces, work in game development, or just need a flexible TTS tool you can run privately on your own hardware, VoxCPM2 is genuinely worth trying. The Apache 2.0 license means commercial use is completely fine with no strings attached.
Developers building multilingual applications will find the 30 language support without language tagging particularly useful. You do not need to build language detection into your pipeline, the model handles it.
For individuals who want a local voice assistant, a narration tool, or something to experiment with, the hardware requirements are reasonable enough that most people with a gaming GPU can run it without issues. Getting it running locally is straightforward, the GitHub repo has clear instructions to get you started.
Current limitations
Voice design and style control results vary between runs. The model itself recommends generating multiple times to get the output you want, which adds friction to any workflow that needs consistency at scale.
Performance across the 30 supported languages is uneven. Languages with less training data representation will produce noticeably weaker results than English, Chinese, or other high resource languages in the training set.
Very long inputs and highly expressive prompts can produce unstable output. For short to medium length content it holds up well. For anything extended, splitting into shorter segments is the safer approach.
Small but powerful
VoxCPM2 is not going to replace a professional voice actor for high stakes production work. But for everything below that bar, which is most of what people actually need, it is one of the most flexible open source TTS options available right now. Try the demo first. If the voice quality works for your use case, the install takes about two minutes.




