back to top
HomeTechVoxCPM2 lets you create voices just by describing them and it is...

VoxCPM2 lets you create voices just by describing them and it is open source

- Advertisement -

Most AI voice tools give you two options. Clone an existing voice or pick from a list of defaults. If neither works for what you need, you are stuck.

VoxCPM2 adds a third option. You describe what you want. A young woman, gentle tone, slightly slow pace. A deep male voice with a formal cadence. Whatever you can put into words, it generates from scratch, no recording needed.

That alone would make it interesting. But it also does voice cloning, supports 30 languages without needing a language tag, outputs 48kHz audio, runs on 8GB of VRAM, and ships under Apache 2.0. The whole thing is two billion parameters and installs with a single pip command.

I tried the audio samples and the results are genuinely good. Not fully human, but natural enough that you stop noticing the model and start paying attention to what it is saying. Mixed languages, different emotions, and you can steer all of it.

What VoxCPM2 actually is

VoxCPM TTS
VoxCPM 2 Demo

VoxCPM2 is a text to speech model from the MiniCPM team, built on a tokenizer-free diffusion autoregressive architecture. That means most TTS models convert text into tokens first, then generate audio from those tokens. VoxCPM2 skips the tokenization step entirely, which is part of why it handles 30 languages without needing you to specify which one you are using. You feed it text, it figures out the rest.

The model is 2 billion parameters, trained on over 2 million hours of multilingual speech data, and outputs 48kHz audio natively. It accepts 16kHz reference audio and upsamples internally through its AudioVAE V2 component, so you do not need to do anything special with your input files.

It runs on around 8GB of VRAM which puts it within reach of most people with a modern GPU. There is also a streaming mode that generates audio in real time rather than waiting for the full output, useful if you are building anything interactive.

The voice design feature

Describe a voice in plain text, wrap it in parentheses at the start of your prompt, and VoxCPM2 generates it. No reference audio, no sample recording needed.

Gender, age, tone, emotion, pace, you can specify any of it or combine them. Want a formal deep voice that sounds slightly tired? Describe it. Want something warm and conversational with a fast delivery? Describe that instead.

The honest limitation is that results vary between runs. It is recommended to generate one to three times to get what you want rather than expecting the first output to be perfect. That is a minor inconvenience, not a dealbreaker, especially when the alternative is hunting down a voice actor or recording your own sample.

For content creators, indie game developers, app builders, or anyone who has ever needed a specific voice and had no way to get it, this feature alone is worth the install.

Three ways to clone

Once you have a voice you want to work with, VoxCPM2 gives you three distinct approaches depending on how much control you need.

The basic route is controllable cloning. You give it a short audio clip of the target voice and it clones it. Simple. But you can also layer style guidance on top, telling it to deliver the cloned voice with a cheerful tone, or slightly faster, or more subdued. The timbre stays the same, the delivery changes. That combination is useful when you want a consistent voice across content but need it to carry different emotional weight in different places.

The second approach is what they call ultimate cloning. You provide the reference audio plus the exact transcript of what is being said in that clip. The model uses both together to capture every nuance of the original, not just the general voice characteristics but the specific way that person breathes, pauses, and moves between sounds. The result is noticeably closer to the source than standard cloning.

The third is voice design, which we already covered. No audio needed at all.

Most people will start with controllable cloning and reach for ultimate cloning when they need the output to be as close to a specific person’s voice as possible. Voice design is its own use case entirely, more for creating something new than reproducing something existing.

Related: Open-Source TTS Models That Can Clone Voices and Actually Sound Human

How it performs

On an RTX 4090 the real time factor sits around 0.30, meaning it generates audio roughly three times faster than the audio plays back. With Nano-vLLM acceleration that drops to around 0.13, which is fast enough for genuinely interactive applications.

On more modest hardware it is slower but still usable. The streaming mode helps here because you start getting audio back before the full generation is complete, which matters if you are building something where latency is noticeable.

The 48kHz output is the other practical performance detail. Most TTS models top out at 22kHz or 24kHz. 48kHz is studio quality, the kind of output you can drop directly into a production workflow without upsampling it afterward.

I tested the audio samples across mixed languages and different emotional prompts. The results are natural enough that you focus on the content rather than the generation. Not perfect, there are moments where expressiveness feels slightly mechanical on longer inputs, which matches the model card’s own warning about instability with very long or highly expressive text. But for most practical use cases it holds up well.

Benchmark results across Seed-TTS-eval, CV3-eval, and other standard TTS evaluations are available in the GitHub repo. We are not going to reproduce the full tables here but the numbers are competitive with leading models in the space according to their own testing.

Who it is for

If you create content regularly, build voice interfaces, work in game development, or just need a flexible TTS tool you can run privately on your own hardware, VoxCPM2 is genuinely worth trying. The Apache 2.0 license means commercial use is completely fine with no strings attached.

Developers building multilingual applications will find the 30 language support without language tagging particularly useful. You do not need to build language detection into your pipeline, the model handles it.

For individuals who want a local voice assistant, a narration tool, or something to experiment with, the hardware requirements are reasonable enough that most people with a gaming GPU can run it without issues. Getting it running locally is straightforward, the GitHub repo has clear instructions to get you started.

Current limitations

Voice design and style control results vary between runs. The model itself recommends generating multiple times to get the output you want, which adds friction to any workflow that needs consistency at scale.

Performance across the 30 supported languages is uneven. Languages with less training data representation will produce noticeably weaker results than English, Chinese, or other high resource languages in the training set.

Very long inputs and highly expressive prompts can produce unstable output. For short to medium length content it holds up well. For anything extended, splitting into shorter segments is the safer approach.

Small but powerful

VoxCPM2 is not going to replace a professional voice actor for high stakes production work. But for everything below that bar, which is most of what people actually need, it is one of the most flexible open source TTS options available right now. Try the demo first. If the voice quality works for your use case, the install takes about two minutes.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
meta muse spark ai

Meta’s Muse Spark: A Closed Bet on Multimodal, Multi-Agent AI

0
Meta has a new AI model and for the first time in years it is not called Llama. Muse Spark launched yesterday under Meta Superintelligence Labs, a new internal division Meta quietly formed by bringing together researchers from Google DeepMind and other frontier labs. It is natively multimodal, supports multi-agent reasoning, and is available right now at meta.ai. It is also not being released as open weights. That last part is worth sitting with for a second. Meta built one of the most trusted brands in open source AI through Llama. Developers built on it, researchers published with it. Muse Spark continues none of that. No weights, no HuggingFace release, private API preview only. What you get instead is a genuinely capable multimodal model with some benchmark numbers that are hard to ignore and a new reasoning mode called Contemplating that puts it in conversation with Gemini Deep Think and GPT Pro. Whether that trade is worth it depends entirely on what you were using Meta AI for in the first place.
GLM 5.1 AI

GLM 5.1: The open source model that gets better the longer you run it

0
Give an AI agent a hard problem and it usually figures out the easy wins fast. After that, more time does not help. It just sits there, trying the same things. ZhipuAI ran GLM-5.1 on a vector database optimization problem and let it go for 600 iterations. It did not run out of ideas. At iteration 50 it was sitting at roughly the same performance as the best single-session result any model had achieved. By iteration 600 it had reached 21,500 queries per second. The previous best was 3,547. That gap is not incremental improvement. It is a different category of result. GLM-5.1 is open source, MIT licensed, and the weights are on HuggingFace right now. It works with Claude Code, vLLM, and SGLang. If you are building anything that runs agents over long tasks, this one is worth understanding.
Bonsai 8B A 1-Bit LLM That Delivers 8B-Class Performance at 1 by 14th the Size

Bonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the Size

0
Nobody expected a 1.15 GB model to score competitively against full precision 8B models. That is not how this usually goes. PrismML released Bonsai 8B last month and the headline number is almost absurd. The whole model, weights and all, fits in 1.15 GB. For context, the standard FP16 version of a comparable 8B model sits at around 16 GB. Bonsai beats or matches several of them on benchmarks while being 14 times smaller. It runs on a phone. There is literally an iPhone build. I want to be clear that these numbers come from PrismML's own evaluations, not independent third party testing. But even with that caveat, this is worth paying attention to.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy