back to top
HomePicksAI Picks4 Open-Source TTS Models That Can Clone Voices and Actually Sound Human

4 Open-Source TTS Models That Can Clone Voices and Actually Sound Human

- Advertisement -

Voice cloning used to mean expensive studio software, proprietary APIs with per-character pricing, or models so heavy they needed server infrastructure just to run. That changed quietly over the last few months.

Four open source models exist right now that do something the previous generation struggled with. They do not just generate speech. They clone a voice from a short audio sample and produce output that is genuinely difficult to compare from the original speaker.

The gap between open source and commercial TTS has been closing for a while. These four models suggest it has effectively closed for voice cloning specifically. Here is what each one actually does and who it is for.

1. OmniVoice

OmniVoice TTS
OmniVoice Demo

OmniVoice supports over 600 languages. It is the broadest language coverage of any Open Source zero-shot TTS model released to date.

Built on a diffusion language model architecture with Qwen3-0.6B as the text encoder, OmniVoice is surprisingly small for what it does. Give it a short reference audio clip and it clones the voice and generates speech in whatever language you need. It also supports voice design, where you describe speaker attributes like gender, age, pitch, accent, or even whisper mode, and the model constructs a voice matching those characteristics without needing a reference clip at all.

Inference speed sits at an RTF of 0.025. That means it generates 40 seconds of audio for every one second of compute. For a model covering this many languages that number is genuinely impressive.

I would include this for any project touching multilingual voice generation. Nothing else at this size comes close to the language coverage.

Try it on HuggingFace Spaces

limitations

  • Voice design feature requires familiarity with attribute prompting
  • Best results with clean reference audio, background noise affects cloning quality

2. LongCat-AudioDiT

LongCat-AudioDiT
LongCat-AudioDiT Demo

Every TTS pipeline you have ever used converts text to a spectrogram first. Then it converts that spectrogram to audio. Two steps, two places for errors to compound, two stages of quality loss baked into the architecture by design.

LongCat skips the spectrogram entirely. It works directly in the waveform latent space, which means what you hear is one generation step closer to what the model actually learned. The result shows in the benchmarks. On the Seed benchmark, the hardest voice cloning evaluation available, LongCat-AudioDiT-3.5B achieved a speaker similarity score of 0.818 on Seed-ZH and 0.797 on Seed-Hard. Both beat the previous state of the art.

It comes in two sizes. The 1B variant is fast and capable for most use cases. The 3.5B variant is where the benchmark numbers above come from. Voice cloning works by passing a reference audio clip alongside your target text. No fine tuning or training, just inference.

It is MIT license and both sizes are available on HuggingFace.

Limitations

  • 3.5B variant needs a capable GPU for comfortable inference
  • Currently stronger on Chinese than English based on benchmark scores

3. FireRedTTS-2

Fish Audio S2 Pro
FireRedTTS-2 Demo

Every TTS model in this list can clone a single voice and generate a sentence. FireRedTTS-2 does something none of them do. It generates multi-speaker conversations with natural speaker switching, context-aware prosody, and first-packet latency as low as 140ms on an L20 GPU.

That is a different use case entirely. If you are building a podcast generator, a chatbot with realistic dialogue, or a voice interface where two speakers need to interact naturally over minutes of audio, FireRedTTS-2 is one of the best open source option doing this reliably right now. It supports up to four speakers in a single generation run, up to three minutes of dialogue, and handles cross-lingual voice cloning so Speaker 1 can be in English and Speaker 2 in Japanese without breaking the output.

Language support covers English, Chinese, Japanese, Korean, French, German, and Russian. The streaming architecture means you do not wait for the full audio to generate before playback starts. It streams sentence by sentence.

At 20.9GB it is the heaviest model in this list. Apache 2.0 licensed. Weights on HuggingFace under FireRedTeam.

Worth knowing

  • Strongest on Chinese and English, other languages less thoroughly evaluated
  • Voice cloning intended for academic research per the model’s own disclaimer, use responsibly
  • Dialogue generation currently capped at three minutes and four speakers

4. Fish Audio S2 Pro

Fish Audio S2 Pro Demo

On the Audio Turing Test, human listeners correctly identified S2 Pro as AI only 48.5% of the time. Essentially a coin flip. That single number tells you more about where this model sits than any other benchmark.

Fish Audio S2 Pro is a 4B parameter model trained on over 10 million hours of audio across 80 plus languages. Voice cloning works from a 10 to 30 second reference sample, no fine tuning required. But what separates S2 Pro from everything else in this list is granular emotional control. Using simple bracket tags you can embed emotional instructions anywhere in the text. Whisper, excited, laughing, angry, inhale, pause, emphasis. Over 15,000 unique tags supported, not fixed presets, free form descriptions the model actually understands.

On Seed-TTS Eval it achieved the lowest word error rate among all evaluated models including closed source systems like Seed-TTS and MiniMax Speech.

Limitations

  • Fish Audio Research License, free for personal and research use, commercial use requires a separate paid license from Fish Audio
  • Requires HuggingFace access and local GPU for self hosting
  • SGLang recommended for best streaming performance

Audio that sounds natural

A year ago the gap between open source and commercial TTS was obvious the moment you hit play. Robotic cadence, clipped consonants, speaker similarity that fooled nobody. These four models do not sound like that.

OmniVoice covers more languages than any other model at its size. LongCat beats previous state of the art on speaker similarity by skipping the spectrogram entirely. FireRedTTS-2 handles multi-speaker conversations in a way nothing else in open source does. Fish Audio S2 Pro passed a human Turing test.

The hardware requirements vary and the licenses are not all equal. But the output quality across all four is at a level that would have seemed unrealistic in open source twelve months ago.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Open Source AI Coding Agents That Don't Need a Subscription

7 Open Source AI Coding Agents That Don’t Need a Subscription

0
Open almost any "best AI coding tools" list and you'll see the same names: Cursor, GitHub Copilot, Claude Code. They're good tools but they're also closed source and paid. What's changed over the past year isn't the quality of those products, it's how quickly the open-source alternatives have caught up. Some can orchestrate multiple agents, remember your projects across sessions, and automate complex development workflows. Many let you bring your own model, whether that's a local LLM, OpenRouter, OpenAI, GLM-5.2, Ornith, DeepSeek, or something else entirely. More importantly, you're in control. You decide where your code runs, which model powers it, and how your workflow evolves without being locked into a single company's ecosystem. If you've only looked at the paid options, these are the open-source AI coding tools worth knowing about.
Ornith Coding model that beats Claude opus 4.7

Ornith 1.0: The New Open-Source AI Model for Agentic Coding

0
Most reinforcement learning setups for coding models work the same way. Researchers build a harness, a fixed scaffold that tells the model how to approach a category of task, then the model gets rewarded for solving problems inside that structure. The harness stays fixed. Only the model's answers change. Ornith-1.0, a new open-source coding model family from DeepReinforce is not just about coding, Instead the model writes its own scaffold. At every training step, it looks at the task in front of it and the scaffold it used last time, then proposes a better version of that scaffold before even attempting an answer. The reward doesn't just grade the solution. It grades the scaffold that produced it. That's a small architectural choice with a strange consequence. A model that gets to design its own training process can, in theory, design one that cheats the verifier instead of solving the actual problem, and DeepReinforce is upfront that this happened during training. The fix they built for it is also worth understanding before getting to the benchmark numbers.
OpenAI Built Its First AI Chip. It's Not Trying to Replace NVIDIA

OpenAI Built Its First AI Chip. It’s Not Trying to Replace NVIDIA.

0
When the news broke that OpenAI had built a custom chip, the instinct was to frame it as a NVIDIA story. Another lab trying to cut the cord, reduce dependence on H100s, claw back some margin from the company that's been printing money off the AI boom. That's not quite what's happening here. The chip is called Jalapeño, built with Broadcom, and it doesn't touch training at all. It's an inference chip, meaning it only runs models after they're already built, when a user sends a message and ChatGPT has to respond. The compute-heavy work of actually training those models still runs on NVIDIA hardware. OpenAI isn't replacing NVIDIA. It's going after a different part of the problem entirely, the part that happens millions of times a day, every time someone uses one of their products. That distinction matters because inference is where AI costs actually accumulate at scale. Training happens once per model. Inference never stops.