back to top
HomePicksAI Picks5 Open Source TTS Models So Small and Capable You Can Run...

5 Open Source TTS Models So Small and Capable You Can Run Local Voice AI on Almost Anything

- Advertisement -

If you are looking for a lightweight AI voice model that actually sounds good, I found six that genuinely impressed me. Small, open source, and free to run locally. Some need almost no GPU at all. One is literally 25MB.

For their size they sound closer to paid platforms like ElevenLabs than anyone would expect.

1. KittenTTS

I almost skipped this one because 15 million parameters sounds like too little for a TTS model. Then I listened to it.

It is not perfect but for something that fits in 25MB and runs on a CPU with absolutely no GPU it is genuinely surprising. Eight built in voices, real time inference, and it works on literally any device that runs Python.

There are three variants worth knowing. The nano at 15M parameters and 25MB is the one that breaks your expectations. The micro at 40M sits in the middle. The mini at 80M is the most capable of the three and still smaller than most apps on your phone.

Features of KittenTTS

  • 15M parameters, 25MB in the smallest variant
  • Runs on CPU with zero GPU required
  • Eight built in voices including Bella, Jasper, Luna, Bruno and more
  • Real time inference optimized
  • Apache 2.0 licensed

VRAM requirements: Zero. Runs entirely on CPU.

Best for

  • Developers who need the lightest possible local TTS
  • Edge deployment, low resource environments, any device

2. Kokoro 82M

KOKORO-82M
KOKORO-82M Demo

82 million parameters & Trained for roughly $1000 on a few hundred hours of audio. And it consistently ranks at the top of open source TTS arenas.

That cost detail is not just trivia. It tells you something about how efficient this architecture is. Most models at this quality level cost hundreds of thousands to train. Kokoro did it for less than a used car.

The voice quality is where it actually surprises you. It does not sound like a lightweight model. Naturalness, pacing, prosody, it handles all of it better than models three or four times its size. 54 voices across 8 languages give you enough variety for most real world use cases.

Apache 2.0 licensed, already deployed in many products, and available via API for under $1 per million characters if you do not want to run it locally.

Features of Kokoro 82M

  • 82M parameters, significantly faster than comparable quality models
  • 54 voices across 8 languages
  • Apache 2.0 licensed
  • Under $1 per million characters via API
  • Runs on modest hardware

VRAM requirements: Low. Runs comfortably on modest GPUs and CPU inference is possible.

Best for

  • Production deployments that need quality without heavy compute
  • Developers who want commercial friendly licensing with no compromises on voice quality

3. LuxTTS

Most TTS models output at 24khz. LuxTTS outputs at 48khz. That difference is immediately noticeable, the audio just sounds cleaner and more detailed than what you are used to hearing from local models.

The speed is the other thing that gets me. 150x realtime on a single GPU. That means a one minute audio clip generates in under a second. It also runs faster than realtime on CPU which puts it in a different category from most voice cloning models.

Voice cloning needs just a 3 second reference clip. The community has already built Gradio apps, a clean UI called OptiSpeech, and even ComfyUI nodes around it. For a model with 1.8K stars that is a healthy ecosystem.

Float16 inference is still coming which should make it nearly 2x faster. Already fast enough that it barely matters.

Features of LuxTTS

  • 48khz audio output vs the standard 24khz
  • 150x realtime speed on GPU, faster than realtime on CPU
  • Voice cloning from a 3 second reference clip
  • Runs on GPU, CPU, and Apple Silicon MPS
  • Apache 2.0 licensed

VRAM requirements Fits within 1GB VRAM. Works on any local GPU.

Best for

  • Creators who want the cleanest local audio quality
  • Voice cloning without heavy hardware requirements

4. CosyVoice 2

CosyVoice2

If you are building anything that needs real time voice like a voice bot, a streaming assistant, anything that has to respond fast, CosyVoice2 is the one to look at.

150ms latency. It supports both text input streaming and audio output streaming simultaneously which is what makes that latency possible. Most models process the full input before generating any audio. CosyVoice2 starts generating while text is still coming in.

The benchmark numbers back it up too. At 0.5B parameters it holds its own against models three times its size on speaker similarity and content consistency. The RL version of Fun-CosyVoice3 which builds on this architecture actually beats most 1.5B models on the hard test set.

Nine languages supported including Chinese dialects, emotion control, speed and volume instructions, and zero shot voice cloning all in a 0.5B package.

Features of CosyVoice2 0.5B

  • 150ms ultra low latency streaming
  • Bi-streaming — text in and audio out simultaneously
  • Zero shot voice cloning
  • 9 languages plus 18 Chinese dialects
  • Emotion, speed and volume control via instructions

VRAM requirements Consumer GPU recommended for smooth streaming inference.

Best for

  • Voice bots and real time conversational AI
  • Streaming applications where latency matters

5. MeloTTS

MeloTTS

Most lightweight TTS models pick one language and do it well. MeloTTS covers English, Spanish, French, Chinese, Japanese and Korean in a single small model and the English support alone comes in four accents: American, British, Indian and Australian.

That accent variety is the part I did not expect. Most models treat English as one thing. MeloTTS treats it as four distinct speakers which matters if your product has a global audience.

MIT licensed, built by researchers from Tsinghua University and MIT. Fast enough for CPU real time inference which means no GPU required at all. Just install, pick your language and accent, and generate.

The Chinese speaker also handles mixed Chinese and English naturally which is genuinely useful for anyone building for that audience.

Features of MeloTTS

  • 6 languages including 4 English accents
  • CPU real time inference, no GPU needed
  • Mixed Chinese and English support
  • MIT licensed, free for commercial use
  • Simple API, minimal setup

VRAM requirements Zero. Runs entirely on CPU in real time.

Best for

  • Multilingual products that need one lightweight model for multiple languages
  • Developers who want accent variety without running separate models

So which one is right for you?

  • KittenTTS: The one to grab if size and hardware constraints are the priority.
  • Kokoro 82M: Best quality to size ratio on this list. Nothing at this weight sounds this good right now.
  • LuxTTS: Voice cloning with the cleanest audio output. 48khz and 150x realtime speed in under 1GB VRAM is genuinely impressive.
  • CosyVoice2 0.5B: Built for real time. If latency matters for what you are building this is the one.
  • MeloTTS: Six languages, four English accents, runs on CPU. The obvious pick for multilingual products.

All five are free, open source, and need no subscription or API key. That is the real win here.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
OpenMythos

OpenMythos: The Closest Thing to Claude Mythos You Can Run (And It’s Open Source)

0
Anthropic hasn't told anyone how Claude Mythos works. No architecture paper or model card with details. Just a product that keeps surprising people and a company that stays quiet about why. That silence has been driving the research community a little crazy. So one developer Kye Gomez did something about it. He read every public paper he could find on recurrent transformers, looped architectures, and inference-time scaling. He studied the behavioral patterns people were reporting from Mythos. Then he built what he thinks is inside it, published the code under MIT, and made it pip installable. It's called OpenMythos. It is not Claude Mythos. Gomez is explicit about that but the hypothesis behind it is serious, the architecture is real, and the reasoning for why Mythos might work this way is harder to dismiss than you'd expect.
Nucleus-Image AI image MOE model

Nucleus-Image: 17B Open-Source MoE Image Model Delivering GPT-Image Level Performance

0
The mixture-of-experts trick changed how people think about LLMs. Instead of running every parameter on every token, you activate a small fraction of the network per forward pass and somehow the quality stays competitive while the compute drops. It's the reason models like Mixtral punched above their weight. Everyone in the LLM space understood it immediately. Nobody had done it openly for image generation. Until now. Nucleus-Image is a 17B parameter diffusion transformer that activates roughly 2B parameters per forward pass. It beats Imagen4 on OneIG-Bench, sits at number one on DPG-Bench overall, and matches Qwen-Image on GenEval. It's also a base model. No fine-tuning, reinforcement learning or human preference tuning. What you're seeing in those benchmarks is raw pre-training performance. That's either impressive or a caveat depending on what you need it for, probably both.
ERNIE-Image Open-Source 8B Text-to-Image Model for Posters Comics and control

ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

0
Text rendering in open source AI image generation has been broken for a long time. Ask most models to put readable words on a poster, lay out a comic panel, or generate anything where the text actually has to make sense and only few models can do it accurately and from rest you get something that looks like it was written by someone who learned the alphabet from a fever dream. ERNIE-Image is Baidu's answer to that specific problem. It's an 8B open weight text-to-image model built on a Diffusion Transformer and it's genuinely good at dense text, structured layouts, posters, infographics and multi-panel compositions. It can run on a 24GB consumer GPU, it's on Hugging Face right now, and it comes in two versions, a full quality model and a turbo variant that gets there in 8 steps instead of 50.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy