back to top
HomeTechDramaBox: An Open-Weight TTS Model Built Around Stage Directions

DramaBox: An Open-Weight TTS Model Built Around Stage Directions

- Advertisement -

Listen to this.

via Hugging Face / DramaBox demo

That’s not a voice being synthesized. It’s a prompt. Someone wrote stage directions around dialogue, handed it to a model, and got back a villain catching his breath mid-monologue before dropping to a whisper.

Dramabox just landed on Hugging Face and the demo space is live. Resemble AI built it on top of Lightricks’ LTX-2.3, and the thing that makes it different from every other TTS model is simpler than you’d expect, you don’t give it text to read. You write it a scene.

Write a scene, not a sentence

Most TTS model you’ve used before works the same way. You paste text, you get speech. The model decides tone, pacing, delivery. You get what you get.

Dramabox works differently. You don’t give it text to read. You write it a script.

Stage directions go outside the quotes and work as performance cues the model never speaks aloud. Dialogue goes inside the quotes and gets spoken literally, including phonetic sounds. “Hahaha” is a laugh. “Hmm” is a pause. “Ugh” is exactly what it sounds like. The model reads the room around the dialogue and performs accordingly.

A line like A regal woman speaks with cold fury. “I have told you a thousand times.” produces something categorically different from just feeding the model that sentence. The direction shapes the delivery. The result sounds less like synthesis and more like someone actually inhabited the line before speaking it.

It’s a strange thing to call a speech model, but the closest analogy is a screenwriting format that happens to also be a prompt.

What’s powering it

Dramabox is an IC-LoRA fine-tune of LTX-2.3, a 3.3B diffusion transformer from Lightricks that was originally built for video. Resemble AI took the audio branch, trained on top of it, and conditioned the whole thing on Gemma 3 12B text embeddings, which is what lets it actually parse and respond to natural language directions.

The architecture is why the prompt format works. Most TTS models are essentially glorified vocoders. This one has a large language model reading your stage directions before it ever produces audio.

Related: Open Source TTS Models So Small and Capable You Can Run Local Voice AI on Almost Anything

The voice cloning part & what it actually need

The voice cloning part is simple too. You can optionally give Dramabox a 10 second voice sample. If you don’t, the model just picks a voice that matches the scene you wrote. If you do, it tries to speak in that person’s voice instead.

Some of the voice cloning demos get weirdly convincing. The model is not only copying the voice, it’s copying the delivery style around it. Pauses. Breathing. The little changes in tone when someone starts laughing halfway through a sentence. This feels closer to directing a performance.

Reference Audio

via Hugging Face / DramaBox demo

Generated Audio

via Hugging Face / DramaBox demo

Limitations

First, hardware. Peak VRAM sits around 24GB and Gemma 3 12B downloads automatically on first run at around 8GB. This isn’t a consumer GPU situation. If you want to try it without committing to a local setup, the ZeroGPU demo space on Hugging Face runs it in the browser.

Second, the license. Dramabox ships under the LTX-2 Community License, which sounds open but has a meaningful condition: companies with over $10M in annual revenue need a separate commercial agreement with Lightricks. It’s open weights, not open source, and worth reading before you build anything on top of it.

Who actually needs this

This is probably overkill for anyone who just wants a basic text-to-speech API. But for people building audio-native experiences, the model starts making a lot more sense.

Game studios generating expressive NPC dialogue. Audio drama pipelines. Dubbing workflows. Podcast tooling. Character prototyping. Interactive storytelling systems where delivery matters as much as the words themselves.

Most TTS systems optimize for intelligibility. Dramabox is optimizing for performance.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
MiniCPM-V 4.6 The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals

MiniCPM-V 4.6: The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals

0
The assumption has always been that serious AI runs on serious hardware. Your phone gets the watered-down version, good enough for a demo but not for real work. MiniCPM-V 4.6 is a direct challenge to that assumption. 1.3 billion parameters. Runs on iOS, Android, and HarmonyOS. Needs 4GB of GPU memory or 2GB on CPU via GGUF. And on the Artificial Analysis Intelligence Index it scores 13 against Qwen3.5-0.8B's score of 10 at 19x lower token cost, and against Qwen3.5-0.8B-Thinking's score of 11 at 43x lower token cost. These parts matter when it comes to a model which runs on your phone.
OpenAI’s Daybreak Wants to Fix Vulnerabilities Before Hackers Exploit Them

OpenAI’s Daybreak Wants to Fix Vulnerabilities Before Hackers Exploit Them

0
OpenAI just launched Daybreak, a new cybersecurity initiative built around one uncomfortable reality, AI is speeding up vulnerability discovery faster than most companies can patch the damage. Earlier this year, HackerOne temporarily paused parts of its bug bounty program because maintainers were getting flooded with AI-assisted vulnerability reports. Some were valid. Some were hallucinated. Either way, humans still had to read them all. And that’s the change happening underneath all the AI hype. Finding bugs is getting cheaper. Faster too. What used to take weeks of manual research can now happen in hours with the right models and enough compute. Security teams are starting to deal with something closer to triage overload than a tooling shortage. OpenAI seems to think the answer is more AI, but aimed at defenders instead of attackers. That’s where Daybreak comes in. The company says Daybreak combines its latest models, Codex Security, and a group of security partners like Cloudflare, CrowdStrike, Cisco, and Palo Alto Networks to help security teams identify vulnerabilities, validate fixes, generate patches, and monitor risky code before attackers get there first. What makes this launch interesting is that it arrives just weeks after Anthropic introduced Mythos, its own cybersecurity-focused AI system. Both companies are chasing the same problem. But they’re handling access very differently.
AutoTTS: Researchers Cut Inference Tokens by 70% by Letting AI Write Its Own Strategy

AutoTTS: Researchers Cut Inference Tokens by 70% by Letting AI Write Its Own Strategy

0
Researchers figured out how to make AI reason more efficiently by having AI figure it out itself. By building an environment where an AI agent writes controller code, tests it, gets feedback, and rewrites it until the strategy gets better. The result cuts token usage by roughly 70% at the same accuracy as running 64 parallel reasoning chains. That's the difference between inference being affordable and inference being a cost problem. The research comes from a team across UMD, UVA, WUSTL, UNC, Google, and Meta. It's called AutoTTS, automated test-time scaling and it's one of the more conceptually interesting papers published this year even if you can't download a model and use it tomorrow.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy