back to top
HomeTechDramaBox: An Open-Weight TTS Model Built Around Stage Directions

DramaBox: An Open-Weight TTS Model Built Around Stage Directions

- Advertisement -

Listen to this.

via Hugging Face / DramaBox demo

That’s not a voice being synthesized. It’s a prompt. Someone wrote stage directions around dialogue, handed it to a model, and got back a villain catching his breath mid-monologue before dropping to a whisper.

Dramabox just landed on Hugging Face and the demo space is live. Resemble AI built it on top of Lightricks’ LTX-2.3, and the thing that makes it different from every other TTS model is simpler than you’d expect, you don’t give it text to read. You write it a scene.

Write a scene, not a sentence

Most TTS model you’ve used before works the same way. You paste text, you get speech. The model decides tone, pacing, delivery. You get what you get.

Dramabox works differently. You don’t give it text to read. You write it a script.

Stage directions go outside the quotes and work as performance cues the model never speaks aloud. Dialogue goes inside the quotes and gets spoken literally, including phonetic sounds. “Hahaha” is a laugh. “Hmm” is a pause. “Ugh” is exactly what it sounds like. The model reads the room around the dialogue and performs accordingly.

A line like A regal woman speaks with cold fury. “I have told you a thousand times.” produces something categorically different from just feeding the model that sentence. The direction shapes the delivery. The result sounds less like synthesis and more like someone actually inhabited the line before speaking it.

It’s a strange thing to call a speech model, but the closest analogy is a screenwriting format that happens to also be a prompt.

What’s powering it

Dramabox is an IC-LoRA fine-tune of LTX-2.3, a 3.3B diffusion transformer from Lightricks that was originally built for video. Resemble AI took the audio branch, trained on top of it, and conditioned the whole thing on Gemma 3 12B text embeddings, which is what lets it actually parse and respond to natural language directions.

The architecture is why the prompt format works. Most TTS models are essentially glorified vocoders. This one has a large language model reading your stage directions before it ever produces audio.

Related: Open Source TTS Models So Small and Capable You Can Run Local Voice AI on Almost Anything

The voice cloning part & what it actually need

The voice cloning part is simple too. You can optionally give Dramabox a 10 second voice sample. If you don’t, the model just picks a voice that matches the scene you wrote. If you do, it tries to speak in that person’s voice instead.

Some of the voice cloning demos get weirdly convincing. The model is not only copying the voice, it’s copying the delivery style around it. Pauses. Breathing. The little changes in tone when someone starts laughing halfway through a sentence. This feels closer to directing a performance.

Reference Audio

via Hugging Face / DramaBox demo

Generated Audio

via Hugging Face / DramaBox demo

Limitations

First, hardware. Peak VRAM sits around 24GB and Gemma 3 12B downloads automatically on first run at around 8GB. This isn’t a consumer GPU situation. If you want to try it without committing to a local setup, the ZeroGPU demo space on Hugging Face runs it in the browser.

Second, the license. Dramabox ships under the LTX-2 Community License, which sounds open but has a meaningful condition: companies with over $10M in annual revenue need a separate commercial agreement with Lightricks. It’s open weights, not open source, and worth reading before you build anything on top of it.

Who actually needs this

This is probably overkill for anyone who just wants a basic text-to-speech API. But for people building audio-native experiences, the model starts making a lot more sense.

Game studios generating expressive NPC dialogue. Audio drama pipelines. Dubbing workflows. Podcast tooling. Character prototyping. Interactive storytelling systems where delivery matters as much as the words themselves.

Most TTS systems optimize for intelligibility. Dramabox is optimizing for performance.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Google Built Gemma 4 12B Without Multimodal Encoders

Google Built Gemma 4 12B Without Multimodal Encoders

0
Every multimodal model you've used has the same basic system. Text goes in one way, images go through a vision encoder first, audio goes through an audio encoder first, and then everything gets handed off to the language model in a form it can work with. The encoders are load-bearing and you don't just remove them.Google actually removed them.Gemma 4 12B takes raw image patches and raw audio waveforms and projects them directly into the same embedding space as text tokens. There is no vision encoder or audio encoder. One decoder handling everything.
MiniMax M3 Shows What Happens When AI Stops Thinking in Turns

MiniMax M3 Shows What Happens When AI Stops Thinking in Turns

0
Most models quit around submission 30 because they stop finding improvement and exit on their own. That's what happened when MiniMax ran a CUDA kernel optimization task against a field of frontier models. Every model except two called it done within the first 30 submissions. M3's best result came on submission 145. After 24 hours. After multiple plateaus where the numbers stopped moving and a reasonable model would have concluded there was nothing left to find. That's the thing MiniMax released yesterday. An AI model with a 1M token context window, native multimodality, and apparently a problem with knowing when to stop.
Anthropic Files for an IPO. AI Is Entering Its Public Company Era

Anthropic Files for an IPO. AI Is Entering Its Public Company Era.

0
Anthropic has officially taken its first step toward becoming a public company. In a brief announcement on Monday, the company said it had confidentially submitted a draft S-1 registration statement to the U.S. Securities and Exchange Commission for a proposed initial public offering. The filing doesn't reveal a share price, a fundraising target, or even a timeline. For now, it simply gives Anthropic the option to go public once the SEC review process is complete. Just a few years ago, Anthropic was a small group of former OpenAI researchers trying to build an alternative vision for advanced AI. Today, it sits among the handful of companies shaping the industry's future and that's why this filing matters. It's one of the world's most influential AI labs beginning the transition from a privately funded research company to a business that may eventually answer to public shareholders. For most of the AI boom, the biggest bets were made behind closed doors. Venture firms, sovereign wealth funds, and tech giants supplied the capital while the public watched from the outside. Anthropic's filing suggests that era may be starting to change.