back to top
HomeTechDramaBox: An Open-Weight TTS Model Built Around Stage Directions

DramaBox: An Open-Weight TTS Model Built Around Stage Directions

- Advertisement -

Listen to this.

via Hugging Face / DramaBox demo

That’s not a voice being synthesized. It’s a prompt. Someone wrote stage directions around dialogue, handed it to a model, and got back a villain catching his breath mid-monologue before dropping to a whisper.

Dramabox just landed on Hugging Face and the demo space is live. Resemble AI built it on top of Lightricks’ LTX-2.3, and the thing that makes it different from every other TTS model is simpler than you’d expect, you don’t give it text to read. You write it a scene.

Write a scene, not a sentence

Most TTS model you’ve used before works the same way. You paste text, you get speech. The model decides tone, pacing, delivery. You get what you get.

Dramabox works differently. You don’t give it text to read. You write it a script.

Stage directions go outside the quotes and work as performance cues the model never speaks aloud. Dialogue goes inside the quotes and gets spoken literally, including phonetic sounds. “Hahaha” is a laugh. “Hmm” is a pause. “Ugh” is exactly what it sounds like. The model reads the room around the dialogue and performs accordingly.

A line like A regal woman speaks with cold fury. “I have told you a thousand times.” produces something categorically different from just feeding the model that sentence. The direction shapes the delivery. The result sounds less like synthesis and more like someone actually inhabited the line before speaking it.

It’s a strange thing to call a speech model, but the closest analogy is a screenwriting format that happens to also be a prompt.

What’s powering it

Dramabox is an IC-LoRA fine-tune of LTX-2.3, a 3.3B diffusion transformer from Lightricks that was originally built for video. Resemble AI took the audio branch, trained on top of it, and conditioned the whole thing on Gemma 3 12B text embeddings, which is what lets it actually parse and respond to natural language directions.

The architecture is why the prompt format works. Most TTS models are essentially glorified vocoders. This one has a large language model reading your stage directions before it ever produces audio.

Related: Open Source TTS Models So Small and Capable You Can Run Local Voice AI on Almost Anything

The voice cloning part & what it actually need

The voice cloning part is simple too. You can optionally give Dramabox a 10 second voice sample. If you don’t, the model just picks a voice that matches the scene you wrote. If you do, it tries to speak in that person’s voice instead.

Some of the voice cloning demos get weirdly convincing. The model is not only copying the voice, it’s copying the delivery style around it. Pauses. Breathing. The little changes in tone when someone starts laughing halfway through a sentence. This feels closer to directing a performance.

Reference Audio

via Hugging Face / DramaBox demo

Generated Audio

via Hugging Face / DramaBox demo

Limitations

First, hardware. Peak VRAM sits around 24GB and Gemma 3 12B downloads automatically on first run at around 8GB. This isn’t a consumer GPU situation. If you want to try it without committing to a local setup, the ZeroGPU demo space on Hugging Face runs it in the browser.

Second, the license. Dramabox ships under the LTX-2 Community License, which sounds open but has a meaningful condition: companies with over $10M in annual revenue need a separate commercial agreement with Lightricks. It’s open weights, not open source, and worth reading before you build anything on top of it.

Who actually needs this

This is probably overkill for anyone who just wants a basic text-to-speech API. But for people building audio-native experiences, the model starts making a lot more sense.

Game studios generating expressive NPC dialogue. Audio drama pipelines. Dubbing workflows. Podcast tooling. Character prototyping. Interactive storytelling systems where delivery matters as much as the words themselves.

Most TTS systems optimize for intelligibility. Dramabox is optimizing for performance.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
OpenAI Built Its First AI Chip. It's Not Trying to Replace NVIDIA

OpenAI Built Its First AI Chip. It’s Not Trying to Replace NVIDIA.

0
When the news broke that OpenAI had built a custom chip, the instinct was to frame it as a NVIDIA story. Another lab trying to cut the cord, reduce dependence on H100s, claw back some margin from the company that's been printing money off the AI boom. That's not quite what's happening here. The chip is called Jalapeño, built with Broadcom, and it doesn't touch training at all. It's an inference chip, meaning it only runs models after they're already built, when a user sends a message and ChatGPT has to respond. The compute-heavy work of actually training those models still runs on NVIDIA hardware. OpenAI isn't replacing NVIDIA. It's going after a different part of the problem entirely, the part that happens millions of times a day, every time someone uses one of their products. That distinction matters because inference is where AI costs actually accumulate at scale. Training happens once per model. Inference never stops.
glm 5.2 ai open weights

GLM-5.2 Is the Closest an Open Model Has Come to Claude

0
What does it take for an open-weight model to stop chasing Claude and actually beat it? Every open-weight release for two years has told some version of the same story: closer, but not quite. The chart shrinks, the wording softens to "competitive with," and the conversation moves on until the next model repeats the cycle. GLM-5.2 breaks that pattern. The model is built to survive long, messy coding work, the kind that runs for hours without losing the thread. That's the pitch its maker is leading with. But scroll down their own benchmark table and something else is sitting there quietly: on a couple of standard math evals, this open model isn't approaching Claude Opus 4.8, GPT-5.5, or Gemini 3.1 Pro. It's beating all three, on the same table. It loses plenty of ground elsewhere, and that part matters just as much as the wins. But a model anyone can download under an MIT license, with no usage restrictions attached, coming out ahead of the lab everyone else measures themselves against, is worth pausing on before getting to what the rest of the numbers actually say.
Open-Source AI Tools Worth Trying Right Now

5 Open-Source AI Tools You Probably Haven’t Tried Yet

0
Every week brings another open source AI release, and most of them require setting up a Python environment. Find out the model card lied about VRAM requirements. By the time something actually runs, the appeal has mostly worn off. The five tools below skip most of that. One turns image and video generation into something closer to a desktop app. One gives DeepSeek an actual workspace instead of a browser tab. One builds UI prototypes using coding agents you probably already have installed. One quietly builds a memory system out of your own apps. And one is, literally, a desktop pet.