Listen to this.
That’s not a voice being synthesized. It’s a prompt. Someone wrote stage directions around dialogue, handed it to a model, and got back a villain catching his breath mid-monologue before dropping to a whisper.
Dramabox just landed on Hugging Face and the demo space is live. Resemble AI built it on top of Lightricks’ LTX-2.3, and the thing that makes it different from every other TTS model is simpler than you’d expect, you don’t give it text to read. You write it a scene.
Table of Contents
Write a scene, not a sentence
Most TTS model you’ve used before works the same way. You paste text, you get speech. The model decides tone, pacing, delivery. You get what you get.
Dramabox works differently. You don’t give it text to read. You write it a script.
Stage directions go outside the quotes and work as performance cues the model never speaks aloud. Dialogue goes inside the quotes and gets spoken literally, including phonetic sounds. “Hahaha” is a laugh. “Hmm” is a pause. “Ugh” is exactly what it sounds like. The model reads the room around the dialogue and performs accordingly.
A line like A regal woman speaks with cold fury. “I have told you a thousand times.” produces something categorically different from just feeding the model that sentence. The direction shapes the delivery. The result sounds less like synthesis and more like someone actually inhabited the line before speaking it.
It’s a strange thing to call a speech model, but the closest analogy is a screenwriting format that happens to also be a prompt.
What’s powering it
Dramabox is an IC-LoRA fine-tune of LTX-2.3, a 3.3B diffusion transformer from Lightricks that was originally built for video. Resemble AI took the audio branch, trained on top of it, and conditioned the whole thing on Gemma 3 12B text embeddings, which is what lets it actually parse and respond to natural language directions.
The architecture is why the prompt format works. Most TTS models are essentially glorified vocoders. This one has a large language model reading your stage directions before it ever produces audio.
Related: Open Source TTS Models So Small and Capable You Can Run Local Voice AI on Almost Anything
The voice cloning part & what it actually need
The voice cloning part is simple too. You can optionally give Dramabox a 10 second voice sample. If you don’t, the model just picks a voice that matches the scene you wrote. If you do, it tries to speak in that person’s voice instead.
Some of the voice cloning demos get weirdly convincing. The model is not only copying the voice, it’s copying the delivery style around it. Pauses. Breathing. The little changes in tone when someone starts laughing halfway through a sentence. This feels closer to directing a performance.
Reference Audio
Generated Audio
Limitations
First, hardware. Peak VRAM sits around 24GB and Gemma 3 12B downloads automatically on first run at around 8GB. This isn’t a consumer GPU situation. If you want to try it without committing to a local setup, the ZeroGPU demo space on Hugging Face runs it in the browser.
Second, the license. Dramabox ships under the LTX-2 Community License, which sounds open but has a meaningful condition: companies with over $10M in annual revenue need a separate commercial agreement with Lightricks. It’s open weights, not open source, and worth reading before you build anything on top of it.
Who actually needs this
This is probably overkill for anyone who just wants a basic text-to-speech API. But for people building audio-native experiences, the model starts making a lot more sense.
Game studios generating expressive NPC dialogue. Audio drama pipelines. Dubbing workflows. Podcast tooling. Character prototyping. Interactive storytelling systems where delivery matters as much as the words themselves.
Most TTS systems optimize for intelligibility. Dramabox is optimizing for performance.




