DramaBox: An Open-Weight TTS Model Built Around Stage Directions

DramaBox An Open-Weight TTS Model Built for Emotion, Voice Cloning, and AI Voice Acting

By Mohit Geryani

May 14, 2026

0

Last updated: May 14, 2026

- Advertisement -

Listen to this.

via Hugging Face / DramaBox demo

That’s not a voice being synthesized. It’s a prompt. Someone wrote stage directions around dialogue, handed it to a model, and got back a villain catching his breath mid-monologue before dropping to a whisper.

Dramabox just landed on Hugging Face and the demo space is live. Resemble AI built it on top of Lightricks’ LTX-2.3, and the thing that makes it different from every other TTS model is simpler than you’d expect, you don’t give it text to read. You write it a scene.

Write a scene, not a sentence

Most TTS model you’ve used before works the same way. You paste text, you get speech. The model decides tone, pacing, delivery. You get what you get.

Dramabox works differently. You don’t give it text to read. You write it a script.

Stage directions go outside the quotes and work as performance cues the model never speaks aloud. Dialogue goes inside the quotes and gets spoken literally, including phonetic sounds. “Hahaha” is a laugh. “Hmm” is a pause. “Ugh” is exactly what it sounds like. The model reads the room around the dialogue and performs accordingly.

A line like A regal woman speaks with cold fury. “I have told you a thousand times.” produces something categorically different from just feeding the model that sentence. The direction shapes the delivery. The result sounds less like synthesis and more like someone actually inhabited the line before speaking it.

It’s a strange thing to call a speech model, but the closest analogy is a screenwriting format that happens to also be a prompt.

DramaBox

What’s powering it

Dramabox is an IC-LoRA fine-tune of LTX-2.3, a 3.3B diffusion transformer from Lightricks that was originally built for video. Resemble AI took the audio branch, trained on top of it, and conditioned the whole thing on Gemma 3 12B text embeddings, which is what lets it actually parse and respond to natural language directions.

The architecture is why the prompt format works. Most TTS models are essentially glorified vocoders. This one has a large language model reading your stage directions before it ever produces audio.

Related: Open Source TTS Models So Small and Capable You Can Run Local Voice AI on Almost Anything

The voice cloning part & what it actually need

The voice cloning part is simple too. You can optionally give Dramabox a 10 second voice sample. If you don’t, the model just picks a voice that matches the scene you wrote. If you do, it tries to speak in that person’s voice instead.

Some of the voice cloning demos get weirdly convincing. The model is not only copying the voice, it’s copying the delivery style around it. Pauses. Breathing. The little changes in tone when someone starts laughing halfway through a sentence. This feels closer to directing a performance.

Reference Audio

via Hugging Face / DramaBox demo

Generated Audio

via Hugging Face / DramaBox demo

Limitations

First, hardware. Peak VRAM sits around 24GB and Gemma 3 12B downloads automatically on first run at around 8GB. This isn’t a consumer GPU situation. If you want to try it without committing to a local setup, the ZeroGPU demo space on Hugging Face runs it in the browser.

Second, the license. Dramabox ships under the LTX-2 Community License, which sounds open but has a meaningful condition: companies with over $10M in annual revenue need a separate commercial agreement with Lightricks. It’s open weights, not open source, and worth reading before you build anything on top of it.

Try DramaBox (HF Spaces)

Who actually needs this

This is probably overkill for anyone who just wants a basic text-to-speech API. But for people building audio-native experiences, the model starts making a lot more sense.

Game studios generating expressive NPC dialogue. Audio drama pipelines. Dubbing workflows. Podcast tooling. Character prototyping. Interactive storytelling systems where delivery matters as much as the words themselves.

Most TTS systems optimize for intelligibility. Dramabox is optimizing for performance.

DramaBox: An Open-Weight TTS Model Built Around Stage Directions

Table of Contents

Write a scene, not a sentence

What’s powering it

Related: Open Source TTS Models So Small and Capable You Can Run Local Voice AI on Almost Anything

The voice cloning part & what it actually need

Reference Audio

Generated Audio

Limitations

Who actually needs this

LEAVE A REPLY Cancel reply

MiniCPM-V 4.6: The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals

OpenAI’s Daybreak Wants to Fix Vulnerabilities Before Hackers Exploit Them

AutoTTS: Researchers Cut Inference Tokens by 70% by Letting AI Write Its Own Strategy

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter