Most open source model releases follow a predictable pattern. A lab drops weights, publishes benchmark numbers, and the community spends the next week figuring out if any of it holds up in real use. Sarvam’s 30B and 105B are different in one specific way — both are already in production.
The 105B is powering Indus, Sarvam’s reasoning and agentic assistant. The 30B is handling live multilingual voice calls on Samvaad, their conversational agent platform. These aren’t research models waiting to be tested. They shipped first and released the weights after.
What makes them technically interesting is the architecture. Both use Mixture of Experts, which means despite the parameter counts the models are only activating a fraction of their weights on any given token. The 105B activates 10.3B parameters. The 30B activates just 2.4B. That gap between total size and active compute is where the interesting performance story lives.
Here is what they actually do.
Table of contents
What Are These Models Actually Built For?
Sarvam built two models for two very different jobs. The 30B is a deployment model. It was designed to run fast, stay cheap, and handle real-time interactions without breaking a sweat. If you need an AI that can take a phone call in Hindi, understand a tool request mid-conversation, and respond before the user notices a delay, that’s what the 30B was built for.
The 105B is a reasoning model. It was built for the tasks where you need the AI to think, plan multiple steps ahead, use web search, write code, and execute complex workflows. It powers Indus, Sarvam’s AI assistant for complex queries.
Think of it this way. The 30B is what you deploy. The 105B is what you use when the problem is hard.
What Sarvam 105B Can Do?

The 105B is built for tasks that require actual thinking. Complex math problems, multi-step reasoning, coding challenges, and agentic workflows where the model needs to plan and execute across several turns. On those fronts it holds up well against models significantly larger than itself.
Where it genuinely stands out is web search and agentic tasks. On BrowseComp, a benchmark that tests how well a model finds real answers through live web search, it scored 49.5 against GLM-4.5-Air’s 21.3.
On Beyond AIME, which tests deep mathematical reasoning, the 105B scored 69.1 against GPT-OSS-120B’s 51.0. On τ² Bench, which measures long horizon agentic task completion, it scored 68.3 against GPT-OSS-120B’s 65.8. A 105B model outperforming a 120B one on the benchmarks that actually matter for real work is worth paying attention to.
That said GPT-OSS-120B still leads on LiveCodeBench, GPQA Diamond, and Arena Hard v2. Both are strong, just in different areas.
Limitations
The honest limitation is writing and instruction following. If your primary use case is creative writing or highly structured outputs, stronger options exist in this class.
But for reasoning, tool use, and long horizon tasks it punches well above what a 105B model should realistically deliver.
What Sarvam 30B Can Do?

The 30B is built for real world deployment where speed and efficiency matter. It handles live multilingual voice calls, executes tool calls mid-conversation, and does all of this on resource constrained hardware without stuttering. On Samvaad (Sarvam’s conversational agent platform) it is already managing real phone conversations in Hindi and Tamil. The 2.4B active parameter design is not a compromise, it is the whole point.
Where it genuinely stands out is how it competes against much larger models on coding and math. It scores 97.0 on Math500 and 70.0 on LiveCodeBench, outperforming several models with significantly more active compute. For a deployment focused model those numbers are unexpected.
Limitations
SWE-Bench Verified at 34.0 is where the 30B shows its ceiling. Complex real world software engineering tasks remain challenging. If you are building something that requires deep code understanding across large repositories, the 30B will struggle. The 105B handles that better, and even then stronger options exist for pure coding workloads.
But for conversational deployment, voice, multilingual tool use, and real time applications it is genuinely hard to find an open source alternative at this size that performs as consistently.
Related: How Sarvam AI Outscored Gemini in India’s Toughest Document Test
How to run them locally?
Not on Ollama. If that’s where you were heading, there’s nothing there yet except Sarvam-1, which is their older model and not what we’re talking about.
The official options are HuggingFace with Transformers, SGLang, and vLLM. The vLLM path is the messiest of the three right now. Native support isn’t merged yet so you’re either building from source or running a hotpatch script. It works, but it’s not a five minute setup.
SGLang is the cleanest path at the moment. HuggingFace Transformers works too if you just want to get something running quickly.
Both models are available for download on HuggingFace and AI Kosh. The HuggingFace model pages have the most up to date setup instructions since they get updated as support improves.
If you don’t have the hardware or just don’t want to deal with the setup, Sarvam has an official API for both models. It’s OpenAI-compatible & Worth checking out their official API documentation if that’s the easier path for you.
Also Read: How GLM-5 Became the Most Talked-About “Nvidia-Free” AI Model This Week
So where does this leave us?
Two production-ready open source models with Apache 2.0 licenses that you can download today. One handles real-time voice calls in Hindi and Tamil on constrained hardware. The other matches frontier closed models on agentic benchmarks. Both came out of the same lab, trained entirely in India on Indian compute.
Whether that impresses you or not probably depends on what you expected open source to look like in 2026. For me it’s getting harder to argue that you need a paid API for most workloads.




