back to top
HomeTechMeta’s Muse Spark: A Closed Bet on Multimodal, Multi-Agent AI

Meta’s Muse Spark: A Closed Bet on Multimodal, Multi-Agent AI

- Advertisement -

Meta has a new AI model and for the first time in years it is not called Llama.

Muse Spark launched under Meta Superintelligence Labs, a new internal division Meta quietly formed by bringing together researchers from Google DeepMind and other frontier labs. It is natively multimodal, supports multi-agent reasoning, and is available right now at meta.ai. It is also not being released as open weights.

That last part is worth sitting with for a second. Meta built one of the most trusted brands in open source AI through Llama. Developers built on it, researchers published with it. Muse Spark continues none of that. No weights, no HuggingFace release, private API preview only.

What you get instead is a genuinely capable multimodal model with some benchmark numbers that are hard to ignore and a new reasoning mode called Contemplating that puts it in conversation with Gemini Deep Think and GPT Pro. Whether that trade is worth it depends entirely on what you were using Meta AI for in the first place.

Why build Muse Spark when Llama exists

Meta Muse Spark

Llama is a general purpose open weights model. It is good at text, decent at reasoning, and has a community built around it. What it was not built for is the kind of deeply multimodal, long horizon agentic work Meta is now chasing.

Muse Spark is a different kind of bet. Meta Superintelligence Labs rebuilt the pretraining stack from scratch, and the efficiency gains are significant. They claim Muse Spark reaches the same capability level as Llama 4 Maverick using more than 10x less compute. That is a different architecture pursuing a different goal.

The closed source decision follows from that. When you are rebuilding from the ground up to compete directly with GPT-5.4 and Gemini 3.1 Pro, open weights become a liability rather than a community asset. Meta is not abandoning open source as a philosophy, Llama continues to exist, but Muse Spark is clearly not part of that story.

Whether that trade bothers you depends on what you were getting from Meta AI before. For developers who built on Llama, nothing changes today. For everyone else, Muse Spark is simply a closed model from a company that used to make open ones.

Visual and real world reasoning

On CharXiv Reasoning, which tests the ability to interpret and analyze scientific charts and figures, Muse Spark scores 86.4. Opus 4.6 scores 65.3. Gemini 3.1 Pro scores 80.2. That 86.4 is the highest number in the table and the gap over Opus is not close.

ScreenSpot Pro, which tests UI understanding and interaction, comes in at 84.1. Competitive across the board there, GPT-5.4 leads slightly at 85.4 but the gap is small.

The health application is the most concrete real world use case Meta demonstrated. Muse Spark can take an image of food, identify items, overlay personalized health recommendations based on dietary restrictions, and display nutritional data interactively. That is a product feature built on genuine multimodal capability.

ERQA, which tests entity recognition across images, scores 64.7. Gemini leads that one at 69.4 but Muse Spark sits above GPT-5.4’s 65.4 and comfortably above Opus at 51.6.

The pattern across multimodal tasks is consistent. This model was built with visual reasoning as a first class capability.

You May Like: Bonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the Size

Contemplating mode

This is Meta’s solution to slow responses of thinking models. Most reasoning models think longer to get better answers. The problem is that thinking longer means slower responses, and slower responses at scale means unhappy users.

Meta’s solution is Contemplating mode, which runs multiple agents reasoning in parallel rather than one agent thinking sequentially for longer. The idea is you get the benefit of extended reasoning without the latency penalty of a single long chain of thought.

On Humanity’s Last Exam with tools, Contemplating mode pushes Muse Spark to 50.4. That puts it in the same conversation as Opus 4.6 at 53.1 and GPT-5.4 at 52.1. Not leading, but genuinely competitive with the frontier.

The parallel agent approach also connects back to the long horizon agentic story. Multiple agents working simultaneously on different parts of a problem is a different architecture than one model thinking harder. Whether it produces meaningfully better results in real workflows versus controlled benchmarks is something independent testing will need to answer.

Where Muse Spark leads

The clearest lead is in health and science reasoning. HealthBench Hard is the most striking number in the entire benchmark table. Muse Spark scores 42.8. Opus 4.6 scores 14.8. Gemini 3.1 Pro scores 20.6. That is not a marginal win, that is a different category of performance on medical reasoning tasks. If Meta is serious about building a personal health assistant this is the benchmark that backs that claim up most convincingly.

The multimodal science story holds up too. CharXiv Reasoning at 86.4 leads every model in the comparison by a meaningful margin. For anyone working with scientific literature, research papers, or data heavy documents, that number matters in practice.

DeepSearchQA at 74.8 leads the table as well, suggesting the model handles open ended research style queries better than its competitors right now.

You May Like: GLM 5.1: The open source model that gets better the longer you run it

Current limitations

ARC AGI 2 is the number that gives me pause. Muse Spark scores 42.5 while Gemini 3.1 Pro scores 76.5 and GPT-5.4 scores 76.1. That is not a small gap on abstract reasoning tasks, it is a 34 point difference against the leading models.

Coding tells a similar story. LiveCodeBench Pro comes in at 80.0 against GPT-5.4’s 87.5. Competitive but not leading, and for developers choosing a model primarily for coding work that gap is relevant.

Terminal-Bench 2.0 at 59.0 trails Gemini at 68.5 and GPT-5.4 at 75.1 by a wider margin than most other categories.

The pattern is consistent. Muse Spark was built for multimodal perception, health, and research tasks. It was not built to be the best coding model or the strongest abstract reasoner. Knowing that going in saves you from expecting something the benchmarks do not support.

A closed door with an open promise

Muse Spark is genuinely impressive in the areas Meta designed it for. Health reasoning, multimodal perception, scientific analysis. Those are not marketing claims, the benchmark gaps over competing models in those categories are real and in some cases surprisingly large.

But it comes with a trade. The community that trusted Meta because of Llama gets nothing here. No weights, no local deployment, no building on top of it. You get an API preview if Meta decides you qualify and a chat interface at meta.ai for everyone else.

That is not necessarily wrong. Closed models can be excellent tools. But it is a different relationship than what Meta spent years building with developers, and worth being clear eyed about before you start depending on it.

If your work involves health, scientific documents, or visual reasoning, Muse Spark is worth trying the moment the API opens up. If you need strong coding performance or abstract reasoning, the benchmarks say GPT-5.4 and Gemini still performs better.

Meta built something genuinely capable. They just decided not to share it this time.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Microsoft and Uber Say AI Coding Tools Are Becoming More Expensive Than Human Workers

Microsoft and Uber Are Running Into an AI Cost Problem

0
The pitch was impressive. AI tools would make developers faster, reduce headcount costs, and pay for themselves many times over. Companies that moved early would have a structural advantage over those that waited. Microsoft believed it. So did Uber. Both pushed hard on AI coding tool adoption across their engineering teams. Both are now dealing with same problem: the faster their employees embraced the tools, the faster the bills grew. In some cases those bills have started exceeding what the same work would have cost with human labor. The problem is what happens to the economics when thousands of employees use something that charges per unit of thought.
Anthropic claude mythos 1 perparation for calude code and security

Anthropic Says Mythos Isn’t Public Yet. ‘Mythos 1’ Keeps Appearing Anyway.

0
On Friday, Anthropic said Claude Mythos would remain restricted. The company was clear about it: stronger safeguards were needed before any general release, and for now the model would stay limited to roughly 40 selected organizations through Project Glasswing. The next day, users started seeing "Mythos 1" inside Claude Code. The model appeared in the UI briefly, with a preview label reading "claude-mythos-1-preview," then disappeared again. TestingCatalog found new strings in the source code: "Access to the Claude Mythos model in Claude Code and Claude Security." Screenshots circulated on X. Then the traces were gone.
qwen 3.7 max

Alibaba’s Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still Kept Getting...

0
Alibaba gave Qwen3.7-Max a kernel optimization task on a hardware platform the model had never encountered before. No documentation or profiling data. No example kernels for the architecture. Just a task description, an existing implementation, and an evaluation script. The model ran for 35 hours. It made 1,158 tool calls. It wrote, compiled, profiled, and rewrote the kernel repeatedly, diagnosing failures, fixing bugs, identifying blocks, and redesigning the architecture multiple times without anyone watching. After 30 hours it was still finding meaningful improvements. The final result was a 10x speedup over the reference implementation. For context: GLM 5.1 ran the same task and reached 7.3x. Kimi K2.6 reached 5x. DeepSeek V4 Pro reached 3.3x. The models that stopped early did so because they issued no tool calls for five consecutive rounds, they concluded they couldn't make further progress and stopped. Qwen3.7-Max didn't stop.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy