back to top
HomeTechMeta’s Muse Spark: A Closed Bet on Multimodal, Multi-Agent AI

Meta’s Muse Spark: A Closed Bet on Multimodal, Multi-Agent AI

- Advertisement -

Meta has a new AI model and for the first time in years it is not called Llama.

Muse Spark launched under Meta Superintelligence Labs, a new internal division Meta quietly formed by bringing together researchers from Google DeepMind and other frontier labs. It is natively multimodal, supports multi-agent reasoning, and is available right now at meta.ai. It is also not being released as open weights.

That last part is worth sitting with for a second. Meta built one of the most trusted brands in open source AI through Llama. Developers built on it, researchers published with it. Muse Spark continues none of that. No weights, no HuggingFace release, private API preview only.

What you get instead is a genuinely capable multimodal model with some benchmark numbers that are hard to ignore and a new reasoning mode called Contemplating that puts it in conversation with Gemini Deep Think and GPT Pro. Whether that trade is worth it depends entirely on what you were using Meta AI for in the first place.

Why build Muse Spark when Llama exists

Meta Muse Spark

Llama is a general purpose open weights model. It is good at text, decent at reasoning, and has a community built around it. What it was not built for is the kind of deeply multimodal, long horizon agentic work Meta is now chasing.

Muse Spark is a different kind of bet. Meta Superintelligence Labs rebuilt the pretraining stack from scratch, and the efficiency gains are significant. They claim Muse Spark reaches the same capability level as Llama 4 Maverick using more than 10x less compute. That is a different architecture pursuing a different goal.

The closed source decision follows from that. When you are rebuilding from the ground up to compete directly with GPT-5.4 and Gemini 3.1 Pro, open weights become a liability rather than a community asset. Meta is not abandoning open source as a philosophy, Llama continues to exist, but Muse Spark is clearly not part of that story.

Whether that trade bothers you depends on what you were getting from Meta AI before. For developers who built on Llama, nothing changes today. For everyone else, Muse Spark is simply a closed model from a company that used to make open ones.

Visual and real world reasoning

On CharXiv Reasoning, which tests the ability to interpret and analyze scientific charts and figures, Muse Spark scores 86.4. Opus 4.6 scores 65.3. Gemini 3.1 Pro scores 80.2. That 86.4 is the highest number in the table and the gap over Opus is not close.

ScreenSpot Pro, which tests UI understanding and interaction, comes in at 84.1. Competitive across the board there, GPT-5.4 leads slightly at 85.4 but the gap is small.

The health application is the most concrete real world use case Meta demonstrated. Muse Spark can take an image of food, identify items, overlay personalized health recommendations based on dietary restrictions, and display nutritional data interactively. That is a product feature built on genuine multimodal capability.

ERQA, which tests entity recognition across images, scores 64.7. Gemini leads that one at 69.4 but Muse Spark sits above GPT-5.4’s 65.4 and comfortably above Opus at 51.6.

The pattern across multimodal tasks is consistent. This model was built with visual reasoning as a first class capability.

You May Like: Bonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the Size

Contemplating mode

This is Meta’s solution to slow responses of thinking models. Most reasoning models think longer to get better answers. The problem is that thinking longer means slower responses, and slower responses at scale means unhappy users.

Meta’s solution is Contemplating mode, which runs multiple agents reasoning in parallel rather than one agent thinking sequentially for longer. The idea is you get the benefit of extended reasoning without the latency penalty of a single long chain of thought.

On Humanity’s Last Exam with tools, Contemplating mode pushes Muse Spark to 50.4. That puts it in the same conversation as Opus 4.6 at 53.1 and GPT-5.4 at 52.1. Not leading, but genuinely competitive with the frontier.

The parallel agent approach also connects back to the long horizon agentic story. Multiple agents working simultaneously on different parts of a problem is a different architecture than one model thinking harder. Whether it produces meaningfully better results in real workflows versus controlled benchmarks is something independent testing will need to answer.

Where Muse Spark leads

The clearest lead is in health and science reasoning. HealthBench Hard is the most striking number in the entire benchmark table. Muse Spark scores 42.8. Opus 4.6 scores 14.8. Gemini 3.1 Pro scores 20.6. That is not a marginal win, that is a different category of performance on medical reasoning tasks. If Meta is serious about building a personal health assistant this is the benchmark that backs that claim up most convincingly.

The multimodal science story holds up too. CharXiv Reasoning at 86.4 leads every model in the comparison by a meaningful margin. For anyone working with scientific literature, research papers, or data heavy documents, that number matters in practice.

DeepSearchQA at 74.8 leads the table as well, suggesting the model handles open ended research style queries better than its competitors right now.

You May Like: GLM 5.1: The open source model that gets better the longer you run it

Current limitations

ARC AGI 2 is the number that gives me pause. Muse Spark scores 42.5 while Gemini 3.1 Pro scores 76.5 and GPT-5.4 scores 76.1. That is not a small gap on abstract reasoning tasks, it is a 34 point difference against the leading models.

Coding tells a similar story. LiveCodeBench Pro comes in at 80.0 against GPT-5.4’s 87.5. Competitive but not leading, and for developers choosing a model primarily for coding work that gap is relevant.

Terminal-Bench 2.0 at 59.0 trails Gemini at 68.5 and GPT-5.4 at 75.1 by a wider margin than most other categories.

The pattern is consistent. Muse Spark was built for multimodal perception, health, and research tasks. It was not built to be the best coding model or the strongest abstract reasoner. Knowing that going in saves you from expecting something the benchmarks do not support.

A closed door with an open promise

Muse Spark is genuinely impressive in the areas Meta designed it for. Health reasoning, multimodal perception, scientific analysis. Those are not marketing claims, the benchmark gaps over competing models in those categories are real and in some cases surprisingly large.

But it comes with a trade. The community that trusted Meta because of Llama gets nothing here. No weights, no local deployment, no building on top of it. You get an API preview if Meta decides you qualify and a chat interface at meta.ai for everyone else.

That is not necessarily wrong. Closed models can be excellent tools. But it is a different relationship than what Meta spent years building with developers, and worth being clear eyed about before you start depending on it.

If your work involves health, scientific documents, or visual reasoning, Muse Spark is worth trying the moment the API opens up. If you need strong coding performance or abstract reasoning, the benchmarks say GPT-5.4 and Gemini still performs better.

Meta built something genuinely capable. They just decided not to share it this time.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
GLM 5.1 AI

GLM 5.1: The open source model that gets better the longer you run it

0
Give an AI agent a hard problem and it usually figures out the easy wins fast. After that, more time does not help. It just sits there, trying the same things. ZhipuAI ran GLM-5.1 on a vector database optimization problem and let it go for 600 iterations. It did not run out of ideas. At iteration 50 it was sitting at roughly the same performance as the best single-session result any model had achieved. By iteration 600 it had reached 21,500 queries per second. The previous best was 3,547. That gap is not incremental improvement. It is a different category of result. GLM-5.1 is open source, MIT licensed, and the weights are on HuggingFace right now. It works with Claude Code, vLLM, and SGLang. If you are building anything that runs agents over long tasks, this one is worth understanding.
Bonsai 8B A 1-Bit LLM That Delivers 8B-Class Performance at 1 by 14th the Size

Bonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the Size

0
Nobody expected a 1.15 GB model to score competitively against full precision 8B models. That is not how this usually goes. PrismML released Bonsai 8B last month and the headline number is almost absurd. The whole model, weights and all, fits in 1.15 GB. For context, the standard FP16 version of a comparable 8B model sits at around 16 GB. Bonsai beats or matches several of them on benchmarks while being 14 times smaller. It runs on a phone. There is literally an iPhone build. I want to be clear that these numbers come from PrismML's own evaluations, not independent third party testing. But even with that caveat, this is worth paying attention to.
Open-Source TTS Models That Can Clone Voices

4 Open-Source TTS Models That Can Clone Voices and Actually Sound Human

0
Voice cloning used to mean expensive studio software, proprietary APIs with per-character pricing, or models so heavy they needed server infrastructure just to run. That changed quietly over the last few months. Four open source models exist right now that do something the previous generation struggled with. They do not just generate speech. They clone a voice from a short audio sample and produce output that is genuinely difficult to compare from the original speaker. The gap between open source and commercial TTS has been closing for a while. These four models suggest it has effectively closed for voice cloning specifically. Here is what each one actually does and who it is for.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy