Meta has a new AI model and for the first time in years it is not called Llama.
Muse Spark launched under Meta Superintelligence Labs, a new internal division Meta quietly formed by bringing together researchers from Google DeepMind and other frontier labs. It is natively multimodal, supports multi-agent reasoning, and is available right now at meta.ai. It is also not being released as open weights.
That last part is worth sitting with for a second. Meta built one of the most trusted brands in open source AI through Llama. Developers built on it, researchers published with it. Muse Spark continues none of that. No weights, no HuggingFace release, private API preview only.
What you get instead is a genuinely capable multimodal model with some benchmark numbers that are hard to ignore and a new reasoning mode called Contemplating that puts it in conversation with Gemini Deep Think and GPT Pro. Whether that trade is worth it depends entirely on what you were using Meta AI for in the first place.
Table of Contents
Why build Muse Spark when Llama exists

Llama is a general purpose open weights model. It is good at text, decent at reasoning, and has a community built around it. What it was not built for is the kind of deeply multimodal, long horizon agentic work Meta is now chasing.
Muse Spark is a different kind of bet. Meta Superintelligence Labs rebuilt the pretraining stack from scratch, and the efficiency gains are significant. They claim Muse Spark reaches the same capability level as Llama 4 Maverick using more than 10x less compute. That is a different architecture pursuing a different goal.
The closed source decision follows from that. When you are rebuilding from the ground up to compete directly with GPT-5.4 and Gemini 3.1 Pro, open weights become a liability rather than a community asset. Meta is not abandoning open source as a philosophy, Llama continues to exist, but Muse Spark is clearly not part of that story.
Whether that trade bothers you depends on what you were getting from Meta AI before. For developers who built on Llama, nothing changes today. For everyone else, Muse Spark is simply a closed model from a company that used to make open ones.
Visual and real world reasoning
On CharXiv Reasoning, which tests the ability to interpret and analyze scientific charts and figures, Muse Spark scores 86.4. Opus 4.6 scores 65.3. Gemini 3.1 Pro scores 80.2. That 86.4 is the highest number in the table and the gap over Opus is not close.
ScreenSpot Pro, which tests UI understanding and interaction, comes in at 84.1. Competitive across the board there, GPT-5.4 leads slightly at 85.4 but the gap is small.
The health application is the most concrete real world use case Meta demonstrated. Muse Spark can take an image of food, identify items, overlay personalized health recommendations based on dietary restrictions, and display nutritional data interactively. That is a product feature built on genuine multimodal capability.
ERQA, which tests entity recognition across images, scores 64.7. Gemini leads that one at 69.4 but Muse Spark sits above GPT-5.4’s 65.4 and comfortably above Opus at 51.6.
The pattern across multimodal tasks is consistent. This model was built with visual reasoning as a first class capability.
You May Like: Bonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the Size
Contemplating mode
This is Meta’s solution to slow responses of thinking models. Most reasoning models think longer to get better answers. The problem is that thinking longer means slower responses, and slower responses at scale means unhappy users.
Meta’s solution is Contemplating mode, which runs multiple agents reasoning in parallel rather than one agent thinking sequentially for longer. The idea is you get the benefit of extended reasoning without the latency penalty of a single long chain of thought.
On Humanity’s Last Exam with tools, Contemplating mode pushes Muse Spark to 50.4. That puts it in the same conversation as Opus 4.6 at 53.1 and GPT-5.4 at 52.1. Not leading, but genuinely competitive with the frontier.
The parallel agent approach also connects back to the long horizon agentic story. Multiple agents working simultaneously on different parts of a problem is a different architecture than one model thinking harder. Whether it produces meaningfully better results in real workflows versus controlled benchmarks is something independent testing will need to answer.
Where Muse Spark leads
The clearest lead is in health and science reasoning. HealthBench Hard is the most striking number in the entire benchmark table. Muse Spark scores 42.8. Opus 4.6 scores 14.8. Gemini 3.1 Pro scores 20.6. That is not a marginal win, that is a different category of performance on medical reasoning tasks. If Meta is serious about building a personal health assistant this is the benchmark that backs that claim up most convincingly.
The multimodal science story holds up too. CharXiv Reasoning at 86.4 leads every model in the comparison by a meaningful margin. For anyone working with scientific literature, research papers, or data heavy documents, that number matters in practice.
DeepSearchQA at 74.8 leads the table as well, suggesting the model handles open ended research style queries better than its competitors right now.
You May Like: GLM 5.1: The open source model that gets better the longer you run it
Current limitations
ARC AGI 2 is the number that gives me pause. Muse Spark scores 42.5 while Gemini 3.1 Pro scores 76.5 and GPT-5.4 scores 76.1. That is not a small gap on abstract reasoning tasks, it is a 34 point difference against the leading models.
Coding tells a similar story. LiveCodeBench Pro comes in at 80.0 against GPT-5.4’s 87.5. Competitive but not leading, and for developers choosing a model primarily for coding work that gap is relevant.
Terminal-Bench 2.0 at 59.0 trails Gemini at 68.5 and GPT-5.4 at 75.1 by a wider margin than most other categories.
The pattern is consistent. Muse Spark was built for multimodal perception, health, and research tasks. It was not built to be the best coding model or the strongest abstract reasoner. Knowing that going in saves you from expecting something the benchmarks do not support.
A closed door with an open promise
Muse Spark is genuinely impressive in the areas Meta designed it for. Health reasoning, multimodal perception, scientific analysis. Those are not marketing claims, the benchmark gaps over competing models in those categories are real and in some cases surprisingly large.
But it comes with a trade. The community that trusted Meta because of Llama gets nothing here. No weights, no local deployment, no building on top of it. You get an API preview if Meta decides you qualify and a chat interface at meta.ai for everyone else.
That is not necessarily wrong. Closed models can be excellent tools. But it is a different relationship than what Meta spent years building with developers, and worth being clear eyed about before you start depending on it.
If your work involves health, scientific documents, or visual reasoning, Muse Spark is worth trying the moment the API opens up. If you need strong coding performance or abstract reasoning, the benchmarks say GPT-5.4 and Gemini still performs better.
Meta built something genuinely capable. They just decided not to share it this time.




