back to top
HomeTechMeta’s Muse Spark: A Closed Bet on Multimodal, Multi-Agent AI

Meta’s Muse Spark: A Closed Bet on Multimodal, Multi-Agent AI

- Advertisement -

Meta has a new AI model and for the first time in years it is not called Llama.

Muse Spark launched under Meta Superintelligence Labs, a new internal division Meta quietly formed by bringing together researchers from Google DeepMind and other frontier labs. It is natively multimodal, supports multi-agent reasoning, and is available right now at meta.ai. It is also not being released as open weights.

That last part is worth sitting with for a second. Meta built one of the most trusted brands in open source AI through Llama. Developers built on it, researchers published with it. Muse Spark continues none of that. No weights, no HuggingFace release, private API preview only.

What you get instead is a genuinely capable multimodal model with some benchmark numbers that are hard to ignore and a new reasoning mode called Contemplating that puts it in conversation with Gemini Deep Think and GPT Pro. Whether that trade is worth it depends entirely on what you were using Meta AI for in the first place.

Why build Muse Spark when Llama exists

Meta Muse Spark

Llama is a general purpose open weights model. It is good at text, decent at reasoning, and has a community built around it. What it was not built for is the kind of deeply multimodal, long horizon agentic work Meta is now chasing.

Muse Spark is a different kind of bet. Meta Superintelligence Labs rebuilt the pretraining stack from scratch, and the efficiency gains are significant. They claim Muse Spark reaches the same capability level as Llama 4 Maverick using more than 10x less compute. That is a different architecture pursuing a different goal.

The closed source decision follows from that. When you are rebuilding from the ground up to compete directly with GPT-5.4 and Gemini 3.1 Pro, open weights become a liability rather than a community asset. Meta is not abandoning open source as a philosophy, Llama continues to exist, but Muse Spark is clearly not part of that story.

Whether that trade bothers you depends on what you were getting from Meta AI before. For developers who built on Llama, nothing changes today. For everyone else, Muse Spark is simply a closed model from a company that used to make open ones.

Visual and real world reasoning

On CharXiv Reasoning, which tests the ability to interpret and analyze scientific charts and figures, Muse Spark scores 86.4. Opus 4.6 scores 65.3. Gemini 3.1 Pro scores 80.2. That 86.4 is the highest number in the table and the gap over Opus is not close.

ScreenSpot Pro, which tests UI understanding and interaction, comes in at 84.1. Competitive across the board there, GPT-5.4 leads slightly at 85.4 but the gap is small.

The health application is the most concrete real world use case Meta demonstrated. Muse Spark can take an image of food, identify items, overlay personalized health recommendations based on dietary restrictions, and display nutritional data interactively. That is a product feature built on genuine multimodal capability.

ERQA, which tests entity recognition across images, scores 64.7. Gemini leads that one at 69.4 but Muse Spark sits above GPT-5.4’s 65.4 and comfortably above Opus at 51.6.

The pattern across multimodal tasks is consistent. This model was built with visual reasoning as a first class capability.

You May Like: Bonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the Size

Contemplating mode

This is Meta’s solution to slow responses of thinking models. Most reasoning models think longer to get better answers. The problem is that thinking longer means slower responses, and slower responses at scale means unhappy users.

Meta’s solution is Contemplating mode, which runs multiple agents reasoning in parallel rather than one agent thinking sequentially for longer. The idea is you get the benefit of extended reasoning without the latency penalty of a single long chain of thought.

On Humanity’s Last Exam with tools, Contemplating mode pushes Muse Spark to 50.4. That puts it in the same conversation as Opus 4.6 at 53.1 and GPT-5.4 at 52.1. Not leading, but genuinely competitive with the frontier.

The parallel agent approach also connects back to the long horizon agentic story. Multiple agents working simultaneously on different parts of a problem is a different architecture than one model thinking harder. Whether it produces meaningfully better results in real workflows versus controlled benchmarks is something independent testing will need to answer.

Where Muse Spark leads

The clearest lead is in health and science reasoning. HealthBench Hard is the most striking number in the entire benchmark table. Muse Spark scores 42.8. Opus 4.6 scores 14.8. Gemini 3.1 Pro scores 20.6. That is not a marginal win, that is a different category of performance on medical reasoning tasks. If Meta is serious about building a personal health assistant this is the benchmark that backs that claim up most convincingly.

The multimodal science story holds up too. CharXiv Reasoning at 86.4 leads every model in the comparison by a meaningful margin. For anyone working with scientific literature, research papers, or data heavy documents, that number matters in practice.

DeepSearchQA at 74.8 leads the table as well, suggesting the model handles open ended research style queries better than its competitors right now.

You May Like: GLM 5.1: The open source model that gets better the longer you run it

Current limitations

ARC AGI 2 is the number that gives me pause. Muse Spark scores 42.5 while Gemini 3.1 Pro scores 76.5 and GPT-5.4 scores 76.1. That is not a small gap on abstract reasoning tasks, it is a 34 point difference against the leading models.

Coding tells a similar story. LiveCodeBench Pro comes in at 80.0 against GPT-5.4’s 87.5. Competitive but not leading, and for developers choosing a model primarily for coding work that gap is relevant.

Terminal-Bench 2.0 at 59.0 trails Gemini at 68.5 and GPT-5.4 at 75.1 by a wider margin than most other categories.

The pattern is consistent. Muse Spark was built for multimodal perception, health, and research tasks. It was not built to be the best coding model or the strongest abstract reasoner. Knowing that going in saves you from expecting something the benchmarks do not support.

A closed door with an open promise

Muse Spark is genuinely impressive in the areas Meta designed it for. Health reasoning, multimodal perception, scientific analysis. Those are not marketing claims, the benchmark gaps over competing models in those categories are real and in some cases surprisingly large.

But it comes with a trade. The community that trusted Meta because of Llama gets nothing here. No weights, no local deployment, no building on top of it. You get an API preview if Meta decides you qualify and a chat interface at meta.ai for everyone else.

That is not necessarily wrong. Closed models can be excellent tools. But it is a different relationship than what Meta spent years building with developers, and worth being clear eyed about before you start depending on it.

If your work involves health, scientific documents, or visual reasoning, Muse Spark is worth trying the moment the API opens up. If you need strong coding performance or abstract reasoning, the benchmarks say GPT-5.4 and Gemini still performs better.

Meta built something genuinely capable. They just decided not to share it this time.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
OpenMythos

OpenMythos: The Closest Thing to Claude Mythos You Can Run (And It’s Open Source)

0
Anthropic hasn't told anyone how Claude Mythos works. No architecture paper or model card with details. Just a product that keeps surprising people and a company that stays quiet about why. That silence has been driving the research community a little crazy. So one developer Kye Gomez did something about it. He read every public paper he could find on recurrent transformers, looped architectures, and inference-time scaling. He studied the behavioral patterns people were reporting from Mythos. Then he built what he thinks is inside it, published the code under MIT, and made it pip installable. It's called OpenMythos. It is not Claude Mythos. Gomez is explicit about that but the hypothesis behind it is serious, the architecture is real, and the reasoning for why Mythos might work this way is harder to dismiss than you'd expect.
Nucleus-Image AI image MOE model

Nucleus-Image: 17B Open-Source MoE Image Model Delivering GPT-Image Level Performance

0
The mixture-of-experts trick changed how people think about LLMs. Instead of running every parameter on every token, you activate a small fraction of the network per forward pass and somehow the quality stays competitive while the compute drops. It's the reason models like Mixtral punched above their weight. Everyone in the LLM space understood it immediately. Nobody had done it openly for image generation. Until now. Nucleus-Image is a 17B parameter diffusion transformer that activates roughly 2B parameters per forward pass. It beats Imagen4 on OneIG-Bench, sits at number one on DPG-Bench overall, and matches Qwen-Image on GenEval. It's also a base model. No fine-tuning, reinforcement learning or human preference tuning. What you're seeing in those benchmarks is raw pre-training performance. That's either impressive or a caveat depending on what you need it for, probably both.
ERNIE-Image Open-Source 8B Text-to-Image Model for Posters Comics and control

ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

0
Text rendering in open source AI image generation has been broken for a long time. Ask most models to put readable words on a poster, lay out a comic panel, or generate anything where the text actually has to make sense and only few models can do it accurately and from rest you get something that looks like it was written by someone who learned the alphabet from a fever dream. ERNIE-Image is Baidu's answer to that specific problem. It's an 8B open weight text-to-image model built on a Diffusion Transformer and it's genuinely good at dense text, structured layouts, posters, infographics and multi-panel compositions. It can run on a 24GB consumer GPU, it's on Hugging Face right now, and it comes in two versions, a full quality model and a turbo variant that gets there in 8 steps instead of 50.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy