Cohere spent the past year deploying North, its enterprise AI workspace, with actual customers doing actual work. Agentic question answering over company file systems. Data analysis across spreadsheets. Multi-session memory that has to hold up in production. Command A+ is what came out of that, a model shaped by a year of watching enterprise workflows break and figuring out why.
The result is a 218B mixture-of-experts model with 25B active parameters at inference time, available today on Hugging Face under Apache 2.0. It replaces five separate models in the Command A family, each of which handled one thing. This one handles all of them, and on most of the tasks those specialist models were built for, it wins.
Table of Contents
Five models became one
The Command A family going into this release was fragmented. Command A for general use, Reasoning for complex problem solving, Vision for multimodal, Translate for multilingual and tool use comes in separately. Five models with five sets of infrastructure to manage.
Command A+ consolidates all of it. One model, 48 language support up from 23, multimodal reasoning included, tool use built in, reasoning mode available. For an enterprise team managing private deployments that matters. Fewer models means fewer hardware configurations, fewer versioning headaches.
The consolidation only works if the unified model actually matches the specialists. On the agentic tasks that matter most for North, it doesn’t just match them. Agentic QA accuracy improved 20% over Command A Reasoning. Spreadsheet analysis quality improved 32%. Memory performance, testing whether the model can use context from a previous session to answer questions in a new one, jumped from 39% to 54%. They’re meaningful gains over the specialist it replaced.
The efficiency numbers
218B total parameters sounds like a cluster problem. It isn’t, and that distinction is the whole point of the MoE architecture here.
In a dense model every parameter fires for every token. Command A+ activates 25B parameters at inference time and leaves the rest idle. The practical result is that it runs on two NVIDIA H100s at W4A4 quantization, or a single Blackwell GPU, with what Cohere describes as imperceptible quality difference versus the full precision version. For teams trying to deploy privately, on their own hardware, without routing sensitive data through an external API, that minimum spec changes the conversation.
Speed is also meaningfully better than its predecessor. Against Command A Reasoning at the same quantization and concurrency levels, Command A+ delivers up to 63% higher output tokens per second and cuts time to first token by up to 17%. The W4A4 quantization adds another 47% speed increase on top of that. Cohere also used speculative decoding optimized specifically for the MoE architecture, adding a further 1.5 to 1.6x inference speedup.
There’s also a new tokenizer. Command A+ is the first Cohere model to use it, and the compression gains matter especially for non-European languages, Arabic tokenization improved 20%, Korean 16%, Japanese 18%. Fewer tokens per response means lower inference cost per query, which compounds quickly at enterprise scale.
You May Like: ZAYA1-8B Matches DeepSeek-R1 on Math with Less Than 1B Active Parameters.
Where it’s genuinely strong
The benchmark Cohere is most confident about is the one that’s hardest to fake: τ²-Bench Telecom, which tests multi-step agentic task completion in realistic enterprise scenarios. Command A Reasoning scored 37% on it. Command A+ scores 85%. That’s not a incremental gain, that’s a different category of capability on the task the model was explicitly built for.
Terminal-Bench Hard went from 3% to 25%. That’s still not a number that makes Command A+ a coding specialist, but it reflects what happens when a model designed around real workflow completion gets properly trained on the full agentic loop rather than just code generation in isolation.
Multimodal reasoning is new to this model and the numbers are solid. MMMU Pro at 63%, MathVista at 80.6% up from 73.5% with Command A Vision, CharXiv reasoning at 52.7% up from 46.9%. Document understanding across charts, tables, and mixed-format files is where enterprise multimodal use actually lives, and these benchmarks test exactly that.
The multilingual part is also genuinely expanded. 48 languages versus 23 in the previous generation, with reasoning capability extending to Arabic, Japanese, and Korean in a way the earlier models didn’t support. Cohere tested this with an internal Arabic, Japanese, and Korean translation of AIME 2025, a mathematics benchmark, to verify that reasoning quality holds across languages, not just translation fluency. That’s a meaningful distinction for global enterprise deployments.
On the Artificial Analysis Intelligence Index, Command A+ scores 37, which Cohere says outperforms other leading open models. That index is a composite of general capability across tasks, and the score reflects a model that’s genuinely strong across multiple dimensions rather than optimized narrowly for one benchmark category.
What it doesn’t do well
General chat quality is not a priority here. If you’re evaluating this as a conversational assistant or a writing tool, the benchmarks will disappoint. That’s not a flaw in the model, it’s a design choice, but it’s worth being clear about before someone deploys it expecting a well-rounded assistant and gets a very capable but narrowly focused one instead.
The model also requires vLLM or Transformers for inference. That’s standard for open weights models at this scale, but enterprise teams running custom inference stacks should verify compatibility before assuming it drops into existing infrastructure cleanly.
Hardware is the other honest constraint. Two H100s is the minimum, and minimum specs in practice often mean acceptable performance rather than good performance. Teams expecting to run demanding agentic workflows at scale will likely need more than the floor. A single Blackwell GPU works too, but Blackwell hardware is still not cheap or widely available outside major cloud providers.
The agentic coding number, 25% on Terminal-Bench Hard, is better than its predecessor but still limited in absolute terms. For teams where coding is the primary use case, there are open models better suited to that specific task.
Who is this for
The Apache 2.0 license and the two H100 minimum spec are doing a lot of work here, and they’re pointing at the same customer.
Enterprise teams who need to keep data on their own infrastructure. Companies in regulated industries where sending queries to an external API isn’t an option. Organizations that have been told sovereign AI matters but haven’t had an open model with this capability profile available to actually deploy.
Command A+ is not trying to be the best general purpose chatbot. The useful part is agentic task completion, private deployment, multilingual reasoning, and multimodal document understanding, packaged into a single model that a team with two H100s can actually run.
For developers who want to try it before committing to infrastructure, the weights are on Hugging Face in BF16, FP8, and W4A4 quantizations. Cohere also has a free Space to test it and a managed inference option through Model Vault for teams that want enterprise-grade deployment without managing the hardware themselves.
The open source release also means the community gets visibility into how the model is built, something Cohere has been less forthcoming about in previous generations. Whether that translates into meaningful community contributions or just more informed evaluation remains to be seen.




