Mercury Diffusion LLM Promises 1,000+ Tokens per Second for Coding

Inception Labs says its Mercury diffusion language models can generate code at more than 1,000 tokens per second on H100 GPUs.

Inception Labs has introduced Mercury, a family of diffusion large language models designed to generate text and code in a different way from standard autoregressive LLMs.

Inception Labs Mercury diffusion language model announcement image — Image source: Inception Labs official announcement.

What happened?

Inception Labs says Mercury is the first commercial-scale diffusion large language model family. Instead of generating text strictly left to right, diffusion language models refine an output through a coarse-to-fine process. The company argues that this approach can improve latency and cost for workloads where speed matters.

The first public model in the family is Mercury Coder, a code-focused diffusion LLM available to test in Inception's playground. The company says enterprise customers can access code and generalist models through an API or on-premise deployments.

The headline numbers

Inception Labs claims Mercury Coder can run at more than 1,000 tokens per second on NVIDIA H100 GPUs. In the company's published comparison table, Mercury Coder Mini is listed at 1,109 tokens per second and Mercury Coder Small at 737 tokens per second on a coding workload.

The company says this makes Mercury 5-10x faster than speed-optimized autoregressive models in its comparisons, while maintaining competitive benchmark results on coding tests such as HumanEval, MBPP, EvalPlus, MultiPL-E, LiveCodeBench, BigCodeBench, and fill-in-the-middle tasks.

Why it matters for developers

AI coding tools are increasingly limited by latency, especially when agents need to plan, call tools, inspect files, revise code, and run multiple steps. If diffusion language models can deliver useful code output at much higher throughput, they could make subagents, autocomplete, code review, and large-scale refactoring feel more interactive.

For coding assistants: faster generation can reduce wait time during long edits or multi-file changes.
For agent workflows: low latency matters when each task requires several model calls.
For enterprises: Inception is positioning Mercury as a drop-in replacement for standard LLMs, with API, on-premise deployment, and fine-tuning support.

What to verify before adopting

The performance numbers are from Inception Labs' own announcement and technical materials, so teams should run their own evaluations before moving production workloads. The most important checks are code correctness, benchmark relevance, tool-calling behavior, latency under real traffic, pricing, security requirements, and integration compatibility.

Still, Mercury is worth watching because it is not just another model checkpoint. It represents a different generation method for language models, and that could matter if AI coding systems continue moving toward many fast model calls instead of a few long responses.

What happened?

The headline numbers

Why it matters for developers

What to verify before adopting

Sources