Introducing Mercury, the first commercial-scale diffusion large language model

We trained diffusion large language models that are up to 10x faster and cheaper than current LLMs, pushing the frontier of intelligence and speed for language models.

We trained diffusion large language models that are up to 10x faster and cheaper than current LLMs, pushing the frontier of intelligence and speed for language models.

Takeaways

1

We are announcing the Mercury family of diffusion large language models (dLLMs), a new generation of LLMs that push the frontier of fast, high-quality text generation.



2

Mercury is up to 10x faster than frontier speed-optimized LLMs. Our models run at over 1000 tokens/sec on NVIDIA H100s, a speed previously possible only using custom chips.

3

A code generation model, Mercury Coder, is available to test in a playground. We offer enterprise clients access to code and generalist models via an API and on-premise deployments.

Our Vision — Next Generation LLMs Powered By Diffusion

Current large language models are autoregressive, meaning that they generate text left to right, one token at a time. Generation is inherently sequential—a token cannot be generated until all the text that comes before it has been generated—and generating each token requires evaluating a neural network with billions of parameters. Frontier LLM companies are betting on test-time computation to increase reasoning and error-correction capabilities, but generating long reasoning traces comes at the price of ballooning inference costs and unusable latency. A paradigm shift is needed to make high-quality AI solutions truly accessible.

Diffusion models provide such a paradigm shift. These models operate with a “coarse-to-fine” generation process, where the output is refined from pure noise over a few “denoising” steps, as illustrated in the video above.

Because diffusion models are not restricted to only considering previous output, they are better at reasoning and at structuring their responses. And because diffusion models can continually refine their outputs, they can correct mistakes and hallucinations. For these reasons, diffusion powers all of the most prominent AI solutions for video, image, and audio generation, including Sora, Midjourney, and Riffusion. However, applications of diffusion to discrete data such as text and code have never been successful. Until now.

Mercury Coder — Frontier Intelligence at 1000+ Tokens per Second

We are excited to announce Mercury Coder, our first publicly available dLLM.

Mercury Coder pushes the frontier of AI capabilities: it is 5-10x faster than the current generation of LLMs, providing high-quality responses at low costs. Our work builds on breakthrough research from our founders–who pioneered the first diffusion models for images—and who co-invented core generative AI techniques such as Direct Preference Optimization, Flash Attention, and Decision Transformers.


A dLLM is a drop-in replacement for a typical autoregressive LLM, supporting all its use cases, including RAG, tool use, and agentic workflows. When prompted with a query, instead of producing the answer one token at a time, the answer is generated in a coarse-to-fine way, as illustrated in the above animation. Improvements are suggested by a neural network – in our case a Transformer model – which is trained on large amounts of data to globally improve the quality of the answer by modifying multiple tokens in parallel.


Mercury Coder is a dLLM specifically optimized for code generation. When evaluated on standard coding benchmarks, Mercury Coder achieves excellent quality across numerous benchmarks, often surpassing the performance of speed-optimized autoregressive models like GPT-4o Mini and Claude 3.5 Haiku while being up to 10x faster.

Throughput (toks/sec)

HumanEval

MBPP

EvalPlus

MultiPL-E

LiveCodeBench

BigCodeBench

Fill-in-the-Middle

Mercury Coder Mini

1109

88.0

77.1

78.6

74.1

17.0

42.0

82.2

Mercury Coder Small

737

90.0

76.6

80.4

76.2

25.0

45.5

84.8

Gemini 2.0 Flash-Lite

201

90.0

75.0

77.3

79.5

18.0

44.4

60.1

Claude 3.5 Haiku

61

86.0

78.0

75.1

72.3

31.0

45.4

45.5

GPT-4o Mini

59

88.0

74.6

78.5

72.0

23.0

46.8

60.9

Qwen 2.5 Coder 7B

207

90.0

80.0

79.3

75.3

9.0

41.4

56.1

DeepSeek Coder V2 Lite

92

92.1

81.0

82.1

79.1

37.8

50.0

46.9

Throughput (toks/sec)

HumanEval

MBPP

EvalPlus

MultiPL-E

LiveCodeBench

BigCodeBench

Fill-in-the-Middle

Mercury Coder Mini

1109

88.0

77.1

78.6

74.1

17.0

42.0

82.2

Mercury Coder Small

737

90.0

76.6

80.4

76.2

25.0

45.5

84.8

Gemini 2.0 Flash-Lite

201

90.0

75.0

77.3

79.5

18.0

44.4

60.1

Claude 3.5 Haiku

61

86.0

78.0

75.1

72.3

31.0

45.4

45.5

GPT-4o Mini

59

88.0

74.6

78.5

72.0

23.0

46.8

60.9

Qwen 2.5 Coder 7B

207

90.0

80.0

79.3

75.3

9.0

41.4

56.1

DeepSeek Coder V2 Lite

92

92.1

81.0

82.1

79.1

37.8

50.0

46.9

Throughput (toks/sec)

HumanEval

MBPP

EvalPlus

MultiPL-E

LiveCodeBench

BigCodeBench

Fill-in-the-Middle

Mercury Coder Mini

1109

88.0

77.1

78.6

74.1

17.0

42.0

82.2

Mercury Coder Small

737

90.0

76.6

80.4

76.2

25.0

45.5

84.8

Gemini 2.0 Flash-Lite

201

90.0

75.0

77.3

79.5

18.0

44.4

60.1

Claude 3.5 Haiku

61

86.0

78.0

75.1

72.3

31.0

45.4

45.5

GPT-4o Mini

59

88.0

74.6

78.5

72.0

23.0

46.8

60.9

Qwen 2.5 Coder 7B

207

90.0

80.0

79.3

75.3

9.0

41.4

56.1

DeepSeek Coder V2 Lite

92

92.1

81.0

82.1

79.1

37.8

50.0

46.9

What sets dLLMs apart is their speed. While even speed-optimized autoregressive models run at most at 200 tokens per second, we can serve Mercury Coder on commodity NVIDIA H100s at speeds of over 1000 tokens per second, a 5x increase. Compared with some frontier models, which can run at less than 50 tokens per second, we offer a more than 20X speedup.


The throughput achieved by dLLMs was previously achievable only using specialized hardware, such as Groq, Cerebras, and SambaNova. Our algorithmic improvements are orthogonal to hardware acceleration and speedups would compound on faster chips.

Speed Comparison; Output Tokens per Second; Coding Workload

We are also excited to report that developers prefer Mercury’s code completions compared to existing code models. When benchmarked on Copilot Arena, Mercury Coder Mini is tied for second place, surpassing the performance of speed-optimized models like GPT-4o Mini and Gemini-1.5-Flash and even of larger models like GPT-4o. At the same time, it is the fastest model, about 4 times faster than GPT-4o Mini.


We invite you to explore the capabilities of our models firsthand in our playground, hosted in partnership with Lambda Labs. Experience Mercury Coder's accuracy in generating high-quality code in a fraction of the time, as demonstrated in the video below. 

What this means for AI applications


What this means for AI applications

What this means for AI
applications

Our early adopters, who include market leaders in areas including customer support, code generation, and enterprise automation, are successfully switching out standard autoregressive base models to our dLLMs as drop-in replacements. This translates into better user experiences and reduced costs. In latency-sensitive applications, our partners were often constrained to use smaller, less capable models to meet strict latency requirements. Thanks to dLLMs’ superior performance, these partners can now use larger, more capable models while adhering to their original cost and speed requirements.


We offer access to our models through an API and via on-premise deployments. Our models are fully compatible with existing hardware, datasets, and supervised fine-tuning (SFT) and alignment (RLHF) pipelines. Fine-tuning support is available for both deployment options.


Please sign up below to get early access to our API and reach out to sales@inceptionlabs.ai to discuss how dLLMs can transform your genAI applications.

Our early adopters, who include market leaders in areas including customer support, code generation, and enterprise automation, are successfully switching out standard autoregressive base models to our dLLMs as drop-in replacements. This translates into better user experiences and reduced costs. In latency-sensitive applications, our partners were often constrained to use smaller, less capable models to meet strict latency requirements. Thanks to dLLMs’ superior performance, these partners can now use larger, more capable models while adhering to their original cost and speed requirements.


We offer access to our models through an API and via on-premise deployments. Our models are fully compatible with existing hardware, datasets, and supervised fine-tuning (SFT) and alignment (RLHF) pipelines. Fine-tuning support is available for both deployment options.


Please sign up below to get early access to our API and reach out to sales@inceptionlabs.ai to discuss how dLLMs can transform your genAI applications.

Sign up for API access

Sign up for API access

Sign up for API access

What’s next?

Mercury Coder is the first in a series of upcoming dLLMs. A model designed for chat applications is in closed beta.


Diffusion language models will unlock a new set of capabilities for LLMs, including:


  1. Improved agents — dLLMs' speed and efficiency make them ideal for agentic applications that require extensive planning and lengthy generation.


  2. Advanced reasoning — dLLMs can leverage error correction to fix hallucinations and improve answers while still thinking in seconds, unlike current autoregressive reasoning models that take minutes.




  3. Controllable generation — dLLMs can edit their output and generate tokens in any order, allowing users to infill text, align outputs with objectives like safety, or produce outputs that reliably conform to user-specified formats.




  4. Edge applications – Given their efficiency, dLLMs excel in resource-constrained environments such as edge deployments on phones and laptops.

Mercury Coder is available
for testing in our playground.