AI Inference

只看“原始速度”这一项指标的时代已经过去了，现在更重要的是吞吐量、效率以及在大规模下的整体经济性。随着 AI 从给出一次性答案演进到执行多步推理，对推理本身及其背后经济性的需求都在不断攀升。由于每个查询需要生成的 token 数量大幅增加，这一转变显著提升了算力需求。除整体吞吐量外，tokens per watt、cost per million tokens、tokens per second per user 等指标也同样关键。对于受功耗约束的 AI 工厂而言，NVIDIA 持续的软件优化能够在时间维度上转化为更高的 token 收入，这进一步凸显了我们技术演进的重要价值。

帕累托曲线清晰展示了 NVIDIA Blackwell 在成本、能效、吞吐量和响应速度等全维度生产优先级上实现了兼顾与平衡。只针对单一场景进行系统优化，往往会削弱部署灵活性，从而在曲线上的其他点产生效率损失。NVIDIA 的全栈设计方法，确保在多种真实生产场景中都能兼顾效率与价值。Blackwell 的领先表现源于其深度软硬件协同设计，全面体现了为速度、效率与可扩展性而打造的全栈架构。

在这篇博客中，你可以了解 Mixture of Experts 如何驱动更智能的前沿 AI 模型，以及在 NVIDIA Blackwell NVL72 上实现高达 10 倍的加速表现。

查看其他性能数据

推理基准

MLPerf 推理

LLM 推理

推理

了解用于获得这些结果的方法论，并通过亲自执行 Benchmarking Recipes 学习如何复现这些测试。

MLPerf Inference v6.0 性能基准

Offline Scenario, Closed Division

Network	Throughput	GPU	Server	GPU Version	QSL Size	Target Accuracy	Dataset
DeepSeek R1	2,494,310 tokens/sec	288x GB300	NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	4388	99% of FP16 (exact match 81.9132%)	mlperf_deepseek_r1
	486,141 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	4388	99% of FP16 (exact match 81.9132%)	mlperf_deepseek_r1
	70,326 tokens/sec	8x B300	NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT)	NVIDIA B300	4388	99% of FP16 (exact match 81.9132%)	mlperf_deepseek_r1
	58,582 tokens/sec	8x B200	Nebius B200 n1 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	4388	99% of FP16 (exact match 81.9132%)	mlperf_deepseek_r1
gpt-oss 120B	1,046,150 tokens/sec	72x GB300	Nebius GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	6396	99% of 83.13%	AIME25, GPQA Diamond, LiveCodeBench v6
	879,542 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	6396	99% of 83.13%	AIME25, GPQA Diamond, LiveCodeBench v6
	111,496 tokens/sec	8x B300	Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT)	NVIDIA B300	6396	99% of 83.13%	AIME25, GPQA Diamond, LiveCodeBench v6
	93,071 tokens/sec	8x B200	LLM-D v0.5.0,Openshift 4.20.12,NVIDIA 8xB200-SXM-180GB	NVIDIA B200	6396	99% of 83.13%	AIME25, GPQA Diamond, LiveCodeBench v6
Qwen3-VL 235B	61 tokens/sec	4x GB300	NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	48289	99% of BF16 (Category Hierarchical F1 Score >= 0.7824)	Shopify Product Catalogue
	44 tokens/sec	4x GB200	NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	48289	99% of BF16 (Category Hierarchical F1 Score >= 0.7824)	Shopify Product Catalogue
	78 tokens/sec	8x B300	Nebius B300 n1 (8x B300-SXM-270GB, TensorRT)	NVIDIA B300	48289	99% of BF16 (Category Hierarchical F1 Score >= 0.7824)	Shopify Product Catalogue
	79 tokens/sec	8x B200	Dell B200,8xB200-SXM-180GB,RHEL 10.1,vLLM CentML:mlperf-inf-mm-q3vl-v6.0	NVIDIA B200	48289	99% of BF16 (Category Hierarchical F1 Score >= 0.7824)	Shopify Product Catalogue
Llama3.1 405B	19,512 tokens/sec	72x GB300	NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	15,462 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	1,971 tokens/sec	8x B300	Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT)	NVIDIA B300	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	1,350 tokens/sec	8x B200	NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B	1,126,850 tokens/sec	72x GB300	NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	OpenOrca (max_seq_len=1024)
	888,054 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	OpenOrca (max_seq_len=1024)
	112,954 tokens/sec	8x B300	NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT)	NVIDIA B300	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	OpenOrca (max_seq_len=1024)
	104,572 tokens/sec	8x B200	HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT)	NVIDIA B200	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	OpenOrca (max_seq_len=1024)
Llama3.1 8B	166,745 tokens/sec	8x B300	XA NB3I-E12	NVIDIA B300	13368	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644)	CNN Dailymail (v3.0.0, max_seq_len=2048)
	160,403 tokens/sec	8x B200	NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	13368	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644)	CNN Dailymail (v3.0.0, max_seq_len=2048)
Wan2.2	0.037 samples/sec	4x GB300	NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	248	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	VBench prompts
	0.027 samples/sec	4x GB200	NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	248	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	VBench prompts
	0.059 samples/sec	8x B300	NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT)	NVIDIA B300	248	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	VBench prompts
	0.046 samples/sec	8x B200	NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	248	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	VBench prompts
DLRMv3	104,637 samples/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	34996	99% of FP32 and 99.9% of FP32 (AUC=80.31%)	Synthetic Streaming 100B Dataset
	10,737 samples/sec	8x B200	Camarero PDI200A2HG-810 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	34996	99% of FP32 and 99.9% of FP32 (WER=2.0671%)	Synthetic Streaming 100B Dataset
Whisper	50,562 samples/sec	8x B300	NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT)	NVIDIA B300	1633	99% of FP32 and 99.9% of FP32 (WER=2.0671%)	LibriSpeech
	49,327 samples/sec	8x B200	NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	1633	99% of FP32 and 99.9% of FP32 (WER=2.0671%)	LibriSpeech

Server Scenario - Closed Division

Network	Throughput	GPU	Server	GPU Version	QSL Size	Target Accuracy	MLPerf Server Latency Constraints (ms)	Dataset
DeepSeek R1	1,555,110 tokens/sec	288x GB300	NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	4388	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 2000 ms/80 ms	mlperf_deepseek_r1
	336,106 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	4388	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 2000 ms/80 ms	mlperf_deepseek_r1
	60,413 tokens/sec	8x B300	Nebius B300 n1 (8x B300-SXM-270GB, TensorRT)	NVIDIA B300	4388	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 2000 ms/80 ms	mlperf_deepseek_r1
	51,693 tokens/sec	8x B200	Nebius B200 n1 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	4388	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 2000 ms/80 ms	mlperf_deepseek_r1
gpt-oss 120B	1,096,770 tokens/sec	72x GB300	Nebius GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	6396	99% of 83.13%	TTFT/TPOT: 3000 ms/80 ms	AIME25, GPQA Diamond, LiveCodeBench v6
	899,218 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	6396	99% of 83.13%	TTFT/TPOT: 3000 ms/80 ms	AIME25, GPQA Diamond, LiveCodeBench v6
	110,655 queries/sec	8x B300	Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT)	NVIDIA B300	6396	99% of 83.13%	TTFT/TPOT: 3000 ms/80 ms	AIME25, GPQA Diamond, LiveCodeBench v6
	87,444 tokens/sec	8x B200	Nebius B200 n1 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	6396	99% of 83.13%	TTFT/TPOT: 3000 ms/80 ms	AIME25, GPQA Diamond, LiveCodeBench v6
Qwen3-VL 235B	43 tokens/sec	4x GB300	Nebius GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	48289	99% of BF16 (Category Hierarchical F1 Score >= 0.7824)	12 s	Shopify Product Catalogue
	38 tokens/sec	4x GB200	NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	48289	99% of BF16 (Category Hierarchical F1 Score >= 0.7824)	12 s	Shopify Product Catalogue
	45 queries/sec	8x B300	Nebius B300 n1 (8x B300-SXM-270GB, TensorRT)	NVIDIA B300	48289	99% of BF16 (Category Hierarchical F1 Score >= 0.7824)	12 s	Shopify Product Catalogue
	68 tokens/sec	8x B200	Dell B200,8xB200-SXM-180GB,RHEL 10.1,vLLM CentML:mlperf-inf-mm-q3vl-v6.0	NVIDIA B200	48289	99% of BF16 (Category Hierarchical F1 Score >= 0.7824)	12 s	Shopify Product Catalogue
Llama3.1 405B	18,628 tokens/sec	72x GB300	NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	14,134 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	1,484 tokens/sec	8x B300	QuantaGrid D75H-10U (8x B300-SXM-270GB, TensorRT)	NVIDIA B300	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	984 tokens/sec	8x B200	NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B	868,278 tokens/sec	72x GB300	NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
	810,104 tokens/sec	72x B200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
	108,392 tokens/sec	8x B300	PowerEdge XE9780L (8x B300-SXM-270GB, TensorRT)	NVIDIA B300	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
	103,627 tokens/sec	8x B200	HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT)	NVIDIA B200	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
Llama3.1 8B	148,067 tokens/sec	8x B300	XA NB3I-E12	NVIDIA B300	13368	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644)	TTFT/TPOT: 2000 ms/100 ms	CNN Dailymail (v3.0.0, max_seq_len=2048)
	131,270 queries/sec	8x B200	HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT)	NVIDIA B200	13368	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644)	TTFT/TPOT: 2000 ms/100 ms	CNN Dailymail (v3.0.0, max_seq_len=2048)
Wan2.2**	31 seconds	4x GB300	NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	248	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	N/A	VBench prompts
	40 seconds	4x GB200	NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	248	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	N/A	VBench prompts
	21 seconds	8x B300	G894-SD3-AAX7	NVIDIA B300	248	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	N/A	VBench prompts
	25 seconds	8x B200	NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	248	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	N/A	VBench prompts
DLRMv3	99,997 queries/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	34996	99% of FP32 (AUC=80.31%)	80 ms	Synthetic Streaming 100B Dataset
	10,007 queries/sec	8x B200	NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	34996	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	80 ms	Synthetic Streaming 100B Dataset

Interactive Scenario - Closed Division

Network	Throughput	GPU	Server	GPU Version	QSL Size	Target Accuracy	MLPerf Server Latency Constraints (ms)	Dataset
DeepSeek R1	250,634 tokens/sec	72x GB300	NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	4388	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 1500 ms/15 ms	mlperf_deepseek_r1
	240,318 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	4388	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 1500 ms/15 ms	mlperf_deepseek_r1
	4,935 tokens/sec	8x B300	G894-SD3-AAX7	NVIDIA B300	4388	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 1500 ms/15 ms	mlperf_deepseek_r1
gpt-oss 120B	677,199 tokens/sec	72x GB300	NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	6396	99% of 83.13%	TTFT/TPOT: 2000 ms/20 ms	AIME25, GPQA Diamond, LiveCodeBench v6
	624,929 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	6396	99% of 83.13%	TTFT/TPOT: 2000 ms/20 ms	AIME25, GPQA Diamond, LiveCodeBench v6
	26,006 tokens/sec	8x B300	XA NB3I-E12	NVIDIA B300	6396	99% of 83.13%	TTFT/TPOT: 2000 ms/20 ms	AIME25, GPQA Diamond, LiveCodeBench v6
	13,155 tokens/sec	8x B200	Nebius B200 n1 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	6396	99% of 83.13%	TTFT/TPOT: 2000 ms/20 ms	AIME25, GPQA Diamond, LiveCodeBench v6
Llama3.1 405B	18,365 tokens/sec	72x GB300	NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	TTFT/TPOT: 4500 ms/80 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	14,010 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	TTFT/TPOT: 4500 ms/80 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	765 tokens/sec	8x B300	G894-SD3-AAX7	NVIDIA B300	8313	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)	TTFT/TPOT: 4500 ms/80 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B	814,128 tokens/sec	72x GB300	NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT)	NVIDIA GB300	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
	754,855 tokens/sec	72x GB200	NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT)	NVIDIA GB200	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
	70,724 tokens/sec	8x B300	PowerEdge XE9780L (8x B300-SXM-270GB, TensorRT)	NVIDIA B300	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
	61,300 tokens/sec	8x B200	HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT)	NVIDIA B200	24576	99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
Llama3.1 8B	128,633 tokens/sec	8x B300	G894-SD3-AAX7	NVIDIA B300	13368	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644)	TTFT/TPOT: 500 ms/30 ms	CNN Dailymail (v3.0.0, max_seq_len=2048)
	128,750 tokens/sec	8x B200	NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT)	NVIDIA B200	13368	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644)	TTFT/TPOT: 500 ms/30 ms	CNN Dailymail (v3.0.0, max_seq_len=2048)

**The primary metric on Wan2.2 in Server Scenario is measured in seconds (lower the better).
MLPerf™ v6.0 Inference Closed Division. NVIDIA platform results from the following entries: 6.0-0006, 6.0-0010, 6.0-0024, 6.0-0039, 6.0-0040, 6.0-0048, 6.0-0062, 6.0-0072, 6.0-0073, 6.0-0074, 6.0-0075, 6.0-0076, 6.0-0077, 6.0-0078, 6.0-0080, 6.0-0081, 6.0-0083, 6.0-0084, 6.0-0085, 6.0-0089, 6.0-0091, 6.0-0094, 6.0-0098. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

NVIDIA 数据中心产品的 LLM 推理性能

B200 推理性能

Model	Parallelism	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Qwen3 235B A22B	DEP4	1000	1000	5,764 output tokens/sec/gpu	4x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Qwen3 235B A22B	DEP4	1024	8192	3,389 output tokens/sec/gpu	4x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Qwen3 235B A22B	DEP4	1024	32768	1,255 output tokens/sec/gpu	4x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Qwen3 235B A22B	DEP4	8192	1024	1,410 output tokens/sec/gpu	4x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Qwen3 235B A22B	DEP4	32768	1024	319 output tokens/sec/gpu	4x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Qwen3 30B A3B	TP1	1000	1000	26,971 output tokens/sec/gpu	1x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Qwen3 30B A3B	TP1	1024	8192	13,497 output tokens/sec/gpu	1x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Qwen3 30B A3B	TP1	1024	32768	4,494 output tokens/sec/gpu	1x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Qwen3 30B A3B	TP1	8192	1024	5,735 output tokens/sec/gpu	1x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Qwen3 30B A3B	TP1	32768	1024	1,265 output tokens/sec/gpu	1x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Llama v4 Maverick	DEP4	1000	1000	11,337 output tokens/sec/gpu	4x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Llama v4 Maverick	DEP4	1024	8192	5,174 output tokens/sec/gpu	4x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Llama v4 Maverick	DEP4	1024	32768	2,204 output tokens/sec/gpu	4x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Llama v4 Maverick	DEP4	8192	1024	3,279 output tokens/sec/gpu	4x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
Llama v4 Maverick	DEP4	32768	1024	859 output tokens/sec/gpu	4x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
GPT-OSS 20B	TP1	1000	1000	53,812 output tokens/sec/gpu	1x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
GPT-OSS 20B	TP1	1024	8192	34,702 output tokens/sec/gpu	1x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
GPT-OSS 20B	TP1	1024	32768	14,589 output tokens/sec/gpu	1x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
GPT-OSS 20B	TP1	8192	1024	11,904 output tokens/sec/gpu	1x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200
GPT-OSS 20B	TP1	32768	1024	2,645 output tokens/sec/gpu	1x B200	DGX B200	FP4	TensorRT-LLM 1.1	NVIDIA B200

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

RTX PRO 6000 Blackwell 服务器版推理性能

Model	Parallelism	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Qwen3 235B A22B	DEP2 PP2	1000	1000	1,731 output tokens/sec/gpu	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.1	NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 235B A22B	DEP8	1024	8192	711 output tokens/sec/gpu	8x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.1	NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 235B A22B	DEP2 PP2	32768	1024	70 output tokens/sec/gpu	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.1	NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B	TP1	1000	1000	9,938 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.1	NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B	TP1	1024	8192	3,621 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.1	NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B	TP1	8192	1024	1,914 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.1	NVIDIA RTX PRO 6000 Blackwell Server Edition
Qwen3 30B A3B	TP1	32768	1024	374 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.1	NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 9B v2	TP1	500	500	1,711 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 9B v2	TP1	1000	4000	790 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 9B v2	TP1	4000	1000	1,238 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 12B v2	TP1	500	500	1,229 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 12B v2	TP1	1000	4000	1,202 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron Nano 12B v2	TP1	4000	1000	1,071 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron 3 Nano 30B	TP1	500	500	6,616 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron 3 Nano 30B	TP1	1000	4000	4,957 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Nemotron 3 Nano 30B	TP1	4000	1000	5,353 output tokens/sec/gpu	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 6000 Blackwell Server Edition

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

RTX PRO 4500 Blackwell 服务器版推理性能

Model	Parallelism	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Nemotron Nano 9B v2	TP1	500	500	945 output tokens/sec/gpu	1x RTX PRO 4500	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 9B v2	TP1	1000	4000	410 output tokens/sec/gpu	1x RTX PRO 4500	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 9B v2	TP1	4000	1000	636 output tokens/sec/gpu	1x RTX PRO 4500	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 12B v2	TP1	500	500	678 output tokens/sec/gpu	1x RTX PRO 4500	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 12B v2	TP1	1000	4000	681 output tokens/sec/gpu	1x RTX PRO 4500	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 4500 Blackwell Server Edition
Nemotron Nano 12B v2	TP1	4000	1000	566 output tokens/sec/gpu	1x RTX PRO 4500	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 1.2.0	NVIDIA RTX PRO 4500 Blackwell Server Edition

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

H200 推理性能

Model	Parallelism	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Qwen3 235B A22B	DEP4	1000	1000	3,288 output tokens/sec/gpu	4x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
Qwen3 235B A22B	DEP4	1024	8192	1,417 output tokens/sec/gpu	4x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
Qwen3 235B A22B	DEP4	8192	1024	627 output tokens/sec/gpu	4x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
Qwen3 235B A22B	DEP4	32768	1024	134 output tokens/sec/gpu	4x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
Llama v4 Maverick	DEP8	1000	1000	4,146 output tokens/sec/gpu	8x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
Llama v4 Maverick	DEP8	1024	8192	1,157 output tokens/sec/gpu	8x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
Llama v4 Maverick	DEP8	1024	32768	679 output tokens/sec/gpu	8x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
Llama v4 Maverick	DEP8	8192	1024	1,276 output tokens/sec/gpu	8x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
GPT-OSS 20B	TP1	1000	1000	13,858 output tokens/sec/gpu	1x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
GPT-OSS 20B	TP1	1024	8192	12,743 output tokens/sec/gpu	1x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
GPT-OSS 20B	TP1	1024	32768	output tokens/sec/gpu	1x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
GPT-OSS 20B	TP1	8192	1024	4,015 output tokens/sec/gpu	1x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200
GPT-OSS 20B	TP1	32768	1024	9,154 output tokens/sec/gpu	1x H200	DGX H200	FP8	TensorRT-LLM 1.1	NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

H100 推理性能

Model	Parallelism	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Qwen3 235B A22B	DEP8	1000	1000	1,932 output tokens/sec/gpu	8x H100	DGX H100	FP8	TensorRT-LLM 1.1	H100-SXM5-80GB
Qwen3 235B A22B	DEP8	1024	8192	873 output tokens/sec/gpu	8x H100	DGX H100	FP8	TensorRT-LLM 1.1	H100-SXM5-80GB
GPT-OSS 20B	TP1	1000	1000	11,557 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.1	H100-SXM5-80GB
GPT-OSS 20B	TP1	1024	8192	8,617 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.1	H100-SXM5-80GB
GPT-OSS 20B	TP1	8192	1024	3,366 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.1	H100-SXM5-80GB
GPT-OSS 20B	TP1	32768	1024	785 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.1	H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism

L40S 推理性能

Model	Parallelism	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v4 Scout	TP2 PP2	128	2048	1,105 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	TP2 PP2	128	4096	707 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	TP4	2048	128	561 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	TP4	5000	500	307 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	TP2 PP2	500	2000	1,093 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	TP2 PP2	1000	1000	920 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	TP2 PP2	1000	2000	884 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	TP2 PP2	2048	2048	615 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	TP4	128	2048	1,694 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	TP2 PP2	128	4096	972 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	TP4	500	2000	1,413 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	TP4	1000	1000	1,498 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	TP4	1000	2000	1,084 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	TP4	2048	2048	773 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	TP1	128	128	8,471 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	TP1	128	4096	2,888 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	TP1	2048	128	1,017 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	TP1	5000	500	863 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	TP1	500	2000	4,032 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	TP1	1000	2000	3,134 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	TP1	2048	2048	2,148 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	TP1	20000	2000	280 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism