只看“原始速度”这一项指标的时代已经过去了,现在更重要的是吞吐量、效率以及在大规模下的整体经济性。随着 AI 从给出一次性答案演进到执行多步推理,对推理本身及其背后经济性的需求都在不断攀升。由于每个查询需要生成的 token 数量大幅增加,这一转变显著提升了算力需求。除整体吞吐量外,tokens per watt、cost per million tokens、tokens per second per user 等指标也同样关键。对于受功耗约束的 AI 工厂而言,NVIDIA 持续的软件优化能够在时间维度上转化为更高的 token 收入,这进一步凸显了我们技术演进的重要价值。
帕累托曲线清晰展示了 NVIDIA Blackwell 在成本、能效、吞吐量和响应速度等全维度生产优先级上实现了兼顾与平衡。只针对单一场景进行系统优化,往往会削弱部署灵活性,从而在曲线上的其他点产生效率损失。NVIDIA 的全栈设计方法,确保在多种真实生产场景中都能兼顾效率与价值。Blackwell 的领先表现源于其深度软硬件协同设计,全面体现了为速度、效率与可扩展性而打造的全栈架构。
在这篇博客中,你可以了解 Mixture of Experts 如何驱动更智能的前沿 AI 模型,以及在 NVIDIA Blackwell NVL72 上实现高达 10 倍的加速表现。
了解用于获得这些结果的方法论,并通过亲自执行 Benchmarking Recipes 学习如何复现这些测试。
| Network | Throughput | GPU | Server | GPU Version | QSL Size | Target Accuracy | Dataset |
|---|---|---|---|---|---|---|---|
| DeepSeek R1 | 2,494,310 tokens/sec | 288x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf_deepseek_r1 |
| 486,141 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf_deepseek_r1 | |
| 70,326 tokens/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf_deepseek_r1 | |
| 58,582 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf_deepseek_r1 | |
| gpt-oss 120B | 1,046,150 tokens/sec | 72x GB300 | Nebius GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 |
| 879,542 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 111,496 tokens/sec | 8x B300 | Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) | NVIDIA B300 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 93,071 tokens/sec | 8x B200 | LLM-D v0.5.0,Openshift 4.20.12,NVIDIA 8xB200-SXM-180GB | NVIDIA B200 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| Qwen3-VL 235B | 61 tokens/sec | 4x GB300 | NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | Shopify Product Catalogue |
| 44 tokens/sec | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | Shopify Product Catalogue | |
| 78 tokens/sec | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | Shopify Product Catalogue | |
| 79 tokens/sec | 8x B200 | Dell B200,8xB200-SXM-180GB,RHEL 10.1,vLLM CentML:mlperf-inf-mm-q3vl-v6.0 | NVIDIA B200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | Shopify Product Catalogue | |
| Llama3.1 405B | 19,512 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| 15,462 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| 1,971 tokens/sec | 8x B300 | Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) | NVIDIA B300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| 1,350 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| Llama2 70B | 1,126,850 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | OpenOrca (max_seq_len=1024) |
| 888,054 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | OpenOrca (max_seq_len=1024) | |
| 112,954 tokens/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | OpenOrca (max_seq_len=1024) | |
| 104,572 tokens/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | OpenOrca (max_seq_len=1024) | |
| Llama3.1 8B | 166,745 tokens/sec | 8x B300 | XA NB3I-E12 | NVIDIA B300 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | CNN Dailymail (v3.0.0, max_seq_len=2048) |
| 160,403 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | CNN Dailymail (v3.0.0, max_seq_len=2048) | |
| Wan2.2 | 0.037 samples/sec | 4x GB300 | NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts |
| 0.027 samples/sec | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts | |
| 0.059 samples/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts | |
| 0.046 samples/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts | |
| DLRMv3 | 104,637 samples/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 34996 | 99% of FP32 and 99.9% of FP32 (AUC=80.31%) | Synthetic Streaming 100B Dataset |
| 10,737 samples/sec | 8x B200 | Camarero PDI200A2HG-810 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 34996 | 99% of FP32 and 99.9% of FP32 (WER=2.0671%) | Synthetic Streaming 100B Dataset | |
| Whisper | 50,562 samples/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 1633 | 99% of FP32 and 99.9% of FP32 (WER=2.0671%) | LibriSpeech |
| 49,327 samples/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 1633 | 99% of FP32 and 99.9% of FP32 (WER=2.0671%) | LibriSpeech |
| Network | Throughput | GPU | Server | GPU Version | QSL Size | Target Accuracy | MLPerf Server Latency
Constraints (ms) |
Dataset |
|---|---|---|---|---|---|---|---|---|
| DeepSeek R1 | 1,555,110 tokens/sec | 288x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf_deepseek_r1 |
| 336,106 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf_deepseek_r1 | |
| 60,413 tokens/sec | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf_deepseek_r1 | |
| 51,693 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf_deepseek_r1 | |
| gpt-oss 120B | 1,096,770 tokens/sec | 72x GB300 | Nebius GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| 899,218 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 110,655 queries/sec | 8x B300 | Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) | NVIDIA B300 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 87,444 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| Qwen3-VL 235B | 43 tokens/sec | 4x GB300 | Nebius GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | 12 s | Shopify Product Catalogue |
| 38 tokens/sec | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | 12 s | Shopify Product Catalogue | |
| 45 queries/sec | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | 12 s | Shopify Product Catalogue | |
| 68 tokens/sec | 8x B200 | Dell B200,8xB200-SXM-180GB,RHEL 10.1,vLLM CentML:mlperf-inf-mm-q3vl-v6.0 | NVIDIA B200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score >= 0.7824) | 12 s | Shopify Product Catalogue | |
| Llama3.1 405B | 18,628 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| 14,134 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| 1,484 tokens/sec | 8x B300 | QuantaGrid D75H-10U (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| 984 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| Llama2 70B | 868,278 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max_seq_len=1024) |
| 810,104 tokens/sec | 72x B200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max_seq_len=1024) | |
| 108,392 tokens/sec | 8x B300 | PowerEdge XE9780L (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max_seq_len=1024) | |
| 103,627 tokens/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max_seq_len=1024) | |
| Llama3.1 8B | 148,067 tokens/sec | 8x B300 | XA NB3I-E12 | NVIDIA B300 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | TTFT/TPOT: 2000 ms/100 ms | CNN Dailymail (v3.0.0, max_seq_len=2048) |
| 131,270 queries/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | TTFT/TPOT: 2000 ms/100 ms | CNN Dailymail (v3.0.0, max_seq_len=2048) | |
| Wan2.2** | 31 seconds | 4x GB300 | NVIDIA GB300 NVL72 (4x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 248 | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | N/A | VBench prompts |
| 40 seconds | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 248 | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | N/A | VBench prompts | |
| 21 seconds | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 248 | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | N/A | VBench prompts | |
| 25 seconds | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 248 | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | N/A | VBench prompts | |
| DLRMv3 | 99,997 queries/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 34996 | 99% of FP32 (AUC=80.31%) | 80 ms | Synthetic Streaming 100B Dataset |
| 10,007 queries/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 34996 | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 80 ms | Synthetic Streaming 100B Dataset |
| Network | Throughput | GPU | Server | GPU Version | QSL Size | Target Accuracy | MLPerf Server Latency
Constraints (ms) |
Dataset |
|---|---|---|---|---|---|---|---|---|
| DeepSeek R1 | 250,634 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 1500 ms/15 ms | mlperf_deepseek_r1 |
| 240,318 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 1500 ms/15 ms | mlperf_deepseek_r1 | |
| 4,935 tokens/sec | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 1500 ms/15 ms | mlperf_deepseek_r1 | |
| gpt-oss 120B | 677,199 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| 624,929 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 26,006 tokens/sec | 8x B300 | XA NB3I-E12 | NVIDIA B300 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| 13,155 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 | |
| Llama3.1 405B | 18,365 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 4500 ms/80 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| 14,010 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 4500 ms/80 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| 765 tokens/sec | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68) | TTFT/TPOT: 4500 ms/80 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
| Llama2 70B | 814,128 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB_aarch64, TensorRT) | NVIDIA GB300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max_seq_len=1024) |
| 754,855 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB_aarch64, TensorRT) | NVIDIA GB200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max_seq_len=1024) | |
| 70,724 tokens/sec | 8x B300 | PowerEdge XE9780L (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max_seq_len=1024) | |
| 61,300 tokens/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max_seq_len=1024) | |
| Llama3.1 8B | 128,633 tokens/sec | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | TTFT/TPOT: 500 ms/30 ms | CNN Dailymail (v3.0.0, max_seq_len=2048) |
| 128,750 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=8167644) | TTFT/TPOT: 500 ms/30 ms | CNN Dailymail (v3.0.0, max_seq_len=2048) |
**The primary metric on Wan2.2 in Server Scenario is measured in seconds (lower the better).
MLPerf™ v6.0 Inference Closed Division. NVIDIA platform results from the following entries: 6.0-0006, 6.0-0010, 6.0-0024, 6.0-0039, 6.0-0040, 6.0-0048, 6.0-0062, 6.0-0072, 6.0-0073, 6.0-0074, 6.0-0075, 6.0-0076, 6.0-0077, 6.0-0078, 6.0-0080, 6.0-0081, 6.0-0083, 6.0-0084, 6.0-0085, 6.0-0089, 6.0-0091, 6.0-0094, 6.0-0098. MLPerf name and logo are trademarks. See
https://mlcommons.org/ for more information.
For MLPerf™ various scenario data, click
here
For MLPerf™ latency constraints, click
here
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3 235B A22B | DEP4 | 1000 | 1000 | 5,764 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 1024 | 8192 | 3,389 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 1024 | 32768 | 1,255 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 8192 | 1024 | 1,410 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 32768 | 1024 | 319 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 1000 | 1000 | 26,971 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 1024 | 8192 | 13,497 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 1024 | 32768 | 4,494 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 8192 | 1024 | 5,735 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 32768 | 1024 | 1,265 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 1000 | 1000 | 11,337 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 1024 | 8192 | 5,174 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 1024 | 32768 | 2,204 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 8192 | 1024 | 3,279 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 32768 | 1024 | 859 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 1000 | 1000 | 53,812 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 1024 | 8192 | 34,702 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 1024 | 32768 | 14,589 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 8192 | 1024 | 11,904 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 32768 | 1024 | 2,645 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3 235B A22B | DEP2 PP2 | 1000 | 1000 | 1,731 output tokens/sec/gpu | 4x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 235B A22B | DEP8 | 1024 | 8192 | 711 output tokens/sec/gpu | 8x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 235B A22B | DEP2 PP2 | 32768 | 1024 | 70 output tokens/sec/gpu | 4x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 1000 | 1000 | 9,938 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 1024 | 8192 | 3,621 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 8192 | 1024 | 1,914 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 32768 | 1024 | 374 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 500 | 500 | 1,711 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 1000 | 4000 | 790 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 4000 | 1000 | 1,238 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 500 | 500 | 1,229 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 1000 | 4000 | 1,202 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 4000 | 1000 | 1,071 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron 3 Nano 30B | TP1 | 500 | 500 | 6,616 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron 3 Nano 30B | TP1 | 1000 | 4000 | 4,957 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron 3 Nano 30B | TP1 | 4000 | 1000 | 5,353 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Nemotron Nano 9B v2 | TP1 | 500 | 500 | 945 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 1000 | 4000 | 410 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 4000 | 1000 | 636 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 500 | 500 | 678 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 1000 | 4000 | 681 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 4000 | 1000 | 566 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3 235B A22B | DEP4 | 1000 | 1000 | 3,288 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Qwen3 235B A22B | DEP4 | 1024 | 8192 | 1,417 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Qwen3 235B A22B | DEP4 | 8192 | 1024 | 627 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Qwen3 235B A22B | DEP4 | 32768 | 1024 | 134 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 1000 | 1000 | 4,146 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 1024 | 8192 | 1,157 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 1024 | 32768 | 679 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 8192 | 1024 | 1,276 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 1000 | 1000 | 13,858 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 1024 | 8192 | 12,743 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 1024 | 32768 | output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 8192 | 1024 | 4,015 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 32768 | 1024 | 9,154 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3 235B A22B | DEP8 | 1000 | 1000 | 1,932 output tokens/sec/gpu | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| Qwen3 235B A22B | DEP8 | 1024 | 8192 | 873 output tokens/sec/gpu | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 1000 | 1000 | 11,557 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 1024 | 8192 | 8,617 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 8192 | 1024 | 3,366 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 32768 | 1024 | 785 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| Llama v4 Scout | TP2 PP2 | 128 | 2048 | 1,105 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 128 | 4096 | 707 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP4 | 2048 | 128 | 561 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP4 | 5000 | 500 | 307 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 500 | 2000 | 1,093 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 1000 | 1000 | 920 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 1000 | 2000 | 884 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 2048 | 2048 | 615 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 128 | 2048 | 1,694 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP2 PP2 | 128 | 4096 | 972 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 500 | 2000 | 1,413 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 1000 | 1000 | 1,498 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 1000 | 2000 | 1,084 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 2048 | 2048 | 773 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 128 | 128 | 8,471 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 128 | 4096 | 2,888 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 2048 | 128 | 1,017 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 5000 | 500 | 863 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 500 | 2000 | 4,032 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 1000 | 2000 | 3,134 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 2048 | 2048 | 2,148 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 20000 | 2000 | 280 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
TP: Tensor Parallelism
PP: Pipeline Parallelism
DEP: Data Expert Parallelism
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable Video Diffusion | 1 | 7.32 videos/min | - | 8202.75 | 1x B200 | DGX B200 | 26.02-py3 | Mixed | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Stable Diffusion XL | 1 | 2.89 images/sec | - | 507.41 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| BEVFusion Head | 1 | 2464.55 images/sec | 6 images/sec/watt | 0.41 | 1x B200 | DGX B200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Flux Image Generator | 1 | 0.47 images/sec | - | 2130.4 | 1x B200 | DGX B200 | 26.02-py3 | FP4 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF Swin Base | 128 | 4,948 samples/sec | 6 samples/sec/watt | 25.87 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF Swin Large | 128 | 3,223 samples/sec | 3 samples/sec/watt | 39.71 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF ViT Base | 2048 | 9,480 samples/sec | 10 samples/sec/watt | 216.04 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF ViT Large | 1024 | 3,381 samples/sec | 4 samples/sec/watt | 302.83 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Yolo v10 M | 1 | 846.98 images/sec | 1.19 images/sec/watt | 1.18 | 1x B200 | DGX B200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Yolo v11 M | 1 | 1034.36 images/sec | 1.4 images/sec/watt | 0.97 | 1x B200 | DGX B200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable Diffusion XL | 1 | 1.05 images/sec | 954 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE | |
| Flux Image Generator | 1 | 0.2 images/sec | - | 5072 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP4 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE |
| BEVFusion Head | 1 | 1738.51 images/sec | 5 images/sec/watt | 0.58 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
| HF Swin Base | 32 | 2,719 samples/sec | 5 samples/sec/watt | 11.77 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
| HF Swin Large | 32 | 1,517 samples/sec | 3 samples/sec/watt | 21.1 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
| HF ViT Base | 32 | 4,011 samples/sec | - | 8 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE |
| HF ViT Large | 16 | 1,280 samples/sec | - | 13 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE |
| Yolo v11 M | 1 | 465 images/sec | 1 images/sec/watt | 2.15 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable Diffusion XL | 1 | 0.4 images/sec | - | 2514 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| Flux Image Generator | 1 | 0.07 images/sec | - | 13816 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP4 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF Bert Large QAT | 64 | 2,720 samples/sec | - | 24 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | INT8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF Bert Large | 64 | 1,507 samples/sec | - | 42 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | Mixed | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF ViT Base | 16 | 1,403 samples/sec | - | 11 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF ViT Large | 4 | 449 samples/sec | - | 9 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable Video Diffusion | 1 | 4.83 videos/min | - | 12414.37 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Stable Diffusion XL | 1 | 1.61 images/sec | - | 760.29 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| BEVFusion Head | 1 | 2006.49 images/sec | 6 images/sec/watt | 0.5 | 1x H200 | DGX H200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Flux Image Generator | 1 | .2 images/sec | - | 5010.27 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF Swin Base | 128 | 3,009 samples/sec | 4 samples/sec/watt | 42.54 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF Swin Large | 128 | 1,821 samples/sec | 3 samples/sec/watt | 70.28 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF ViT Base | 1024 | 4,943 samples/sec | 7 samples/sec/watt | 207.15 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF ViT Large | 1024 | 1,702 samples/sec | 2 samples/sec/watt | 601.64 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Yolo v10 M | 1 | 431.92 images/sec | 0.68 images/sec/watt | 2.32 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Yolo v11 M | 1 | 518.04 images/sec | 0.8 images/sec/watt | 1.93 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BEVFusion Head | 1 | 2006.78 images/sec | 6 images/sec/watt | 0.5 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF Swin Base | 128 | 2,919 samples/sec | 4 samples/sec/watt | 43.84 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF Swin Large | 128 | 1,752 samples/sec | 3 samples/sec/watt | 73.04 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF ViT Base | 1024 | 4,728 samples/sec | 7 samples/sec/watt | 216.57 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF ViT Large | 2048 | 1,629 samples/sec | 2 samples/sec/watt | 1256.97 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| Yolo v10 M | 1 | 433.06 images/sec | 0.66 images/sec/watt | 2.31 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| Yolo v11 M | 1 | 505.3 images/sec | 0.8 images/sec/watt | 1.98 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stable Video Diffusion | 1 | 4.68 videos/min | - | 12811.33 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| Stable Diffusion XL | 1 | 1.54 images/sec | - | 780.31 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| BEVFusion Head | 1 | 1999.27 images/sec | 6 images/sec/watt | 0.5 | 1x H100 | DGX H100 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF Swin Base | 128 | 2,866 samples/sec | 4 samples/sec/watt | 44.67 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF Swin Large | 128 | 1,767 samples/sec | 3 samples/sec/watt | 72.42 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF ViT Base | 2048 | 4,864 samples/sec | 7 samples/sec/watt | 421.03 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF ViT Large | 2048 | 1,679 samples/sec | 2 samples/sec/watt | 1219.62 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| Yolo v10 M | 1 | 403.68 images/sec | 0.68 images/sec/watt | 2.48 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| Yolo v11 M | 1 | 476 images/sec | 0.76 images/sec/watt | 2.1 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BEVFusion Head | 1 | 1958.07 images/sec | 7 images/sec/watt | 0.51 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF Swin Base | 32 | 1,396 samples/sec | 4 samples/sec/watt | 22.92 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF Swin Large | 32 | 716 samples/sec | 2 samples/sec/watt | 44.72 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF ViT Base | 1024 | 1,662 samples/sec | 5 samples/sec/watt | 616.09 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF ViT Large | 1024 | 597 samples/sec | 2 samples/sec/watt | 1716.6 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| Yolo v10 M | 1 | 274.78 images/sec | 0.79 images/sec/watt | 3.64 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| Yolo v11 M | 1 | 310 images/sec | 0.9 images/sec/watt | 3.23 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384
只看计算单价或 FLOPs per dollar 会对推理 TCO 形成不完整的认知。对于 AI 推理 TCO,最重要的指标是每个 token 的成本,也就是实际交付的性价比。GB300 NVL72 在使用 Dynamo 和 TensorRT-LLM、单用户交互吞吐 116 TPS 的场景下,实现了每百万个 token 0.123 美元的推理成本——根据截至 2026 年 4 月的 SemiAnalysis InferenceX 基准测试,这是各大平台中每 token 成本最低的水平。
GB300 NVL72 在使用 Dynamo 和 TensorRT-LLM、单用户交互吞吐 116 TPS 的场景下,实现了每百万个 token 0.123 美元的推理成本——根据截至 2026 年 4 月的 SemiAnalysis InferenceX 基准测试,这是各大平台中每 token 成本最低的水平。
NVIDIA 每百万 token 的推理成本在各代 GPU 中有了显著改善:根据 2026 年第一季度的 SemiAnalysis InferenceX 基准测试,在低延迟 agentic 工作负载上,得益于软硬件协同设计,NVIDIA Blackwell Ultra(GB300 NVL72)相较 NVIDIA Hopper 实现了每 MW 吞吐最高提升至 50 倍、每 token 成本最多降低至 35 倍。软件优化也带来持续改进——GB200 的 token 产出在三个月内提升了 4 倍,对应地每 token 成本也按比例下降。
NVIDIA 的 TensorRT-LLM 和 Dynamo 软件栈在无需更换硬件的前提下,持续带来推理成本优化。根据截至 2026 年 4 月的 SemiAnalysis InferenceX 基准测试,NVIDIA Blackwell B200 在 GPT-OSS-120B 模型上的每百万 token 成本,从发布时的 0.11 美元在两个月内降至 0.02 美元,单靠软件就实现了约 5 倍的改进。每个版本的 TensorRT-LLM 通常会通过算子/内核融合、量化改进以及调度优化等方式提升吞吐,从而进一步摊薄单位 token 的推理成本。