# AI Inference

只看“原始速度”这一项指标的时代已经过去了，现在更重要的是吞吐量、效率以及在大规模下的整体经济性。随着 AI 从给出一次性答案演进到执行多步推理，对推理本身及其背后经济性的需求都在不断攀升。由于每个查询需要生成的 token 数量大幅增加，这一转变显著提升了算力需求。除整体吞吐量外，tokens per watt、cost per million tokens、tokens per second per user 等指标也同样关键。对于受功耗约束的 AI 工厂而言，NVIDIA 持续的软件优化能够在时间维度上转化为更高的 token 收入，这进一步凸显了我们技术演进的重要价值。

帕累托曲线清晰展示了 NVIDIA Blackwell 在成本、能效、吞吐量和响应速度等全维度生产优先级上实现了兼顾与平衡。只针对单一场景进行系统优化，往往会削弱部署灵活性，从而在曲线上的其他点产生效率损失。NVIDIA 的全栈设计方法，确保在多种真实生产场景中都能兼顾效率与价值。Blackwell 的领先表现源于其深度软硬件协同设计，全面体现了为速度、效率与可扩展性而打造的全栈架构。

在这篇[博客](https://blogs.nvidia.cn/blog/mixture-of-experts-frontier-models/)中，你可以了解 Mixture of Experts 如何驱动更智能的前沿 AI 模型，以及在 NVIDIA Blackwell NVL72 上实现高达 10 倍的加速表现。

[查看其他性能数据  
  
](/deep-learning-performance-training-inference)

推理基准  

MLPerf 推理

LLM 推理

推理


*图表数据正在准备中，请稍后重试。*


了解用于获得这些结果的[方法论](https://github.com/NVIDIA/dgxc-benchmarking/blob/develop/devzone-repro.md)，并通过亲自执行 Benchmarking Recipes 学习如何复现这些测试。

## MLPerf Inference v6.0 性能基准  

## Offline Scenario, Closed Division

| Network | Throughput | GPU | Server | GPU Version | QSL Size | Target Accuracy | Dataset |
| --- | --- | --- | --- | --- | --- | --- | --- |
| DeepSeek R1 | 2,494,310 tokens/sec | 288x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf\_deepseek\_r1 |
| | 486,141 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf\_deepseek\_r1 |
| | 70,326 tokens/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf\_deepseek\_r1 |
| | 58,582 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 4388 | 99% of FP16 (exact match 81.9132%) | mlperf\_deepseek\_r1 |
| gpt-oss 120B | 1,046,150 tokens/sec | 72x GB300 | Nebius GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 |
| | 879,542 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 |
| | 111,496 tokens/sec | 8x B300 | Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) | NVIDIA B300 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 |
| | 93,071 tokens/sec | 8x B200 | LLM-D v0.5.0,Openshift 4.20.12,NVIDIA 8xB200-SXM-180GB | NVIDIA B200 | 6396 | 99% of 83.13% | AIME25, GPQA Diamond, LiveCodeBench v6 |
| Qwen3-VL 235B | 61 tokens/sec | 4x GB300 | NVIDIA GB300 NVL72 (4x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score \&gt;= 0.7824) | Shopify Product Catalogue |
| | 44 tokens/sec | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score \&gt;= 0.7824) | Shopify Product Catalogue |
| | 78 tokens/sec | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score \&gt;= 0.7824) | Shopify Product Catalogue |
| | 79 tokens/sec | 8x B200 | Dell B200,8xB200-SXM-180GB,RHEL 10.1,vLLM CentML:mlperf-inf-mm-q3vl-v6.0 | NVIDIA B200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score \&gt;= 0.7824) | Shopify Product Catalogue |
| Llama3.1 405B | 19,512 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| | 15,462 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| | 1,971 tokens/sec | 8x B300 | Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) | NVIDIA B300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| | 1,350 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| Llama2 70B | 1,126,850 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | OpenOrca (max\_seq\_len=1024) |
| | 888,054 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | OpenOrca (max\_seq\_len=1024) |
| | 112,954 tokens/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | OpenOrca (max\_seq\_len=1024) |
| | 104,572 tokens/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | OpenOrca (max\_seq\_len=1024) |
| Llama3.1 8B | 166,745 tokens/sec | 8x B300 | XA NB3I-E12 | NVIDIA B300 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen\_len=8167644) | CNN Dailymail (v3.0.0, max\_seq\_len=2048) |
| | 160,403 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen\_len=8167644) | CNN Dailymail (v3.0.0, max\_seq\_len=2048) |
| Wan2.2 | 0.037 samples/sec | 4x GB300 | NVIDIA GB300 NVL72 (4x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts |
| | 0.027 samples/sec | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts |
| | 0.059 samples/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts |
| | 0.046 samples/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 248 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | VBench prompts |
| DLRMv3 | 104,637 samples/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 34996 | 99% of FP32 and 99.9% of FP32 (AUC=80.31%) | Synthetic Streaming 100B Dataset |
| | 10,737 samples/sec | 8x B200 | Camarero PDI200A2HG-810 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 34996 | 99% of FP32 and 99.9% of FP32 (WER=2.0671%) | Synthetic Streaming 100B Dataset |
| Whisper | 50,562 samples/sec | 8x B300 | NVIDIA DGX B300 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 1633 | 99% of FP32 and 99.9% of FP32 (WER=2.0671%) | LibriSpeech |
| | 49,327 samples/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 1633 | 99% of FP32 and 99.9% of FP32 (WER=2.0671%) | LibriSpeech |

### Server Scenario - Closed Division 

| Network | Throughput | GPU | Server | GPU Version | QSL Size | Target Accuracy | MLPerf Server Latency   
Constraints (ms) | Dataset |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| DeepSeek R1 | 1,555,110 tokens/sec | 288x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf\_deepseek\_r1 |
| | 336,106 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf\_deepseek\_r1 |
| | 60,413 tokens/sec | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf\_deepseek\_r1 |
| | 51,693 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 2000 ms/80 ms | mlperf\_deepseek\_r1 |
| gpt-oss 120B | 1,096,770 tokens/sec | 72x GB300 | Nebius GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| | 899,218 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| | 110,655 queries/sec | 8x B300 | Cisco UCS C880A M8 (8x NVIDIA B300-SXM-270GB, TensorRT) | NVIDIA B300 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| | 87,444 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 6396 | 99% of 83.13% | TTFT/TPOT: 3000 ms/80 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| Qwen3-VL 235B | 43 tokens/sec | 4x GB300 | Nebius GB300 NVL72 (4x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score \&gt;= 0.7824) | 12 s | Shopify Product Catalogue |
| | 38 tokens/sec | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score \&gt;= 0.7824) | 12 s | Shopify Product Catalogue |
| | 45 queries/sec | 8x B300 | Nebius B300 n1 (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 48289 | 99% of BF16 (Category Hierarchical F1 Score \&gt;= 0.7824) | 12 s | Shopify Product Catalogue |
| | 68 tokens/sec | 8x B200 | Dell B200,8xB200-SXM-180GB,RHEL 10.1,vLLM CentML:mlperf-inf-mm-q3vl-v6.0 | NVIDIA B200 | 48289 | 99% of BF16 (Category Hierarchical F1 Score \&gt;= 0.7824) | 12 s | Shopify Product Catalogue |
| Llama3.1 405B | 18,628 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| | 14,134 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| | 1,484 tokens/sec | 8x B300 | QuantaGrid D75H-10U (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| | 984 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| Llama2 70B | 868,278 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max\_seq\_len=1024) |
| | 810,104 tokens/sec | 72x B200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max\_seq\_len=1024) |
| | 108,392 tokens/sec | 8x B300 | PowerEdge XE9780L (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max\_seq\_len=1024) |
| | 103,627 tokens/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max\_seq\_len=1024) |
| Llama3.1 8B | 148,067 tokens/sec | 8x B300 | XA NB3I-E12 | NVIDIA B300 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen\_len=8167644) | TTFT/TPOT: 2000 ms/100 ms | CNN Dailymail (v3.0.0, max\_seq\_len=2048) |
| | 131,270 queries/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen\_len=8167644) | TTFT/TPOT: 2000 ms/100 ms | CNN Dailymail (v3.0.0, max\_seq\_len=2048) |
| Wan2.2\*\* | 31 seconds | 4x GB300 | NVIDIA GB300 NVL72 (4x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 248 | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | N/A | VBench prompts |
| | 40 seconds | 4x GB200 | NVIDIA GB200 NVL72 (4x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 248 | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | N/A | VBench prompts |
| | 21 seconds | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 248 | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | N/A | VBench prompts |
| | 25 seconds | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 248 | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | N/A | VBench prompts |
| DLRMv3 | 99,997 queries/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 34996 | 99% of FP32 (AUC=80.31%) | 80 ms | Synthetic Streaming 100B Dataset |
| | 10,007 queries/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 34996 | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 80 ms | Synthetic Streaming 100B Dataset |

### Interactive Scenario - Closed Division 

| Network | Throughput | GPU | Server | GPU Version | QSL Size | Target Accuracy | MLPerf Server Latency   
Constraints (ms) | Dataset |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| DeepSeek R1 | 250,634 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 1500 ms/15 ms | mlperf\_deepseek\_r1 |
| | 240,318 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 1500 ms/15 ms | mlperf\_deepseek\_r1 |
| | 4,935 tokens/sec | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 4388 | 99% of FP16 (exact match 81.9132%) | TTFT/TPOT: 1500 ms/15 ms | mlperf\_deepseek\_r1 |
| gpt-oss 120B | 677,199 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| | 624,929 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| | 26,006 tokens/sec | 8x B300 | XA NB3I-E12 | NVIDIA B300 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| | 13,155 tokens/sec | 8x B200 | Nebius B200 n1 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 6396 | 99% of 83.13% | TTFT/TPOT: 2000 ms/20 ms | AIME25, GPQA Diamond, LiveCodeBench v6 |
| Llama3.1 405B | 18,365 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | TTFT/TPOT: 4500 ms/80 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| | 14,010 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | TTFT/TPOT: 4500 ms/80 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| | 765 tokens/sec | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact\_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens\_per\_sample=684.68) | TTFT/TPOT: 4500 ms/80 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
| Llama2 70B | 814,128 tokens/sec | 72x GB300 | NVIDIA GB300 NVL72 (72x GB300-288GB\_aarch64, TensorRT) | NVIDIA GB300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max\_seq\_len=1024) |
| | 754,855 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 (72x GB200-186GB\_aarch64, TensorRT) | NVIDIA GB200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max\_seq\_len=1024) |
| | 70,724 tokens/sec | 8x B300 | PowerEdge XE9780L (8x B300-SXM-270GB, TensorRT) | NVIDIA B300 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max\_seq\_len=1024) |
| | 61,300 tokens/sec | 8x B200 | HPE ProLiant Compute XD685 (8x NVIDIA B200 180GB, TensorRT) | NVIDIA B200 | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens\_per\_sample=294.45) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max\_seq\_len=1024) |
| Llama3.1 8B | 128,633 tokens/sec | 8x B300 | G894-SD3-AAX7 | NVIDIA B300 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen\_len=8167644) | TTFT/TPOT: 500 ms/30 ms | CNN Dailymail (v3.0.0, max\_seq\_len=2048) |
| | 128,750 tokens/sec | 8x B200 | NVIDIA DGX B200 (8x B200-SXM-180GB, TensorRT) | NVIDIA B200 | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen\_len=8167644) | TTFT/TPOT: 500 ms/30 ms | CNN Dailymail (v3.0.0, max\_seq\_len=2048) |

\*\*The primary metric on Wan2.2 in Server Scenario is measured in seconds (lower the better).  
 MLPerf™ v6.0 Inference Closed Division. NVIDIA platform results from the following entries: 6.0-0006, 6.0-0010, 6.0-0024, 6.0-0039, 6.0-0040, 6.0-0048, 6.0-0062, 6.0-0072, 6.0-0073, 6.0-0074, 6.0-0075, 6.0-0076, 6.0-0077, 6.0-0078, 6.0-0080, 6.0-0081, 6.0-0083, 6.0-0084, 6.0-0085, 6.0-0089, 6.0-0091, 6.0-0094, 6.0-0098. MLPerf name and logo are trademarks. See [https://mlcommons.org/](https://mlcommons.org/) for more information.   
 For MLPerf™ various scenario data, click [here](https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#scenarios)  
 For MLPerf™ latency constraints, click [here](https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#constraints-for-the-closed-division)

## NVIDIA 数据中心产品的 LLM 推理性能  

### B200 推理性能 

| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Qwen3 235B A22B | DEP4 | 1000 | 1000 | 5,764 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 1024 | 8192 | 3,389 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 1024 | 32768 | 1,255 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 8192 | 1024 | 1,410 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 235B A22B | DEP4 | 32768 | 1024 | 319 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 1000 | 1000 | 26,971 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 1024 | 8192 | 13,497 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 1024 | 32768 | 4,494 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 8192 | 1024 | 5,735 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Qwen3 30B A3B | TP1 | 32768 | 1024 | 1,265 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 1000 | 1000 | 11,337 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 1024 | 8192 | 5,174 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 1024 | 32768 | 2,204 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 8192 | 1024 | 3,279 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| Llama v4 Maverick | DEP4 | 32768 | 1024 | 859 output tokens/sec/gpu | 4x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 1000 | 1000 | 53,812 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 1024 | 8192 | 34,702 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 1024 | 32768 | 14,589 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 8192 | 1024 | 11,904 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |
| GPT-OSS 20B | TP1 | 32768 | 1024 | 2,645 output tokens/sec/gpu | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 1.1 | NVIDIA B200 |

TP: Tensor Parallelism   
 PP: Pipeline Parallelism   
 DEP: Data Expert Parallelism

### RTX PRO 6000 Blackwell 服务器版推理性能 

| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Qwen3 235B A22B | DEP2 PP2 | 1000 | 1000 | 1,731 output tokens/sec/gpu | 4x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 235B A22B | DEP8 | 1024 | 8192 | 711 output tokens/sec/gpu | 8x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 235B A22B | DEP2 PP2 | 32768 | 1024 | 70 output tokens/sec/gpu | 4x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 1000 | 1000 | 9,938 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 1024 | 8192 | 3,621 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 8192 | 1024 | 1,914 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Qwen3 30B A3B | TP1 | 32768 | 1024 | 374 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.1 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 500 | 500 | 1,711 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 1000 | 4000 | 790 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 4000 | 1000 | 1,238 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 500 | 500 | 1,229 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 1000 | 4000 | 1,202 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 4000 | 1000 | 1,071 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron 3 Nano 30B | TP1 | 500 | 500 | 6,616 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron 3 Nano 30B | TP1 | 1000 | 4000 | 4,957 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| Nemotron 3 Nano 30B | TP1 | 4000 | 1000 | 5,353 output tokens/sec/gpu | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 6000 Blackwell Server Edition |

TP: Tensor Parallelism   
 PP: Pipeline Parallelism   
 DEP: Data Expert Parallelism

### RTX PRO 4500 Blackwell 服务器版推理性能 

| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Nemotron Nano 9B v2 | TP1 | 500 | 500 | 945 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 1000 | 4000 | 410 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 9B v2 | TP1 | 4000 | 1000 | 636 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 500 | 500 | 678 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 1000 | 4000 | 681 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |
| Nemotron Nano 12B v2 | TP1 | 4000 | 1000 | 566 output tokens/sec/gpu | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | FP4 | TensorRT-LLM 1.2.0 | NVIDIA RTX PRO 4500 Blackwell Server Edition |

TP: Tensor Parallelism   
 PP: Pipeline Parallelism   
 DEP: Data Expert Parallelism

### H200 推理性能 

| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Qwen3 235B A22B | DEP4 | 1000 | 1000 | 3,288 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Qwen3 235B A22B | DEP4 | 1024 | 8192 | 1,417 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Qwen3 235B A22B | DEP4 | 8192 | 1024 | 627 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Qwen3 235B A22B | DEP4 | 32768 | 1024 | 134 output tokens/sec/gpu | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 1000 | 1000 | 4,146 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 1024 | 8192 | 1,157 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 1024 | 32768 | 679 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| Llama v4 Maverick | DEP8 | 8192 | 1024 | 1,276 output tokens/sec/gpu | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 1000 | 1000 | 13,858 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 1024 | 8192 | 12,743 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 1024 | 32768 | output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 8192 | 1024 | 4,015 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |
| GPT-OSS 20B | TP1 | 32768 | 1024 | 9,154 output tokens/sec/gpu | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 1.1 | NVIDIA H200 |

TP: Tensor Parallelism   
 PP: Pipeline Parallelism   
 DEP: Data Expert Parallelism

### H100 推理性能 

| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Qwen3 235B A22B | DEP8 | 1000 | 1000 | 1,932 output tokens/sec/gpu | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| Qwen3 235B A22B | DEP8 | 1024 | 8192 | 873 output tokens/sec/gpu | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 1000 | 1000 | 11,557 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 1024 | 8192 | 8,617 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 8192 | 1024 | 3,366 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |
| GPT-OSS 20B | TP1 | 32768 | 1024 | 785 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 1.1 | H100-SXM5-80GB |

TP: Tensor Parallelism   
 PP: Pipeline Parallelism   
 DEP: Data Expert Parallelism

### L40S 推理性能 

| Model | Parallelism | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Llama v4 Scout | TP2 PP2 | 128 | 2048 | 1,105 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 128 | 4096 | 707 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP4 | 2048 | 128 | 561 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP4 | 5000 | 500 | 307 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 500 | 2000 | 1,093 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 1000 | 1000 | 920 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 1000 | 2000 | 884 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v4 Scout | TP2 PP2 | 2048 | 2048 | 615 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 128 | 2048 | 1,694 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP2 PP2 | 128 | 4096 | 972 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 500 | 2000 | 1,413 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 1000 | 1000 | 1,498 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 1000 | 2000 | 1,084 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.3 70B | TP4 | 2048 | 2048 | 773 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 128 | 128 | 8,471 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 128 | 4096 | 2,888 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 2048 | 128 | 1,017 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 5000 | 500 | 863 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 500 | 2000 | 4,032 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 1000 | 2000 | 3,134 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 2048 | 2048 | 2,148 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |
| Llama v3.1 8B | TP1 | 20000 | 2000 | 280 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.21.0 | NVIDIA L40S |

TP: Tensor Parallelism   
 PP: Pipeline Parallelism   
 DEP: Data Expert Parallelism

## NVIDIA 数据中心产品的推理性能  

## B200 推理性能

| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Stable Video Diffusion | 1 | 7.32 videos/min | - | 8202.75 | 1x B200 | DGX B200 | 26.02-py3 | Mixed | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Stable Diffusion XL | 1 | 2.89 images/sec | - | 507.41 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| BEVFusion Head | 1 | 2464.55 images/sec | 6 images/sec/watt | 0.41 | 1x B200 | DGX B200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Flux Image Generator | 1 | 0.47 images/sec | - | 2130.4 | 1x B200 | DGX B200 | 26.02-py3 | FP4 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF Swin Base | 128 | 4,948 samples/sec | 6 samples/sec/watt | 25.87 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF Swin Large | 128 | 3,223 samples/sec | 3 samples/sec/watt | 39.71 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF ViT Base | 2048 | 9,480 samples/sec | 10 samples/sec/watt | 216.04 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| HF ViT Large | 1024 | 3,381 samples/sec | 4 samples/sec/watt | 302.83 | 1x B200 | DGX B200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Yolo v10 M | 1 | 846.98 images/sec | 1.19 images/sec/watt | 1.18 | 1x B200 | DGX B200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |
| Yolo v11 M | 1 | 1034.36 images/sec | 1.4 images/sec/watt | 0.97 | 1x B200 | DGX B200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA B200 |

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

### RTX PRO 6000 Blackwell 服务器版推理性能 

| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Stable Diffusion XL | 1 | 1.05 images/sec | | 954 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE |
| Flux Image Generator | 1 | 0.2 images/sec | - | 5072 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP4 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE |
| BEVFusion Head | 1 | 1738.51 images/sec | 5 images/sec/watt | 0.58 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
| HF Swin Base | 32 | 2,719 samples/sec | 5 samples/sec/watt | 11.77 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
| HF Swin Large | 32 | 1,517 samples/sec | 3 samples/sec/watt | 21.1 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |
| HF ViT Base | 32 | 4,011 samples/sec | - | 8 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE |
| HF ViT Large | 16 | 1,280 samples/sec | - | 13 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 6000 BSE |
| Yolo v11 M | 1 | 465 images/sec | 1 images/sec/watt | 2.15 | 1x RTX PRO 6000 | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | RTX PRO 6000 BSE |

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

### RTX PRO 4500 Blackwell 服务器版推理性能 

| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Stable Diffusion XL | 1 | 0.4 images/sec | - | 2514 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| Flux Image Generator | 1 | 0.07 images/sec | - | 13816 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP4 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF Bert Large QAT | 64 | 2,720 samples/sec | - | 24 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | INT8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF Bert Large | 64 | 1,507 samples/sec | - | 42 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | Mixed | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF ViT Base | 16 | 1,403 samples/sec | - | 11 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |
| HF ViT Large | 4 | 449 samples/sec | - | 9 | 1x RTX PRO 4500 | Supermicro SYS-521GE-TNRT | 26.01-py3 | FP8 | Synthetic | TensorRT 10.14.1 | RTX PRO 4500 BSE |

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

### H200 推理性能 

| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Stable Video Diffusion | 1 | 4.83 videos/min | - | 12414.37 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Stable Diffusion XL | 1 | 1.61 images/sec | - | 760.29 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| BEVFusion Head | 1 | 2006.49 images/sec | 6 images/sec/watt | 0.5 | 1x H200 | DGX H200 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Flux Image Generator | 1 | .2 images/sec | - | 5010.27 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF Swin Base | 128 | 3,009 samples/sec | 4 samples/sec/watt | 42.54 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF Swin Large | 128 | 1,821 samples/sec | 3 samples/sec/watt | 70.28 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF ViT Base | 1024 | 4,943 samples/sec | 7 samples/sec/watt | 207.15 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| HF ViT Large | 1024 | 1,702 samples/sec | 2 samples/sec/watt | 601.64 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Yolo v10 M | 1 | 431.92 images/sec | 0.68 images/sec/watt | 2.32 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |
| Yolo v11 M | 1 | 518.04 images/sec | 0.8 images/sec/watt | 1.93 | 1x H200 | DGX H200 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA H200 |

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

### GH200 推理性能 

| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| BEVFusion Head | 1 | 2006.78 images/sec | 6 images/sec/watt | 0.5 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF Swin Base | 128 | 2,919 samples/sec | 4 samples/sec/watt | 43.84 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF Swin Large | 128 | 1,752 samples/sec | 3 samples/sec/watt | 73.04 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF ViT Base | 1024 | 4,728 samples/sec | 7 samples/sec/watt | 216.57 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| HF ViT Large | 2048 | 1,629 samples/sec | 2 samples/sec/watt | 1256.97 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| Yolo v10 M | 1 | 433.06 images/sec | 0.66 images/sec/watt | 2.31 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |
| Yolo v11 M | 1 | 505.3 images/sec | 0.8 images/sec/watt | 1.98 | 1x GH200 | NVIDIA P3880 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA GH200 |

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

### H100 推理性能 

| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Stable Video Diffusion | 1 | 4.68 videos/min | - | 12811.33 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| Stable Diffusion XL | 1 | 1.54 images/sec | - | 780.31 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| BEVFusion Head | 1 | 1999.27 images/sec | 6 images/sec/watt | 0.5 | 1x H100 | DGX H100 | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF Swin Base | 128 | 2,866 samples/sec | 4 samples/sec/watt | 44.67 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF Swin Large | 128 | 1,767 samples/sec | 3 samples/sec/watt | 72.42 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF ViT Base | 2048 | 4,864 samples/sec | 7 samples/sec/watt | 421.03 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| HF ViT Large | 2048 | 1,679 samples/sec | 2 samples/sec/watt | 1219.62 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| Yolo v10 M | 1 | 403.68 images/sec | 0.68 images/sec/watt | 2.48 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |
| Yolo v11 M | 1 | 476 images/sec | 0.76 images/sec/watt | 2.1 | 1x H100 | DGX H100 | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | H100 SXM5-80GB |

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

### L40S 推理性能 

| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| BEVFusion Head | 1 | 1958.07 images/sec | 7 images/sec/watt | 0.51 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF Swin Base | 32 | 1,396 samples/sec | 4 samples/sec/watt | 22.92 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF Swin Large | 32 | 716 samples/sec | 2 samples/sec/watt | 44.72 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF ViT Base | 1024 | 1,662 samples/sec | 5 samples/sec/watt | 616.09 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| HF ViT Large | 1024 | 597 samples/sec | 2 samples/sec/watt | 1716.6 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | FP8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| Yolo v10 M | 1 | 274.78 images/sec | 0.79 images/sec/watt | 3.64 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |
| Yolo v11 M | 1 | 310 images/sec | 0.9 images/sec/watt | 3.23 | 1x L40S | Supermicro SYS-521GE-TNRT | 26.02-py3 | INT8 | Synthetic | TensorRT 10.15.1 | NVIDIA L40S |

HF Swin Base, HF Swin Large, HF ViT Base, HF ViT Large Sequence Length = 384

## 查看更多性能数据  

### 训练至收敛  

要在真实场景中部署 AI，需要将网络训练到在指定精度下收敛。  
这是检验 AI 系统是否已准备好在实际环境中交付有价值结果的最有效方法。

[了解详情](https://developer.nvidia.com/deep-learning-performance-training-inference/training)

### AI Pipeline  

NVIDIA Riva 是一个用于构建多模态对话式 AI 服务的应用框架，能够在 GPU 上提供高性能的实际运行体验。

[了解详情](/deep-learning-performance-training-inference/conversational-ai)

## NVIDIA 数据中心深度学习产品性能常见问题  

只看计算单价或 FLOPs per dollar 会对推理 TCO 形成不完整的认知。对于 AI 推理 TCO，最重要的指标是每个 token 的成本，也就是实际交付的性价比。GB300 NVL72 在使用 Dynamo 和 TensorRT-LLM、单用户交互吞吐 116 TPS 的场景下，实现了每百万个 token 0.123 美元的推理成本——根据截至 2026 年 4 月的 [SemiAnalysis InferenceX 基准测试](https://inferencex.semianalysis.com/)，这是各大平台中每 token 成本最低的水平。

GB300 NVL72 在使用 Dynamo 和 TensorRT-LLM、单用户交互吞吐 116 TPS 的场景下，实现了每百万个 token 0.123 美元的推理成本——根据截至 2026 年 4 月的 [SemiAnalysis InferenceX 基准测试](https://inferencex.semianalysis.com/)，这是各大平台中每 token 成本最低的水平。

NVIDIA 每百万 token 的推理成本在各代 GPU 中有了显著改善：根据 2026 年第一季度的 [SemiAnalysis InferenceX 基准测试](https://inferencex.semianalysis.com/)，在低延迟 agentic 工作负载上，得益于软硬件协同设计，NVIDIA Blackwell Ultra（GB300 NVL72）相较 NVIDIA Hopper 实现了每 MW 吞吐最高提升至 50 倍、每 token 成本最多降低至 35 倍。软件优化也带来持续改进——GB200 的 token 产出在三个月内提升了 4 倍，对应地每 token 成本也按比例下降。

NVIDIA 的 TensorRT-LLM 和 Dynamo 软件栈在无需更换硬件的前提下，持续带来推理成本优化。根据截至 2026 年 4 月的 [SemiAnalysis InferenceX 基准测试](https://inferencex.semianalysis.com/)，NVIDIA Blackwell B200 在 GPT-OSS-120B 模型上的每百万 token 成本，从发布时的 0.11 美元在两个月内降至 0.02 美元，单靠软件就实现了约 5 倍的改进。每个版本的 TensorRT-LLM 通常会通过算子/内核融合、量化改进以及调度优化等方式提升吞吐，从而进一步摊薄单位 token 的推理成本。