Reproducible Performance

Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide

Related Resources

HPC Performance

Review the latest GPU-acceleration factors of popular HPC applications.


Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Related Resources

Read our blog on convergence for more details.

Get up and running quickly with NVIDIA’s complete solution stack:


NVIDIA Performance on MLPerf 2.1 Training Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA Performance on MLPerf 2.1 AI Benchmarks: Single Node - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.514.74675.90% classification8x H100DGX H1002.1-2091MixedImageNet2012H100-SXM5-80GB
27.68875.90% classification8x A100GIGABYTE: G492-ZD22.1-2038MixedImageNet2012A100-SXM4-80GB
3D U-Net13.1010.908 Mean DICE score8x H100DGX H1002.1-2091MixedKiTS 2019H100-SXM5-80GB
22.9890.908 Mean DICE score8x A100GIGABYTE: G492-ZD22.1-2038MixedKiTS 2019A100-SXM4-80GB
PyTorchBERT6.3780.72 Mask-LM accuracy8x H100DGX H1002.1-2091MixedWikipedia 2020/01/01H100-SXM5-80GB
16.5490.72 Mask-LM accuracy8x A100GIGABYTE: G492-ZD22.1-2039MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN20.3480.377 Box min AP and 0.339 Mask min AP8x H100DGX H1002.1-2091MixedCOCO2017H100-SXM5-80GB
37.9160.377 Box min AP and 0.339 Mask min AP8x A100GIGABYTE: G492-ZD22.1-2039MixedCOCO2017A100-SXM4-80GB
RNN-T18.2020.058 Word Error Rate8x H100DGX H1002.1-2091MixedLibriSpeechH100-SXM5-80GB
29.9480.058 Word Error Rate8x A100GIGABYTE: G492-ZD22.1-2039MixedLibriSpeechA100-SXM4-80GB
RetinaNet38.05034.0% mAP8x H100DGX H1002.1-2091MixedOpenImagesH100-SXM5-80GB
82.52934.0% mAP8x A100GIGABYTE: G492-ZD22.1-2039MixedOpenImagesA100-SXM4-80GB
TensorFlowMiniGo174.58450% win rate vs. checkpoint8x H100DGX H1002.1-2091MixedGoH100-SXM5-80GB
161.84850% win rate vs. checkpoint8x A100GIGABYTE: G492-ZD22.1-2040MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM1.0630.8025 AUC8x H100DGX H1002.1-2091MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)H100-SXM5-80GB
1.6250.8025 AUC8x A100Fujitsu: PRIMERGY-GX2570M6-hugectr2.1-2033MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

NVIDIA Performance on MLPerf 2.1 AI Benchmarks: Multi Node - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.54.50875.90% classification32x H100DGX H1002.1-2093MixedImageNet2012H100-SXM5-80GB
4.52375.90% classification64x A100DGX A1002.1-2065MixedImageNet2012A100-SXM4-80GB
0.55575.90% classification1,024x A100DGX A1002.1-2073MixedImageNet2012A100-SXM4-80GB
0.31975.90% classification4,216x A100DGX A1002.1-2080MixedImageNet2012A100-SXM4-80GB
3D U-Net5.3470.908 Mean DICE score24x H100DGX H1002.1-2092MixedKiTS 2019H100-SXM5-80GB
3.4370.908 Mean DICE score72x A100Azure: ND96amsr_A100_v4_n92.1-2009MixedKiTS 2019A100-SXM4-80GB
1.2160.908 Mean DICE score768x A100DGX A1002.1-2072MixedKiTS 2019A100-SXM4-80GB
PyTorchBERT1.7970.72 Mask-LM accuracy32x H100DGX H1002.1-2093MixedWikipedia 2020/01/01H100-SXM5-80GB
2.4970.72 Mask-LM accuracy64x A100DGX A1002.1-2068MixedWikipedia 2020/01/01A100-SXM4-80GB
0.4210.72 Mask-LM accuracy1,024x A100DGX A1002.1-2074MixedWikipedia 2020/01/01A100-SXM4-80GB
0.2080.72 Mask-LM accuracy4,096x A100DGX A1002.1-2079MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN7.3380.377 Box min AP and 0.339 Mask min AP32x H100DGX H1002.1-2093MixedCOCO2017H100-SXM5-80GB
8.2930.377 Box min AP and 0.339 Mask min AP64x A100HPE-ProLiant-XL675d-Gen10-Plus_A100-SXM-80GB_pytorch2.1-2049MixedCOCO2017A100-SXM4-80GB
2.7500.377 Box min AP and 0.339 Mask min AP384x A100DGX A1002.1-2071MixedCOCO2017A100-SXM4-80GB
RNN-T7.5340.058 Word Error Rate32x H100DGX H1002.1-2093MixedLibriSpeechH100-SXM5-80GB
6.9100.058 Word Error Rate64x A100DGX A1002.1-2066MixedLibriSpeechA100-SXM4-80GB
2.1510.058 Word Error Rate1,536x A100DGX A1002.1-2076MixedLibriSpeechA100-SXM4-80GB
RetinaNet11.79834.0% mAP32x H100DGX H1002.1-2093MixedOpenImagesH100-SXM5-80GB
12.76334.0% mAP64x A100DGX A1002.1-2068MixedOpenImagesA100-SXM4-80GB
2.34934.0% mAP1,280x A100DGX A1002.1-2075MixedOpenImagesA100-SXM4-80GB
1.84334.0% mAP2,048x A100DGX A1002.1-2078MixedOpenImagesA100-SXM4-80GB
TensorFlowMiniGo92.52250% win rate vs. checkpoint32x H100DGX H1002.1-2093MixedGoH100-SXM5-80GB
73.03850% win rate vs. checkpoint64x A100DGX A1002.1-2067MixedGoA100-SXM4-80GB
16.23150% win rate vs. checkpoint1,792x A100DGX A1002.1-2077MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM0.5150.8025 AUC32x H100DGX H1002.1-2093MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)H100-SXM5-80GB
0.6530.8025 AUC64x A100DGX A1002.1-2064MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.5880.8025 AUC112x A100DGX A1002.1-2070MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

MLPerf™ v2.1 Training Closed: 2.1-2038, 2.1-2039, 2.1-2033, 2.1-2040, 2.1-2065, 2.1-2068, 2.1-2049, 2.1-2066, 2.1-2064, 2.1-2067, 2.1-2009, 2.1-2070, 2.1-2071, 2.1-2072, 2.1-2073, 2.1-2074, 2.1-2075, 2.1-2076, 2.1-2077, 2.1-2078, 2.1-2079, 2.1-2080, 2.1-2091, 2.1-2092, 2.1-2093 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
H100 SXM5-80GB is a preview submission


NVIDIA A100 Performance on MLPerf 2.0 Training HPC Benchmarks: Strong Scaling - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
PyTorchCosmoFlow3.79Mean average error 0.124512x A100DGX A1002.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
DeepCAM1.57IOU 0.822,048x A100DGX A1002.0-8005MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB
OpenCatalyst21.93Forces mean absolute error 0.036512x A100DGX A1002.0-8006MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setA100-SXM4-80GB

NVIDIA A100 Performance on MLPerf 2.0 Training HPC Benchmarks: Weak Scaling - Closed Division

FrameworkNetworkThroughputMLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
PyTorchCosmoFlow4.21 models/minMean average error 0.1244,096x A100DGX A1002.0-8014MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
DeepCAM6.40 models/minIOU 0.824,096x A100DGX A1002.0-8014MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB
OpenCatalyst0.66 models/minForces mean absolute error 0.0364,096x A100DGX A1002.0-8014MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setA100-SXM4-80GB

MLPerf™ v2.0 Training HPC Closed: 2.0-8005, 2.0-8006, 2.0-8014 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v2.0 Training HPC rules and guidelines, click here

Converged Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron297.55 Training Loss321,306 total output mels/sec8x A100DGX A10022.10-py3Mixed128LJSpeech 1.1A100-SXM4-80GB
1.13.0a0WaveGlow227-5.7 Training Loss1,869,185 output samples/sec8x A100DGX A10022.11-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
1.13.0a0GNMT v21924.39 BLEU Score960,732 total tokens/sec8x A100DGX A10022.11-py3Mixed128wmt16-en-deA100-SXM4-80GB
1.13.0a0NCF0.35.96 Hit Rate at 10159,982,051 samples/sec8x A100DGX A10022.11-py3Mixed131072MovieLens 20MA100-SXM4-80GB
1.13.0a0Transformer XL Base18722.35 Perplexity710,002 total tokens/sec8x A100DGX A10022.11-py3Mixed128WikiText-103A100-SXM4-80GB
1.13.0a0TFT - Traffic1.08 P90134,270 items/sec8x A100DGX A10022.11-py3Mixed1024TrafficA100-SXM4-80GB
1.13.0a0TFT-Electricity2.03 P90134,971 items/sec8x A100DGX A10022.11-py3Mixed1024ElectricityA100-SXM4-80GB
1.13.0a0HiFiGAN1,7489.56 Training Loss62,639 total output mels/sec8x A100DGX A10022.11-py3Mixed16LJSpeech-1.1A100-SXM4-80GB
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.991,044 images/sec8x A100DGX A10022.11-py3Mixed2DAGM2007A100-SXM4-80GB
2.10.0U-Net Medical2.89 DICE Score1,080 images/sec8x A100DGX A10022.11-py3Mixed8EM segmentation challengeA100-SXM4-80GB
2.10.0Electra Fine Tuning392.48 F12,800 sequences/sec8x A100DGX A10022.11-py3Mixed32SQuaD v1.1A100-SXM4-80GB
2.10.0EfficientNet-B052676.67 Top 121,485 images/sec8x A100DGX A10022.11-py3Mixed1024Imagenet2012A100-SXM4-40GB
2.10.0SIM1.82 AUC3,269,087 samples/sec8x A100DGX A10022.11-py3Mixed16384Amazon ReviewsA100-SXM4-80GB

A40 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0NCF1.96 Hit Rate at 1050,144,826 samples/sec8x A40GIGABYTE G482-Z52-0022.11-py3Mixed131072MovieLens 20MA40
1.13.0a0Tacotron2113.57 Training Loss269,899 total output mels/sec8x A40Supermicro AS -4124GS-TNR22.11-py3Mixed128LJSpeech 1.1A40
1.13.0a0WaveGlow445-5.7 Training Loss940,768 output samples/sec8x A40Supermicro AS -4124GS-TNR22.11-py3Mixed10LJSpeech 1.1A40
1.13.0a0GNMT v24624.38 BLEU Score321,760 total tokens/sec8x A40Supermicro AS -4124GS-TNR22.11-py3Mixed128wmt16-en-deA40
1.13.0a0Transformer XL Base45022.41 Perplexity297,106 total tokens/sec8x A40Supermicro AS -4124GS-TNR22.11-py3Mixed128WikiText-103A40
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99748 images/sec8x A40GIGABYTE G482-Z52-0022.11-py3Mixed2DAGM2007A40
2.10.0Electra Fine Tuning492.46 F11,105 sequences/sec8x A40Supermicro AS -4124GS-TNR22.11-py3Mixed32SQuaD v1.1A40

A30 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2121.51 Training Loss256,736 total output mels/sec8x A30GIGABYTE G482-Z52-0022.11-py3Mixed104LJSpeech 1.1A30
1.13.0a0WaveGlow429-5.74 Training Loss985,229 output samples/sec8x A30GIGABYTE G482-Z52-0022.11-py3Mixed10LJSpeech 1.1A30
1.13.0a0GNMT v24624.24 BLEU Score319,191 total tokens/sec8x A30GIGABYTE G482-Z52-0022.11-py3Mixed128wmt16-en-deA30
1.13.0a0NCF1.96 Hit Rate at 1054,535,299 samples/sec8x A30GIGABYTE G482-Z52-0022.11-py3Mixed131072MovieLens 20MA30
1.13.0a0BERT-LARGE1090.71 F1301 sequences/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed10SQuaD v1.1A30
1.13.0a0FastPitch4352.7 Training Loss180,819 frames/sec8x A30GIGABYTE G482-Z52-0022.07-py3Mixed16LJSpeech 1.1A30
1.13.0a0Transformer XL Base14723.69 Perplexity228,197 total tokens/sec8x A30GIGABYTE G482-Z52-0022.07-py3Mixed32WikiText-103A30
1.13.0a0TFT - Traffic2.08 P9082,285 items/sec8x A30GIGABYTE G482-Z52-0022.11-py3Mixed1024TrafficA30
1.13.0a0TFT-Electricity3.03 P9082,065 items/sec8x A30GIGABYTE G482-Z52-0022.11-py3Mixed1024ElectricityA30
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99678 images/sec8x A30GIGABYTE G482-Z52-0022.11-py3Mixed2DAGM2007A30
2.10.0U-Net Medical2.89 DICE Score486 images/sec8x A30GIGABYTE G482-Z52-0022.11-py3Mixed8EM segmentation challengeA30
2.10.0Electra Fine Tuning592.58 F1977 sequences/sec8x A30GIGABYTE G482-Z52-0022.11-py3Mixed16SQuaD v1.1A30
2.10.0SIM1.81 AUC2,250,516 samples/sec8x A30GIGABYTE G482-Z52-0022.11-py3Mixed16384Amazon ReviewsA30

A10 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2139.53 Training Loss220,186 total output mels/sec8x A10GIGABYTE G482-Z52-0022.11-py3Mixed104LJSpeech 1.1A10
1.13.0a0WaveGlow567-5.7 Training Loss739,602 output samples/sec8x A10GIGABYTE G482-Z52-0022.10-py3Mixed10LJSpeech 1.1A10
1.13.0a0GNMT v25224.25 BLEU Score277,159 total tokens/sec8x A10GIGABYTE G482-Z52-0022.11-py3Mixed128wmt16-en-deA10
1.13.0a0NCF1.96 Hit Rate at 1047,791,819 samples/sec8x A10GIGABYTE G482-Z52-0022.11-py3Mixed131072MovieLens 20MA10
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99655 images/sec8x A10GIGABYTE G482-Z52-0022.11-py3Mixed2DAGM2007A10
1.15.5U-Net Medical14.89 DICE Score359 images/sec8x A10GIGABYTE G482-Z52-0022.11-py3Mixed8EM segmentation challengeA10
2.10.0Electra Fine Tuning592.78 F1771 sequences/sec8x A10GIGABYTE G482-Z52-0022.11-py3Mixed16SQuaD v1.1A10
2.10.0SIM1.81 AUC2,180,220 samples/sec8x A10GIGABYTE G482-Z52-0022.11-py3Mixed16384Amazon ReviewsA10

T4 Training Performance +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2231.53 Training Loss130,930 total output mels/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed104LJSpeech 1.1NVIDIA T4
1.13.0a0WaveGlow1,089-5.81 Training Loss383,020 output samples/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed10LJSpeech 1.1NVIDIA T4
1.13.0a0GNMT v29524.24 BLEU Score152,304 total tokens/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed128wmt16-en-deNVIDIA T4
1.13.0a0NCF2.96 Hit Rate at 1026,301,627 samples/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed131072MovieLens 20MNVIDIA T4
1.13.0a0TFT - Traffic10.08 P9033,694 items/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed1024TrafficNVIDIA T4
1.13.0a0TFT-Electricity16.03 P9033,609 items/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed1024ElectricityNVIDIA T4
Tensorflow1.15.5U-Net Industrial2.99 IoU Threshold 0.99331 images/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed2DAGM2007NVIDIA T4
1.15.5U-Net Medical42.9 DICE Score155 images/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed8EM segmentation challengeNVIDIA T4
2.10.0Electra Fine Tuning1092.73 F1382 sequences/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed16SQuaD v1.1NVIDIA T4
1.15.5Transformer XL Base90922.31 Perplexity36,121 total tokens/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed16WikiText-103NVIDIA T4
2.10.0SIM2.81 AUC1,125,154 samples/sec8x T4Supermicro SYS-4029GP-TRT22.11-py3Mixed16384Amazon ReviewsNVIDIA T4


V100 Training Performance +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2151.53 Training Loss208,130 total output mels/sec8x V100DGX-222.11-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.13.0a0WaveGlow402-5.73 Training Loss1,059,562 output samples/sec8x V100DGX-222.10-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.13.0a0GNMT v23324.21 BLEU Score440,850 total tokens/sec8x V100DGX-222.11-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.13.0a0NCF1.96 Hit Rate at 1099,138,714 samples/sec8x V100DGX-222.11-py3Mixed131072MovieLens 20MV100-SXM3-32GB
1.13.0a0BERT-LARGE790.78 F1398 sequences/sec8x V100DGX-222.08-py3Mixed10SQuaD v1.1V100-SXM3-32GB
1.13.0a0TFT - Traffic2.08 P9088,986 items/sec8x V100DGX-222.11-py3Mixed1024TrafficV100-SXM3-32GB
1.13.0a0TFT-Electricity3.03 P9088,647 items/sec8x V100DGX-222.11-py3Mixed1024ElectricityV100-SXM3-32GB
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99643 images/sec8x V100DGX-222.11-py3Mixed2DAGM2007V100-SXM3-32GB
1.15.5U-Net Medical14.9 DICE Score467 images/sec8x V100DGX-222.11-py3Mixed8EM segmentation challengeV100-SXM3-32GB
1.15.5Transformer XL Base31022.7 Perplexity106,475 total tokens/sec8x V100DGX-222.11-py3Mixed16WikiText-103V100-SXM3-32GB
2.10.0Electra Fine Tuning492.61 F11,346 sequences/sec8x V100DGX-222.11-py3Mixed32SQuaD v1.1V100-SXM3-32GB
2.10.0SIM1.82 AUC2,212,761 samples/sec8x V100DGX-222.11-py3Mixed16384Amazon ReviewsV100-SXM3-32GB

Single-GPU Training

Some scenarios aren’t used in real-world training, such as single-GPU throughput. The table below provides an indication of a platform’s single-chip throughput.

Related Resources

Achieve unprecedented acceleration at every scale with NVIDIA’s complete solution stack.

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.

NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit the NVIDIA NGC catalog to pull containers and quickly get up and running with deep learning.


Single GPU Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron241,908 total output mels/sec1x A100DGX A10022.11-py3TF32128LJSpeech 1.1A100-SXM4-80GB
1.13.0a0WaveGlow255,720 output samples/sec1x A100DGX A10022.11-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
1.13.0a0FastPitch161,081 frames/sec1x A100DGX A10022.11-py3Mixed32LJSpeech 1.1A100-SXM4-80GB
1.13.0a0GNMT v2171,369 total tokens/sec1x A100DGX A10022.11-py3Mixed128wmt16-en-deA100-SXM4-80GB
1.13.0a0Transformer XL Large16,084 total tokens/sec1x A100DGX A10022.11-py3Mixed16WikiText-103A100-SXM4-80GB
1.13.0a0Transformer XL Base86,441 total tokens/sec1x A100DGX A10022.11-py3Mixed128WikiText-103A100-SXM4-80GB
1.13.0a0nnU-Net1,126 images/sec1x A100DGX A10022.11-py3Mixed64Medical Segmentation DecathlonA100-SXM4-80GB
1.13.0a0EfficientNet-B4391 images/sec1x A100DGX A10022.11-py3Mixed128Imagenet2012A100-SXM4-80GB
1.13.0a0BERT Large Pre-Training Phase 2294 sequences/sec1x A100DGX A10022.09-py3Mixed56Wikipedia 2020/01/01A100-SXM4-80GB
1.13.0a0EfficientNet-WideSE-B4391 images/sec1x A100DGX A10022.11-py3Mixed128Imagenet2012A100-SXM4-80GB
1.13.0a0SE3 Transformer3,274 molecules/sec1x A100DGX A10022.11-py3Mixed240Quantum Machines 9A100-SXM4-80GB
1.13.0a0TFT - Traffic17,342 items/sec1x A100DGX A10022.11-py3Mixed1024TrafficA100-SXM4-80GB
1.13.0a0TFT - Electricity17,285 items/sec1x A100DGX A10022.11-py3Mixed1024ElectricityA100-SXM4-80GB
1.13.0a0HiFiGAN19,919 total output mels/sec1x A100DGX A10022.11-py3Mixed128LJSpeech-1.1A100-SXM4-80GB
Tensorflow1.15.5U-Net Industrial353 images/sec1x A100DGX A10022.11-py3Mixed16DAGM2007A100-SXM4-40GB
2.10.0U-Net Medical150 images/sec1x A100DGX A10022.11-py3Mixed8EM segmentation challengeA100-SXM4-80GB
2.10.0Electra Fine Tuning368 sequences/sec1x A100DGX A10022.11-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.15.5NCF51,889,497 samples/sec1x A100DGX A10022.11-py3Mixed1048576MovieLens 20MA100-SXM4-80GB
2.10.0EfficientNet-B03,255 images/sec1x A100DGX A10022.11-py3Mixed1024Imagenet2012A100-SXM4-80GB
2.10.0SIM588,800 samples/sec1x A100DGX A10022.11-py3Mixed131072Amazon ReviewsA100-SXM4-80GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A40 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron235,907 total output mels/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed128LJSpeech 1.1A40
1.13.0a0WaveGlow148,179 output samples/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed10LJSpeech 1.1A40
1.13.0a0GNMT v280,263 total tokens/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed128wmt16-en-deA40
1.13.0a0NCF19,499,362 samples/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed1048576MovieLens 20MA40
1.13.0a0Transformer XL Large10,023 total tokens/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed16WikiText-103A40
1.13.0a0FastPitch93,824 frames/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed32LJSpeech 1.1A40
1.13.0a0Transformer XL Base40,845 total tokens/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed128WikiText-103A40
1.13.0a0nnU-Net561 images/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed64Medical Segmentation DecathlonA40
1.13.0a0EfficientNet-B4181 images/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed64Imagenet2012A40
1.13.0a0EfficientNet-WideSE-B4181 images/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed64Imagenet2012A40
1.13.0a0SE3 Transformer1,900 molecules/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed240Quantum Machines 9A40
1.13.0a0TFT - Traffic9,642 items/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed1024TrafficA40
1.13.0a0TFT - Electricity9,542 items/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed1024ElectricityA40
1.13.0a0HiFiGAN10,333 total output mels/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed128LJSpeech-1.1A40
Tensorflow1.15.5U-Net Industrial122 images/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed16DAGM2007A40
2.10.0U-Net Medical70 images/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed8EM segmentation challengeA40
2.10.0Electra Fine Tuning160 sequences/sec1x A40GIGABYTE G482-Z52-0022.11-py3Mixed32SQuaD v1.1A40

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A30 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron234,406 total output mels/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed104LJSpeech 1.1A30
1.13.0a0WaveGlow153,115 output samples/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed10LJSpeech 1.1A30
1.13.0a0FastPitch91,968 frames/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed16LJSpeech 1.1A30
1.13.0a0NCF21,620,400 samples/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed1048576MovieLens 20MA30
1.13.0a0GNMT v291,214 total tokens/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed128wmt16-en-deA30
1.13.0a0Transformer XL Base18,368 total tokens/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed32WikiText-103A30
1.13.0a0Transformer XL Large7,150 total tokens/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed4WikiText-103A30
1.13.0a0nnU-Net590 images/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed64Medical Segmentation DecathlonA30
1.13.0a0EfficientNet-B4191 images/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed32Imagenet2012A30
1.13.0a0EfficientNet-WideSE-B4188 images/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed32Imagenet2012A30
1.13.0a0SE3 Transformer2,152 molecules/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed240Quantum Machines 9A30
1.13.0a0TFT - Traffic10,437 items/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed1024TrafficA30
1.13.0a0TFT - Electricity10,463 items/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed1024ElectricityA30
1.13.0a0HiFiGAN10,605 total output mels/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed128LJSpeech-1.1A30
Tensorflow1.15.5U-Net Industrial116 images/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed16DAGM2007A30
2.10.0U-Net Medical74 images/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed8EM segmentation challengeA30
1.15.5Transformer XL Base18,259 total tokens/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed16WikiText-103A30
2.10.0Electra Fine Tuning162 sequences/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed16SQuaD v1.1A30
2.10.0SIM404,661 samples/sec1x A30GIGABYTE G482-Z52-0022.11-py3Mixed131072Amazon ReviewsA30

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A10 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron229,310 total output mels/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed104LJSpeech 1.1A10
1.13.0a0WaveGlow116,803 output samples/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed10LJSpeech 1.1A10
1.13.0a0FastPitch74,233 frames/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed16LJSpeech 1.1A10
1.13.0a0Transformer XL Base15,388 total tokens/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed32WikiText-103A10
1.13.0a0GNMT v264,713 total tokens/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed128wmt16-en-deA10
1.13.0a0NCF16,211,650 samples/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed1048576MovieLens 20MA10
1.13.0a0Transformer XL Large6,133 total tokens/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed4WikiText-103A10
1.13.0a0nnU-Net447 images/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed64Medical Segmentation DecathlonA10
1.13.0a0EfficientNet-B4146 images/sec1x A10GIGABYTE G482-Z52-0022.10-py3Mixed32Imagenet2012A10
1.13.0a0EfficientNet-WideSE-B4145 images/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed32Imagenet2012A10
1.13.0a0SE3 Transformer1,686 molecules/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed240Quantum Machines 9A10
1.13.0a0TFT - Traffic8,066 items/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed1024TrafficA10
1.13.0a0TFT - Electricity8,036 items/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed1024ElectricityA10
1.13.0a0HiFiGAN8,113 total output mels/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed128LJSpeech-1.1A10
Tensorflow1.15.5U-Net Industrial100 images/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed16DAGM2007A10
2.10.0U-Net Medical51 images/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed8EM segmentation challengeA10
2.10.0Electra Fine Tuning119 sequences/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed16SQuaD v1.1A10
2.10.0SIM368,967 samples/sec1x A10GIGABYTE G482-Z52-0022.11-py3Mixed131072Amazon ReviewsA10

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

T4 Training Performance +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron217,983 total output mels/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed104LJSpeech 1.1NVIDIA T4
1.13.0a0WaveGlow56,267 output samples/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed10LJSpeech 1.1NVIDIA T4
1.13.0a0FastPitch34,006 frames/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed16LJSpeech 1.1NVIDIA T4
1.13.0a0GNMT v230,963 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed128wmt16-en-deNVIDIA T4
1.13.0a0NCF7,741,117 samples/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed1048576MovieLens 20MNVIDIA T4
1.13.0a0Transformer XL Base8,856 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed32WikiText-103NVIDIA T4
1.13.0a0Transformer XL Large2,798 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed4WikiText-103NVIDIA T4
1.13.0a0nnU-Net202 images/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed64Medical Segmentation DecathlonNVIDIA T4
1.13.0a0EfficientNet-B468 images/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed32Imagenet2012NVIDIA T4
1.13.0a0EfficientNet-WideSE-B468 images/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed32Imagenet2012NVIDIA T4
1.13.0a0SE3 Transformer638 molecules/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed240Quantum Machines 9NVIDIA T4
1.13.0a0TFT - Traffic4,317 items/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed1024TrafficNVIDIA T4
1.13.0a0TFT - Electricity4,317 items/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed1024ElectricityNVIDIA T4
1.13.0a0HiFiGAN2,803 total output mels/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed64LJSpeech-1.1NVIDIA T4
Tensorflow1.15.5U-Net Industrial45 images/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed16DAGM2007NVIDIA T4
1.15.5U-Net Medical21 images/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed8EM segmentation challengeNVIDIA T4
2.10.0Electra Fine Tuning58 sequences/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed16SQuaD v1.1NVIDIA T4
2.10.0EfficientNet-B0638 images/sec1x T4Supermicro SYS-1029GQ-TRT22.11-py3Mixed256Imagenet2012NVIDIA T4
2.10.0SIM175,996 samples/sec1x T4Supermicro SYS-4029GP-TRT22.10-py3Mixed131072Amazon ReviewsNVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec



V100 Training Performance +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron231,218 total output mels/sec1x V100DGX-222.11-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.13.0a0WaveGlow156,982 output samples/sec1x V100DGX-222.11-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.13.0a0FastPitch87,316 frames/sec1x V100DGX-222.11-py3Mixed16LJSpeech 1.1V100-SXM3-32GB
1.13.0a0GNMT v277,935 total tokens/sec1x V100DGX-222.11-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.13.0a0NCF24,122,383 samples/sec1x V100DGX-222.11-py3Mixed1048576MovieLens 20MV100-SXM3-32GB
1.13.0a0Transformer XL Base17,155 total tokens/sec1x V100DGX-222.11-py3Mixed32WikiText-103V100-SXM3-32GB
1.13.0a0Transformer XL Large7,172 total tokens/sec1x V100DGX-222.11-py3Mixed8WikiText-103V100-SXM3-32GB
1.13.0a0nnU-Net660 images/sec1x V100DGX-222.11-py3Mixed64Medical Segmentation DecathlonV100-SXM3-32GB
1.13.0a0EfficientNet-B4220 images/sec1x V100DGX-222.11-py3Mixed64Imagenet2012V100-SXM3-32GB
1.13.0a0EfficientNet-WideSE-B4220 images/sec1x V100DGX-222.11-py3Mixed64Imagenet2012V100-SXM3-32GB
1.13.0a0SE3 Transformer2,106 molecules/sec1x V100DGX-222.11-py3Mixed240Quantum Machines 9V100-SXM3-32GB
1.13.0a0TFT - Traffic11,743 items/sec1x V100DGX-222.11-py3Mixed1024TrafficV100-SXM3-32GB
1.13.0a0TFT - Electricity11,695 items/sec1x V100DGX-222.11-py3Mixed1024ElectricityV100-SXM3-32GB
1.13.0a0HiFiGAN9,695 total output mels/sec1x V100DGX-222.11-py3Mixed128LJSpeech-1.1V100-SXM3-32GB
Tensorflow1.15.5U-Net Industrial118 images/sec1x V100DGX-222.11-py3Mixed16DAGM2007V100-SXM3-32GB
1.15.5U-Net Medical68 images/sec1x V100DGX-222.11-py3Mixed8EM segmentation challengeV100-SXM3-32GB
2.10.0Electra Fine Tuning185 sequences/sec1x V100DGX-222.11-py3Mixed32SQuaD v1.1V100-SXM3-32GB
1.15.5Transformer XL Base18,671 total tokens/sec1x V100DGX-222.11-py3Mixed16WikiText-103V100-SXM3-32GB
2.10.0SIM368,684 samples/sec1x V100DGX-222.10-py3Mixed131072Amazon ReviewsV100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Related Resources

Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:


MLPerf Inference v2.1 Performance Benchmarks

Offline Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionDatasetTarget Accuracy
ResNet-50 v1.581,292 samples/sec1x H100NVIDIA H100H100-SXM-80GBImageNet76.46% Top1
335,144 samples/sec8x A100DGX A100A100 SXM-80GBImageNet76.46% Top1
5,589 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBImageNet76.46% Top1
316,342 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet76.46% Top1
RetinaNet960 samples/sec1x H100NVIDIA H100H100-SXM-80GBOpenImages0.3755 mAP
4,739 samples/sec8x A100DGX A100A100 SXM-80GBOpenImages0.3755 mAP
74 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBOpenImages0.3755 mAP
4,345 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages0.3755 mAP
3D-UNet5 samples/sec1x H100NVIDIA H100H100-SXM-80GBKiTS 20190.863 DICE mean
26 samples/sec8x A100DGX A100A100 SXM-80GBKiTS 20190.863 DICE mean
0.51 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBKiTS 20190.863 DICE mean
25 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBKiTS 20190.863 DICE mean
RNN-T22,885 samples/sec1x H100NVIDIA H100H100-SXM-80GBLibriSpeech7.45% WER
106,726 samples/sec8x A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
1,918 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
102,784 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech7.45% WER
BERT7,921 samples/sec1x H100NVIDIA H100H100-SXM-80GBSQuAD v1.190.87% f1
13,968 samples/sec8x A100DGX A100A100 SXM-80GBSQuAD v1.190.87% f1
1,757 samples/sec1x A100DGX A100A100 SXM-80GBSQuAD v1.190.87% f1
247 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBSQuAD v1.190.87% f1
12,822 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.190.87% f1
DLRM695,298 samples/sec1x H100NVIDIA H100H100-SXM-80GBCriteo 1TB Click Logs80.25% AUC
2,443,220 samples/sec8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
314,992 samples/sec1x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
38,995 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
2,291,310 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBCriteo 1TB Click Logs80.25% AUC

Server Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionTarget AccuracyMLPerf Server Latency
Constraints (ms)
Dataset
ResNet-50 v1.558,995 queries/sec1x H100NVIDIA H100H100-SXM-80GB76.46% Top115ImageNet
300,064 queries/sec8x A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
3,527 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
236,057 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB76.46% Top115ImageNet
RetinaNet848 queries/sec1x H100NVIDIA H100H100-SXM-80GB0.3755 mAP100OpenImages
4,096 queries/sec8x A100DGX A100A100 SXM-80GB0.3755 mAP100OpenImages
45 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB0.3755 mAP100OpenImages
3,997 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB0.3755 mAP100OpenImages
RNN-T21,488 queries/sec1x H100NVIDIA H100H100-SXM-80GB7.45% WER1,000LibriSpeech
104,020 queries/sec8x A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
1,347 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
90,005 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB7.45% WER1,000LibriSpeech
BERT6,195 queries/sec1x H100NVIDIA H100H100-SXM-80GB90.87% f1130SQuAD v1.1
12,815 queries/sec8x A100DGX A100A100 SXM-80GB90.87% f1130SQuAD v1.1
1,572 queries/sec1x A100DGX A100A100 SXM-80GB90.87% f1130SQuAD v1.1
164 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB90.87% f1130SQuAD v1.1
10,795 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB90.87% f1130SQuAD v1.1
DLRM545,174 queries/sec1x H100NVIDIA H100H100-SXM-80GB80.25% AUC30Criteo 1TB Click Logs
2,390,910 queries/sec8x A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
298,565 queries/sec1x A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
35,991 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
1,326,940 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB80.25% AUC30Criteo 1TB Click Logs

Power Efficiency Offline Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5288,733 samples/sec93.68 samples/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
252,721 samples/sec122.19 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet
RetinaNet4,122 samples/sec1.32 samples/sec/watt8x A100DGX A100A100 SXM-80GBOpenImages
3,805 samples/sec1.73 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages
3D-UNet23 samples/sec0.008 samples/sec/watt8x A100DGX A100A100 SXM-80GBKiTS 2019
19 samples/sec0.011 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBKiTS 2019
RNN-T84,508 samples/sec27.79 samples/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
78,750 samples/sec38.88 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech
BERT11,152 samples/sec3.33 samples/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
11,158 samples/sec4.37 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.1
DLRM2,128,420 samples/sec641.77 samples/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs

Power Efficiency Server Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5229,055 queries/sec78.93 queries/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
185,047 queries/sec87.2 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet
RetinaNet3,896 queries/sec1.25 queries/sec/watt8x A100DGX A100A100 SXM-80GBOpenImages
2,296 queries/sec1.21 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages
RNN-T88,003 queries/sec25.44 queries/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
74,995 queries/sec33.88 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech
BERT9,995 queries/sec2.93 queries/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
7,494 queries/sec3.45 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.1
DLRM2,002,080 queries/sec592.73 queries/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs

MLPerf™ v2.1 Inference Closed: ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99.9% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 2.1-0082, 2.1-0084, 2.1-0085, 2.1-0087, 2.1-0088, 2.1-0089, 2.1-0121, 2.1-0122. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
H100 SXM-80GB is a preview submission
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
1x1g.10gb is a notation used to refer to the MIG configuration. In this example, the workload is running on a single MIG slice, with 10GB of memory on a single A100.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v2.1


NVIDIA landed top performance spots on all MLPerf™ Inference 2.1 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.

MLPerf™ v2.1 A100 Inference Closed: ResNet-50 v1.5, RetinaNet, BERT 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 2.1-0088, 2.1-0090. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.​

 

NVIDIA Client Batch Size 1 and 2 Performance with Triton Inference Server

A100 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA100-SXM4-40GBtensorrt_planTensorRTMixed4112431.784755 inf/sec38422.11-py3
BERT Large InferenceA100-SXM4-40GBtensorrt_planTensorRTMixed4212461.186784 inf/sec38422.11-py3
BERT Large InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed4112438.159629 inf/sec38422.11-py3
BERT Large InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed4212472.731660 inf/sec38422.11-py3
BERT Base InferenceA100-SXM4-80GBtensorrt_planTensorRTMixed411243.9666,050 inf/sec12822.11-py3
BERT Base InferenceA100-SXM4-40GBtensorrt_planTensorRTMixed421246.757,110 inf/sec12822.11-py3
BERT Base InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed411244.4355,408 inf/sec12822.11-py3
BERT Base InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed421247.7666,179 inf/sec12822.11-py3
DLRM InferenceA100-SXM4-40GBpytorch_libtorchPyTorchMixed4165,536282.20612,687 inf/sec-22.11-py3
DLRM InferenceA100-SXM4-80GBpytorch_libtorchPyTorchMixed2265,536282.24324,953 inf/sec-22.11-py3
DLRM InferenceA100-PCIE-40GBpytorch_libtorchPyTorchMixed4165,536302.31612,946 inf/sec-22.08-py3
DLRM InferenceA100-PCIE-40GBpytorch_libtorchPyTorchMixed1265,536302.3925,093 inf/sec-22.08-py3

A30 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA30tensorrt_planTensorRTMixed4112466.753359 inf/sec38422.11-py3
BERT Large InferenceA30tensorrt_planTensorRTMixed42120108.243370 inf/sec38422.11-py3
BERT Base InferenceA30tensorrt_planTensorRTMixed411247.2543,308 inf/sec12822.11-py3
BERT Base InferenceA30tensorrt_planTensorRTMixed4212413.0043,690 inf/sec12822.11-py3

A10 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA10tensorrt_planTensorRTMixed41124101.549236 inf/sec38422.11-py3
BERT Large InferenceA10tensorrt_planTensorRTMixed42124198.296242 inf/sec38422.11-py3
BERT Base InferenceA10tensorrt_planTensorRTMixed4112410.8312,220 inf/sec12822.11-py3
BERT Base InferenceA10tensorrt_planTensorRTMixed2212016.9992,353 inf/sec12822.11-py3

T4 Triton Inference Server Performance +

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceNVIDIA T4tensorrt_planTensorRTMixed11124255.52294 inf/sec38422.11-py3
BERT Large InferenceNVIDIA T4tensorrt_planTensorRTMixed12120427.10794 inf/sec38422.11-py3
BERT Base InferenceNVIDIA T4tensorrt_planTensorRTMixed1112425.149954 inf/sec12822.11-py3
BERT Base InferenceNVIDIA T4tensorrt_planTensorRTMixed1212446.741,027 inf/sec12822.11-py3


V100 Triton Inference Server Performance +

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed4112491.464263 inf/sec38422.11-py3
BERT Large InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed42120146.731273 inf/sec38422.11-py3
BERT Base InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed4112411.2982,124 inf/sec12822.11-py3
BERT Base InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed4212419.1382,508 inf/sec12822.11-py3
DLRM InferenceV100-SXM2-32GBpytorch_libtorchPyTorchMixed2165,536303.4528,688 inf/sec-22.11-py3
DLRM InferenceV100-SXM2-32GBpytorch_libtorchPyTorchMixed1265,536303.73916,041 inf/sec-22.11-py3

Inference Performance of NVIDIA A100, A40, A30, A10, A2, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: Mixed | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: Mixed | Dataset: Synthetic

 

A100 Full Chip Inference Performance

NetworkBatch SizeFull Chip ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50811,807 images/sec63 images/sec/watt0.681x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-40GB
12830,814 images/sec80 images/sec/watt4.151x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
22532,500 images/sec- images/sec/watt6.921x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
ResNet-50v1.5811,524 images/sec62 images/sec/watt0.691x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-40GB
12830,004 images/sec76 images/sec/watt4.271x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
21631,228 images/sec- images/sec/watt6.921x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
87,300 sequences/sec27 sequences/sec/watt1.11x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
12815,147 sequences/sec38 sequences/sec/watt8.451x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-40GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
82,679 sequences/sec9 sequences/sec/watt2.991x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
1284,965 sequences/sec12 sequences/sec/watt25.781x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-40GB
EfficientNet-B089,155 images/sec58 images/sec/watt0.871x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
12830,273 images/sec95 images/sec/watt4.231x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
EfficientNet-B482,593 images/sec12 images/sec/watt3.091x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-40GB
1284,588 images/sec12 images/sec/watt27.91x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 1/7 MIG Inference Performance

NetworkBatch Size1/7 MIG ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,747 images/sec31 images/sec/watt2.131x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
304,352 images/sec- images/sec/watt6.891x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
1284,708 images/sec38 images/sec/watt27.191x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
ResNet-50v1.583,661 images/sec31 images/sec/watt2.191x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
294,189 images/sec- images/sec/watt6.921x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
1284,555 images/sec36 images/sec/watt28.11x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
BERT-BASE81,866 sequences/sec15 sequences/sec/watt4.291x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
1282,304 sequences/sec16 sequences/sec/watt55.551x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
BERT-LARGE8610 sequences/sec5 sequences/sec/watt13.111x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
128741 sequences/sec5 sequences/sec/watt172.761x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 7 MIG Inference Performance

NetworkBatch Size7 MIG ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50826,040 images/sec80 images/sec/watt2.161x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
2930,190 images/sec- images/sec/watt6.731x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
12832,792 images/sec85 images/sec/watt27.351x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
ResNet-50v1.5825,299 images/sec77 images/sec/watt2.221x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
2929,285 images/sec- images/sec/watt2.941x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
12831,749 images/sec83 images/sec/watt28.261x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
BERT-BASE812,941 sequences/sec34 sequences/sec/watt4.341x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
12815,157 sequences/sec38 sequences/sec/watt59.151x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
BERT-LARGE84,210 sequences/sec12 sequences/sec/watt13.321x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB
1284,806 sequences/sec12 sequences/sec/watt186.491x A100DGX A10022.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

 

A40 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50810,034 images/sec38 images/sec/watt0.81x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
10715,868 images/sec- images/sec/watt6.741x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
12815,727 images/sec53 images/sec/watt8.141x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
ResNet-50v1.589,724 images/sec36 images/sec/watt0.821x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
10014,965 images/sec- images/sec/watt6.681x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
12814,950 images/sec50 images/sec/watt8.561x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
BERT-BASE85,602 sequences/sec19 sequences/sec/watt1.431x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
1287,735 sequences/sec26 sequences/sec/watt16.551x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
BERT-LARGE81,796 sequences/sec6 sequences/sec/watt4.451x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
1282,359 sequences/sec8 sequences/sec/watt54.271x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
EfficientNet-B089,343 images/sec50 images/sec/watt0.861x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
12819,243 images/sec64 images/sec/watt6.651x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
EfficientNet-B481,943 images/sec7 images/sec/watt4.121x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40
1282,630 images/sec9 images/sec/watt48.681x A40GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A40

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5088,851 images/sec69 images/sec/watt0.91x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
10715,973 images/sec- images/sec/watt6.71x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
12816,057 images/sec98 images/sec/watt7.971x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
ResNet-50v1.588,725 images/sec67 images/sec/watt0.921x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
10215,271 images/sec- images/sec/watt6.681x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
12815,541 images/sec95 images/sec/watt8.241x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
85,124 sequences/sec31 sequences/sec/watt1.561x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
1287,590 sequences/sec46 sequences/sec/watt16.861x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,775 sequences/sec11 sequences/sec/watt4.511x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
1282,438 sequences/sec15 sequences/sec/watt52.51x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
EfficientNet-B087,511 images/sec74 images/sec/watt1.071x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
12816,697 images/sec102 images/sec/watt7.671x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
EfficientNet-B481,719 images/sec12 images/sec/watt4.651x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
1282,358 images/sec14 images/sec/watt54.291x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 1/4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,623 images/sec44 images/sec/watt2.211x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
294,285 images/sec- images/sec/watt6.771x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
1284,604 images/sec52 images/sec/watt27.81x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
ResNet-50v1.583,543 images/sec41 images/sec/watt2.261x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
284,126 images/sec- images/sec/watt6.791x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
1284,456 images/sec50 images/sec/watt28.731x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
BERT-BASE81,879 sequences/sec20 sequences/sec/watt4.261x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
1282,276 sequences/sec22 sequences/sec/watt56.231x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
BERT-LARGE8604 sequences/sec6 sequences/sec/watt13.251x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
128742 sequences/sec7 sequences/sec/watt172.571x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50814,088 images/sec86 images/sec/watt2.281x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
2716,432 images/sec- images/sec/watt6.591x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
12817,343 images/sec106 images/sec/watt29.631x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
ResNet-50v1.5813,734 images/sec84 images/sec/watt2.341x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
2615,777 images/sec- images/sec/watt6.611x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
12816,745 images/sec102 images/sec/watt30.691x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
BERT-BASE86,896 sequences/sec42 sequences/sec/watt4.661x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
1287,742 sequences/sec47 sequences/sec/watt66.351x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
BERT-LARGE82,190 sequences/sec13 sequences/sec/watt14.661x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30
1282,450 sequences/sec15 sequences/sec/watt209.681x A30GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A10 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5087,826 images/sec52 images/sec/watt1.021x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
7310,950 images/sec- images/sec/watt6.671x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
12811,520 images/sec77 images/sec/watt11.111x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
ResNet-50v1.587,628 images/sec51 images/sec/watt1.051x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
7010,694 images/sec- images/sec/watt6.551x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
12810,852 images/sec73 images/sec/watt11.81x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
84,149 sequences/sec28 sequences/sec/watt1.931x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
1285,100 sequences/sec34 sequences/sec/watt25.11x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,248 sequences/sec9 sequences/sec/watt6.411x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
1281,576 sequences/sec11 sequences/sec/watt81.221x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
EfficientNet-B088,241 images/sec55 images/sec/watt0.971x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
12814,102 images/sec94 images/sec/watt9.081x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
EfficientNet-B481,535 images/sec10 images/sec/watt5.211x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10
1281,828 images/sec12 images/sec/watt70.041x A10GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A10

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A2 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5082,621 images/sec44 images/sec/watt3.051x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
192,927 images/sec- images/sec/watt6.491x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
1283,059 images/sec51 images/sec/watt41.851x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
ResNet-50v1.582,519 images/sec42 images/sec/watt3.181x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
182,809 images/sec- images/sec/watt6.761x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
1283,059 images/sec51 images/sec/watt41.851x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
BERT-BASE81,132 sequences/sec19 sequences/sec/watt7.071x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
1281,194 sequences/sec20 sequences/sec/watt107.231x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
BERT-LARGE8339 sequences/sec6 sequences/sec/watt23.581x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
128362 sequences/sec6 sequences/sec/watt353.441x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
EfficientNet-B083,044 images/sec59 images/sec/watt2.631x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
1283,929 images/sec65 images/sec/watt32.581x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
EfficientNet-B48469 images/sec8 images/sec/watt17.051x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2
128514 images/sec9 images/sec/watt249.041x A2GIGABYTE G482-Z52-0022.11-py3INT8SyntheticTensorRT 8.5.1A2

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power

 

T4 Inference Performance +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,811 images/sec54 images/sec/watt2.11x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
314,615 images/sec- images/sec/watt6.721x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
1285,003 images/sec72 images/sec/watt25.591x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
ResNet-50v1.583,740 images/sec53 images/sec/watt2.141x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
284,309 images/sec- images/sec/watt6.51x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
1284,864 images/sec69 images/sec/watt26.321x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,684 sequences/sec24 sequences/sec/watt4.751x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
1281,855 sequences/sec27 sequences/sec/watt691x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
8550 sequences/sec8 sequences/sec/watt14.551x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
128526 sequences/sec8 sequences/sec/watt243.351x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
EfficientNet-B084,722 images/sec68 images/sec/watt1.691x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
1286,388 images/sec92 images/sec/watt20.041x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
EfficientNet-B48786 images/sec11 images/sec/watt10.181x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
128886 images/sec13 images/sec/watt144.491x T4Supermicro SYS-1029GQ-TRT22.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container



V100 Inference Performance +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5084,398 images/sec15 images/sec/watt1.821x V100DGX-222.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM3-32GB
1287,896 images/sec23 images/sec/watt16.211x V100DGX-222.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM3-32GB
ResNet-50v1.584,283 images/sec14 images/sec/watt1.871x V100DGX-222.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM3-32GB
1287,495 images/sec22 images/sec/watt17.081x V100DGX-222.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM3-32GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
82,359 sequences/sec7 sequences/sec/watt3.391x V100DGX-222.11-py3MixedSyntheticTensorRT 8.5.1V100-SXM3-32GB
1283,111 sequences/sec9 sequences/sec/watt41.151x V100DGX-222.11-py3MixedSyntheticTensorRT 8.5.1V100-SXM3-32GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
8777 sequences/sec2 sequences/sec/watt10.31x V100DGX-222.11-py3MixedSyntheticTensorRT 8.5.1V100-SXM3-32GB
128947 sequences/sec3 sequences/sec/watt135.111x V100DGX-222.11-py3MixedSyntheticTensorRT 8.5.1V100-SXM3-32GB
EfficientNet-B084,690 images/sec22 images/sec/watt1.711x V100DGX-222.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM3-32GB
1289,493 images/sec30 images/sec/watt13.481x V100DGX-222.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM3-32GB
EfficientNet-B48951 images/sec3 images/sec/watt8.421x V100DGX-222.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM3-32GB
1281,258 images/sec4 images/sec/watt101.761x V100DGX-222.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM3-32GB

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container


Inference Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Inference Performance on Cloud

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.5811,644 images/sec- images/sec/watt0.691x A100GCP A2-HIGHGPU-1G22.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-40GB
12828,444 images/sec- images/sec/watt4.51x A100GCP A2-HIGHGPU-1G22.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-40GB
811,453 images/sec- images/sec/watt0.71x A100AWS EC2 p4d.24xlarge22.10-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-40GB
12828,528 images/sec- images/sec/watt4.491x A100AWS EC2 p4d.24xlarge22.10-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-40GB
811,334 images/sec- images/sec/watt0.711x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-80GB
12829,613 images/sec- images/sec/watt4.321x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-40GB
BERT-LARGE82,663 images/sec- images/sec/watt31x A100GCP A2-HIGHGPU-1G22.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-40GB
1284,966 images/sec- images/sec/watt25.781x A100GCP A2-HIGHGPU-1G22.11-py3INT8SyntheticTensorRT 8.5.1A100-SXM4-40GB
82,569 images/sec- images/sec/watt3.111x A100AWS EC2 p4d.24xlarge22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
1285,008 images/sec- images/sec/watt25.561x A100AWS EC2 p4d.24xlarge22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
82,698 images/sec- images/sec/watt2.961x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-80GB
1284,907 images/sec- images/sec/watt26.091x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-80GB

BERT-Large: Sequence Length = 128

T4 Inference Performance on Cloud +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.583,354 images/sec- images/sec/watt2.391x T4GCP N1-HIGHMEM-822.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
1284,195 images/sec- images/sec/watt30.521x T4GCP N1-HIGHMEM-822.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
BERT-LARGE8487 images/sec- images/sec/watt16.411x T4GCP N1-HIGHMEM-822.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4
128433 images/sec- images/sec/watt295.371x T4GCP N1-HIGHMEM-822.11-py3INT8SyntheticTensorRT 8.5.1NVIDIA T4


V100 Inference Performance on Cloud +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.584,297 images/sec- images/sec/watt1.861x V100GCP N1-HIGHMEM-822.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM2-16GB
1287,212 images/sec- images/sec/watt17.751x V100GCP N1-HIGHMEM-822.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM2-16GB
BERT-LARGE8707 images/sec- images/sec/watt11.321x V100GCP N1-HIGHMEM-822.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM2-16GB
128920 images/sec- images/sec/watt139.091x V100GCP N1-HIGHMEM-822.11-py3INT8SyntheticTensorRT 8.5.1V100-SXM2-16GB

Conversational AI

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.

Related Resources

Download and get started with NVIDIA Riva.


Riva Benchmarks

A100 ASR Benchmarks

A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram111.41A100 SXM4-40GB
citrinetn-gram6464.164A100 SXM4-40GB
citrinetn-gram128103126A100 SXM4-40GB
citrinetn-gram256166.7250A100 SXM4-40GB
citrinetn-gram384235371A100 SXM4-40GB
citrinetn-gram512311490A100 SXM4-40GB
citrinetn-gram768492717A100 SXM4-40GB
conformern-gram116.81A100 SXM4-40GB
conformern-gram6410964A100 SXM4-40GB
conformern-gram128130126A100 SXM4-40GB
conformern-gram256236249A100 SXM4-40GB
conformern-gram384342369A100 SXM4-40GB
conformern-gram512485486A100 SXM4-40GB

A100 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram110.471A100 SXM4-40GB
citrinetn-gram815.148A100 SXM4-40GB
citrinetn-gram1626.216A100 SXM4-40GB
citrinetn-gram3239.132A100 SXM4-40GB
citrinetn-gram484848A100 SXM4-40GB
citrinetn-gram6455.464A100 SXM4-40GB
conformern-gram114.691A100 SXM4-40GB
conformern-gram837.78A100 SXM4-40GB
conformern-gram1641.516A100 SXM4-40GB
conformern-gram3255.732A100 SXM4-40GB
conformern-gram4866.848A100 SXM4-40GB
conformern-gram6482.263A100 SXM4-40GB

A100 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram324390A100 SXM4-40GB
conformern-gram321700A100 SXM4-40GB

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.8.1 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A30 ASR Benchmarks

A30 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram114.641A30
citrinetn-gram6410163A30
citrinetn-gram128152126A30
citrinetn-gram256272249A30
citrinetn-gram384393368A30
citrinetn-gram512569484A30
conformern-gram121.761A30
conformern-gram6413463A30
conformern-gram128216126A30
conformern-gram256397248A30
conformern-gram384672364A30

A30 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.741A30
citrinetn-gram829.48A30
citrinetn-gram1644.216A30
citrinetn-gram3258.732A30
citrinetn-gram4865.848A30
citrinetn-gram648363A30
conformern-gram120.321A30
conformern-gram842.28A30
conformern-gram1651.516A30
conformern-gram3271.332A30
conformern-gram48103.948A30
conformern-gram64126.863A30

A30 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram323142A30
conformern-gram321120A30

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.8.1 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A10 ASR Benchmarks

A10 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram112.931A10
citrinetn-gram6488.564A10
citrinetn-gram128162.6126A10
citrinetn-gram256316248A10
citrinetn-gram384486367A10
citrinetn-gram512710481A10
conformern-gram115.331A10
conformern-gram6413363A10
conformern-gram128234126A10
conformern-gram256434247A10

A10 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram110.4051A10
citrinetn-gram820.228A10
citrinetn-gram1629.816A10
citrinetn-gram3249.132A10
citrinetn-gram4867.648A10
citrinetn-gram6484.763A10
conformern-gram113.491A10
conformern-gram833.88A10
conformern-gram1640.916A10
conformern-gram3271.532A10
conformern-gram4810848A10
conformern-gram6414063A10

A10 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram322719A10
conformern-gram32992A10

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.8.1 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

V100 ASR Benchmarks +

V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.911V100 SXM2-16GB
citrinetn-gram6487.963V100 SXM2-16GB
citrinetn-gram128153125V100 SXM2-16GB
citrinetn-gram256283.7246V100 SXM2-16GB
citrinetn-gram384407363V100 SXM2-16GB
citrinetn-gram512590474V100 SXM2-16GB
conformern-gram122.31V100 SXM2-16GB
conformern-gram6415363V100 SXM2-16GB
conformern-gram128230.6125V100 SXM2-16GB
conformern-gram256400245V100 SXM2-16GB
conformern-gram384716359V100 SXM2-16GB

V100 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.261V100 SXM2-16GB
citrinetn-gram824.328V100 SXM2-16GB
citrinetn-gram1632.916V100 SXM2-16GB
citrinetn-gram3250.532V100 SXM2-16GB
citrinetn-gram486548V100 SXM2-16GB
citrinetn-gram6484.163V100 SXM2-16GB
conformern-gram119.71V100 SXM2-16GB
conformern-gram8558V100 SXM2-16GB
conformern-gram1652.316V100 SXM2-16GB
conformern-gram3276.732V100 SXM2-16GB
conformern-gram48119.847V100 SXM2-16GB
conformern-gram6414363V100 SXM2-16GB

V100 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram322693V100 SXM2-16GB
conformern-gram32964V100 SXM2-16GB

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.8.1 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz



T4 ASR Benchmarks +

T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram126.71NVIDIA T4
citrinetn-gram64170.863NVIDIA T4
citrinetn-gram128342125NVIDIA T4
citrinetn-gram256736242NVIDIA T4
conformern-gram159.11NVIDIA T4
conformern-gram6431063NVIDIA T4
conformern-gram128505124NVIDIA T4

T4 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram125.91NVIDIA T4
citrinetn-gram8578NVIDIA T4
citrinetn-gram1660.516NVIDIA T4
citrinetn-gram3293.132NVIDIA T4
citrinetn-gram48139.747NVIDIA T4
conformern-gram153.41NVIDIA T4
conformern-gram8828NVIDIA T4
conformern-gram16104.116NVIDIA T4
conformern-gram3223932NVIDIA T4

T4 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram321322NVIDIA T4
conformern-gram32488NVIDIA T4

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.8.1 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A100 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0210.003145A100 SXM4-40GB
FastPitch + Hifi-GAN40.0370.006336A100 SXM4-40GB
FastPitch + Hifi-GAN60.0460.007395A100 SXM4-40GB
FastPitch + Hifi-GAN80.0560.009421A100 SXM4-40GB
FastPitch + Hifi-GAN100.0590.01434A100 SXM4-40GB
FastPitch + Hifi-GAN320.3390.015437A100 SXM4-40GB

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.8.1 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A30 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0220.004127A30
FastPitch + Hifi-GAN40.0440.007267A30
FastPitch + Hifi-GAN60.0640.009292A30
FastPitch + Hifi-GAN80.0820.011310A30
FastPitch + Hifi-GAN100.0910.013318A30
FastPitch + Hifi-GAN160.1960.014332A30
FastPitch + Hifi-GAN320.4270.019349A30

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.8.1 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A10 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0210.004127A10
FastPitch + Hifi-GAN40.0490.008235A10
FastPitch + Hifi-GAN60.0720.011250A10
FastPitch + Hifi-GAN80.0960.014256A10
FastPitch + Hifi-GAN160.2180.02278A10
FastPitch + Hifi-GAN320.5210.024284A10

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.8.1 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

V100 TTS Benchmarks +

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0240.005104V100 SXM2-16GB
FastPitch + Hifi-GAN40.0550.009215V100 SXM2-16GB
FastPitch + Hifi-GAN60.080.012227V100 SXM2-16GB
FastPitch + Hifi-GAN80.1080.015232V100 SXM2-16GB
FastPitch + Hifi-GAN100.1190.018235V100 SXM2-16GB
FastPitch + Hifi-GAN160.2380.022254V100 SXM2-16GB
FastPitch + Hifi-GAN320.5620.026264V100 SXM2-16GB

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.8.1 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz



T4 TTS Benchmarks +

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.050.00764NVIDIA T4
FastPitch + Hifi-GAN40.0960.016121NVIDIA T4
FastPitch + Hifi-GAN60.1420.022127NVIDIA T4
FastPitch + Hifi-GAN80.1880.028132NVIDIA T4
FastPitch + Hifi-GAN100.2180.03134NVIDIA T4
FastPitch + Hifi-GAN160.4120.042142NVIDIA T4
FastPitch + Hifi-GAN321.0240.047145NVIDIA T4

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.8.1 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

 

Last updated: January 7th, 2023