Reproducible Performance

Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide

Related Resources

HPC Performance

Review the latest GPU-acceleration factors of popular HPC applications.


Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Related Resources

Read our blog on convergence for more details.

Get up and running quickly with NVIDIA’s complete solution stack:


NVIDIA Performance on MLPerf 2.0 Training Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA A100 Performance on MLPerf 2.0 AI Benchmarks - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.527.22775.90% classification8x A100Inspur: NF5688M62.0-2069MixedImageNet2012A100-SXM4-80GB
4.50375.90% classification64x A100DGX A1002.0-2094MixedImageNet2012A100-SXM4-80GB
0.55575.90% classification1,024x A100DGX A1002.0-2101MixedImageNet2012A100-SXM4-80GB
0.31975.90% classification4,216x A100DGX A1002.0-2107MixedImageNet2012A100-SXM4-80GB
3D U-Net21.2780.908 Mean DICE score8x A100H3C: R5500G5-Intelx8A100-SXM-80GB2.0-2060MixedKiTS 2019A100-SXM4-80GB
3.4370.908 Mean DICE score72x A100Azure: ND96amsr_A100_v4_n92.0-2007MixedKiTS 2019A100-SXM4-80GB
1.2160.908 Mean DICE score768x A100DGX A1002.0-2100MixedKiTS 2019A100-SXM4-80GB
PyTorchBERT15.8690.72 Mask-LM accuracy8x A100Inspur: NF5688M62.0-2070MixedWikipedia 2020/01/01A100-SXM4-80GB
2.9420.72 Mask-LM accuracy64x A100DGX A1002.0-2095MixedWikipedia 2020/01/01A100-SXM4-80GB
0.4210.72 Mask-LM accuracy1,024x A100DGX A1002.0-2102MixedWikipedia 2020/01/01A100-SXM4-80GB
0.2060.72 Mask-LM accuracy4,096x A100DGX A1002.0-2106MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN40.9170.377 Box min AP and 0.339 Mask min AP8x A100Inspur: NF5688M62.0-2070MixedCOCO2017A100-SXM4-80GB
8.4470.377 Box min AP and 0.339 Mask min AP64x A100DGX A1002.0-2095MixedCOCO2017A100-SXM4-80GB
3.0850.377 Box min AP and 0.339 Mask min AP384x A100DGX A1002.0-2099MixedCOCO2017A100-SXM4-80GB
RNN-T28.7590.058 Word Error Rate8x A100Inspur: NF5488A52.0-2066MixedLibriSpeechA100-SXM4-80GB
6.910.058 Word Error Rate64x A100DGX A1002.0-2095MixedLibriSpeechA100-SXM4-80GB
2.1510.058 Word Error Rate1,536x A100DGX A1002.0-2104MixedLibriSpeechA100-SXM4-80GB
RetinaNet84.397mAP of 0.348x A100DGX A1002.0-2091MixedOpenImagesA100-SXM4-80GB
14.462mAP of 0.3464x A100DGX A1002.0-2095MixedOpenImagesA100-SXM4-80GB
4.253mAP of 0.341,280x A100DGX A1002.0-2103MixedOpenImagesA100-SXM4-80GB
TensorFlowMiniGo255.67250% win rate vs. checkpoint8x A100H3C: R5500G5-AMDx8A100-SXM-80GB2.0-2059MixedGoA100-SXM4-80GB
73.03850% win rate vs. checkpoint64x A100DGX A1002.0-2096MixedGoA100-SXM4-80GB
16.23150% win rate vs. checkpoint1,792x A100DGX A1002.0-2105MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM1.5970.8025 AUC8x A100Inspur: NF5688M62.0-2068MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.6530.8025 AUC64x A100DGX A1002.0-2093MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.5880.8025 AUC112x A100DGX A1002.0-2098MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

MLPerf™ v2.0 Training Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.


NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Strong Scaling - Closed Division

FrameworkNetworkTime to Train (mins)MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetCosmoFlow8.04Mean average error 0.1241,024x A100DGX A1001.0-1120MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
25.78Mean average error 0.124128x A100DGX A1001.0-1121MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
PyTorchDeepCAM1.67IOU 0.822,048x A100DGX A1001.0-1122MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB
2.65IOU 0.82512x A100DGX A1001.0-1123MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB

NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Weak Scaling - Closed Division

FrameworkNetworkThroughputMLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetCosmoFlow0.73 models/minMean average error 0.1244,096x A100DGX A1001.0-1131MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
PyTorchDeepCAM5.27 models/minIOU 0.824,096x A100DGX A1001.0-1132MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB

MLPerf™ v1.0 Training HPC Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v1.0 Training HPC rules and guidelines, click here

Converged Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2107.56 Training Loss289,921 total output mels/sec8x A100DGX A10022.07-py3TF32128LJSpeech 1.1A100-SXM4-80GB
1.13.0a0WaveGlow241-5.72 Training Loss1,763,443 output samples/sec8x A100DGX A10022.07-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
1.13.0a0GNMT v21624.4 BLEU Score962,349 total tokens/sec8x A100DGX A10022.08-py3Mixed128wmt16-en-deA100-SXM4-80GB
1.13.0a0NCF0.37.96 Hit Rate at 10155,071,904 samples/sec8x A100DGX A10022.08-py3Mixed131072MovieLens 20MA100-SXM4-80GB
1.13.0a0Transformer XL Base17822.37 Perplexity744,933 total tokens/sec8x A100DGX A10022.08-py3Mixed128WikiText-103A100-SXM4-80GB
1.13.0a0EfficientNet-WideSE-B068077.11 Top 113,135 images/sec8x A100DGX A10022.07-py3Mixed256Imagenet2012A100-SXM4-80GB
1.13.0a0SE3 Transformer9.04 MAE22,339 molecules/sec8x A100DGX A10022.08-py3Mixed240Quantum Machines 9A100-SXM4-80GB
Tensorflow1.15.5ResNext10118879.25 Top 110,316 images/sec8x A100DGX A10022.08-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5SE-ResNeXt10121779.72 Top 18,951 images/sec8x A100DGX A10022.07-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5U-Net Industrial1.99 IoU Threshold 0.991,056 images/sec8x A100DGX A10022.07-py3Mixed2DAGM2007A100-SXM4-80GB
1.15.5U-Net Medical5.89 DICE Score986 images/sec8x A100DGX A10022.08-py3Mixed8EM segmentation challengeA100-SXM4-80GB
2.8.0Electra Base Fine Tuning392.55 F12,823 sequences/sec8x A100DGX A10022.05-py3Mixed32SQuaD v1.1A100-SXM4-80GB

A40 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0NCF1.96 Hit Rate at 1048,441,850 samples/sec8x A40GIGABYTE G482-Z52-0022.08-py3Mixed131072MovieLens 20MA40
1.13.0a0Tacotron2115.56 Training Loss268,967 total output mels/sec8x A40Supermicro AS -4124GS-TNR22.07-py3Mixed128LJSpeech 1.1A40
1.13.0a0WaveGlow464-5.74 Training Loss907,704 output samples/sec8x A40Supermicro AS -4124GS-TNR22.08-py3Mixed10LJSpeech 1.1A40
1.13.0a0GNMT v25424.24 BLEU Score324,183 total tokens/sec8x A40Supermicro AS -4124GS-TNR22.08-py3Mixed128wmt16-en-deA40
1.13.0a0Transformer XL Base44022.33 Perplexity302,876 total tokens/sec8x A40Supermicro AS -4124GS-TNR22.08-py3Mixed128WikiText-103A40
1.13.0a0EfficientNet-B086677.37 Top 110,307 images/sec8x A40Supermicro AS -4124GS-TNR22.08-py3Mixed256Imagenet2012A40
1.13.0a0EfficientNet-WideSE-B087577.32 Top 110,282 images/sec8x A40Supermicro AS -4124GS-TNR22.08-py3Mixed256Imagenet2012A40
1.13.0a0SE3 Transformer13.04 MAE14,150 molecules/sec8x A40Supermicro AS -4124GS-TNR22.08-py3Mixed240Quantum Machines 9A40
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99734 images/sec8x A40GIGABYTE G482-Z52-0022.08-py3Mixed2DAGM2007A40
1.15.5ResNeXt10141379.25 Top 14,670 images/sec8x A40Supermicro AS -4124GS-TNR22.08-py3Mixed256Imagenet2012A40
1.15.5SE-ResNeXt10146979.83 Top 14,115 images/sec8x A40Supermicro AS -4124GS-TNR22.07-py3Mixed256Imagenet2012A40
2.8.0Electra Base Fine Tuning492.6 F11,132 sequences/sec8x A40Supermicro AS -4124GS-TNR22.05-py3Mixed32SQuaD v1.1A40

A30 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2121.53 Training Loss255,080 total output mels/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed104LJSpeech 1.1A30
1.13.0a0WaveGlow474-5.64 Training Loss889,286 output samples/sec8x A30GIGABYTE G482-Z52-0022.07-py3Mixed10LJSpeech 1.1A30
1.13.0a0GNMT v25424.18 BLEU Score324,186 total tokens/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed128wmt16-en-deA30
1.13.0a0NCF1.96 Hit Rate at 1057,621,616 samples/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed131072MovieLens 20MA30
1.13.0a0FastPitch4352.7 Training Loss180,819 frames/sec8x A30GIGABYTE G482-Z52-0022.07-py3Mixed16LJSpeech 1.1A30
1.13.0a0Transformer XL Base14723.69 Perplexity228,197 total tokens/sec8x A30GIGABYTE G482-Z52-0022.07-py3Mixed32WikiText-103A30
1.13.0a0ResNeXt10150379.89 Top 13,938 images/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed112Imagenet2012A30
1.13.0a0EfficientNet-B087877.06 Top 110,062 images/sec8x A30GIGABYTE G482-Z52-0022.07-py3Mixed128Imagenet2012A30
1.13.0a0SE3 Transformer12.04 MAE16,339 molecules/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed240Quantum Machines 9A30
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99682 images/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed2DAGM2007A30
1.15.5U-Net Medical10.9 DICE Score461 images/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed8EM segmentation challengeA30
1.15.5Transformer-XL Base38922.28 Perplexity84,570 total tokens/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed16WikiText-103A30
1.15.5ResNeXt10146279.29 Top 14,184 images/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed128Imagenet2012A30
1.15.5SE-ResNeXt10155179.91 Top 13,514 images/sec8x A30GIGABYTE G482-Z52-0022.07-py3Mixed96Imagenet2012A30
2.8.0Electra Base Fine Tuning592.69 F1990 sequences/sec8x A30GIGABYTE G482-Z52-0022.05-py3Mixed16SQuaD v1.1A30

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

A10 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2141.54 Training Loss215,014 total output mels/sec8x A10GIGABYTE G482-Z52-0022.08-py3Mixed104LJSpeech 1.1A10
1.13.0a0WaveGlow590-5.75 Training Loss710,549 output samples/sec8x A10GIGABYTE G482-Z52-0022.07-py3Mixed10LJSpeech 1.1A10
1.13.0a0GNMT V25624.31 BLEU Score256,704 total tokens/sec8x A10GIGABYTE G482-Z52-0022.08-py3Mixed128wmt16-en-deA10
1.13.0a0NCF1.96 Best Hit Rate at 1051,046,635 samples/sec8x A10GIGABYTE G482-Z52-0022.07-py3Mixed131072MovieLens 20MA10
1.13.0a0EfficientNet-WideSE-B01,11577.21 Top 17,975 images/sec8x A10GIGABYTE G482-Z52-0022.08-py3Mixed128Imagenet2012A10
1.13.0a0SE3 Transformer15.04 MAE12,422 molecules/sec8x A10GIGABYTE G482-Z52-0022.08-py3Mixed240Quantum Machines 9A10
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99645 images/sec8x A10GIGABYTE G482-Z52-0022.08-py3Mixed2DAGM2007A10
1.15.5U-Net Medical13.9 DICE Score344 images/sec8x A10GIGABYTE G482-Z52-0022.07-py3Mixed8EM segmentation challengeA10
1.15.5ResNext10157579.29 Top 13,355 images/sec8x A10GIGABYTE G482-Z52-0022.06-py3Mixed128Imagenet2012A10
1.15.5SE-ResNeXt10167479.84 Top 12,867 images/sec8x A10GIGABYTE G482-Z52-0022.07-py3Mixed96Imagenet2012A10
2.8.0Electra Base Fine Tuning692.64 F1753 sequences/sec8x A10GIGABYTE G482-Z52-0022.05-py3Mixed16SQuaD v1.1A10

T4 Training Performance +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0ResNeXt1011,43479.92 Top 11,429 images/sec8x T4Supermicro SYS-4029GP-TRT22.08-py3Mixed112Imagenet2012NVIDIA T4
1.13.0a0Tacotron2242.53 Training Loss126,370 total output mels/sec8x T4Supermicro SYS-4029GP-TRT22.08-py3Mixed104LJSpeech 1.1NVIDIA T4
1.13.0a0WaveGlow999-5.69 Training Loss420,050 output samples/sec8x T4Supermicro SYS-4029GP-TRT22.07-py3Mixed10LJSpeech 1.1NVIDIA T4
1.13.0a0GNMT v29224.49 BLEU Score156,900 total tokens/sec8x T4Supermicro SYS-4029GP-TRT22.06-py3Mixed128wmt16-en-deNVIDIA T4
1.13.0a0NCF2.96 Hit Rate at 1026,199,847 samples/sec8x T4Supermicro SYS-4029GP-TRT22.08-py3Mixed131072MovieLens 20MNVIDIA T4
1.13.0a0EfficientNet-B02,32877.08 Top 13,869 images/sec8x T4Supermicro SYS-4029GP-TRT22.08-py3Mixed128Imagenet2012NVIDIA T4
1.13.0a0EfficientNet-WideSE-B02,28177.4 Top 13,893 images/sec8x T4Supermicro SYS-4029GP-TRT22.08-py3Mixed128Imagenet2012NVIDIA T4
1.13.0a0SE3 Transformer36.04 MAE4,838 molecules/sec8x T4Supermicro SYS-4029GP-TRT22.08-py3Mixed240Quantum Machines 9NVIDIA T4
Tensorflow1.15.5U-Net Industrial2.99 IoU Threshold 0.99294 images/sec8x T4Supermicro SYS-4029GP-TRT22.07-py3Mixed2DAGM2007NVIDIA T4
1.15.5U-Net Medical60.89 DICE Score151 images/sec8x T4Supermicro SYS-4029GP-TRT22.08-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5ResNext1011,34179.13 Top 11,437 images/sec8x T4Supermicro SYS-4029GP-TRT22.08-py3Mixed128Imagenet2012NVIDIA T4
1.15.5SE-ResNeXt1011,62679.59 Top 11,185 images/sec8x T4Supermicro SYS-4029GP-TRT22.07-py3Mixed96Imagenet2012NVIDIA T4
2.8.0Electra Base Fine Tuning1092.7 F1378 sequences/sec8x T4Supermicro SYS-4029GP-TRT22.05-py3Mixed16SQuaD v1.1NVIDIA T4


V100 Training Performance +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2176.49 Training Loss170,958 total output mels/sec8x V100DGX-222.08-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.13.0a0WaveGlow399-5.81 Training Loss1,069,377 output samples/sec8x V100DGX-222.08-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.13.0a0GNMT v23324.1 BLEU Score443,670 total tokens/sec8x V100DGX-222.08-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.13.0a0NCF1.96 Hit Rate at 1096,492,012 samples/sec8x V100DGX-222.08-py3Mixed131072MovieLens 20MV100-SXM3-32GB
1.13.0a0EfficientNet-B093477.08 Top 19,611 images/sec8x V100DGX-222.07-py3Mixed256Imagenet2012V100-SXM3-32GB
1.13.0a0EfficientNet-WideSE-B094877.05 Top 19,465 images/sec8x V100DGX-222.07-py3Mixed256Imagenet2012V100-SXM3-32GB
1.13.0a0SE3 Transformer13.04 MAE14,630 molecules/sec8x V100DGX-222.08-py3Mixed240Quantum Machines 9V100-SXM3-32GB
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99634 images/sec8x V100DGX-222.06-py3Mixed2DAGM2007V100-SXM3-32GB
1.15.5ResNext10140179.15 Top 14,827 images/sec8x V100DGX-222.08-py3Mixed128Imagenet2012V100-SXM3-32GB
1.15.5SE-ResNeXt10149479.71 Top 13,931 images/sec8x V100DGX-222.07-py3Mixed96Imagenet2012V100-SXM3-32GB
1.15.5U-Net Medical13.9 DICE Score466 images/sec8x V100DGX-222.08-py3Mixed8EM segmentation challengeV100-SXM3-32GB
1.15.5Transformer XL Base31822.32 Perplexity103,869 total tokens/sec8x V100DGX-222.08-py3Mixed16WikiText-103V100-SXM3-32GB
2.8.0Electra Base Fine Tuning492.62 F11,376 sequences/sec8x V100DGX-222.05-py3Mixed32SQuaD v1.1V100-SXM3-32GB

Converged Training Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance on Cloud

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
Tensorflow-BERT-LARGE1091.36 F1825 sequences/sec8x A100Azure Standard_ND96amsr_A100_v422.05-py3Mixed24SQuaD v1.1A100-SXM4-40GB
-BERT-LARGE1391.4 F1745 sequences/sec8x A100GCP A2-HIGHGPU-8G22.07-py3Mixed24SQuaD v1.1A100-SXM4-40GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

V100 Training Performance on Cloud +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
Tensorflow-BERT-LARGE2991.07 F1164 sequences/sec8x V100GCP N1-HIGHMEM-6422.07-py3Mixed3SQuaD v1.1V100-SXM2-16GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

Converged Multi-Node Training Performance of NVIDIA GPU

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Multi-Node Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputTotal GPUsNodesServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.11.0a0BERT-LARGE Pre-Training P12961.53 Training Loss25,365 sequences/sec64x A1008Selene21.12-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P21691.35 Training Loss5,112 sequences/sec64x A1008Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training E2E2531.35 Training Loss-64x A1008Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P11601.51 Training Loss48,380 sequences/sec128x A10016Selene21.12-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P2871.34 Training Loss9,961 sequences/sec128x A10016Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training E2E1361.34 Training Loss-128x A10016Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P1871.49 Training Loss89,062 sequences/sec256x A10032Selene21.12-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P2461.34 Training Loss19,169 sequences/sec256x A10032Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training E2E731.34 Training Loss-256x A10032Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P1511.5 Training Loss153,429 sequences/sec512x A10064Selene21.12-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P2251.33 Training Loss36,887 sequences/sec512x A10064Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training E2E421.33 Training Loss-512x A10064Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.10.0a0BERT-LARGE Pre-Training P1261.5 Training Loss300,769 sequences/sec1,024x A100128Selene21.09-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.10.0a0BERT-LARGE Pre-Training P2131.35 Training Loss74,498 sequences/sec1,024x A100128Selene21.09-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.10.0a0BERT-LARGE Pre-Training E2E221.35 Training Loss-1,024x A100128Selene21.09-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0Transformer18618.25 Perplexity454,979 total tokens/sec16x A1002Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0Transformer10518.27 Perplexity822,173 total tokens/sec64x A1004Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0Transformer6318.34 Perplexity1,389,494 total tokens/sec64x A1008Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB

BERT-Large Pre-Training Phase 1 Sequence Length = 128
BERT-Large Pre-Training Phase 2 Sequence Length = 512
Starting from 21.09-py3, ECC is enabled

Single-GPU Training

Some scenarios aren’t used in real-world training, such as single-GPU throughput. The table below provides an indication of a platform’s single-chip throughput.

Related Resources

Achieve unprecedented acceleration at every scale with NVIDIA’s complete solution stack.

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.

NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit the NVIDIA NGC catalog to pull containers and quickly get up and running with deep learning.


Single GPU Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron239,145 total output mels/sec1x A100DGX A10022.08-py3TF32128LJSpeech 1.1A100-SXM4-80GB
1.13.0a0WaveGlow257,894 output samples/sec1x A100DGX A10022.08-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
1.13.0a0FastPitch75,749 frames/sec1x A100DGX A10022.08-py3Mixed32LJSpeech 1.1A100-SXM4-80GB
1.13.0a0GNMT v2171,432 total tokens/sec1x A100DGX A10022.08-py3Mixed128wmt16-en-deA100-SXM4-80GB
1.13.0a0NCF38,865,126 samples/sec1x A100DGX A10022.08-py3Mixed1048576MovieLens 20MA100-SXM4-80GB
1.13.0a0ResNeXt1011,237 images/sec1x A100DGX A10022.08-py3Mixed128Imagenet2012A100-SXM4-80GB
1.13.0a0Transformer-XL Large16,738 total tokens/sec1x A100DGX A10022.08-py3Mixed16WikiText-103A100-SXM4-80GB
1.13.0a0Transformer-XL Base91,058 total tokens/sec1x A100DGX A10022.08-py3Mixed128WikiText-103A100-SXM4-80GB
1.13.0a0nnU-Net1,147 images/sec1x A100DGX A10022.08-py3Mixed64Medical Segmentation DecathlonA100-SXM4-80GB
1.13.0a0EfficientNet-B4389 images/sec1x A100DGX A10022.08-py3Mixed128Imagenet2012A100-SXM4-80GB
1.13.0a0BERT Large Pre-Training Phase 2302 sequences/sec1x A100DGX A10022.07-py3Mixed56Wikipedia 2020/01/01A100-SXM4-80GB
1.13.0a0BERT Large Pre-Training Phase 1853 sequences/sec1x A100DGX A10022.07-py3Mixed512Wikipedia 2020/01/01A100-SXM4-80GB
1.13.0a0EfficientNet-WideSE-B01,616 images/sec1x A100DGX A10022.07-py3Mixed256Imagenet2012A100-SXM4-40GB
1.13.0a0EfficientNet-WideSE-B4388 images/sec1x A100DGX A10022.08-py3Mixed128Imagenet2012A100-SXM4-80GB
1.13.0a0SE3 Transformer3,191 molecules/sec1x A100DGX A10022.08-py3Mixed240Quantum Machines 9A100-SXM4-80GB
Tensorflow1.15.5ResNeXt1011,324 images/sec1x A100DGX A10022.08-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5SE-ResNeXt1011,154 images/sec1x A100DGX A10022.08-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5U-Net Industrial368 images/sec1x A100DGX A10022.08-py3Mixed16DAGM2007A100-SXM4-40GB
1.15.5U-Net Medical149 images/sec1x A100DGX A10022.08-py3Mixed8EM segmentation challengeA100-SXM4-80GB
2.8.0Electra Base Fine Tuning372 sequences/sec1x A100DGX A10022.05-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.15.5NCF44,622,967 samples/sec1x A100DGX A10022.08-py3Mixed1048576MovieLens 20MA100-SXM4-40GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A40 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron235,652 total output mels/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed128LJSpeech 1.1A40
1.13.0a0WaveGlow145,818 output samples/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed10LJSpeech 1.1A40
1.13.0a0GNMT v281,394 total tokens/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed128wmt16-en-deA40
1.13.0a0NCF18,502,201 samples/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed1048576MovieLens 20MA40
1.13.0a0Transformer-XL Large10,143 total tokens/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed16WikiText-103A40
1.13.0a0FastPitch78,848 frames/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed32LJSpeech 1.1A40
1.13.0a0Transformer-XL Base42,286 total tokens/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed128WikiText-103A40
1.13.0a0nnU-Net562 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed64Medical Segmentation DecathlonA40
1.13.0a0EfficientNet-B01,359 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed256Imagenet2012A40
1.13.0a0EfficientNet-B4182 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed64Imagenet2012A40
1.13.0a0EfficientNet-WideSE-B01,362 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed256Imagenet2012A40
1.13.0a0EfficientNet-WideSE-B4182 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed64Imagenet2012A40
1.13.0a0SE3 Transformer1,838 molecules/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed240Quantum Machines 9A40
Tensorflow1.15.5U-Net Industrial123 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed16DAGM2007A40
1.15.5U-Net Medical70 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed8EM segmentation challengeA40
1.15.5ResNeXt101621 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed256Imagenet2012A40
1.15.5SE-ResNeXt101568 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed256Imagenet2012A40
2.8.0Electra Base Fine Tuning165 sequences/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed32SQuaD v1.1A40

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A30 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron232,752 total output mels/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed104LJSpeech 1.1A30
1.13.0a0WaveGlow153,596 output samples/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed10LJSpeech 1.1A30
1.13.0a0FastPitch70,526 frames/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed16LJSpeech 1.1A30
1.13.0a0NCF20,322,005 samples/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed1048576MovieLens 20MA30
1.13.0a0GNMT v290,990 total tokens/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed128wmt16-en-deA30
1.13.0a0Transformer-XL Base19,328 total tokens/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed32WikiText-103A30
1.13.0a0ResNeXt101588 images/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed112Imagenet2012A30
1.13.0a0Transformer-XL Large7,209 total tokens/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed4WikiText-103A30
1.13.0a0nnU-Net582 images/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed64Medical Segmentation DecathlonA30
1.13.0a0EfficientNet-B01,328 images/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed128Imagenet2012A30
1.13.0a0EfficientNet-B4186 images/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed32Imagenet2012A30
1.13.0a0EfficientNet-WideSE-B01,326 images/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed128Imagenet2012A30
1.13.0a0EfficientNet-WideSE-B4184 images/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed32Imagenet2012A30
1.13.0a0SE3 Transformer2,112 molecules/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed240Quantum Machines 9A30
Tensorflow1.15.5ResNeXt101597 images/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed128Imagenet2012A30
1.15.5SE-ResNeXt101497 images/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed96Imagenet2012A30
1.15.5U-Net Industrial117 images/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed16DAGM2007A30
1.15.5U-Net Medical68 images/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed8EM segmentation challengeA30
1.15.5Transformer-XL Base18,505 total tokens/sec1x A30GIGABYTE G482-Z52-0022.08-py3Mixed16WikiText-103A30
2.8.0Electra Base Fine Tuning165 sequences/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed16SQuaD v1.1A30
1.15.5Wide and Deep299,047 samples/sec1x A30GIGABYTE G482-Z52-0022.07-py3Mixed131072Kaggle Outbrain Click PredictionA30

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A10 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron228,564 total output mels/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed104LJSpeech 1.1A10
1.13.0a0WaveGlow114,489 output samples/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed10LJSpeech 1.1A10
1.13.0a0FastPitch61,589 frames/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed16LJSpeech 1.1A10
1.13.0a0Transformer-XL Base15,741 total tokens/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed32WikiText-103A10
1.13.0a0GNMT v264,911 total tokens/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed128wmt16-en-deA10
1.13.0a0ResNeXt101418 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed112Imagenet2012A10
1.13.0a0NCF15,495,851 samples/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed1048576MovieLens 20MA10
1.13.0a0Transformer-XL Large6,027 total tokens/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed4WikiText-103A10
1.13.0a0nnU-Net445 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed64Medical Segmentation DecathlonA10
1.13.0a0EfficientNet-B01,108 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed128Imagenet2012A10
1.13.0a0EfficientNet-B4146 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed32Imagenet2012A10
1.13.0a0EfficientNet-WideSE-B01,102 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed128Imagenet2012A10
1.13.0a0EfficientNet-WideSE-B4145 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed32Imagenet2012A10
1.13.0a0SE3 Transformer1,636 molecules/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed240Quantum Machines 9A10
Tensorflow1.15.5ResNeXt101450 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed128Imagenet2012A10
1.15.5SE-ResNeXt101391 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed96Imagenet2012A10
1.15.5U-Net Industrial99 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed16DAGM2007A10
1.15.5U-Net Medical49 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed8EM segmentation challengeA10
2.8.0Electra Base Fine Tuning122 sequences/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed16SQuaD v1.1A10
1.15.5Wide and Deep274,172 samples/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed131072Kaggle Outbrain Click PredictionA10

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

T4 Training Performance +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0ResNeXt101190 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed112Imagenet2012NVIDIA T4
1.13.0a0Tacotron217,622 total output mels/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed104LJSpeech 1.1NVIDIA T4
1.13.0a0WaveGlow51,801 output samples/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed10LJSpeech 1.1NVIDIA T4
1.13.0a0FastPitch30,534 frames/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed16LJSpeech 1.1NVIDIA T4
1.13.0a0GNMT v230,990 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed128wmt16-en-deNVIDIA T4
1.13.0a0NCF7,300,045 samples/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed1048576MovieLens 20MNVIDIA T4
1.13.0a0Transformer-XL Base9,080 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed32WikiText-103NVIDIA T4
1.13.0a0SE-ResNeXt101152 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed112Imagenet2012NVIDIA T4
1.13.0a0Transformer-XL Large2,734 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed4WikiText-103NVIDIA T4
1.13.0a0nnU-Net204 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed64Medical Segmentation DecathlonNVIDIA T4
1.13.0a0EfficientNet-B0507 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed128Imagenet2012NVIDIA T4
1.13.0a0EfficientNet-B467 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed32Imagenet2012NVIDIA T4
1.13.0a0EfficientNet-WideSE-B0507 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed128Imagenet2012NVIDIA T4
1.13.0a0EfficientNet-WideSE-B467 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed32Imagenet2012NVIDIA T4
1.13.0a0SE3 Transformer620 molecules/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed240Quantum Machines 9NVIDIA T4
Tensorflow1.15.5U-Net Industrial44 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed16DAGM2007NVIDIA T4
1.15.5U-Net Medical21 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5SE-ResNeXt101161 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed96Imagenet2012NVIDIA T4
1.15.5ResNeXt101183 images/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed128Imagenet2012NVIDIA T4
2.8.0Electra Base Fine Tuning56 sequences/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed16SQuaD v1.1NVIDIA T4
1.15.5Wide and Deep190,709 samples/sec1x T4Supermicro SYS-1029GQ-TRT22.08-py3Mixed131072Kaggle Outbrain Click PredictionNVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec



V100 Training Performance +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0ResNeXt101598 images/sec1x V100DGX-222.08-py3Mixed112Imagenet2012V100-SXM3-32GB
1.13.0a0Tacotron223,774 total output mels/sec1x V100DGX-222.08-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.13.0a0WaveGlow155,021 output samples/sec1x V100DGX-222.08-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.13.0a0FastPitch44,179 frames/sec1x V100DGX-222.08-py3Mixed16LJSpeech 1.1V100-SXM3-32GB
1.13.0a0GNMT v278,543 total tokens/sec1x V100DGX-222.08-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.13.0a0NCF23,025,458 samples/sec1x V100DGX-222.08-py3Mixed1048576MovieLens 20MV100-SXM3-32GB
1.13.0a0Transformer-XL Base18,332 total tokens/sec1x V100DGX-222.08-py3Mixed32WikiText-103V100-SXM3-32GB
1.13.0a0Transformer-XL Large7,354 total tokens/sec1x V100DGX-222.08-py3Mixed8WikiText-103V100-SXM3-32GB
1.13.0a0nnU-Net653 images/sec1x V100DGX-222.08-py3Mixed64Medical Segmentation DecathlonV100-SXM3-32GB
1.13.0a0EfficientNet-B01,579 images/sec1x V100DGX-222.07-py3Mixed256Imagenet2012V100-SXM3-32GB
1.13.0a0EfficientNet-B4220 images/sec1x V100DGX-222.08-py3Mixed64Imagenet2012V100-SXM3-32GB
1.13.0a0EfficientNet-WideSE-B01,576 images/sec1x V100DGX-222.06-py3Mixed256Imagenet2012V100-SXM3-32GB
1.13.0a0EfficientNet-WideSE-B4220 images/sec1x V100DGX-222.08-py3Mixed64Imagenet2012V100-SXM3-32GB
1.13.0a0SE3 Transformer2,043 molecules/sec1x V100DGX-222.08-py3Mixed240Quantum Machines 9V100-SXM3-32GB
Tensorflow1.15.5ResNeXt101639 images/sec1x V100DGX-222.08-py3Mixed128Imagenet2012V100-SXM3-32GB
1.15.5SE-ResNeXt101545 images/sec1x V100DGX-222.08-py3Mixed96Imagenet2012V100-SXM3-32GB
1.15.5U-Net Industrial118 images/sec1x V100DGX-222.08-py3Mixed16DAGM2007V100-SXM3-32GB
1.15.5U-Net Medical68 images/sec1x V100DGX-222.08-py3Mixed8EM segmentation challengeV100-SXM3-32GB
2.8.0Electra Base Fine Tuning188 sequences/sec1x V100DGX-222.05-py3Mixed32SQuaD v1.1V100-SXM3-32GB
1.15.5Transformer-XL Base18,514 total tokens/sec1x V100DGX-222.08-py3Mixed16WikiText-103V100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

Single GPU Training Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance on Cloud

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.52,887 images/sec1x A100GCP A2-HIGHGPU-1G22.07-py3Mixed192ImageNet2012A100-SXM4-40GB
PyTorch-DLRM3,450,000 records/sec1x A100GCP A2-HIGHGPU-1G22.07-py3Mixed32768Criteo Terabyte DatasetA100-SXM4-40GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

T4 Training Performance on Cloud +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.5450 images/sec1x T4AWS EC2 g4dn.4xlarge22.06-py3Mixed192ImageNet2012NVIDIA T4
-ResNet-50 v1.5432 images/sec1x T4GCP N1-HIGHMEM-822.07-py3Mixed192ImageNet2012NVIDIA T4
TensorFlow-ResNet-50 v1.5419 images/sec1x T4AWS EC2 g4dn.4xlarge22.06-py3Mixed256Imagenet2012NVIDIA T4

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384



V100 Training Performance on Cloud +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.51,467 images/sec1x V100GCP N1-HIGHMEM-822.07-py3Mixed192ImageNet2012V100-SXM2-16GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Related Resources

Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:


MLPerf Inference v2.1 Performance Benchmarks

Offline Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionDatasetTarget Accuracy
ResNet-50 v1.581,292 samples/sec1x H100NVIDIA H100H100-SXM-80GBImageNet76.46% Top1
335,144 samples/sec8x A100DGX A100A100 SXM-80GBImageNet76.46% Top1
5,589 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBImageNet76.46% Top1
316,342 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet76.46% Top1
RetinaNet960 samples/sec1x H100NVIDIA H100H100-SXM-80GBOpenImages0.3755 mAP
4,739 samples/sec8x A100DGX A100A100 SXM-80GBOpenImages0.3755 mAP
74 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBOpenImages0.3755 mAP
4,345 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages0.3755 mAP
3D-UNet5 samples/sec1x H100NVIDIA H100H100-SXM-80GBKiTS 20190.863 DICE mean
26 samples/sec8x A100DGX A100A100 SXM-80GBKiTS 20190.863 DICE mean
0.51 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBKiTS 20190.863 DICE mean
25 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBKiTS 20190.863 DICE mean
RNN-T22,885 samples/sec1x H100NVIDIA H100H100-SXM-80GBLibriSpeech7.45% WER
106,726 samples/sec8x A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
1,918 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
102,784 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech7.45% WER
BERT7,921 samples/sec1x H100NVIDIA H100H100-SXM-80GBSQuAD v1.190.87% f1
13,968 samples/sec8x A100DGX A100A100 SXM-80GBSQuAD v1.190.87% f1
1,757 samples/sec1x A100DGX A100A100 SXM-80GBSQuAD v1.190.87% f1
247 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBSQuAD v1.190.87% f1
12,822 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.190.87% f1
DLRM695,298 samples/sec1x H100NVIDIA H100H100-SXM-80GBCriteo 1TB Click Logs80.25% AUC
2,443,220 samples/sec8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
314,992 samples/sec1x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
38,995 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
2,291,310 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBCriteo 1TB Click Logs80.25% AUC

Server Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionTarget AccuracyMLPerf Server Latency
Constraints (ms)
Dataset
ResNet-50 v1.558,995 queries/sec1x H100NVIDIA H100H100-SXM-80GB76.46% Top115ImageNet
300,064 queries/sec8x A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
3,527 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
236,057 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB76.46% Top115ImageNet
RetinaNet848 queries/sec1x H100NVIDIA H100H100-SXM-80GB0.3755 mAP100OpenImages
4,096 queries/sec8x A100DGX A100A100 SXM-80GB0.3755 mAP100OpenImages
45 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB0.3755 mAP100OpenImages
3,997 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB0.3755 mAP100OpenImages
RNN-T21,488 queries/sec1x H100NVIDIA H100H100-SXM-80GB7.45% WER1,000LibriSpeech
104,020 queries/sec8x A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
1,347 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
90,005 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB7.45% WER1,000LibriSpeech
BERT6,195 queries/sec1x H100NVIDIA H100H100-SXM-80GB90.87% f1130SQuAD v1.1
12,815 queries/sec8x A100DGX A100A100 SXM-80GB90.87% f1130SQuAD v1.1
1,572 queries/sec1x A100DGX A100A100 SXM-80GB90.87% f1130SQuAD v1.1
164 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB90.87% f1130SQuAD v1.1
10,795 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB90.87% f1130SQuAD v1.1
DLRM545,174 queries/sec1x H100NVIDIA H100H100-SXM-80GB80.25% AUC30Criteo 1TB Click Logs
2,390,910 queries/sec8x A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
298,565 queries/sec1x A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
35,991 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
1,326,940 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB80.25% AUC30Criteo 1TB Click Logs

Power Efficiency Offline Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5288,733 samples/sec93.68 samples/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
252,721 samples/sec122.19 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet
RetinaNet4,122 samples/sec1.32 samples/sec/watt8x A100DGX A100A100 SXM-80GBOpenImages
3,805 samples/sec1.73 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages
3D-UNet23 samples/sec0.008 samples/sec/watt8x A100DGX A100A100 SXM-80GBKiTS 2019
19 samples/sec0.011 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBKiTS 2019
RNN-T84,508 samples/sec27.79 samples/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
78,750 samples/sec38.88 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech
BERT11,152 samples/sec3.33 samples/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
11,158 samples/sec4.37 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.1
DLRM2,128,420 samples/sec641.77 samples/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs

Power Efficiency Server Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5229,055 queries/sec78.93 queries/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
185,047 queries/sec87.2 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet
RetinaNet3,896 queries/sec1.25 queries/sec/watt8x A100DGX A100A100 SXM-80GBOpenImages
2,296 queries/sec1.21 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages
RNN-T88,003 queries/sec25.44 queries/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
74,995 queries/sec33.88 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech
BERT9,995 queries/sec2.93 queries/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
7,494 queries/sec3.45 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.1
DLRM2,002,080 queries/sec592.73 queries/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs

MLPerf™ v2.1 Inference Closed: ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99.9% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 2.1-0082, 2.1-0084, 2.1-0085, 2.1-0087, 2.1-0088, 2.1-0089, 2.1-0121, 2.1-0122. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
H100 SXM-80GB is a preview submission
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
1x1g.10gb is a notation used to refer to the MIG configuration. In this example, the workload is running on a single MIG slice, with 10GB of memory on a single A100.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v2.1


NVIDIA landed top performance spots on all MLPerf™ Inference 2.1 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.

MLPerf™ v2.1 A100 Inference Closed: ResNet-50 v1.5, RetinaNet, BERT 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 2.1-0088, 2.1-0090. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.​

 

NVIDIA Client Batch Size 1 and 2 Performance with Triton Inference Server

A100 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA100-SXM4-40GBtensorrt_planTensorRTMixed4112432.705734 inf/sec38422.08-py3
BERT Large InferenceA100-SXM4-80GBtensorrt_planTensorRTMixed4212462.407770 inf/sec38422.08-py3
BERT Large InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed4112438.956616 inf/sec38422.08-py3
BERT Large InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed4212474.039648 inf/sec38422.08-py3
BERT Base InferenceA100-SXM4-80GBtensorrt_planTensorRTMixed411244.1945,721 inf/sec12822.08-py3
BERT Base InferenceA100-SXM4-40GBtensorrt_planTensorRTMixed421247.0096,848 inf/sec12822.08-py3
BERT Base InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed411244.8194,979 inf/sec12822.08-py3
BERT Base InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed421248.1865,862 inf/sec12822.08-py3
DLRM InferenceA100-SXM4-40GBpytorch_libtorchPyTorchMixed2165,536302.35112,756 inf/sec-22.08-py3
DLRM InferenceA100-SXM4-40GBpytorch_libtorchPyTorchMixed2265,536282.22325,185 inf/sec-22.08-py3
DLRM InferenceA100-PCIE-40GBpytorch_libtorchPyTorchMixed4165,536302.31612,946 inf/sec-22.08-py3
DLRM InferenceA100-PCIE-40GBpytorch_libtorchPyTorchMixed4265,536302.15227,875 inf/sec-22.07-py3

A30 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA30tensorrt_planTensorRTMixed4112468.117352 inf/sec38422.08-py3
BERT Large InferenceA30tensorrt_planTensorRTMixed2211687.427366 inf/sec38422.08-py3
BERT Base InferenceA30tensorrt_planTensorRTMixed411247.6793,125 inf/sec12822.08-py3
BERT Base InferenceA30tensorrt_planTensorRTMixed221169.5023,367 inf/sec12822.08-py3

A10 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA10tensorrt_planTensorRTMixed41124107.907222 inf/sec38422.08-py3
BERT Large InferenceA10tensorrt_planTensorRTMixed22124211.233228 inf/sec38422.08-py3
BERT Base InferenceA10tensorrt_planTensorRTMixed2112411.0782,166 inf/sec12822.08-py3
BERT Base InferenceA10tensorrt_planTensorRTMixed4212421.2612,257 inf/sec12822.08-py3

T4 Triton Inference Server Performance +

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceNVIDIA T4tensorrt_planTensorRTMixed111892.26487 inf/sec38422.08-py3
BERT Large InferenceNVIDIA T4tensorrt_planTensorRTMixed1218183.10687 inf/sec38422.08-py3
BERT Base InferenceNVIDIA T4tensorrt_planTensorRTMixed1112426.479906 inf/sec12822.08-py3
BERT Base InferenceNVIDIA T4tensorrt_planTensorRTMixed1212043.281924 inf/sec12822.08-py3


V100 Triton Inference Server Performance +

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed2112496.163249 inf/sec38422.08-py3
BERT Large InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed42124189.991253 inf/sec38422.08-py3
BERT Base InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed4112412.5671,910 inf/sec12822.08-py3
BERT Base InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed4212420.9072,295 inf/sec12822.08-py3
DLRM InferenceV100-SXM2-32GBpytorch_libtorchPyTorchMixed2165,536303.3588,931 inf/sec-22.08-py3
DLRM InferenceV100-SXM2-32GBpytorch_libtorchPyTorchMixed2265,536303.53216,983 inf/sec-22.08-py3

Inference Performance of NVIDIA A100, A40, A30, A10, A2, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

Inference Natural Langugage Processing

BERT Inference Throughput

DGX A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128

 

NVIDIA A100 BERT Inference Benchmarks

NetworkNetwork
Type
Batch
Size
ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
BERT-Large with SparsityAttention946,188 sequences/sec--1x A100DGX A100-INT8SQuaD v1.1-A100 SXM4-40GB

A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: Mixed | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: Mixed | Dataset: Synthetic

 

A100 Full Chip Inference Performance

NetworkBatch SizeFull Chip ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50811,431 images/sec57 images/sec/watt0.71x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
12830,244 images/sec79 images/sec/watt4.231x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
21931,828 images/sec- images/sec/watt6.881x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
ResNet-50v1.5811,383 images/sec55 images/sec/watt0.71x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
12829,733 images/sec75 images/sec/watt4.311x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
21330,810 images/sec- images/sec/watt6.911x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
87,307 sequences/sec26 sequences/sec/watt1.091x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
12815,353 sequences/sec38 sequences/sec/watt8.341x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-40GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
82,718 sequences/sec9 sequences/sec/watt2.941x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
1284,981 sequences/sec12 sequences/sec/watt25.71x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-40GB
EfficientNet-B089,246 images/sec62 images/sec/watt0.871x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
12829,800 images/sec91 images/sec/watt4.31x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
EfficientNet-B482,609 images/sec11 images/sec/watt3.071x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
1284,614 images/sec12 images/sec/watt27.741x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 1/7 MIG Inference Performance

NetworkBatch Size1/7 MIG ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,723 images/sec34 images/sec/watt2.151x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
304,309 images/sec- images/sec/watt6.961x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
1284,671 images/sec37 images/sec/watt27.41x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
ResNet-50v1.583,620 images/sec30 images/sec/watt2.211x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
284,125 images/sec- images/sec/watt6.791x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
1284,526 images/sec38 images/sec/watt28.281x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
BERT-BASE81,886 sequences/sec15 sequences/sec/watt4.241x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
1282,342 sequences/sec17 sequences/sec/watt54.651x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
BERT-LARGE8618 sequences/sec5 sequences/sec/watt12.951x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
128744 sequences/sec5 sequences/sec/watt172.081x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 7 MIG Inference Performance

NetworkBatch Size7 MIG ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50825,682 images/sec79 images/sec/watt2.181x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
2929,917 images/sec- images/sec/watt6.791x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
12832,531 images/sec88 images/sec/watt27.621x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
ResNet-50v1.5825,056 images/sec77 images/sec/watt2.241x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
2828,819 images/sec- images/sec/watt6.81x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
12831,490 images/sec82 images/sec/watt28.541x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
BERT-BASE813,095 sequences/sec34 sequences/sec/watt4.291x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
12815,342 sequences/sec40 sequences/sec/watt58.531x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
BERT-LARGE84,214 sequences/sec11 sequences/sec/watt13.311x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
1284,812 sequences/sec12 sequences/sec/watt186.611x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

 

A40 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5089,917 images/sec38 images/sec/watt0.811x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
10615,867 images/sec- images/sec/watt6.681x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
12815,867 images/sec53 images/sec/watt8.071x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
ResNet-50v1.589,686 images/sec37 images/sec/watt0.831x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
10115,016 images/sec- images/sec/watt6.731x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
12815,171 images/sec51 images/sec/watt8.441x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
BERT-BASE85,609 sequences/sec19 sequences/sec/watt1.431x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
1287,753 sequences/sec26 sequences/sec/watt16.511x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
BERT-LARGE81,724 sequences/sec6 sequences/sec/watt4.641x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
1282,386 sequences/sec8 sequences/sec/watt53.651x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
EfficientNet-B089,194 images/sec50 images/sec/watt0.871x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
12819,286 images/sec65 images/sec/watt6.641x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
EfficientNet-B481,950 images/sec7 images/sec/watt4.11x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
1282,648 images/sec9 images/sec/watt48.341x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5088,883 images/sec68 images/sec/watt0.91x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
10515,657 images/sec- images/sec/watt6.711x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
12815,973 images/sec97 images/sec/watt8.011x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
ResNet-50v1.588,666 images/sec69 images/sec/watt0.921x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
10115,071 images/sec- images/sec/watt6.71x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
12815,468 images/sec94 images/sec/watt8.281x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
85,187 sequences/sec32 sequences/sec/watt1.541x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
1287,680 sequences/sec49 sequences/sec/watt16.671x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,788 sequences/sec11 sequences/sec/watt4.481x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
1282,437 sequences/sec15 sequences/sec/watt52.531x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
EfficientNet-B087,455 images/sec76 images/sec/watt1.071x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
12816,627 images/sec101 images/sec/watt7.71x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
EfficientNet-B481,724 images/sec12 images/sec/watt4.641x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
1282,360 images/sec14 images/sec/watt54.251x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 1/4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,581 images/sec43 images/sec/watt2.231x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
294,253 images/sec- images/sec/watt6.821x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
1284,620 images/sec50 images/sec/watt27.711x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
ResNet-50v1.583,486 images/sec42 images/sec/watt2.31x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
274,143 images/sec- images/sec/watt6.521x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
1284,461 images/sec48 images/sec/watt28.691x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
BERT-BASE81,903 sequences/sec21 sequences/sec/watt4.21x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
1282,320 sequences/sec23 sequences/sec/watt55.181x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
BERT-LARGE8608 sequences/sec6 sequences/sec/watt13.171x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
128747 sequences/sec7 sequences/sec/watt171.411x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50813,975 images/sec85 images/sec/watt2.291x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
2716,175 images/sec- images/sec/watt6.71x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
12817,191 images/sec104 images/sec/watt29.881x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
ResNet-50v1.5813,516 images/sec82 images/sec/watt2.381x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
2615,607 images/sec- images/sec/watt6.71x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
12816,658 images/sec101 images/sec/watt30.91x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
BERT-BASE86,920 sequences/sec42 sequences/sec/watt4.641x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
1287,818 sequences/sec47 sequences/sec/watt65.711x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
BERT-LARGE82,105 sequences/sec14 sequences/sec/watt15.341x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
1282,441 sequences/sec15 sequences/sec/watt210.391x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A10 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5087,774 images/sec52 images/sec/watt1.031x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
7110,857 images/sec- images/sec/watt6.631x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
12811,284 images/sec77 images/sec/watt11.341x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
ResNet-50v1.587,531 images/sec51 images/sec/watt1.061x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
12810,666 images/sec71 images/sec/watt121x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
83,949 sequences/sec26 sequences/sec/watt2.031x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
1284,976 sequences/sec33 sequences/sec/watt25.721x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,263 sequences/sec8 sequences/sec/watt6.331x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
1281,549 sequences/sec10 sequences/sec/watt82.641x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
EfficientNet-B088,160 images/sec55 images/sec/watt0.981x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
12813,835 images/sec93 images/sec/watt9.251x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
EfficientNet-B481,529 images/sec10 images/sec/watt5.231x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
1281,873 images/sec13 images/sec/watt68.331x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A2 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5082,610 images/sec43 images/sec/watt3.061x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
192,901 images/sec- images/sec/watt6.551x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
1283,027 images/sec51 images/sec/watt42.281x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
ResNet-50v1.582,527 images/sec42 images/sec/watt3.171x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
182,761 images/sec- images/sec/watt6.521x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
1282,917 images/sec49 images/sec/watt43.891x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
BERT-BASE81,125 sequences/sec19 sequences/sec/watt7.111x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
1281,183 sequences/sec20 sequences/sec/watt108.241x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
BERT-LARGE8341 sequences/sec6 sequences/sec/watt23.461x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
128365 sequences/sec6 sequences/sec/watt350.611x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
EfficientNet-B082,992 images/sec58 images/sec/watt2.671x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
1283,906 images/sec65 images/sec/watt32.771x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
EfficientNet-B48469 images/sec8 images/sec/watt17.041x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
128516 images/sec9 images/sec/watt248.161x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power

 

T4 Inference Performance +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,840 images/sec55 images/sec/watt2.081x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
304,562 images/sec- images/sec/watt6.581x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
1284,686 images/sec67 images/sec/watt27.321x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
ResNet-50v1.583,594 images/sec51 images/sec/watt2.231x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
274,213 images/sec- images/sec/watt6.651x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
1284,448 images/sec63 images/sec/watt28.781x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,684 sequences/sec24 sequences/sec/watt4.751x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
1281,776 sequences/sec25 sequences/sec/watt721x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
8542 sequences/sec8 sequences/sec/watt14.771x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
128528 sequences/sec8 sequences/sec/watt242.471x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
EfficientNet-B084,689 images/sec68 images/sec/watt1.711x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
1286,147 images/sec88 images/sec/watt20.821x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
EfficientNet-B48782 images/sec11 images/sec/watt10.231x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
128845 images/sec12 images/sec/watt151.461x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container



V100 Inference Performance +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5084,363 images/sec16 images/sec/watt1.831x V100DGX-222.08-py3INT8SyntheticTensorRT 8.4.2V100-SXM3-32GB
1287,938 images/sec23 images/sec/watt16.131x V100DGX-222.08-py3INT8SyntheticTensorRT 8.4.2V100-SXM3-32GB
ResNet-50v1.584,260 images/sec14 images/sec/watt1.881x V100DGX-222.08-py3MixedSyntheticTensorRT 8.4.2V100-SXM3-32GB
1287,611 images/sec22 images/sec/watt16.821x V100DGX-222.08-py3INT8SyntheticTensorRT 8.4.2V100-SXM3-32GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
82,125 sequences/sec6 sequences/sec/watt3.761x V100DGX-222.08-py3MixedSyntheticTensorRT 8.4.2V100-SXM3-32GB
1283,148 sequences/sec10 sequences/sec/watt40.661x V100DGX-222.08-py3MixedSyntheticTensorRT 8.4.2V100-SXM3-32GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
8729 sequences/sec2 sequences/sec/watt10.971x V100DGX-222.08-py3INT8SyntheticTensorRT 8.4.2V100-SXM3-32GB
128948 sequences/sec3 sequences/sec/watt134.971x V100DGX-222.08-py3INT8SyntheticTensorRT 8.4.2V100-SXM3-32GB
EfficientNet-B084,399 images/sec22 images/sec/watt1.821x V100DGX-222.08-py3INT8SyntheticTensorRT 8.4.2V100-SXM3-32GB
1288,972 images/sec29 images/sec/watt14.271x V100DGX-222.08-py3INT8SyntheticTensorRT 8.4.2V100-SXM3-32GB
EfficientNet-B48907 images/sec3 images/sec/watt8.821x V100DGX-222.08-py3INT8SyntheticTensorRT 8.4.2V100-SXM3-32GB
1281,204 images/sec4 images/sec/watt106.331x V100DGX-222.08-py3INT8SyntheticTensorRT 8.4.2V100-SXM3-32GB

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container


Inference Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Inference Performance on Cloud

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.5811,495 images/sec- images/sec/watt0.71x A100GCP A2-HIGHGPU-1G22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
12828,222 images/sec- images/sec/watt4.541x A100GCP A2-HIGHGPU-1G22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
811,288 images/sec- images/sec/watt0.711x A100AWS EC2 p4d.24xlarge22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
12828,211 images/sec- images/sec/watt4.541x A100AWS EC2 p4d.24xlarge22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
811,334 images/sec- images/sec/watt0.711x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-80GB
12829,613 images/sec- images/sec/watt4.321x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-40GB
BERT-LARGE82,569 images/sec- images/sec/watt3.111x A100AWS EC2 p4d.24xlarge22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
1285,008 images/sec- images/sec/watt25.561x A100AWS EC2 p4d.24xlarge22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
82,698 images/sec- images/sec/watt2.961x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-80GB
1284,907 images/sec- images/sec/watt26.091x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-80GB

BERT-Large: Sequence Length = 128

T4 Inference Performance on Cloud +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.583,351 images/sec- images/sec/watt2.391x T4GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1NVIDIA T4
1283,885 images/sec- images/sec/watt32.951x T4GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1NVIDIA T4
83,308 images/sec- images/sec/watt2.421x T4AWS EC2 g4dn.4xlarge22.06-py3INT8SyntheticTensorRT 8.2.5NVIDIA T4
1284,143 images/sec- images/sec/watt30.891x T4AWS EC2 g4dn.4xlarge22.06-py3INT8SyntheticTensorRT 8.2.5NVIDIA T4
BERT-LARGE8475 images/sec- images/sec/watt16.831x T4GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1NVIDIA T4
128430 images/sec- images/sec/watt297.771x T4GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1NVIDIA T4


V100 Inference Performance on Cloud +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.584,257 images/sec- images/sec/watt1.881x V100GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1V100-SXM2-16GB
1287,360 images/sec- images/sec/watt17.391x V100GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1V100-SXM2-16GB
83,824 images/sec- images/sec/watt2.091x V100Azure Standard_NC6s_v322.05-py3INT8Synthetic-V100-SXM2-16GB
1287,043 images/sec- images/sec/watt18.171x V100Azure Standard_NC6s_v322.05-py3INT8Synthetic-V100-SXM2-16GB
BERT-LARGE8684 images/sec- images/sec/watt11.691x V100GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1V100-SXM2-16GB
128915 images/sec- images/sec/watt139.891x V100GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1V100-SXM2-16GB

Conversational AI

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.

Related Resources

Download and get started with NVIDIA Riva.


Riva Benchmarks

A100 ASR Benchmarks

A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram111.41A100 SXM4-40GB
citrinetn-gram6464.164A100 SXM4-40GB
citrinetn-gram128103126A100 SXM4-40GB
citrinetn-gram256166.7250A100 SXM4-40GB
citrinetn-gram384235371A100 SXM4-40GB
citrinetn-gram512311490A100 SXM4-40GB
citrinetn-gram768492717A100 SXM4-40GB
conformern-gram116.81A100 SXM4-40GB
conformern-gram6410964A100 SXM4-40GB
conformern-gram128130126A100 SXM4-40GB
conformern-gram256236249A100 SXM4-40GB
conformern-gram384342369A100 SXM4-40GB
conformern-gram512485486A100 SXM4-40GB

A100 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram110.471A100 SXM4-40GB
citrinetn-gram815.148A100 SXM4-40GB
citrinetn-gram1626.216A100 SXM4-40GB
citrinetn-gram3239.132A100 SXM4-40GB
citrinetn-gram484848A100 SXM4-40GB
citrinetn-gram6455.464A100 SXM4-40GB
conformern-gram114.691A100 SXM4-40GB
conformern-gram837.78A100 SXM4-40GB
conformern-gram1641.516A100 SXM4-40GB
conformern-gram3255.732A100 SXM4-40GB
conformern-gram4866.848A100 SXM4-40GB
conformern-gram6482.263A100 SXM4-40GB

A100 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram324390A100 SXM4-40GB
conformern-gram321700A100 SXM4-40GB

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.5.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A30 ASR Benchmarks

A30 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram114.641A30
citrinetn-gram6410163A30
citrinetn-gram128152126A30
citrinetn-gram256272249A30
citrinetn-gram384393368A30
citrinetn-gram512569484A30
conformern-gram121.761A30
conformern-gram6413463A30
conformern-gram128216126A30
conformern-gram256397248A30
conformern-gram384672364A30

A30 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.741A30
citrinetn-gram829.48A30
citrinetn-gram1644.216A30
citrinetn-gram3258.732A30
citrinetn-gram4865.848A30
citrinetn-gram648363A30
conformern-gram120.321A30
conformern-gram842.28A30
conformern-gram1651.516A30
conformern-gram3271.332A30
conformern-gram48103.948A30
conformern-gram64126.863A30

A30 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram323142A30
conformern-gram321120A30

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.5.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A10 ASR Benchmarks

A10 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram112.931A10
citrinetn-gram6488.564A10
citrinetn-gram128162.6126A10
citrinetn-gram256316248A10
citrinetn-gram384486367A10
citrinetn-gram512710481A10
conformern-gram115.331A10
conformern-gram6413363A10
conformern-gram128234126A10
conformern-gram256434247A10

A10 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram110.4051A10
citrinetn-gram820.228A10
citrinetn-gram1629.816A10
citrinetn-gram3249.132A10
citrinetn-gram4867.648A10
citrinetn-gram6484.763A10
conformern-gram113.491A10
conformern-gram833.88A10
conformern-gram1640.916A10
conformern-gram3271.532A10
conformern-gram4810848A10
conformern-gram6414063A10

A10 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram322719A10
conformern-gram321006A10

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.5.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

V100 ASR Benchmarks +

V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.911V100 SXM2-16GB
citrinetn-gram6487.963V100 SXM2-16GB
citrinetn-gram128153126V100 SXM2-16GB
citrinetn-gram256283.7246V100 SXM2-16GB
citrinetn-gram384407363V100 SXM2-16GB
citrinetn-gram512590474V100 SXM2-16GB
conformern-gram122.31V100 SXM2-16GB
conformern-gram6415363V100 SXM2-16GB
conformern-gram128230.6125V100 SXM2-16GB
conformern-gram256400245V100 SXM2-16GB
conformern-gram384716359V100 SXM2-16GB

V100 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.261V100 SXM2-16GB
citrinetn-gram824.328V100 SXM2-16GB
citrinetn-gram1632.916V100 SXM2-16GB
citrinetn-gram3250.532V100 SXM2-16GB
citrinetn-gram486548V100 SXM2-16GB
citrinetn-gram6484.163V100 SXM2-16GB
conformern-gram119.71V100 SXM2-16GB
conformern-gram8558V100 SXM2-16GB
conformern-gram1652.316V100 SXM2-16GB
conformern-gram3276.732V100 SXM2-16GB
conformern-gram48119.847V100 SXM2-16GB
conformern-gram6414363V100 SXM2-16GB

V100 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram322693V100 SXM2-16GB
conformern-gram32964V100 SXM2-16GB

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.5.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz



T4 ASR Benchmarks +

T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram126.71NVIDIA T4
citrinetn-gram64170.863NVIDIA T4
citrinetn-gram128342125NVIDIA T4
citrinetn-gram256736242NVIDIA T4
conformern-gram159.11NVIDIA T4
conformern-gram6431063NVIDIA T4
conformern-gram128505124NVIDIA T4

T4 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram125.91NVIDIA T4
citrinetn-gram8578NVIDIA T4
citrinetn-gram1660.516NVIDIA T4
citrinetn-gram3293.132NVIDIA T4
citrinetn-gram48139.747NVIDIA T4
conformern-gram153.41NVIDIA T4
conformern-gram8828NVIDIA T4
conformern-gram16104.116NVIDIA T4
conformern-gram3223932NVIDIA T4

T4 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram321322NVIDIA T4
conformern-gram32488NVIDIA T4

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.5.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A100 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0210.003145A100 SXM4-40GB
FastPitch + Hifi-GAN40.0370.006336A100 SXM4-40GB
FastPitch + Hifi-GAN60.0460.007395A100 SXM4-40GB
FastPitch + Hifi-GAN80.0560.009421A100 SXM4-40GB
FastPitch + Hifi-GAN100.0590.01434A100 SXM4-40GB

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.5.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A30 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0220.004127A30
FastPitch + Hifi-GAN40.0440.007267A30
FastPitch + Hifi-GAN60.0640.009292A30
FastPitch + Hifi-GAN80.0820.011310A30
FastPitch + Hifi-GAN100.0910.013318A30

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.5.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A10 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0210.004127A10
FastPitch + Hifi-GAN40.0490.008235A10
FastPitch + Hifi-GAN60.0720.011250A10
FastPitch + Hifi-GAN80.0960.014256A10

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.5.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

V100 TTS Benchmarks +

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0240.005104V100 SXM2-16GB
FastPitch + Hifi-GAN40.0550.009215V100 SXM2-16GB
FastPitch + Hifi-GAN60.080.012227V100 SXM2-16GB
FastPitch + Hifi-GAN80.1080.015232V100 SXM2-16GB
FastPitch + Hifi-GAN100.1190.018235V100 SXM2-16GB

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.5.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz



T4 TTS Benchmarks +

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.050.00764NVIDIA T4
FastPitch + Hifi-GAN40.0960.016121NVIDIA T4
FastPitch + Hifi-GAN60.1420.022127NVIDIA T4
FastPitch + Hifi-GAN80.1880.028132NVIDIA T4
FastPitch + Hifi-GAN100.2180.03134NVIDIA T4

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.5.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

 

Last updated: September 19th, 2022