NVIDIA Data Center Deep Learning Product Performance
Reproducible Performance
Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide
Related Resources
Read why training to convergence is essential for enterprise AI adoption.
Learn about The Full-Stack Optimizations Fueling NVIDIA MLPerf Training 2.1 Leadership.
Access containers in the NVIDIA NGC™ catalog.
NVIDIA Supercomputer Wins Every Benchmark in MLPerf HPC 2.0
HPC Performance
Review the latest GPU-acceleration factors of popular HPC applications.
Training to Convergence
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Related Resources
Read our blog on convergence for more details.
Get up and running quickly with NVIDIA’s complete solution stack:
Pull software containers from NVIDIA NGC.
NVIDIA Performance on MLPerf 2.1 Training Benchmarks
BERT Time to Train on A100
PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements
MLPerf Training Performance
NVIDIA Performance on MLPerf 2.1 AI Benchmarks: Single Node - Closed Division
NVIDIA Performance on MLPerf 2.1 AI Benchmarks: Multi Node - Closed Division
MLPerf™ v2.1 Training Closed: 2.1-2038, 2.1-2039, 2.1-2033, 2.1-2040, 2.1-2065, 2.1-2068, 2.1-2049, 2.1-2066, 2.1-2064, 2.1-2067, 2.1-2009, 2.1-2070, 2.1-2071, 2.1-2072, 2.1-2073, 2.1-2074, 2.1-2075, 2.1-2076, 2.1-2077, 2.1-2078, 2.1-2079, 2.1-2080, 2.1-2091, 2.1-2092, 2.1-2093 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
H100 SXM5-80GB is a preview submission
NVIDIA A100 Performance on MLPerf 2.0 Training HPC Benchmarks: Strong Scaling - Closed Division
NVIDIA A100 Performance on MLPerf 2.0 Training HPC Benchmarks: Weak Scaling - Closed Division
Framework | Network | Throughput | MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
PyTorch | CosmoFlow | 4.21 models/min | Mean average error 0.124 | 4,096x A100 | DGX A100 | 2.0-8014 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | A100-SXM4-80GB |
DeepCAM | 6.40 models/min | IOU 0.82 | 4,096x A100 | DGX A100 | 2.0-8014 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | A100-SXM4-80GB | |
OpenCatalyst | 0.66 models/min | Forces mean absolute error 0.036 | 4,096x A100 | DGX A100 | 2.0-8014 | Mixed | Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set | A100-SXM4-80GB |
MLPerf™ v2.0 Training HPC Closed: 2.0-8005, 2.0-8006, 2.0-8014 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v2.0 Training HPC rules and guidelines, click here
Converged Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 97 | .55 Training Loss | 321,306 total output mels/sec | 8x A100 | DGX A100 | 22.10-py3 | Mixed | 128 | LJSpeech 1.1 | A100-SXM4-80GB |
1.13.0a0 | WaveGlow | 227 | -5.7 Training Loss | 1,869,185 output samples/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 10 | LJSpeech 1.1 | A100-SXM4-80GB | |
1.13.0a0 | GNMT v2 | 19 | 24.39 BLEU Score | 960,732 total tokens/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 128 | wmt16-en-de | A100-SXM4-80GB | |
1.13.0a0 | NCF | 0.35 | .96 Hit Rate at 10 | 159,982,051 samples/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 131072 | MovieLens 20M | A100-SXM4-80GB | |
1.13.0a0 | Transformer XL Base | 187 | 22.35 Perplexity | 710,002 total tokens/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 128 | WikiText-103 | A100-SXM4-80GB | |
1.13.0a0 | TFT - Traffic | 1 | .08 P90 | 134,270 items/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 1024 | Traffic | A100-SXM4-80GB | |
1.13.0a0 | TFT-Electricity | 2 | .03 P90 | 134,971 items/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 1024 | Electricity | A100-SXM4-80GB | |
1.13.0a0 | HiFiGAN | 1,748 | 9.56 Training Loss | 62,639 total output mels/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 16 | LJSpeech-1.1 | A100-SXM4-80GB | |
Tensorflow | 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 1,044 images/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 2 | DAGM2007 | A100-SXM4-80GB |
2.10.0 | U-Net Medical | 2 | .89 DICE Score | 1,080 images/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 8 | EM segmentation challenge | A100-SXM4-80GB | |
2.10.0 | Electra Fine Tuning | 3 | 92.48 F1 | 2,800 sequences/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
2.10.0 | EfficientNet-B0 | 526 | 76.67 Top 1 | 21,485 images/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 1024 | Imagenet2012 | A100-SXM4-40GB | |
2.10.0 | SIM | 1 | .82 AUC | 3,269,087 samples/sec | 8x A100 | DGX A100 | 22.11-py3 | Mixed | 16384 | Amazon Reviews | A100-SXM4-80GB |
A40 Training Performance
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | NCF | 1 | .96 Hit Rate at 10 | 50,144,826 samples/sec | 8x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 131072 | MovieLens 20M | A40 |
1.13.0a0 | Tacotron2 | 113 | .57 Training Loss | 269,899 total output mels/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.11-py3 | Mixed | 128 | LJSpeech 1.1 | A40 | |
1.13.0a0 | WaveGlow | 445 | -5.7 Training Loss | 940,768 output samples/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.11-py3 | Mixed | 10 | LJSpeech 1.1 | A40 | |
1.13.0a0 | GNMT v2 | 46 | 24.38 BLEU Score | 321,760 total tokens/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.11-py3 | Mixed | 128 | wmt16-en-de | A40 | |
1.13.0a0 | Transformer XL Base | 450 | 22.41 Perplexity | 297,106 total tokens/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.11-py3 | Mixed | 128 | WikiText-103 | A40 | |
Tensorflow | 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 748 images/sec | 8x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 2 | DAGM2007 | A40 |
2.10.0 | Electra Fine Tuning | 4 | 92.46 F1 | 1,105 sequences/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.11-py3 | Mixed | 32 | SQuaD v1.1 | A40 |
A30 Training Performance
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 121 | .51 Training Loss | 256,736 total output mels/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 104 | LJSpeech 1.1 | A30 |
1.13.0a0 | WaveGlow | 429 | -5.74 Training Loss | 985,229 output samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 10 | LJSpeech 1.1 | A30 | |
1.13.0a0 | GNMT v2 | 46 | 24.24 BLEU Score | 319,191 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 128 | wmt16-en-de | A30 | |
1.13.0a0 | NCF | 1 | .96 Hit Rate at 10 | 54,535,299 samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 131072 | MovieLens 20M | A30 | |
1.13.0a0 | BERT-LARGE | 10 | 90.71 F1 | 301 sequences/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.08-py3 | Mixed | 10 | SQuaD v1.1 | A30 | |
1.13.0a0 | FastPitch | 435 | 2.7 Training Loss | 180,819 frames/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.07-py3 | Mixed | 16 | LJSpeech 1.1 | A30 | |
1.13.0a0 | Transformer XL Base | 147 | 23.69 Perplexity | 228,197 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.07-py3 | Mixed | 32 | WikiText-103 | A30 | |
1.13.0a0 | TFT - Traffic | 2 | .08 P90 | 82,285 items/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1024 | Traffic | A30 | |
1.13.0a0 | TFT-Electricity | 3 | .03 P90 | 82,065 items/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1024 | Electricity | A30 | |
Tensorflow | 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 678 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 2 | DAGM2007 | A30 |
2.10.0 | U-Net Medical | 2 | .89 DICE Score | 486 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 8 | EM segmentation challenge | A30 | |
2.10.0 | Electra Fine Tuning | 5 | 92.58 F1 | 977 sequences/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | SQuaD v1.1 | A30 | |
2.10.0 | SIM | 1 | .81 AUC | 2,250,516 samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16384 | Amazon Reviews | A30 |
A10 Training Performance
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 139 | .53 Training Loss | 220,186 total output mels/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 104 | LJSpeech 1.1 | A10 |
1.13.0a0 | WaveGlow | 567 | -5.7 Training Loss | 739,602 output samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.10-py3 | Mixed | 10 | LJSpeech 1.1 | A10 | |
1.13.0a0 | GNMT v2 | 52 | 24.25 BLEU Score | 277,159 total tokens/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 128 | wmt16-en-de | A10 | |
1.13.0a0 | NCF | 1 | .96 Hit Rate at 10 | 47,791,819 samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 131072 | MovieLens 20M | A10 | |
Tensorflow | 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 655 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 2 | DAGM2007 | A10 |
1.15.5 | U-Net Medical | 14 | .89 DICE Score | 359 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 8 | EM segmentation challenge | A10 | |
2.10.0 | Electra Fine Tuning | 5 | 92.78 F1 | 771 sequences/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | SQuaD v1.1 | A10 | |
2.10.0 | SIM | 1 | .81 AUC | 2,180,220 samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16384 | Amazon Reviews | A10 |
T4 Training Performance +
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 231 | .53 Training Loss | 130,930 total output mels/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 104 | LJSpeech 1.1 | NVIDIA T4 |
1.13.0a0 | WaveGlow | 1,089 | -5.81 Training Loss | 383,020 output samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA T4 | |
1.13.0a0 | GNMT v2 | 95 | 24.24 BLEU Score | 152,304 total tokens/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
1.13.0a0 | NCF | 2 | .96 Hit Rate at 10 | 26,301,627 samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 131072 | MovieLens 20M | NVIDIA T4 | |
1.13.0a0 | TFT - Traffic | 10 | .08 P90 | 33,694 items/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 1024 | Traffic | NVIDIA T4 | |
1.13.0a0 | TFT-Electricity | 16 | .03 P90 | 33,609 items/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 1024 | Electricity | NVIDIA T4 | |
Tensorflow | 1.15.5 | U-Net Industrial | 2 | .99 IoU Threshold 0.99 | 331 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 2 | DAGM2007 | NVIDIA T4 |
1.15.5 | U-Net Medical | 42 | .9 DICE Score | 155 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 8 | EM segmentation challenge | NVIDIA T4 | |
2.10.0 | Electra Fine Tuning | 10 | 92.73 F1 | 382 sequences/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 16 | SQuaD v1.1 | NVIDIA T4 | |
1.15.5 | Transformer XL Base | 909 | 22.31 Perplexity | 36,121 total tokens/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 16 | WikiText-103 | NVIDIA T4 | |
2.10.0 | SIM | 2 | .81 AUC | 1,125,154 samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.11-py3 | Mixed | 16384 | Amazon Reviews | NVIDIA T4 |
V100 Training Performance +
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 151 | .53 Training Loss | 208,130 total output mels/sec | 8x V100 | DGX-2 | 22.11-py3 | Mixed | 104 | LJSpeech 1.1 | V100-SXM3-32GB |
1.13.0a0 | WaveGlow | 402 | -5.73 Training Loss | 1,059,562 output samples/sec | 8x V100 | DGX-2 | 22.10-py3 | Mixed | 10 | LJSpeech 1.1 | V100-SXM3-32GB | |
1.13.0a0 | GNMT v2 | 33 | 24.21 BLEU Score | 440,850 total tokens/sec | 8x V100 | DGX-2 | 22.11-py3 | Mixed | 128 | wmt16-en-de | V100-SXM3-32GB | |
1.13.0a0 | NCF | 1 | .96 Hit Rate at 10 | 99,138,714 samples/sec | 8x V100 | DGX-2 | 22.11-py3 | Mixed | 131072 | MovieLens 20M | V100-SXM3-32GB | |
1.13.0a0 | BERT-LARGE | 7 | 90.78 F1 | 398 sequences/sec | 8x V100 | DGX-2 | 22.08-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
1.13.0a0 | TFT - Traffic | 2 | .08 P90 | 88,986 items/sec | 8x V100 | DGX-2 | 22.11-py3 | Mixed | 1024 | Traffic | V100-SXM3-32GB | |
1.13.0a0 | TFT-Electricity | 3 | .03 P90 | 88,647 items/sec | 8x V100 | DGX-2 | 22.11-py3 | Mixed | 1024 | Electricity | V100-SXM3-32GB | |
Tensorflow | 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 643 images/sec | 8x V100 | DGX-2 | 22.11-py3 | Mixed | 2 | DAGM2007 | V100-SXM3-32GB |
1.15.5 | U-Net Medical | 14 | .9 DICE Score | 467 images/sec | 8x V100 | DGX-2 | 22.11-py3 | Mixed | 8 | EM segmentation challenge | V100-SXM3-32GB | |
1.15.5 | Transformer XL Base | 310 | 22.7 Perplexity | 106,475 total tokens/sec | 8x V100 | DGX-2 | 22.11-py3 | Mixed | 16 | WikiText-103 | V100-SXM3-32GB | |
2.10.0 | Electra Fine Tuning | 4 | 92.61 F1 | 1,346 sequences/sec | 8x V100 | DGX-2 | 22.11-py3 | Mixed | 32 | SQuaD v1.1 | V100-SXM3-32GB | |
2.10.0 | SIM | 1 | .82 AUC | 2,212,761 samples/sec | 8x V100 | DGX-2 | 22.11-py3 | Mixed | 16384 | Amazon Reviews | V100-SXM3-32GB |
Single-GPU Training
Some scenarios aren’t used in real-world training, such as single-GPU throughput. The table below provides an indication of a platform’s single-chip throughput.
Related Resources
Achieve unprecedented acceleration at every scale with NVIDIA’s complete solution stack.
Pull software containers from NVIDIA NGC.
Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit the NVIDIA NGC catalog to pull containers and quickly get up and running with deep learning.
Single GPU Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 41,908 total output mels/sec | 1x A100 | DGX A100 | 22.11-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM4-80GB |
1.13.0a0 | WaveGlow | 255,720 output samples/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 10 | LJSpeech 1.1 | A100-SXM4-80GB | |
1.13.0a0 | FastPitch | 161,081 frames/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 32 | LJSpeech 1.1 | A100-SXM4-80GB | |
1.13.0a0 | GNMT v2 | 171,369 total tokens/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 128 | wmt16-en-de | A100-SXM4-80GB | |
1.13.0a0 | Transformer XL Large | 16,084 total tokens/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 16 | WikiText-103 | A100-SXM4-80GB | |
1.13.0a0 | Transformer XL Base | 86,441 total tokens/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 128 | WikiText-103 | A100-SXM4-80GB | |
1.13.0a0 | nnU-Net | 1,126 images/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 64 | Medical Segmentation Decathlon | A100-SXM4-80GB | |
1.13.0a0 | EfficientNet-B4 | 391 images/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 128 | Imagenet2012 | A100-SXM4-80GB | |
1.13.0a0 | BERT Large Pre-Training Phase 2 | 294 sequences/sec | 1x A100 | DGX A100 | 22.09-py3 | Mixed | 56 | Wikipedia 2020/01/01 | A100-SXM4-80GB | |
1.13.0a0 | EfficientNet-WideSE-B4 | 391 images/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 128 | Imagenet2012 | A100-SXM4-80GB | |
1.13.0a0 | SE3 Transformer | 3,274 molecules/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 240 | Quantum Machines 9 | A100-SXM4-80GB | |
1.13.0a0 | TFT - Traffic | 17,342 items/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 1024 | Traffic | A100-SXM4-80GB | |
1.13.0a0 | TFT - Electricity | 17,285 items/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 1024 | Electricity | A100-SXM4-80GB | |
1.13.0a0 | HiFiGAN | 19,919 total output mels/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 128 | LJSpeech-1.1 | A100-SXM4-80GB | |
Tensorflow | 1.15.5 | U-Net Industrial | 353 images/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 16 | DAGM2007 | A100-SXM4-40GB |
2.10.0 | U-Net Medical | 150 images/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 8 | EM segmentation challenge | A100-SXM4-80GB | |
2.10.0 | Electra Fine Tuning | 368 sequences/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
1.15.5 | NCF | 51,889,497 samples/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 1048576 | MovieLens 20M | A100-SXM4-80GB | |
2.10.0 | EfficientNet-B0 | 3,255 images/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 1024 | Imagenet2012 | A100-SXM4-80GB | |
2.10.0 | SIM | 588,800 samples/sec | 1x A100 | DGX A100 | 22.11-py3 | Mixed | 131072 | Amazon Reviews | A100-SXM4-80GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
A40 Training Performance
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 35,907 total output mels/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 128 | LJSpeech 1.1 | A40 |
1.13.0a0 | WaveGlow | 148,179 output samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 10 | LJSpeech 1.1 | A40 | |
1.13.0a0 | GNMT v2 | 80,263 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 128 | wmt16-en-de | A40 | |
1.13.0a0 | NCF | 19,499,362 samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1048576 | MovieLens 20M | A40 | |
1.13.0a0 | Transformer XL Large | 10,023 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | WikiText-103 | A40 | |
1.13.0a0 | FastPitch | 93,824 frames/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 32 | LJSpeech 1.1 | A40 | |
1.13.0a0 | Transformer XL Base | 40,845 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 128 | WikiText-103 | A40 | |
1.13.0a0 | nnU-Net | 561 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 64 | Medical Segmentation Decathlon | A40 | |
1.13.0a0 | EfficientNet-B4 | 181 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 64 | Imagenet2012 | A40 | |
1.13.0a0 | EfficientNet-WideSE-B4 | 181 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 64 | Imagenet2012 | A40 | |
1.13.0a0 | SE3 Transformer | 1,900 molecules/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 240 | Quantum Machines 9 | A40 | |
1.13.0a0 | TFT - Traffic | 9,642 items/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1024 | Traffic | A40 | |
1.13.0a0 | TFT - Electricity | 9,542 items/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1024 | Electricity | A40 | |
1.13.0a0 | HiFiGAN | 10,333 total output mels/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 128 | LJSpeech-1.1 | A40 | |
Tensorflow | 1.15.5 | U-Net Industrial | 122 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | DAGM2007 | A40 |
2.10.0 | U-Net Medical | 70 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 8 | EM segmentation challenge | A40 | |
2.10.0 | Electra Fine Tuning | 160 sequences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 32 | SQuaD v1.1 | A40 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
A30 Training Performance
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 34,406 total output mels/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 104 | LJSpeech 1.1 | A30 |
1.13.0a0 | WaveGlow | 153,115 output samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 10 | LJSpeech 1.1 | A30 | |
1.13.0a0 | FastPitch | 91,968 frames/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | LJSpeech 1.1 | A30 | |
1.13.0a0 | NCF | 21,620,400 samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1048576 | MovieLens 20M | A30 | |
1.13.0a0 | GNMT v2 | 91,214 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 128 | wmt16-en-de | A30 | |
1.13.0a0 | Transformer XL Base | 18,368 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 32 | WikiText-103 | A30 | |
1.13.0a0 | Transformer XL Large | 7,150 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 4 | WikiText-103 | A30 | |
1.13.0a0 | nnU-Net | 590 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 64 | Medical Segmentation Decathlon | A30 | |
1.13.0a0 | EfficientNet-B4 | 191 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 32 | Imagenet2012 | A30 | |
1.13.0a0 | EfficientNet-WideSE-B4 | 188 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 32 | Imagenet2012 | A30 | |
1.13.0a0 | SE3 Transformer | 2,152 molecules/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 240 | Quantum Machines 9 | A30 | |
1.13.0a0 | TFT - Traffic | 10,437 items/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1024 | Traffic | A30 | |
1.13.0a0 | TFT - Electricity | 10,463 items/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1024 | Electricity | A30 | |
1.13.0a0 | HiFiGAN | 10,605 total output mels/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 128 | LJSpeech-1.1 | A30 | |
Tensorflow | 1.15.5 | U-Net Industrial | 116 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | DAGM2007 | A30 |
2.10.0 | U-Net Medical | 74 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 8 | EM segmentation challenge | A30 | |
1.15.5 | Transformer XL Base | 18,259 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | WikiText-103 | A30 | |
2.10.0 | Electra Fine Tuning | 162 sequences/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | SQuaD v1.1 | A30 | |
2.10.0 | SIM | 404,661 samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 131072 | Amazon Reviews | A30 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
A10 Training Performance
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 29,310 total output mels/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 104 | LJSpeech 1.1 | A10 |
1.13.0a0 | WaveGlow | 116,803 output samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 10 | LJSpeech 1.1 | A10 | |
1.13.0a0 | FastPitch | 74,233 frames/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | LJSpeech 1.1 | A10 | |
1.13.0a0 | Transformer XL Base | 15,388 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 32 | WikiText-103 | A10 | |
1.13.0a0 | GNMT v2 | 64,713 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 128 | wmt16-en-de | A10 | |
1.13.0a0 | NCF | 16,211,650 samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1048576 | MovieLens 20M | A10 | |
1.13.0a0 | Transformer XL Large | 6,133 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 4 | WikiText-103 | A10 | |
1.13.0a0 | nnU-Net | 447 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 64 | Medical Segmentation Decathlon | A10 | |
1.13.0a0 | EfficientNet-B4 | 146 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.10-py3 | Mixed | 32 | Imagenet2012 | A10 | |
1.13.0a0 | EfficientNet-WideSE-B4 | 145 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 32 | Imagenet2012 | A10 | |
1.13.0a0 | SE3 Transformer | 1,686 molecules/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 240 | Quantum Machines 9 | A10 | |
1.13.0a0 | TFT - Traffic | 8,066 items/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1024 | Traffic | A10 | |
1.13.0a0 | TFT - Electricity | 8,036 items/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 1024 | Electricity | A10 | |
1.13.0a0 | HiFiGAN | 8,113 total output mels/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 128 | LJSpeech-1.1 | A10 | |
Tensorflow | 1.15.5 | U-Net Industrial | 100 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | DAGM2007 | A10 |
2.10.0 | U-Net Medical | 51 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 8 | EM segmentation challenge | A10 | |
2.10.0 | Electra Fine Tuning | 119 sequences/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 16 | SQuaD v1.1 | A10 | |
2.10.0 | SIM | 368,967 samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | Mixed | 131072 | Amazon Reviews | A10 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
T4 Training Performance +
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 17,983 total output mels/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 104 | LJSpeech 1.1 | NVIDIA T4 |
1.13.0a0 | WaveGlow | 56,267 output samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA T4 | |
1.13.0a0 | FastPitch | 34,006 frames/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 16 | LJSpeech 1.1 | NVIDIA T4 | |
1.13.0a0 | GNMT v2 | 30,963 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
1.13.0a0 | NCF | 7,741,117 samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 1048576 | MovieLens 20M | NVIDIA T4 | |
1.13.0a0 | Transformer XL Base | 8,856 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 32 | WikiText-103 | NVIDIA T4 | |
1.13.0a0 | Transformer XL Large | 2,798 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 4 | WikiText-103 | NVIDIA T4 | |
1.13.0a0 | nnU-Net | 202 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 64 | Medical Segmentation Decathlon | NVIDIA T4 | |
1.13.0a0 | EfficientNet-B4 | 68 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 32 | Imagenet2012 | NVIDIA T4 | |
1.13.0a0 | EfficientNet-WideSE-B4 | 68 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 32 | Imagenet2012 | NVIDIA T4 | |
1.13.0a0 | SE3 Transformer | 638 molecules/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 240 | Quantum Machines 9 | NVIDIA T4 | |
1.13.0a0 | TFT - Traffic | 4,317 items/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 1024 | Traffic | NVIDIA T4 | |
1.13.0a0 | TFT - Electricity | 4,317 items/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 1024 | Electricity | NVIDIA T4 | |
1.13.0a0 | HiFiGAN | 2,803 total output mels/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 64 | LJSpeech-1.1 | NVIDIA T4 | |
Tensorflow | 1.15.5 | U-Net Industrial | 45 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 16 | DAGM2007 | NVIDIA T4 |
1.15.5 | U-Net Medical | 21 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 8 | EM segmentation challenge | NVIDIA T4 | |
2.10.0 | Electra Fine Tuning | 58 sequences/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 16 | SQuaD v1.1 | NVIDIA T4 | |
2.10.0 | EfficientNet-B0 | 638 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | Mixed | 256 | Imagenet2012 | NVIDIA T4 | |
2.10.0 | SIM | 175,996 samples/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 22.10-py3 | Mixed | 131072 | Amazon Reviews | NVIDIA T4 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
V100 Training Performance +
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.13.0a0 | Tacotron2 | 31,218 total output mels/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 104 | LJSpeech 1.1 | V100-SXM3-32GB |
1.13.0a0 | WaveGlow | 156,982 output samples/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 10 | LJSpeech 1.1 | V100-SXM3-32GB | |
1.13.0a0 | FastPitch | 87,316 frames/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 16 | LJSpeech 1.1 | V100-SXM3-32GB | |
1.13.0a0 | GNMT v2 | 77,935 total tokens/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 128 | wmt16-en-de | V100-SXM3-32GB | |
1.13.0a0 | NCF | 24,122,383 samples/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 1048576 | MovieLens 20M | V100-SXM3-32GB | |
1.13.0a0 | Transformer XL Base | 17,155 total tokens/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 32 | WikiText-103 | V100-SXM3-32GB | |
1.13.0a0 | Transformer XL Large | 7,172 total tokens/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 8 | WikiText-103 | V100-SXM3-32GB | |
1.13.0a0 | nnU-Net | 660 images/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 64 | Medical Segmentation Decathlon | V100-SXM3-32GB | |
1.13.0a0 | EfficientNet-B4 | 220 images/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 64 | Imagenet2012 | V100-SXM3-32GB | |
1.13.0a0 | EfficientNet-WideSE-B4 | 220 images/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 64 | Imagenet2012 | V100-SXM3-32GB | |
1.13.0a0 | SE3 Transformer | 2,106 molecules/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 240 | Quantum Machines 9 | V100-SXM3-32GB | |
1.13.0a0 | TFT - Traffic | 11,743 items/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 1024 | Traffic | V100-SXM3-32GB | |
1.13.0a0 | TFT - Electricity | 11,695 items/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 1024 | Electricity | V100-SXM3-32GB | |
1.13.0a0 | HiFiGAN | 9,695 total output mels/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 128 | LJSpeech-1.1 | V100-SXM3-32GB | |
Tensorflow | 1.15.5 | U-Net Industrial | 118 images/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 16 | DAGM2007 | V100-SXM3-32GB |
1.15.5 | U-Net Medical | 68 images/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 8 | EM segmentation challenge | V100-SXM3-32GB | |
2.10.0 | Electra Fine Tuning | 185 sequences/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 32 | SQuaD v1.1 | V100-SXM3-32GB | |
1.15.5 | Transformer XL Base | 18,671 total tokens/sec | 1x V100 | DGX-2 | 22.11-py3 | Mixed | 16 | WikiText-103 | V100-SXM3-32GB | |
2.10.0 | SIM | 368,684 samples/sec | 1x V100 | DGX-2 | 22.10-py3 | Mixed | 131072 | Amazon Reviews | V100-SXM3-32GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
AI Inference
Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.
Related Resources
Learn how NVIDIA landed top performance spots on all MLPerf Inference 2.1 tests.
Read the inference whitepape to explore the evolving landscape and get an overview of inference platforms.
Learn how Dynamic Batching can increase throughput on Triton with Benefits of Triton.
For additional data on Triton performance in offline and online server, please refer to ResNet-50 v1.5
Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:
Achieve the most efficient inference performance with NVIDIA® TensorRT™ running on NVIDIA Tensor Core GPUs.
Maximize performance and simplify the deployment of AI models with the NVIDIA Triton™ Inference Server.
Pull software containers from NVIDIA NGC to race into production.
MLPerf Inference v2.1 Performance Benchmarks
Offline Scenario - Closed Division
Network | Throughput | GPU | Server | GPU Version | Dataset | Target Accuracy |
---|---|---|---|---|---|---|
ResNet-50 v1.5 | 81,292 samples/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | ImageNet | 76.46% Top1 |
335,144 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet | 76.46% Top1 | |
5,589 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | ImageNet | 76.46% Top1 | |
316,342 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | ImageNet | 76.46% Top1 | |
RetinaNet | 960 samples/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | OpenImages | 0.3755 mAP |
4,739 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | OpenImages | 0.3755 mAP | |
74 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | OpenImages | 0.3755 mAP | |
4,345 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | OpenImages | 0.3755 mAP | |
3D-UNet | 5 samples/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | KiTS 2019 | 0.863 DICE mean |
26 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | KiTS 2019 | 0.863 DICE mean | |
0.51 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | KiTS 2019 | 0.863 DICE mean | |
25 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | KiTS 2019 | 0.863 DICE mean | |
RNN-T | 22,885 samples/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | LibriSpeech | 7.45% WER |
106,726 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech | 7.45% WER | |
1,918 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | LibriSpeech | 7.45% WER | |
102,784 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | LibriSpeech | 7.45% WER | |
BERT | 7,921 samples/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | SQuAD v1.1 | 90.87% f1 |
13,968 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 | 90.87% f1 | |
1,757 samples/sec | 1x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 | 90.87% f1 | |
247 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 | 90.87% f1 | |
12,822 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | SQuAD v1.1 | 90.87% f1 | |
DLRM | 695,298 samples/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC |
2,443,220 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
314,992 samples/sec | 1x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
38,995 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
2,291,310 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | Criteo 1TB Click Logs | 80.25% AUC |
Server Scenario - Closed Division
Network | Throughput | GPU | Server | GPU Version | Target Accuracy | MLPerf Server Latency Constraints (ms) | Dataset |
---|---|---|---|---|---|---|---|
ResNet-50 v1.5 | 58,995 queries/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | 76.46% Top1 | 15 | ImageNet |
300,064 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet | |
3,527 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet | |
236,057 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 76.46% Top1 | 15 | ImageNet | |
RetinaNet | 848 queries/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | 0.3755 mAP | 100 | OpenImages |
4,096 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 0.3755 mAP | 100 | OpenImages | |
45 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 0.3755 mAP | 100 | OpenImages | |
3,997 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 0.3755 mAP | 100 | OpenImages | |
RNN-T | 21,488 queries/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | 7.45% WER | 1,000 | LibriSpeech |
104,020 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech | |
1,347 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech | |
90,005 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 7.45% WER | 1,000 | LibriSpeech | |
BERT | 6,195 queries/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | 90.87% f1 | 130 | SQuAD v1.1 |
12,815 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 90.87% f1 | 130 | SQuAD v1.1 | |
1,572 queries/sec | 1x A100 | DGX A100 | A100 SXM-80GB | 90.87% f1 | 130 | SQuAD v1.1 | |
164 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 90.87% f1 | 130 | SQuAD v1.1 | |
10,795 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 90.87% f1 | 130 | SQuAD v1.1 | |
DLRM | 545,174 queries/sec | 1x H100 | NVIDIA H100 | H100-SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs |
2,390,910 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
298,565 queries/sec | 1x A100 | DGX A100 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
35,991 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
1,326,940 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs |
Power Efficiency Offline Scenario - Closed Division
Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
---|---|---|---|---|---|---|
ResNet-50 v1.5 | 288,733 samples/sec | 93.68 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet |
252,721 samples/sec | 122.19 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | ImageNet | |
RetinaNet | 4,122 samples/sec | 1.32 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | OpenImages |
3,805 samples/sec | 1.73 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | OpenImages | |
3D-UNet | 23 samples/sec | 0.008 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | KiTS 2019 |
19 samples/sec | 0.011 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | KiTS 2019 | |
RNN-T | 84,508 samples/sec | 27.79 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech |
78,750 samples/sec | 38.88 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | LibriSpeech | |
BERT | 11,152 samples/sec | 3.33 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 |
11,158 samples/sec | 4.37 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | SQuAD v1.1 | |
DLRM | 2,128,420 samples/sec | 641.77 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs |
Power Efficiency Server Scenario - Closed Division
Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
---|---|---|---|---|---|---|
ResNet-50 v1.5 | 229,055 queries/sec | 78.93 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet |
185,047 queries/sec | 87.2 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | ImageNet | |
RetinaNet | 3,896 queries/sec | 1.25 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | OpenImages |
2,296 queries/sec | 1.21 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | OpenImages | |
RNN-T | 88,003 queries/sec | 25.44 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech |
74,995 queries/sec | 33.88 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | LibriSpeech | |
BERT | 9,995 queries/sec | 2.93 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 |
7,494 queries/sec | 3.45 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | SQuAD v1.1 | |
DLRM | 2,002,080 queries/sec | 592.73 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs |
MLPerf™ v2.1 Inference Closed: ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99.9% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 2.1-0082, 2.1-0084, 2.1-0085, 2.1-0087, 2.1-0088, 2.1-0089, 2.1-0121, 2.1-0122. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
H100 SXM-80GB is a preview submission
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
1x1g.10gb is a notation used to refer to the MIG configuration. In this example, the workload is running on a single MIG slice, with 10GB of memory on a single A100.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here
NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v2.1
NVIDIA landed top performance spots on all MLPerf™ Inference 2.1 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.
MLPerf™ v2.1 A100 Inference Closed: ResNet-50 v1.5, RetinaNet, BERT 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 2.1-0088, 2.1-0090. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.
NVIDIA Client Batch Size 1 and 2 Performance with Triton Inference Server
A100 Triton Inference Server Performance
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | A100-SXM4-40GB | tensorrt_plan | TensorRT | Mixed | 4 | 1 | 1 | 24 | 31.784 | 755 inf/sec | 384 | 22.11-py3 |
BERT Large Inference | A100-SXM4-40GB | tensorrt_plan | TensorRT | Mixed | 4 | 2 | 1 | 24 | 61.186 | 784 inf/sec | 384 | 22.11-py3 |
BERT Large Inference | A100-PCIE-40GB | tensorrt_plan | TensorRT | Mixed | 4 | 1 | 1 | 24 | 38.159 | 629 inf/sec | 384 | 22.11-py3 |
BERT Large Inference | A100-PCIE-40GB | tensorrt_plan | TensorRT | Mixed | 4 | 2 | 1 | 24 | 72.731 | 660 inf/sec | 384 | 22.11-py3 |
BERT Base Inference | A100-SXM4-80GB | tensorrt_plan | TensorRT | Mixed | 4 | 1 | 1 | 24 | 3.966 | 6,050 inf/sec | 128 | 22.11-py3 |
BERT Base Inference | A100-SXM4-40GB | tensorrt_plan | TensorRT | Mixed | 4 | 2 | 1 | 24 | 6.75 | 7,110 inf/sec | 128 | 22.11-py3 |
BERT Base Inference | A100-PCIE-40GB | tensorrt_plan | TensorRT | Mixed | 4 | 1 | 1 | 24 | 4.435 | 5,408 inf/sec | 128 | 22.11-py3 |
BERT Base Inference | A100-PCIE-40GB | tensorrt_plan | TensorRT | Mixed | 4 | 2 | 1 | 24 | 7.766 | 6,179 inf/sec | 128 | 22.11-py3 |
DLRM Inference | A100-SXM4-40GB | pytorch_libtorch | PyTorch | Mixed | 4 | 1 | 65,536 | 28 | 2.206 | 12,687 inf/sec | - | 22.11-py3 |
DLRM Inference | A100-SXM4-80GB | pytorch_libtorch | PyTorch | Mixed | 2 | 2 | 65,536 | 28 | 2.243 | 24,953 inf/sec | - | 22.11-py3 |
DLRM Inference | A100-PCIE-40GB | pytorch_libtorch | PyTorch | Mixed | 4 | 1 | 65,536 | 30 | 2.316 | 12,946 inf/sec | - | 22.08-py3 |
DLRM Inference | A100-PCIE-40GB | pytorch_libtorch | PyTorch | Mixed | 1 | 2 | 65,536 | 30 | 2.39 | 25,093 inf/sec | - | 22.08-py3 |
A30 Triton Inference Server Performance
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | A30 | tensorrt_plan | TensorRT | Mixed | 4 | 1 | 1 | 24 | 66.753 | 359 inf/sec | 384 | 22.11-py3 |
BERT Large Inference | A30 | tensorrt_plan | TensorRT | Mixed | 4 | 2 | 1 | 20 | 108.243 | 370 inf/sec | 384 | 22.11-py3 |
BERT Base Inference | A30 | tensorrt_plan | TensorRT | Mixed | 4 | 1 | 1 | 24 | 7.254 | 3,308 inf/sec | 128 | 22.11-py3 |
BERT Base Inference | A30 | tensorrt_plan | TensorRT | Mixed | 4 | 2 | 1 | 24 | 13.004 | 3,690 inf/sec | 128 | 22.11-py3 |
A10 Triton Inference Server Performance
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | A10 | tensorrt_plan | TensorRT | Mixed | 4 | 1 | 1 | 24 | 101.549 | 236 inf/sec | 384 | 22.11-py3 |
BERT Large Inference | A10 | tensorrt_plan | TensorRT | Mixed | 4 | 2 | 1 | 24 | 198.296 | 242 inf/sec | 384 | 22.11-py3 |
BERT Base Inference | A10 | tensorrt_plan | TensorRT | Mixed | 4 | 1 | 1 | 24 | 10.831 | 2,220 inf/sec | 128 | 22.11-py3 |
BERT Base Inference | A10 | tensorrt_plan | TensorRT | Mixed | 2 | 2 | 1 | 20 | 16.999 | 2,353 inf/sec | 128 | 22.11-py3 |
T4 Triton Inference Server Performance +
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | NVIDIA T4 | tensorrt_plan | TensorRT | Mixed | 1 | 1 | 1 | 24 | 255.522 | 94 inf/sec | 384 | 22.11-py3 |
BERT Large Inference | NVIDIA T4 | tensorrt_plan | TensorRT | Mixed | 1 | 2 | 1 | 20 | 427.107 | 94 inf/sec | 384 | 22.11-py3 |
BERT Base Inference | NVIDIA T4 | tensorrt_plan | TensorRT | Mixed | 1 | 1 | 1 | 24 | 25.149 | 954 inf/sec | 128 | 22.11-py3 |
BERT Base Inference | NVIDIA T4 | tensorrt_plan | TensorRT | Mixed | 1 | 2 | 1 | 24 | 46.74 | 1,027 inf/sec | 128 | 22.11-py3 |
V100 Triton Inference Server Performance +
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | V100 SXM2-32GB | tensorrt_plan | TensorRT | Mixed | 4 | 1 | 1 | 24 | 91.464 | 263 inf/sec | 384 | 22.11-py3 |
BERT Large Inference | V100 SXM2-32GB | tensorrt_plan | TensorRT | Mixed | 4 | 2 | 1 | 20 | 146.731 | 273 inf/sec | 384 | 22.11-py3 |
BERT Base Inference | V100 SXM2-32GB | tensorrt_plan | TensorRT | Mixed | 4 | 1 | 1 | 24 | 11.298 | 2,124 inf/sec | 128 | 22.11-py3 |
BERT Base Inference | V100 SXM2-32GB | tensorrt_plan | TensorRT | Mixed | 4 | 2 | 1 | 24 | 19.138 | 2,508 inf/sec | 128 | 22.11-py3 |
DLRM Inference | V100-SXM2-32GB | pytorch_libtorch | PyTorch | Mixed | 2 | 1 | 65,536 | 30 | 3.452 | 8,688 inf/sec | - | 22.11-py3 |
DLRM Inference | V100-SXM2-32GB | pytorch_libtorch | PyTorch | Mixed | 1 | 2 | 65,536 | 30 | 3.739 | 16,041 inf/sec | - | 22.11-py3 |
Inference Performance of NVIDIA A100, A40, A30, A10, A2, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
Inference Image Classification on CNNs with TensorRT
ResNet-50 v1.5 Throughput
DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: Mixed | Dataset: Synthetic
ResNet-50 v1.5 Power Efficiency
DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.5.1 | Batch Size = 128 | 22.11-py3 | Precision: Mixed | Dataset: Synthetic
A100 Full Chip Inference Performance
Network | Batch Size | Full Chip Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 11,807 images/sec | 63 images/sec/watt | 0.68 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-40GB |
128 | 30,814 images/sec | 80 images/sec/watt | 4.15 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
225 | 32,500 images/sec | - images/sec/watt | 6.92 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
ResNet-50v1.5 | 8 | 11,524 images/sec | 62 images/sec/watt | 0.69 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-40GB |
128 | 30,004 images/sec | 76 images/sec/watt | 4.27 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
216 | 31,228 images/sec | - images/sec/watt | 6.92 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 7,300 sequences/sec | 27 sequences/sec/watt | 1.1 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
128 | 15,147 sequences/sec | 38 sequences/sec/watt | 8.45 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-40GB | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 2,679 sequences/sec | 9 sequences/sec/watt | 2.99 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
128 | 4,965 sequences/sec | 12 sequences/sec/watt | 25.78 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-40GB | |
EfficientNet-B0 | 8 | 9,155 images/sec | 58 images/sec/watt | 0.87 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
128 | 30,273 images/sec | 95 images/sec/watt | 4.23 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
EfficientNet-B4 | 8 | 2,593 images/sec | 12 images/sec/watt | 3.09 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-40GB |
128 | 4,588 images/sec | 12 images/sec/watt | 27.9 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
A100 1/7 MIG Inference Performance
Network | Batch Size | 1/7 MIG Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 3,747 images/sec | 31 images/sec/watt | 2.13 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
30 | 4,352 images/sec | - images/sec/watt | 6.89 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
128 | 4,708 images/sec | 38 images/sec/watt | 27.19 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
ResNet-50v1.5 | 8 | 3,661 images/sec | 31 images/sec/watt | 2.19 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
29 | 4,189 images/sec | - images/sec/watt | 6.92 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
128 | 4,555 images/sec | 36 images/sec/watt | 28.1 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
BERT-BASE | 8 | 1,866 sequences/sec | 15 sequences/sec/watt | 4.29 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
128 | 2,304 sequences/sec | 16 sequences/sec/watt | 55.55 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
BERT-LARGE | 8 | 610 sequences/sec | 5 sequences/sec/watt | 13.11 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
128 | 741 sequences/sec | 5 sequences/sec/watt | 172.76 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
A100 7 MIG Inference Performance
Network | Batch Size | 7 MIG Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 26,040 images/sec | 80 images/sec/watt | 2.16 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
29 | 30,190 images/sec | - images/sec/watt | 6.73 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
128 | 32,792 images/sec | 85 images/sec/watt | 27.35 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
ResNet-50v1.5 | 8 | 25,299 images/sec | 77 images/sec/watt | 2.22 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
29 | 29,285 images/sec | - images/sec/watt | 2.94 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
128 | 31,749 images/sec | 83 images/sec/watt | 28.26 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
BERT-BASE | 8 | 12,941 sequences/sec | 34 sequences/sec/watt | 4.34 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
128 | 15,157 sequences/sec | 38 sequences/sec/watt | 59.15 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB | |
BERT-LARGE | 8 | 4,210 sequences/sec | 12 sequences/sec/watt | 13.32 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
128 | 4,806 sequences/sec | 12 sequences/sec/watt | 186.49 | 1x A100 | DGX A100 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
A40 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 10,034 images/sec | 38 images/sec/watt | 0.8 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 |
107 | 15,868 images/sec | - images/sec/watt | 6.74 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 | |
128 | 15,727 images/sec | 53 images/sec/watt | 8.14 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 | |
ResNet-50v1.5 | 8 | 9,724 images/sec | 36 images/sec/watt | 0.82 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 |
100 | 14,965 images/sec | - images/sec/watt | 6.68 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 | |
128 | 14,950 images/sec | 50 images/sec/watt | 8.56 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 | |
BERT-BASE | 8 | 5,602 sequences/sec | 19 sequences/sec/watt | 1.43 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 |
128 | 7,735 sequences/sec | 26 sequences/sec/watt | 16.55 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 | |
BERT-LARGE | 8 | 1,796 sequences/sec | 6 sequences/sec/watt | 4.45 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 |
128 | 2,359 sequences/sec | 8 sequences/sec/watt | 54.27 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 | |
EfficientNet-B0 | 8 | 9,343 images/sec | 50 images/sec/watt | 0.86 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 |
128 | 19,243 images/sec | 64 images/sec/watt | 6.65 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 | |
EfficientNet-B4 | 8 | 1,943 images/sec | 7 images/sec/watt | 4.12 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 |
128 | 2,630 images/sec | 9 images/sec/watt | 48.68 | 1x A40 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A40 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
A30 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 8,851 images/sec | 69 images/sec/watt | 0.9 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
107 | 15,973 images/sec | - images/sec/watt | 6.7 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
128 | 16,057 images/sec | 98 images/sec/watt | 7.97 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
ResNet-50v1.5 | 8 | 8,725 images/sec | 67 images/sec/watt | 0.92 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
102 | 15,271 images/sec | - images/sec/watt | 6.68 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
128 | 15,541 images/sec | 95 images/sec/watt | 8.24 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 5,124 sequences/sec | 31 sequences/sec/watt | 1.56 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
128 | 7,590 sequences/sec | 46 sequences/sec/watt | 16.86 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,775 sequences/sec | 11 sequences/sec/watt | 4.51 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
128 | 2,438 sequences/sec | 15 sequences/sec/watt | 52.5 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
EfficientNet-B0 | 8 | 7,511 images/sec | 74 images/sec/watt | 1.07 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
128 | 16,697 images/sec | 102 images/sec/watt | 7.67 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
EfficientNet-B4 | 8 | 1,719 images/sec | 12 images/sec/watt | 4.65 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
128 | 2,358 images/sec | 14 images/sec/watt | 54.29 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
A30 1/4 MIG Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 3,623 images/sec | 44 images/sec/watt | 2.21 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
29 | 4,285 images/sec | - images/sec/watt | 6.77 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
128 | 4,604 images/sec | 52 images/sec/watt | 27.8 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
ResNet-50v1.5 | 8 | 3,543 images/sec | 41 images/sec/watt | 2.26 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
28 | 4,126 images/sec | - images/sec/watt | 6.79 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
128 | 4,456 images/sec | 50 images/sec/watt | 28.73 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
BERT-BASE | 8 | 1,879 sequences/sec | 20 sequences/sec/watt | 4.26 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
128 | 2,276 sequences/sec | 22 sequences/sec/watt | 56.23 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
BERT-LARGE | 8 | 604 sequences/sec | 6 sequences/sec/watt | 13.25 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
128 | 742 sequences/sec | 7 sequences/sec/watt | 172.57 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
A30 4 MIG Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 14,088 images/sec | 86 images/sec/watt | 2.28 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
27 | 16,432 images/sec | - images/sec/watt | 6.59 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
128 | 17,343 images/sec | 106 images/sec/watt | 29.63 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
ResNet-50v1.5 | 8 | 13,734 images/sec | 84 images/sec/watt | 2.34 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
26 | 15,777 images/sec | - images/sec/watt | 6.61 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
128 | 16,745 images/sec | 102 images/sec/watt | 30.69 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
BERT-BASE | 8 | 6,896 sequences/sec | 42 sequences/sec/watt | 4.66 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
128 | 7,742 sequences/sec | 47 sequences/sec/watt | 66.35 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 | |
BERT-LARGE | 8 | 2,190 sequences/sec | 13 sequences/sec/watt | 14.66 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
128 | 2,450 sequences/sec | 15 sequences/sec/watt | 209.68 | 1x A30 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
A10 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 7,826 images/sec | 52 images/sec/watt | 1.02 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 |
73 | 10,950 images/sec | - images/sec/watt | 6.67 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 | |
128 | 11,520 images/sec | 77 images/sec/watt | 11.11 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 | |
ResNet-50v1.5 | 8 | 7,628 images/sec | 51 images/sec/watt | 1.05 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 |
70 | 10,694 images/sec | - images/sec/watt | 6.55 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 | |
128 | 10,852 images/sec | 73 images/sec/watt | 11.8 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 4,149 sequences/sec | 28 sequences/sec/watt | 1.93 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 | |
128 | 5,100 sequences/sec | 34 sequences/sec/watt | 25.1 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,248 sequences/sec | 9 sequences/sec/watt | 6.41 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 | |
128 | 1,576 sequences/sec | 11 sequences/sec/watt | 81.22 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 | |
EfficientNet-B0 | 8 | 8,241 images/sec | 55 images/sec/watt | 0.97 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 |
128 | 14,102 images/sec | 94 images/sec/watt | 9.08 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 | |
EfficientNet-B4 | 8 | 1,535 images/sec | 10 images/sec/watt | 5.21 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 |
128 | 1,828 images/sec | 12 images/sec/watt | 70.04 | 1x A10 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A10 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
A2 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 2,621 images/sec | 44 images/sec/watt | 3.05 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 |
19 | 2,927 images/sec | - images/sec/watt | 6.49 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 | |
128 | 3,059 images/sec | 51 images/sec/watt | 41.85 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 | |
ResNet-50v1.5 | 8 | 2,519 images/sec | 42 images/sec/watt | 3.18 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 |
18 | 2,809 images/sec | - images/sec/watt | 6.76 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 | |
128 | 3,059 images/sec | 51 images/sec/watt | 41.85 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 | |
BERT-BASE | 8 | 1,132 sequences/sec | 19 sequences/sec/watt | 7.07 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 |
128 | 1,194 sequences/sec | 20 sequences/sec/watt | 107.23 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 | |
BERT-LARGE | 8 | 339 sequences/sec | 6 sequences/sec/watt | 23.58 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 |
128 | 362 sequences/sec | 6 sequences/sec/watt | 353.44 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 | |
EfficientNet-B0 | 8 | 3,044 images/sec | 59 images/sec/watt | 2.63 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 |
128 | 3,929 images/sec | 65 images/sec/watt | 32.58 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 | |
EfficientNet-B4 | 8 | 469 images/sec | 8 images/sec/watt | 17.05 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 |
128 | 514 images/sec | 9 images/sec/watt | 249.04 | 1x A2 | GIGABYTE G482-Z52-00 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A2 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
T4 Inference Performance +
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 3,811 images/sec | 54 images/sec/watt | 2.1 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 |
31 | 4,615 images/sec | - images/sec/watt | 6.72 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 | |
128 | 5,003 images/sec | 72 images/sec/watt | 25.59 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 | |
ResNet-50v1.5 | 8 | 3,740 images/sec | 53 images/sec/watt | 2.14 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 |
28 | 4,309 images/sec | - images/sec/watt | 6.5 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 | |
128 | 4,864 images/sec | 69 images/sec/watt | 26.32 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,684 sequences/sec | 24 sequences/sec/watt | 4.75 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 | |
128 | 1,855 sequences/sec | 27 sequences/sec/watt | 69 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 550 sequences/sec | 8 sequences/sec/watt | 14.55 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 | |
128 | 526 sequences/sec | 8 sequences/sec/watt | 243.35 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 | |
EfficientNet-B0 | 8 | 4,722 images/sec | 68 images/sec/watt | 1.69 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 |
128 | 6,388 images/sec | 92 images/sec/watt | 20.04 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 | |
EfficientNet-B4 | 8 | 786 images/sec | 11 images/sec/watt | 10.18 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 |
128 | 886 images/sec | 13 images/sec/watt | 144.49 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
V100 Inference Performance +
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 4,398 images/sec | 15 images/sec/watt | 1.82 | 1x V100 | DGX-2 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB |
128 | 7,896 images/sec | 23 images/sec/watt | 16.21 | 1x V100 | DGX-2 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB | |
ResNet-50v1.5 | 8 | 4,283 images/sec | 14 images/sec/watt | 1.87 | 1x V100 | DGX-2 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB |
128 | 7,495 images/sec | 22 images/sec/watt | 17.08 | 1x V100 | DGX-2 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 2,359 sequences/sec | 7 sequences/sec/watt | 3.39 | 1x V100 | DGX-2 | 22.11-py3 | Mixed | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB | |
128 | 3,111 sequences/sec | 9 sequences/sec/watt | 41.15 | 1x V100 | DGX-2 | 22.11-py3 | Mixed | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 777 sequences/sec | 2 sequences/sec/watt | 10.3 | 1x V100 | DGX-2 | 22.11-py3 | Mixed | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB | |
128 | 947 sequences/sec | 3 sequences/sec/watt | 135.11 | 1x V100 | DGX-2 | 22.11-py3 | Mixed | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB | |
EfficientNet-B0 | 8 | 4,690 images/sec | 22 images/sec/watt | 1.71 | 1x V100 | DGX-2 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB |
128 | 9,493 images/sec | 30 images/sec/watt | 13.48 | 1x V100 | DGX-2 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB | |
EfficientNet-B4 | 8 | 951 images/sec | 3 images/sec/watt | 8.42 | 1x V100 | DGX-2 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB |
128 | 1,258 images/sec | 4 images/sec/watt | 101.76 | 1x V100 | DGX-2 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM3-32GB |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Inference Performance of NVIDIA GPU on Cloud
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Inference Performance on Cloud
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 11,644 images/sec | - images/sec/watt | 0.69 | 1x A100 | GCP A2-HIGHGPU-1G | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-40GB |
128 | 28,444 images/sec | - images/sec/watt | 4.5 | 1x A100 | GCP A2-HIGHGPU-1G | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-40GB | |
8 | 11,453 images/sec | - images/sec/watt | 0.7 | 1x A100 | AWS EC2 p4d.24xlarge | 22.10-py3 | INT8 | Synthetic | TensorRT 8.5.0 | A100-SXM4-40GB | |
128 | 28,528 images/sec | - images/sec/watt | 4.49 | 1x A100 | AWS EC2 p4d.24xlarge | 22.10-py3 | INT8 | Synthetic | TensorRT 8.5.0 | A100-SXM4-40GB | |
8 | 11,334 images/sec | - images/sec/watt | 0.71 | 1x A100 | Azure Standard_ND96amsr_A100_v4 | 22.08-py3 | INT8 | Synthetic | - | A100-SXM4-80GB | |
128 | 29,613 images/sec | - images/sec/watt | 4.32 | 1x A100 | Azure Standard_ND96amsr_A100_v4 | 22.08-py3 | INT8 | Synthetic | - | A100-SXM4-40GB | |
BERT-LARGE | 8 | 2,663 images/sec | - images/sec/watt | 3 | 1x A100 | GCP A2-HIGHGPU-1G | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-40GB |
128 | 4,966 images/sec | - images/sec/watt | 25.78 | 1x A100 | GCP A2-HIGHGPU-1G | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | A100-SXM4-40GB | |
8 | 2,569 images/sec | - images/sec/watt | 3.11 | 1x A100 | AWS EC2 p4d.24xlarge | 22.07-py3 | INT8 | Synthetic | TensorRT 8.4.1 | A100-SXM4-40GB | |
128 | 5,008 images/sec | - images/sec/watt | 25.56 | 1x A100 | AWS EC2 p4d.24xlarge | 22.07-py3 | INT8 | Synthetic | TensorRT 8.4.1 | A100-SXM4-40GB | |
8 | 2,698 images/sec | - images/sec/watt | 2.96 | 1x A100 | Azure Standard_ND96amsr_A100_v4 | 22.08-py3 | INT8 | Synthetic | - | A100-SXM4-80GB | |
128 | 4,907 images/sec | - images/sec/watt | 26.09 | 1x A100 | Azure Standard_ND96amsr_A100_v4 | 22.08-py3 | INT8 | Synthetic | - | A100-SXM4-80GB |
BERT-Large: Sequence Length = 128
T4 Inference Performance on Cloud +
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 3,354 images/sec | - images/sec/watt | 2.39 | 1x T4 | GCP N1-HIGHMEM-8 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 |
128 | 4,195 images/sec | - images/sec/watt | 30.52 | 1x T4 | GCP N1-HIGHMEM-8 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 | |
BERT-LARGE | 8 | 487 images/sec | - images/sec/watt | 16.41 | 1x T4 | GCP N1-HIGHMEM-8 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 |
128 | 433 images/sec | - images/sec/watt | 295.37 | 1x T4 | GCP N1-HIGHMEM-8 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | NVIDIA T4 |
V100 Inference Performance on Cloud +
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 4,297 images/sec | - images/sec/watt | 1.86 | 1x V100 | GCP N1-HIGHMEM-8 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM2-16GB |
128 | 7,212 images/sec | - images/sec/watt | 17.75 | 1x V100 | GCP N1-HIGHMEM-8 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM2-16GB | |
BERT-LARGE | 8 | 707 images/sec | - images/sec/watt | 11.32 | 1x V100 | GCP N1-HIGHMEM-8 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM2-16GB |
128 | 920 images/sec | - images/sec/watt | 139.09 | 1x V100 | GCP N1-HIGHMEM-8 | 22.11-py3 | INT8 | Synthetic | TensorRT 8.5.1 | V100-SXM2-16GB |
Conversational AI
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.
Related Resources
Download and get started with NVIDIA Riva.
Riva Benchmarks
A100 ASR Benchmarks
A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic Model | Language Model | # of Streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
citrinet | n-gram | 1 | 11.4 | 1 | A100 SXM4-40GB |
citrinet | n-gram | 64 | 64.1 | 64 | A100 SXM4-40GB |
citrinet | n-gram | 128 | 103 | 126 | A100 SXM4-40GB |
citrinet | n-gram | 256 | 166.7 | 250 | A100 SXM4-40GB |
citrinet | n-gram | 384 | 235 | 371 | A100 SXM4-40GB |
citrinet | n-gram | 512 | 311 | 490 | A100 SXM4-40GB |
citrinet | n-gram | 768 | 492 | 717 | A100 SXM4-40GB |
conformer | n-gram | 1 | 16.8 | 1 | A100 SXM4-40GB |
conformer | n-gram | 64 | 109 | 64 | A100 SXM4-40GB |
conformer | n-gram | 128 | 130 | 126 | A100 SXM4-40GB |
conformer | n-gram | 256 | 236 | 249 | A100 SXM4-40GB |
conformer | n-gram | 384 | 342 | 369 | A100 SXM4-40GB |
conformer | n-gram | 512 | 485 | 486 | A100 SXM4-40GB |
A100 Best Streaming Latency Mode (160 ms chunk)
Acoustic Model | Language Model | # of Streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
citrinet | n-gram | 1 | 10.47 | 1 | A100 SXM4-40GB |
citrinet | n-gram | 8 | 15.14 | 8 | A100 SXM4-40GB |
citrinet | n-gram | 16 | 26.2 | 16 | A100 SXM4-40GB |
citrinet | n-gram | 32 | 39.1 | 32 | A100 SXM4-40GB |
citrinet | n-gram | 48 | 48 | 48 | A100 SXM4-40GB |
citrinet | n-gram | 64 | 55.4 | 64 | A100 SXM4-40GB |
conformer | n-gram | 1 | 14.69 | 1 | A100 SXM4-40GB |
conformer | n-gram | 8 | 37.7 | 8 | A100 SXM4-40GB |
conformer | n-gram | 16 | 41.5 | 16 | A100 SXM4-40GB |
conformer | n-gram | 32 | 55.7 | 32 | A100 SXM4-40GB |
conformer | n-gram | 48 | 66.8 | 48 | A100 SXM4-40GB |
conformer | n-gram | 64 | 82.2 | 63 | A100 SXM4-40GB |
A100 Offline Mode (1600 ms chunk)
Acoustic Model | Language Model | # of Streams | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
citrinet | n-gram | 32 | 4390 | A100 SXM4-40GB |
conformer | n-gram | 32 | 1700 | A100 SXM4-40GB |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.8.1 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz
A30 ASR Benchmarks
A30 Best Streaming Throughput Mode (800 ms chunk)
Acoustic Model | Language Model | # of Streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
citrinet | n-gram | 1 | 14.64 | 1 | A30 |
citrinet | n-gram | 64 | 101 | 63 | A30 |
citrinet | n-gram | 128 | 152 | 126 | A30 |
citrinet | n-gram | 256 | 272 | 249 | A30 |
citrinet | n-gram | 384 | 393 | 368 | A30 |
citrinet | n-gram | 512 | 569 | 484 | A30 |
conformer | n-gram | 1 | 21.76 | 1 | A30 |
conformer | n-gram | 64 | 134 | 63 | A30 |
conformer | n-gram | 128 | 216 | 126 | A30 |
conformer | n-gram | 256 | 397 | 248 | A30 |
conformer | n-gram | 384 | 672 | 364 | A30 |
A30 Best Streaming Latency Mode (160 ms chunk)
Acoustic Model | Language Model | # of Streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
citrinet | n-gram | 1 | 13.74 | 1 | A30 |
citrinet | n-gram | 8 | 29.4 | 8 | A30 |
citrinet | n-gram | 16 | 44.2 | 16 | A30 |
citrinet | n-gram | 32 | 58.7 | 32 | A30 |
citrinet | n-gram | 48 | 65.8 | 48 | A30 |
citrinet | n-gram | 64 | 83 | 63 | A30 |
conformer | n-gram | 1 | 20.32 | 1 | A30 |
conformer | n-gram | 8 | 42.2 | 8 | A30 |
conformer | n-gram | 16 | 51.5 | 16 | A30 |
conformer | n-gram | 32 | 71.3 | 32 | A30 |
conformer | n-gram | 48 | 103.9 | 48 | A30 |
conformer | n-gram | 64 | 126.8 | 63 | A30 |
A30 Offline Mode (1600 ms chunk)
Acoustic Model | Language Model | # of Streams | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
citrinet | n-gram | 32 | 3142 | A30 |
conformer | n-gram | 32 | 1120 | A30 |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.8.1 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz
A10 ASR Benchmarks
A10 Best Streaming Throughput Mode (800 ms chunk)
Acoustic Model | Language Model | # of Streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
citrinet | n-gram | 1 | 12.93 | 1 | A10 |
citrinet | n-gram | 64 | 88.5 | 64 | A10 |
citrinet | n-gram | 128 | 162.6 | 126 | A10 |
citrinet | n-gram | 256 | 316 | 248 | A10 |
citrinet | n-gram | 384 | 486 | 367 | A10 |
citrinet | n-gram | 512 | 710 | 481 | A10 |
conformer | n-gram | 1 | 15.33 | 1 | A10 |
conformer | n-gram | 64 | 133 | 63 | A10 |
conformer | n-gram | 128 | 234 | 126 | A10 |
conformer | n-gram | 256 | 434 | 247 | A10 |
A10 Best Streaming Latency Mode (160 ms chunk)
Acoustic Model | Language Model | # of Streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
citrinet | n-gram | 1 | 10.405 | 1 | A10 |
citrinet | n-gram | 8 | 20.22 | 8 | A10 |
citrinet | n-gram | 16 | 29.8 | 16 | A10 |
citrinet | n-gram | 32 | 49.1 | 32 | A10 |
citrinet | n-gram | 48 | 67.6 | 48 | A10 |
citrinet | n-gram | 64 | 84.7 | 63 | A10 |
conformer | n-gram | 1 | 13.49 | 1 | A10 |
conformer | n-gram | 8 | 33.8 | 8 | A10 |
conformer | n-gram | 16 | 40.9 | 16 | A10 |
conformer | n-gram | 32 | 71.5 | 32 | A10 |
conformer | n-gram | 48 | 108 | 48 | A10 |
conformer | n-gram | 64 | 140 | 63 | A10 |
A10 Offline Mode (1600 ms chunk)
Acoustic Model | Language Model | # of Streams | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
citrinet | n-gram | 32 | 2719 | A10 |
conformer | n-gram | 32 | 992 | A10 |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.8.1 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz
V100 ASR Benchmarks +
V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic Model | Language Model | # of Streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
citrinet | n-gram | 1 | 13.91 | 1 | V100 SXM2-16GB |
citrinet | n-gram | 64 | 87.9 | 63 | V100 SXM2-16GB |
citrinet | n-gram | 128 | 153 | 125 | V100 SXM2-16GB |
citrinet | n-gram | 256 | 283.7 | 246 | V100 SXM2-16GB |
citrinet | n-gram | 384 | 407 | 363 | V100 SXM2-16GB |
citrinet | n-gram | 512 | 590 | 474 | V100 SXM2-16GB |
conformer | n-gram | 1 | 22.3 | 1 | V100 SXM2-16GB |
conformer | n-gram | 64 | 153 | 63 | V100 SXM2-16GB |
conformer | n-gram | 128 | 230.6 | 125 | V100 SXM2-16GB |
conformer | n-gram | 256 | 400 | 245 | V100 SXM2-16GB |
conformer | n-gram | 384 | 716 | 359 | V100 SXM2-16GB |
V100 Best Streaming Latency Mode (160 ms chunk)
Acoustic Model | Language Model | # of Streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
citrinet | n-gram | 1 | 13.26 | 1 | V100 SXM2-16GB |
citrinet | n-gram | 8 | 24.32 | 8 | V100 SXM2-16GB |
citrinet | n-gram | 16 | 32.9 | 16 | V100 SXM2-16GB |
citrinet | n-gram | 32 | 50.5 | 32 | V100 SXM2-16GB |
citrinet | n-gram | 48 | 65 | 48 | V100 SXM2-16GB |
citrinet | n-gram | 64 | 84.1 | 63 | V100 SXM2-16GB |
conformer | n-gram | 1 | 19.7 | 1 | V100 SXM2-16GB |
conformer | n-gram | 8 | 55 | 8 | V100 SXM2-16GB |
conformer | n-gram | 16 | 52.3 | 16 | V100 SXM2-16GB |
conformer | n-gram | 32 | 76.7 | 32 | V100 SXM2-16GB |
conformer | n-gram | 48 | 119.8 | 47 | V100 SXM2-16GB |
conformer | n-gram | 64 | 143 | 63 | V100 SXM2-16GB |
V100 Offline Mode (1600 ms chunk)
Acoustic Model | Language Model | # of Streams | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
citrinet | n-gram | 32 | 2693 | V100 SXM2-16GB |
conformer | n-gram | 32 | 964 | V100 SXM2-16GB |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.8.1 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz
T4 ASR Benchmarks +
T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic Model | Language Model | # of Streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
citrinet | n-gram | 1 | 26.7 | 1 | NVIDIA T4 |
citrinet | n-gram | 64 | 170.8 | 63 | NVIDIA T4 |
citrinet | n-gram | 128 | 342 | 125 | NVIDIA T4 |
citrinet | n-gram | 256 | 736 | 242 | NVIDIA T4 |
conformer | n-gram | 1 | 59.1 | 1 | NVIDIA T4 |
conformer | n-gram | 64 | 310 | 63 | NVIDIA T4 |
conformer | n-gram | 128 | 505 | 124 | NVIDIA T4 |
T4 Best Streaming Latency Mode (160 ms chunk)
Acoustic Model | Language Model | # of Streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
citrinet | n-gram | 1 | 25.9 | 1 | NVIDIA T4 |
citrinet | n-gram | 8 | 57 | 8 | NVIDIA T4 |
citrinet | n-gram | 16 | 60.5 | 16 | NVIDIA T4 |
citrinet | n-gram | 32 | 93.1 | 32 | NVIDIA T4 |
citrinet | n-gram | 48 | 139.7 | 47 | NVIDIA T4 |
conformer | n-gram | 1 | 53.4 | 1 | NVIDIA T4 |
conformer | n-gram | 8 | 82 | 8 | NVIDIA T4 |
conformer | n-gram | 16 | 104.1 | 16 | NVIDIA T4 |
conformer | n-gram | 32 | 239 | 32 | NVIDIA T4 |
T4 Offline Mode (1600 ms chunk)
Acoustic Model | Language Model | # of Streams | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
citrinet | n-gram | 32 | 1322 | NVIDIA T4 |
conformer | n-gram | 32 | 488 | NVIDIA T4 |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.8.1 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz
A100 TTS Benchmarks
Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
FastPitch + Hifi-GAN | 1 | 0.021 | 0.003 | 145 | A100 SXM4-40GB |
FastPitch + Hifi-GAN | 4 | 0.037 | 0.006 | 336 | A100 SXM4-40GB |
FastPitch + Hifi-GAN | 6 | 0.046 | 0.007 | 395 | A100 SXM4-40GB |
FastPitch + Hifi-GAN | 8 | 0.056 | 0.009 | 421 | A100 SXM4-40GB |
FastPitch + Hifi-GAN | 10 | 0.059 | 0.01 | 434 | A100 SXM4-40GB |
FastPitch + Hifi-GAN | 32 | 0.339 | 0.015 | 437 | A100 SXM4-40GB |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.8.1 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz
A30 TTS Benchmarks
Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
FastPitch + Hifi-GAN | 1 | 0.022 | 0.004 | 127 | A30 |
FastPitch + Hifi-GAN | 4 | 0.044 | 0.007 | 267 | A30 |
FastPitch + Hifi-GAN | 6 | 0.064 | 0.009 | 292 | A30 |
FastPitch + Hifi-GAN | 8 | 0.082 | 0.011 | 310 | A30 |
FastPitch + Hifi-GAN | 10 | 0.091 | 0.013 | 318 | A30 |
FastPitch + Hifi-GAN | 16 | 0.196 | 0.014 | 332 | A30 |
FastPitch + Hifi-GAN | 32 | 0.427 | 0.019 | 349 | A30 |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.8.1 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz
A10 TTS Benchmarks
Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
FastPitch + Hifi-GAN | 1 | 0.021 | 0.004 | 127 | A10 |
FastPitch + Hifi-GAN | 4 | 0.049 | 0.008 | 235 | A10 |
FastPitch + Hifi-GAN | 6 | 0.072 | 0.011 | 250 | A10 |
FastPitch + Hifi-GAN | 8 | 0.096 | 0.014 | 256 | A10 |
FastPitch + Hifi-GAN | 16 | 0.218 | 0.02 | 278 | A10 |
FastPitch + Hifi-GAN | 32 | 0.521 | 0.024 | 284 | A10 |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.8.1 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz
V100 TTS Benchmarks +
Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
FastPitch + Hifi-GAN | 1 | 0.024 | 0.005 | 104 | V100 SXM2-16GB |
FastPitch + Hifi-GAN | 4 | 0.055 | 0.009 | 215 | V100 SXM2-16GB |
FastPitch + Hifi-GAN | 6 | 0.08 | 0.012 | 227 | V100 SXM2-16GB |
FastPitch + Hifi-GAN | 8 | 0.108 | 0.015 | 232 | V100 SXM2-16GB |
FastPitch + Hifi-GAN | 10 | 0.119 | 0.018 | 235 | V100 SXM2-16GB |
FastPitch + Hifi-GAN | 16 | 0.238 | 0.022 | 254 | V100 SXM2-16GB |
FastPitch + Hifi-GAN | 32 | 0.562 | 0.026 | 264 | V100 SXM2-16GB |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.8.1 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz
T4 TTS Benchmarks +
Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
FastPitch + Hifi-GAN | 1 | 0.05 | 0.007 | 64 | NVIDIA T4 |
FastPitch + Hifi-GAN | 4 | 0.096 | 0.016 | 121 | NVIDIA T4 |
FastPitch + Hifi-GAN | 6 | 0.142 | 0.022 | 127 | NVIDIA T4 |
FastPitch + Hifi-GAN | 8 | 0.188 | 0.028 | 132 | NVIDIA T4 |
FastPitch + Hifi-GAN | 10 | 0.218 | 0.03 | 134 | NVIDIA T4 |
FastPitch + Hifi-GAN | 16 | 0.412 | 0.042 | 142 | NVIDIA T4 |
FastPitch + Hifi-GAN | 32 | 1.024 | 0.047 | 145 | NVIDIA T4 |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.8.1 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz
Last updated: January 7th, 2023