Discover How Tensor Cores Accelerate Your Mixed Precision Models
From intelligent assistants to autonomous robots and beyond, your deep learning models are addressing challenges that are rapidly growing in complexity. But converging these models has become increasingly difficult and often leads to underperforming and inefficient training cycles.
You don’t have to let those limitations slow your work. NVIDIA Ampere, Volta and Turing GPUs powered by Tensor Cores give you an immediate path to faster training and greater deep learning performance. The third generation of tensor cores introduced in the NVIDIA Ampere architecture provides a huge performance boost and delivers new precisions to cover the full spectrum required from research to production — FP32, Tensor Float 32 (TF32), FP16, INT8, INT4 and bfloat16. With Tensor Cores enabled, you can dramatically accelerate your throughput and reduce AI training times.
NVIDIA GPUs with Tensor Cores enabled have already helped Fast.AI and AWS achieve impressive performance gains and powered NVIDIA to the top spots on MLPerf, the first industry-wide AI benchmark.
Customer Success Stories

AWS recommends Tensor Cores for the most complex deep learning models and scientific applications
Performance Benchmarks
High Performance Computing
“Machine learning researchers, data scientists, and engineers want to accelerate time to solution. When TensorFloat-32 is natively integrated into PyTorch, it will enable out of the box acceleration with zero code changes while maintaining accuracy of FP32 when using the NVIDIA Ampere Architecture based GPUs.”
— The PyTorch Team
“TensorFloat-32 provides a huge out of the box performance increase for AI applications for training and inference while preserving FP32 levels of accuracy. We plan to make TensorFloat-32 supported natively in TensorFlow to enable data scientists to benefit from dramatically higher speedups in NVIDIA A100 Tensor Core GPUs without any code changes.”
— Kemal El Moujahid, Director of Product Management for TensorFlow
“Nuance Research advances and applies conversational AI technologies to power solutions that redefine how humans and computers interact. The rate of our advances reflects the speed at which we train and assess deep learning models. With Automatic Mixed Precision, we’ve realized a 50% speedup in TensorFlow-based ASR model training without loss of accuracy via a minimal code change. We’re eager to achieve a similar impact in our other deep learning language processing applications.”
— Wenxuan Teng, Senior Research Manager, Nuance Communications
“Automated mixed precision powered by NVIDIA Tensor Core GPUs on Alibaba allows us to instantly speedup AI models nearly 3X. Our researchers appreciated the ease of turning on this feature to instantly accelerate our AI.”
— Wei Lin, Sr Director, Alibaba Computing Platform
“Clova AI pursues advanced multimodal platforms as a partnership between Korea’s top search engine NAVER, and Japan’s top messenger, LINE. Clova AI’s LaRva team focuses on language understandings in this platform to enable AI based services. “Using automatic mixed precision powered by NVIDIA Tensor Core GPUs increased throughput and enabled us to double our batch size for massive models like RoBERTa. With these optimizations we achieved a training speedup of 2x while still maintaining accuracy. We expect this improved technology can enhance our many NLP services including AI for Contact Center. That means significant cost savings in our model production and enhanced services for customers in shorter time.”
— Dongjun Lee and Sungdong Kim, Machine Learning Engineer, NAVER
Learn How Tensor Cores Accelerate Your Models
Accelerated models speed your time to insight. With NVIDIA Tensor Cores. With the third generation of Tensor Cores in NVIDIA Ampere GPUs, you can unlock up to 10X higher FLOPS using TF32 with zero code changes. The new TF32 format delivers the accuracy of FP32 while increasing performance dramatically. Additionally with automatic mixed precision enabled, you can further gain a 3X performance boost with FP16.
On previous generations of Tensor Cores in Volta and Turing GPUs, you can get a 3x performance increase over FP32 using automatic mixed precision (performing matrix multiply in FP16 and accumulating the result in FP32 while maintaining accuracy) with just a couple of lines of code. Automatic Mixed Precision capability offers this by:
- Halving storage requirements (enables increased batch size on a fixed memory budget) with super-linear benefit.
- Generating half the memory traffic by reducing size of gradient and activation tensors.
Mixed Precision Training Techniques Using Tensor Cores For Deep Learning
Learn how mixed precision accelerates your models
GET STARTEDImplementation of your Deep Learning workflows is seamless. NVIDIA provides out of the box models to get started immediately as well as tools to allow you to optimize your models for Tensor Cores.
Customer Implementation Whitepapers
Technical Blogs
Open Source Software Optimizations for Mixed Precision Training on Tensor Cores
Automatic Mixed Precision for auto enabling of Tensor Cores in PyTorch
Automatic Mixed Precision for auto enabling of Tensor Cores in TensorFlow
Developer Resources
Webinar: Tensor Core Performance on NVIDIA GPUs: The Ultimate Guide
Webinar: Mixed-Precision Training of Neural Networks
Webinar: Real-World Examples Training Neural Networks with Mixed Precision
Webinar: Automatic Mixed Precision (AMP) – easily enable mixed precision in your model with 2 lines of code
Blog: AI’s Latest Precision Format Delivers 20x Speed-Ups with TensorFloat-32
Blog: Comparison between precision computing techniques
Containers And Out-Of-The-Box Optimized Models Get You Running Quickly
You can try Tensor Cores in the cloud (any major CSP) or in your datacenter GPU. NVIDIA NGC is a comprehensive catalog of deep learning and scientific applications in easy-to-use software containers to get you started immediately.
Quickly experiment with tensor core optimized, out-of-the-box deep learning models from NVIDIA.
Get Tensor Core Optimized Examples
Application specific examples readily available for popular deep learning frameworks
GET STARTEDAccess Tensor Core Optimized Examples via NVIDIA NGC and GitHub:
Get NVIDIA NGC Model Scripts (Choose Examples)
Download>GitHub Repository
Join>Implement Tensor Cores To Easily Speedup Your Own Models
Realize faster performance on your own models with NVIDIA resources. Analyze your models with NVIDIA's profiler tools and optimize your Tensor Cores implementation with helpful documentation.
Analyze your model
NVIDIA NVProf is a profiler that can easily analyze your own model and optimize for mixed precision on Tensor Cores
GET STARTEDMXNet
Enabling Automatic Mixed Precision in MXNet
PyTorch
Enabling Automatic Mixed Precision in PyTorch
Webinar: Automatic Mixed Precision – easily enable mixed precision in your model with 2 lines of code
DevBlog: Tools For Easy Mixed Precision Training in PyTorch
TensorFlow
Enabling Automatic Mixed Precision in TensorFlow
Tutorial: TensorFlow ResNet-50 with Mixed-Precision
PaddlePaddle
Enabling Automatic Mixed Precision in PaddlePaddle
Documentation
SDK:Mixed-precision best practices