Discover How Tensor Cores Accelerate Your Mixed Precision Models

From intelligent assistants to autonomous robots and beyond, your deep learning models are addressing challenges that are rapidly growing in complexity. But converging these models has become increasingly difficult and often leads to underperforming and inefficient training cycles.

You don’t have to let those limitations slow your work. NVIDIA Ampere, Volta and Turing GPUs powered by Tensor Cores give you an immediate path to faster training and greater deep learning performance. The third generation of tensor cores introduced in the NVIDIA Ampere architecture provides a huge performance boost and delivers new precisions to cover the full spectrum required from research to production — FP32, Tensor Float 32 (TF32), FP16, INT8, INT4 and bfloat16. With Tensor Cores enabled, you can dramatically accelerate your throughput and reduce AI training times.

New To Tensor Cores?

See how Tensor Cores accelerate your AI training and deployment


NVIDIA GPUs with Tensor Cores enabled have already helped Fast.AI and AWS achieve impressive performance gains and powered NVIDIA to the top spots on MLPerf, the first industry-wide AI benchmark.

Customer Success Stories

Nuance achieved 50% speedup in ASR and NLP training using Mixed Precision

Learn More

AWS recommends Tensor Cores for the most complex deep learning models and scientific applications

Learn More

Performance Benchmarks

NVIDIA Captures Top Spots on MLPerf - World’s First Industry-Wide AI Benchmark by Leveraging Tensor Cores

Training      Inference

See NVIDIA AI product performance across multiple frameworks, models and GPUs

Learn More

High Performance Computing

NVIDIA Tensor Core GPUs Power 5 of 6 Gordon Bell Finalists in Scientific Applications

Learn More

Using Mixed Precision for FP64 Scientific Computing

Learn More

“Machine learning researchers, data scientists, and engineers want to accelerate time to solution. When TensorFloat-32 is natively integrated into PyTorch, it will enable out of the box acceleration with zero code changes while maintaining accuracy of FP32 when using the NVIDIA Ampere Architecture based GPUs.”

— The PyTorch Team

“TensorFloat-32 provides a huge out of the box performance increase for AI applications for training and inference while preserving FP32 levels of accuracy. We plan to make TensorFloat-32 supported natively in TensorFlow to enable data scientists to benefit from dramatically higher speedups in NVIDIA A100 Tensor Core GPUs without any code changes.”

— Kemal El Moujahid, Director of Product Management for TensorFlow

“Nuance Research advances and applies conversational AI technologies to power solutions that redefine how humans and computers interact. The rate of our advances reflects the speed at which we train and assess deep learning models. With Automatic Mixed Precision, we’ve realized a 50% speedup in TensorFlow-based ASR model training without loss of accuracy via a minimal code change. We’re eager to achieve a similar impact in our other deep learning language processing applications.”

— Wenxuan Teng, Senior Research Manager, Nuance Communications Nuance logo

“Automated mixed precision powered by NVIDIA Tensor Core GPUs on Alibaba allows us to instantly speedup AI models nearly 3X. Our researchers appreciated the ease of turning on this feature to instantly accelerate our AI.”

— Wei Lin, Sr Director, Alibaba Computing Platform Alibaba logo

“Clova AI pursues advanced multimodal platforms as a partnership between Korea’s top search engine NAVER, and Japan’s top messenger, LINE. Clova AI’s LaRva team focuses on language understandings in this platform to enable AI based services. “Using automatic mixed precision powered by NVIDIA Tensor Core GPUs increased throughput and enabled us to double our batch size for massive models like RoBERTa. With these optimizations we achieved a training speedup of 2x while still maintaining accuracy. We expect this improved technology can enhance our many NLP services including AI for Contact Center. That means significant cost savings in our model production and enhanced services for customers in shorter time.”

— Dongjun Lee and Sungdong Kim, Machine Learning Engineer, NAVER NAVER logo

Learn How Tensor Cores Accelerate Your Models

Accelerated models speed your time to insight. With NVIDIA Tensor Cores. With the third generation of Tensor Cores in NVIDIA Ampere GPUs, you can unlock up to 10X higher FLOPS using TF32 with zero code changes. The new TF32 format delivers the accuracy of FP32 while increasing performance dramatically. Additionally with automatic mixed precision enabled, you can further gain a 3X performance boost with FP16.

On previous generations of Tensor Cores in Volta and Turing GPUs, you can get a 3x performance increase over FP32 using automatic mixed precision (performing matrix multiply in FP16 and accumulating the result in FP32 while maintaining accuracy) with just a couple of lines of code. Automatic Mixed Precision capability offers this by:

  • Halving storage requirements (enables increased batch size on a fixed memory budget) with super-linear benefit.
  • Generating half the memory traffic by reducing size of gradient and activation tensors.

Mixed Precision Training Techniques Using Tensor Cores For Deep Learning

Learn how mixed precision accelerates your models


Implementation of your Deep Learning workflows is seamless. NVIDIA provides out of the box models to get started immediately as well as tools to allow you to optimize your models for Tensor Cores.

Customer Implementation Whitepapers

Facebook Scaling NMT 5X with Mixed Precision (Arxiv Sep 2018)

Learn More

Baidu Research and NVIDIA on Mixed Precision Training (ICLR 2018)

Learn More

Technical Blogs

Open Source Software Optimizations for Mixed Precision Training on Tensor Cores

Learn More

Automatic Mixed Precision for auto enabling of Tensor Cores in PyTorch

Learn More

Automatic Mixed Precision for auto enabling of Tensor Cores in TensorFlow

Learn More

Developer Resources

Webinar: Tensor Core Performance on NVIDIA GPUs: The Ultimate Guide

Learn More

Webinar: Mixed-Precision Training of Neural Networks

Learn More

Webinar: Real-World Examples Training Neural Networks with Mixed Precision

Learn More

Webinar: Automatic Mixed Precision (AMP) – easily enable mixed precision in your model with 2 lines of code

Learn More

Blog: AI’s Latest Precision Format Delivers 20x Speed-Ups with TensorFloat-32

Learn More

Blog: Comparison between precision computing techniques

Learn More

Containers And Out-Of-The-Box Optimized Models Get You Running Quickly

You can try Tensor Cores in the cloud (any major CSP) or in your datacenter GPU. NVIDIA NGC is a comprehensive catalog of deep learning and scientific applications in easy-to-use software containers to get you started immediately.

Quickly experiment with tensor core optimized, out-of-the-box deep learning models from NVIDIA.

Get Tensor Core Optimized Examples

Application specific examples readily available for popular deep learning frameworks


Access Tensor Core Optimized Examples via NVIDIA NGC and GitHub:

Get NVIDIA NGC Containers
(Pre-Packaged Examples)

PyTorch>    TensorFlow>   MXNet>

Get NVIDIA NGC Model Scripts
(Choose Examples)


GitHub Repository


Implement Tensor Cores To Easily Speedup Your Own Models

Realize faster performance on your own models with NVIDIA resources. Analyze your models with NVIDIA's profiler tools and optimize your Tensor Cores implementation with helpful documentation.

Analyze your model

NVIDIA NVProf is a profiler that can easily analyze your own model and optimize for mixed precision on Tensor Cores



Enabling Automatic Mixed Precision in MXNet

Learn More


Enabling Automatic Mixed Precision in PyTorch

Learn More

Webinar: Automatic Mixed Precision – easily enable mixed precision in your model with 2 lines of code

Learn More

DevBlog: Tools For Easy Mixed Precision Training in PyTorch

Learn More


Enabling Automatic Mixed Precision in TensorFlow

Learn More

Tutorial: TensorFlow ResNet-50 with Mixed-Precision

Learn More


Enabling Automatic Mixed Precision in PaddlePaddle

Learn More


SDK:Mixed-precision best practices

Learn More