NVIDIA TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications.

Get Started

TensorRT-based applications perform up to 40X faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded, or automotive product platforms.

TensorRT is built on CUDA®, NVIDIA’s parallel programming model, and enables you to optimize inference leveraging libraries, development tools, and technologies in CUDA-X™ for artificial intelligence, autonomous machines, high-performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRT also leverages sparse tensor cores providing an additional performance boost.

TensorRT provides INT8 and FP16 optimizations for production deployments of deep learning inference applications such as video streaming, speech recognition, recommendation, fraud detection, and natural language processing. Reduced precision inference significantly reduces application latency, which is a requirement for many real-time services, as well as autonomous and embedded applications.

With TensorRT, developers can focus on creating novel AI-powered applications rather than performance tuning for inference deployment.

1. Reduce Mixed Precision

Maximizes throughput by quantizing models to INT8 while preserving accuracy

2. Layer and Tensor Fusion

Optimizes use of GPU memory and bandwidth by fusing nodes in a kernel

3. Kernel Auto-Tuning

Selects best data layers and algorithms based on the target GPU platform

4. Dynamic Tensor Memory

Minimizes memory footprint and reuses memory for tensors efficiently

5. Multi-Stream Execution

Uses a scalable design to process multiple input streams in parallel

6. Time Fusion

Optimizes recurrent neural networks over time steps with dynamically generated kernels


World-Leading Inference Performance

TensorRT powered NVIDIA’s wins across all performance tests in the industry-standard MLPerf Inference benchmark. It accelerates every model across the data center and edge in computer vision, speech-to-text, natural language understanding (BERT), and recommender systems.

Conversational AI

Computer Vision

Recommender Systems

See All Benchmarks

Accelerates Every Inference Platform

TensorRT can optimize and deploy applications to the data center, as well as embedded and automotive environments. It powers inference solutions such as NVIDIA TAO, NVIDIA DRIVE®, NVIDIA Clara™, and NVIDIA Jetpack™.

TensorRT is also integrated with application-specific SDKs such as NVIDIA DeepStream, Riva, Merlin™, Maxine™, and Broadcast Engine to provide developers a unified path to deploy intelligent video analytics, conversational AI, recommender systems, video conference, and streaming apps in production.


Supports All Major Frameworks

NVIDIA works closely with deep learning framework developers to achieve optimized performance for inference on AI platforms using TensorRT. If you are performing deep learning training in a proprietary or custom framework, use the TensorRT C++ API to import and accelerate your models. Read more in the TensorRT documentation.

Below are a few integrations with information on how to get started.

CUDA 11

TensorRT and TensorFlow are tightly integrated so you get the flexibility of TensorFlow with the powerful optimizations of TensorRT.

LEARN MORE
CUDA 11

MATLAB is integrated with TensorRT through GPU Coder so that engineers and scientists using MATLAB can automatically generate high-performance inference engines for NVIDIA Jetson™, DRIVE, and Data Center platforms.

LEARN MORE
CUDA 11

TensorRT provides an ONNX parser to easily import ONNX models from popular frameworks into TensorRT. It’s also integrated with ONNX Runtime, providing an easy way to achieve high-performance inference in the ONNX format.

LEARN MORE

Widely Adopted

Introductory Resources

Introductory Blog

Learn how to apply TensorRT optimizations and deploy a PyTorch model to GPUs.

Read blog

Introductory Webinar

Watch and learn more about TensorRT 8.0 features, and tools that simplify the inference workflow.

Watch webinar

Developer Guide

See how to get started with TensorRT in this step-by-step developer guide and API reference.

Click here

NVIDIA TensorRT is a free download for members of the NVIDIA Developer Program and in the NVIDIA NGC™ catalog. Open-source samples and parsers are available from GitHub.

Get Started