数据中心/云端

五大 AI 模型优化技术,实现更快速、更智能的推理

随着 AI 模型规模不断扩大,架构日益复杂,研究人员和工程师正持续探索新技术,以优化 AI 系统在生产环境中的性能并降低总体成本。

模型优化是一类专注于提升推理服务效率的技术,能够有效降低计算成本、改善用户体验并支持系统规模化扩展。这类技术涵盖了多种方法,包括快速高效的手段(如模型量化),以及功能强大的多步骤流程(如剪枝与知识蒸馏),在性能与资源之间提供了极具价值的平衡。

本文将介绍基于 NVIDIA Model Optimizer 实现的五项模型优化技术,并阐述每项技术如何提升模型在 NVIDIA GPU 上部署时的性能、总体拥有成本(TCO)和可扩展性。

这些技术是 Model Optimizer 当前提供的最先进且最具扩展性的方法,团队可立即采用以降低每 token 的成本、提升吞吐量,并加快大规模推理的处理速度。

A visual showing five cards, each with a small green-themed icon and headline. The techniques listed are: Post-Training Quantization (“Fastest Path to Optimization”), Quantization-Aware Training (“Simple Accuracy Recovery”), Quantization-Aware Distillation (“Max Accuracy and Speedup”), Speculative Decoding (“Speedup without Model Changes”), and Pruning & Distillation (“Slim Model and Keep Intelligence”). All cards use clean white backgrounds with NVIDIA-style green bar/brain/network iconography. A visual showing five cards, each with a small green-themed icon and headline. The techniques listed are: Post-Training Quantization (“Fastest Path to Optimization”), Quantization-Aware Training (“Simple Accuracy Recovery”), Quantization-Aware Distillation (“Max Accuracy and Speedup”), Speculative Decoding (“Speedup without Model Changes”), and Pruning & Distillation (“Slim Model and Keep Intelligence”). All cards use clean white backgrounds with NVIDIA-style green bar/brain/network iconography.
图1:影响力最大的五大模型优化技术

1. 训练后量化

训练后量化(PTQ)是实现模型优化的一条高效路径。您可以在不修改原始训练流程的前提下,利用现有的模型(FP16/BF16/FP8)并结合校准数据集,将其压缩为更低精度的格式(如 FP8、NVFP4、INT8、INT4)。这种方法适用于大多数团队作为初始优化策略,可轻松集成到 Model Optimizer 中,即使在大型基础模型上也能快速获得延迟降低和吞吐量提升的显著效果。

Comparison of representable ranges and data precision for FP16, FP8, and FP4 formats. FP16 shows the widest range (−65,504 to +65,504) with closely spaced values A and B, representing high precision. FP8 has a narrower range (−448 to +448) with quantized values QA and QB spaced farther apart, indicating lower precision. FP4 shows an even smaller range (−6 to +6), illustrating the trade‑off between range and precision when reducing bit width.
Comparison of representable ranges and data precision for FP16, FP8, and FP4 formats. FP16 shows the widest range (−65,504 to +65,504) with closely spaced values A and B, representing high precision. FP8 has a narrower range (−448 to +448) with quantized values QA and QB spaced farther apart, indicating lower precision. FP4 shows an even smaller range (−6 to +6), illustrating the trade‑off between range and precision when reducing bit width.
图2。从FP16量化到FP8或FP4时,数值范围和精度的变化情况
优点 缺点
–能快速实现价值
–仅需小型校准数据集即可完成
–内存、延迟和吞吐量的提升可与其他优化技术叠加
–支持高度自定义的量化方法(例如 NVFP4 KV 缓存
–若服务质量水平在 SLA 范围内下降,可结合其他技术(如 QAT/QAD)进行补充
表 1。PTQ 的优缺点

如需了解更多信息,请参阅如何通过训练后量化优化大语言模型,以提升性能并保持准确性

2. 量化感知训练 

量化感知训练(QAT)引入了一个简短而有针对性的微调阶段,使模型能够针对低精度带来的误差进行调整。该方法在前向传播过程中模拟量化噪声,同时以较高精度计算梯度。当需要比PTQ提供更高的精度时,建议采用 QAT 作为下一步方案。

Flowchart illustrating the Quantization Aware Training (QAT) workflow. On the left, an original precision model is combined with calibration data and a Model Optimizer quantization recipe to form a QAT-ready model. This model, along with a subset of original training data, enters the QAT training loop. Inside the loop, high-precision weights are updated and then used as “fake quantization” weights during the forward pass. Training loss is calculated, and the backward pass uses a straight-through estimator (STE) to propagate gradients. The loop repeats until training converges. Flowchart illustrating the Quantization Aware Training (QAT) workflow. On the left, an original precision model is combined with calibration data and a Model Optimizer quantization recipe to form a QAT-ready model. This model, along with a subset of original training data, enters the QAT training loop. Inside the loop, high-precision weights are updated and then used as “fake quantization” weights during the forward pass. Training loss is calculated, and the backward pass uses a straight-through estimator (STE) to propagate gradients. The loop repeats until training converges.
图 3. 在 QAT 工作流中通过模拟低精度权重对模型进行准备、量化及迭代训练
优点与缺点 缺点
–能在低精度下恢复全部或大部分因量化导致的准确率损失
–与 NVFP4 完全兼容,尤其有利于 FP4 的稳定性
–需要额外的训练预算和数据支持
–相较于仅使用 PTQ 的方案,实施所需时间更长
表 2. QAT 的优缺点

如需了解更多信息,请参阅量化感知训练如何实现低精度下的性能恢复

3. 量化感知蒸馏 

量化感知蒸馏(QAD)在性能上优于量化感知训练(QAT)。该方法通过引入蒸馏机制,使学生模型不仅能够学习如何应对量化误差,还能通过蒸馏损失与全精度教师模型保持对齐。QAD在QAT的基础上融入了知识蒸馏的思想,显著提升了模型表现,可在推理阶段支持超低精度部署的同时,更好地保持模型质量。对于量化后性能明显下降的下游任务,QAD是一种更为有效的解决方案。

Flowchart of Quantization Aware Distillation (QAD). On the left, an original precision model is combined with calibration data and a quantization recipe to create a QAD-ready student model. This student model is paired with a higher precision teacher model and a subset of the original training data. In the QAD training loop, the student uses “fake quantization” weights in its forward pass, while the teacher performs a standard forward pass. Outputs are compared to calculate QAD loss, which combines distillation loss with standard training loss. Gradients flow back through the student model using a straight-through estimator (STE), and the student’s high-precision weights are updated to adapt to quantization conditions.
Flowchart of Quantization Aware Distillation (QAD). On the left, an original precision model is combined with calibration data and a quantization recipe to create a QAD-ready student model. This student model is paired with a higher precision teacher model and a subset of the original training data. In the QAD training loop, the student uses “fake quantization” weights in its forward pass, while the teacher performs a standard forward pass. Outputs are compared to calculate QAD loss, which combines distillation loss with standard training loss. Gradients flow back through the student model using a straight-through estimator (STE), and the student’s high-precision weights are updated to adapt to quantization conditions.
图4. QAD在教师模型的指导下训练低精度学生模型,结合蒸馏损失与标准的QAT更新方法。
优点与缺点 缺点
– 高精度恢复能力
– 适用于多阶段后训练流程,易于配置并实现稳定收敛
– 需要预训练后的额外训练周期
– 占用更多内存资源
– 当前实施流程相对复杂一些
表 3. QAD 的优缺点

如需了解更多信息,请参阅量化感知训练如何实现低精度下的性能恢复

4. 预测解码

推理过程中的解码步骤因受限于顺序处理算法而广为人知。预测性解码通过引入更小或更高效的草稿模型(如 EAGLE-3),预先生成多个候选 token,再与目标模型并行进行验证,有效应对了这一瓶颈。该方法将连续的延迟分解为独立步骤,显著减少了长序列生成过程中所需的前向传播次数,同时不改变目标模型的权重。

如果希望在不重新训练或量化模型的情况下立即提升生成速度,建议采用预测解码技术。该技术可与其他优化方法无缝结合,进一步叠加吞吐量和延迟的优化效果。

A gif showing an example where the input is “The  Quick”. From this input, the draft model proposes “Brown”, “Fox”, “Hopped”, “Over”. The input and draft are ingested by the target model, which verifies “Brown” and “Fox” before rejecting “Hopped” and subsequently everything after. “Jumped” is the target model’s own generation resulting from the forward pass. A gif showing an example where the input is “The  Quick”. From this input, the draft model proposes “Brown”, “Fox”, “Hopped”, “Over”. The input and draft are ingested by the target model, which verifies “Brown” and “Fox” before rejecting “Hopped” and subsequently everything after. “Jumped” is the target model’s own generation resulting from the forward pass.
图5。用于预测解码的草案-目标方法采用双模型架构。
优点 缺点
–有效降低解码延迟
–通过 PTQ、QAT、QAD 以及 NVFP 充分利用堆栈 4
–需进行调优(接受率至关重要)
–第二个模型或头部需根据具体变体进行调整
表 4. 预测解码的优缺点

如需了解更多信息,可参考《预测性解码在降低 AI 推理延迟中的应用简介》

5. 剪枝加知识蒸馏

剪枝是一种结构优化方法,通过移除权重、层或注意力头来缩小模型规模。随后,知识蒸馏使精简后的小模型学习大型教师模型的推理方式。这种多阶段优化策略能够永久提升模型效率,因为其基础计算量和内存占用均得到持续降低。

当列表中的其他技术无法满足应用程序对内存或计算效率的要求时,可采用剪枝与知识蒸馏方法。若团队愿意对现有模型进行较大幅度的调整,以更好地适配特定的下游专业应用场景,也可考虑使用该方法。

This diagram shows the successful outcome of knowledge distillation by comparing the teacher network to the smaller, trained student network. The student model, despite being more compact, produces an output probability vector that closely mimics the teacher's vector. This diagram shows the successful outcome of knowledge distillation by comparing the teacher network to the smaller, trained student network. The student model, despite being more compact, produces an output probability vector that closely mimics the teacher's vector.
图 6。知识蒸馏训练中师生模型的输出
优点与缺点 缺点
– 减少参数数量,可永久降低并节省结构成本
– 支持较小的模型仍能像大型模型一样运行
– 可积极剪枝,且无需依赖蒸馏即可保持精度
– 相较于仅使用PTQ,需在工作流程中投入更多工作
表 5. 剪枝结合知识蒸馏的优缺点

如需了解更多信息,请参阅 使用 NVIDIA TensorRT Model Optimizer 对大语言模型进行剪枝和蒸馏的相关内容

开始使用 AI 模型优化

优化技术在形式和规模上各有不同。本文将重点介绍通过 Model Optimizer 实现的五种主要模型优化技术。

  • PTQ、QAT、QAD、剪枝和知识蒸馏等技术能够有效减小模型体积,提升运行效率,从而降低部署成本。
  • 预测解码则通过减少顺序生成的延迟,显著提升推理速度。

要深入了解并学习相关技术,可探索与各项技术配套的深度文章,获取技术解析、性能洞察以及 Jupyter Notebook 实践指南。

 

标签