1. [主页](/)

[高性能计算](/hpc)

[CUDA-X GPU 加速库](/gpu-accelerated-libraries)

nvmath-python

快速链接

- [安装 (pip)](https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html#install-from-pypi)[安装 (许可)](https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html#install-from-conda)[从源代码构建](https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html#build-from-source)[GitHub](https://github.com/NVIDIA/nvmath-python/)[文档](https://docs.nvidia.com/cuda/nvmath-python/index.html)
- 

* * *

![图像替代文本](https://developer.download.nvidia.com/images/cudss-nvmath-python-green-r4@4x_CUT-1.png)

# nvmath-python

nvmath-python（Beta）是一个[开源库](https://github.com/nvidia/nvmath-python)，通过重新构想 Python 面向性能的 API，在 Python 科学计算社区与 [NVIDIA CUDA-X™ 数学库](https://developer.nvidia.cn/gpu-accelerated-libraries) 之间架起桥梁。它可与 NumPy、CuPy 和 PyTorch 等现有数组库互操作并形成补充，通过状态化 API、即时（just-in-time）内核融合、自定义回调以及多 GPU 扩展等能力，将性能提升到新的水平。  
  
Python 从业者、库开发者以及 GPU 内核开发者正在将 **nvmath-python** 作为一种高效工具，以较低的投入推动其科学计算、数据科学和 AI 工作流实现规模化扩展。

[使用 pip 安装](https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html#install-from-pypi &quot;使用 pip 安装&quot;)[使用 conda 安装](https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html#install-from-conda &quot;使用 conda 安装&quot;)

**其他链接：**

[从源代码构建](https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html#build-from-source &quot;从源代码构建&quot;)[GitHub](https://github.com/NVIDIA/nvmath-python/ &quot;GitHub&quot;)

* * *

## 主要特性  

### 直观的 Pythonic API

- 

**nvmath-python** 对数学库 API 进行了重新设计，以支持更 **复杂的使用场景** ，这些场景在使用类似 NumPy 的 API 时往往难以实现，或者需要在性能上做出权衡。

- 

**Host API** 通过可选参数提供开箱即用的简洁性以及丰富的自定义能力，从而可以访问底层 NVIDIA 数学库的各类“调节项”。Host API 分为通用 API 和专用 API。通用 API 旨在在不同内存/执行空间之间提供一致的用户体验，但它们可能不支持某些特定数据类型（这些数据类型依赖具体硬件），也不一定能够充分利用特定设备的能力，因此非常适合编写可移植代码。相比之下，专用 API 的应用范围更窄，可能只适用于特定硬件平台，但能够更充分地发挥硬件能力，其代价是可移植性较低。

- 

**Device API ** 允许在使用 numba-cuda 等 Python 编译器编写的自定义内核中嵌入 nvmath-python 库调用。您不再需要从头开始编写 GEMM 或 FFT 内核。

- 

**具有回调功能的 Host API ** 允许将自定义 Python 代码嵌入到 nvmath-python 调用中。内部 JIT 机制编译自定义代码，并与 nvmath-python 操作融合，实现峰值性能。

- 

Stateful（类形式）API 允许将完整的数学运算拆分为规格定义、规划、自动调优和执行等阶段。将代价高昂的规格定义、规划和自动调优在前期完成一次后，其开销可以在后续多次执行中进行摊销。

- 

与 [Python 日志功能](https://docs.python.org/3/library/logging.html)的集成，使得在运行时可以深入了解规格定义、规划、自动调优和执行机制的各类细节。

### 与 Python 生态系统的互操作性  

- 

**nvmath-python ** 与热门 Python 包结合使用。其中包括 CuPy、PyTorch 和 RAPIDS 等基于 GPU 的软件包，以及 NumPy、SciPy 和 scikit-learn 等基于 CPU 的软件包。您可以继续使用熟悉的数据结构和工作流程，同时通过 **nvmath-python** 。

- 

**nvmath-python ** 并不能取代 NumPy、CuPy 和 PyTorch 等数组库。它不实现用于数组创建、索引和切片的数组 API。 **nvmath-python ** 旨在与这些数组库一起使用。所有这些依赖项都是可选项，您可以自由选择使用哪个数组库 (或多个库) **nvmath-python** 。

- 

**nvmath-python ** 支持 CPU 和 GPU 执行以及内存空间。它简化了 CPU 和 GPU 实施之间的过渡，并允许实施 CPU-GPU 混合工作流程。

- 

与 Python 编译器结合使用，例如 [numba-cuda](https://github.com/NVIDIA/numba-cuda) 您可以使用嵌入式技术实现 GPU 自定义内核  **nvmath-python ** 库调用。

### 可扩展性能  

- 

nvmath-python 在性能方面不断逼近极限，可提供与底层 CUDA-X 原生库（如 [cuBLAS](https://developer.nvidia.cn/cublas) 系列、[cuFFT 系列](https://developer.nvidia.cn/cufft)、[cuDSS](https://developer.nvidia.cn/cudss) 和 [cuRAND](https://developer.nvidia.cn/curand)）相当的表现。借助 stateful API，你可以通过多次执行来摊销规格定义、规划和自动调优阶段的开销。

- 

在 CPU 端执行时，nvmath-python 利用 [NVPL 库](https://developer.nvidia.cn/nvpl)，在 [NVIDIA Grace™ CPU](https://www.nvidia.cn/data-center/grace-cpu/) 平台上实现出色性能。同时，它还通过使用 MKL 库来支持对 x86 主机的加速。

- 

与 Python 编译器结合使用，例如 **[numba-cuda，](https://github.com/NVIDIA/numba-cuda)**您现在可以编写涉及 GEMM、FFT 和/ 或 RNG 运算的高性能内核。以下是使用 **nvmath-python** 将“不可能”变为可能的一些示例

- 

[使用 int8 Tensor Core 进行 DGEMM 仿真](https://github.com/NVIDIA/nvmath-python/blob/main/examples/device/cublasdx_fp64_emulation.py)

- 

[卷积 kernel](https://github.com/NVIDIA/nvmath-python/blob/main/examples/device/cufftdx_convolution_performance.py)

- 

[蒙特卡罗 kernel](https://github.com/NVIDIA/nvmath-python/blob/main/examples/device/curand_philox_uniform4.py)

- 

**nvmath-python ** 允许扩展到单个 GPU 之外，甚至扩展到单个节点之外，而无需进行大量编码工作。多 GPU 多节点 (MGMN) API 允许从单个 GPU 实现轻松过渡到 MGMN，并无缝扩展到数千个 GPU。该库还提供辅助实用程序，可根据需要重塑数据 (重新分区) ，而无需进行主要编码。

* * *

## 支持的操作

### 密集线性代数 - 广义矩阵乘法  

该库提供一个广义矩阵乘法（GEMM），执行的运算为  
𝐃 = 𝐹(ɑ ⋅ 𝐀 ⋅ 𝐁 + β ⋅ 𝐂)，其中 𝐀、𝐁、𝐂 是维度与布局兼容的矩阵，ɑ 和 β 是标量，𝐹(𝐗) 是预定义函数（epilog），按元素方式作用在矩阵 𝐗 上。

#### 文档

- 

[专用 Host API](https://docs.nvidia.com/cuda/nvmath-python/latest/host-apis/linalg/index.html)

- 

通用 Host API (即将推出)

- 

[Device API](https://docs.nvidia.com/cuda/nvmath-python/latest/device-apis/cublas.html)

- 

分布式 API (即将推出)

#### 教程和示例

- 

[博客“使用 nvmath-python 将 Epilog 运算与矩阵乘法融合”](https://developer.nvidia.cn/blog/fusing-epilog-operations-with-matrix-multiplication-using-nvmath-python/)

- 

[教程“使用 nvmath-python 的 GEMM 简介”](https://github.com/NVIDIA/nvmath-python/blob/main/notebooks/matmul/01_introduction.ipynb)

- 

[教程“Epilogs”](https://github.com/NVIDIA/nvmath-python/blob/main/notebooks/matmul/02_epilogs.ipynb)

- 

[教程“使用 nvmath-python 实现神经网络”](https://github.com/NVIDIA/nvmath-python/blob/main/notebooks/matmul/03_backpropagation.ipynb)

- 

[教程“使用 nvmath-python 进行 FP8 计算”](https://github.com/NVIDIA/nvmath-python/blob/main/notebooks/matmul/04_fp8.ipynb)

- 

[Host API 示例](https://github.com/NVIDIA/nvmath-python/tree/main/examples/linalg/advanced/matmul)

- 

[Device API 示例](https://github.com/NVIDIA/nvmath-python/tree/main/examples/device)

- 

**[Host API](https://docs.nvidia.com/cuda/nvmath-python/latest/host-apis/linalg/index.html)** 提供了一个位于 nvmath.linalg.advanced 子模块中的专用 API，其底层由 [cuBLASLt 库](https://developer.nvidia.com/nvmath-python)驱动。此 API 仅支持 GPU 执行空间。该库的关键显著特点是能够将矩阵运算和后记融合到  **单个融合内核** 。该库还提供执行其他操作的设施 **自动调整** 允许为特定硬件和特定问题大小选择最佳融合内核。两者兼有 **有状态** 以及 **无状态** 提供 API。通用 API 将在未来版本中实施。

- 

**[Device API](https://docs.nvidia.com/cuda/nvmath-python/latest/device-apis/cublas.html)** 位于 nvmath.device 子模块中，底层由 [cuBLASDx 库](https://docs.nvidia.com/cuda/cublasdx/)提供支持。它们可以在 [numba-cuda](https://github.com/NVIDIA/numba-cuda) kernel 中使用。

- 

**分布式 API ** 将在未来的版本中实现。

![nvmath-python 线性代数性能](https://developer.download.nvidia.com/images/linear_alg_perf_CUT.jpg)
_高级
矩阵性能在 H100 PCIe 上显示为矩阵 A【m* n】、B【n* k】，
偏差【m】，其中 m = 65536，n = 16384，k = 8192。操作数的数据类型
结果为 bfloat16，float32 类型用于计算。_

### 快速里叶变换  

该库为复杂到复杂 (C2C) 、复杂到现实 (C2R) 和真实到复杂 (R2C) 的离散里叶变换提供正向和反向 FFT。

#### 文档

- 

[Host API](https://docs.nvidia.com/cuda/nvmath-python/latest/host-apis/fft/index.html)

- 

[Device API](https://docs.nvidia.com/cuda/nvmath-python/latest/device-apis/cufft.html)

- 

[分布式 API](https://docs.nvidia.com/cuda/nvmath-python/latest/distributed-apis/fft/index.html)

#### 教程和示例

- 

[Host API 示例](https://github.com/NVIDIA/nvmath-python/tree/main/examples/fft)

- 

[Device API 示例](https://github.com/NVIDIA/nvmath-python/tree/main/examples/device)

- 

[分布式 FFT API 示例](https://github.com/NVIDIA/nvmath-python/tree/main/examples/distributed/fft)

- 

[分布式 Reshape API 示例](https://github.com/NVIDIA/nvmath-python/tree/main/examples/distributed/reshape)

![nvmath-python FFT 性能](https://developer.download.nvidia.com/images/fast_four_trans_perf_CUT.jpg)
_H100 PCIe 上显示了快速里叶变换性能，适用于大小为 512 的 FFT，使用 complex64 数据类型以 1048576 (220) 批量计算。_

- 

[Host API](https://docs.nvidia.com/cuda/nvmath-python/latest/host-apis/fft/index.html) 位于 nvmath.fft 子模块由 [cuFFT](/cufft#section-cufft) 库提供支持。API 支持 CPU 和 GPU 执行空间。NVIDIA Grace™ CPU 平台由 NVPL 库提供支持，而对于 x86 主机，MKL 作为 CPU 后端提供。该库的关键区别在于能够将 FFT 运算和编写为 Python 函数的自定义回调融合到 **单个融合内核** 。该库还提供执行其他操作的设施 **自动调整** 允许为特定硬件和特定问题大小选择最佳融合内核。两者兼而有之 **有状态** 以及 **无状态** 提供 API。

- 

[Device API](https://docs.nvidia.com/cuda/nvmath-python/latest/device-apis/cufft.html) 位于nvmath.device子模块由 [cuFFTDx 库](/cufft#section-cufftdx)提供。可以从内部使用 [numba-cuda](https://github.com/NVIDIA/numba-cuda) 内核。

- 

[分布式 API](https://docs.nvidia.com/cuda/nvmath-python/latest/distributed-apis/fft/index.html) 位于 nvmath.distributed.fft 子模块由[ cuFFTMp 库](/cufft#section-cufftmp)提供使用户能够解决分布式 2D 和 3D FFT 百亿亿次级问题。

### 随机数生成

该库提供了 device API，可用于在使用 **[numba-cuda](https://github.com/NVIDIA/numba-cuda)** 编写的 GPU kernel 内部执行随机数生成。它提供了一系列伪随机数和准随机数位生成器，以及从热门概率分布中进行采样。

#### 文档

- 

[Device API](https://docs.nvidia.com/cuda/nvmath-python/latest/device-apis/curand.html)

#### 教程和示例

- 

[Device API 示例](https://github.com/NVIDIA/nvmath-python/tree/main/examples/device)

![nvmath-python FFT 性能](https://developer.download.nvidia.com/images/gbm_paths.png)
_使用 nvmath-python device RNG 在 numba-cuda 中编写的几何布朗运动股票定价 kernel_

- 

**[Device API](https://docs.nvidia.com/cuda/nvmath-python/latest/device-apis/curand.html)** 位于 nvmath.device 子模块中，底层由 [cuRAND 库](https://developer.nvidia.cn/curand)提供支持。它们可在 [**numba-cuda**](https://developer.nvidia.com/nvmath-python) kernel 内使用，用于在 GPU 上高效执行蒙特卡洛模拟。需要注意的是，该库不提供对应的 host API，而是建议使用各自数组库（如 NumPy 和 CuPy）所提供的随机数生成功能。

**位 RNG：**

- 

MRG32k3a

- 

MTGP 梅森旋转算法 (Merseinne Twister)

- 

XORWOW

- 

Sobol 准随机数生成器

**分布式 RNG：**

- 

均匀分布

- 

正态分布

- 

对数正态分布

- 

泊松分布

### 稀疏线性代数 - 直接求解器  

该库提供专用 API，用于支持稀疏线性代数计算。当前，该库提供了用于求解线性方程组  
𝐀 ⋅ 𝐗 = 𝐁 的专用直接求解器 API，其中 𝐀 是已知的左端（LHS）稀疏矩阵，𝐁 是已知的右端（RHS）向量或形状兼容的矩阵，𝐗 为由求解器给出的未知解。

#### 文档

- 

[专用 Host API](https://docs.nvidia.com/cuda/nvmath-python/latest/host-apis/sparse/index.html)

- 

通用 Host API (未来)

- 

Device API (未来)

- 

分布式 API (即将推出)

#### 教程和示例

- 

[Host DSS API 示例](https://github.com/NVIDIA/nvmath-python/tree/main/examples/sparse/advanced/direct_solver)

![nvmath-python FFT 性能](https://developer.download.nvidia.com/images/DatacenterKV-2-1536x864.png)
_使用 nvmath-python device RNG 在 numba-cuda 中编写的几何布朗运动股票定价 kernel_

- 

**[Host API](https://docs.nvidia.com/cuda/nvmath-python/latest/host-apis/sparse/index.html)** 提供了一个位于 nvmath.sparse.advanced 子模块中的专用 API，其底层由 [cuDSS 库](https://developer.nvidia.cn/cudss)提供支持。此 API 仅支持 GPU 执行和混合 GPU-CPU 执行空间。该库的关键特征是能够批量求解一系列线性系统。当线性系统以 LHS 和/ 或 RHS 序列的形式提供时，该库支持显式批处理；当从高维张量推理序列时，库支持隐式批处理。同时提供 **stateful** 和 **stateless** 两种 API。通用 API 将在未来版本中实施。

- 

**Device API ** 将在未来的版本中提供。

- 

**分布式 API ** 将在未来的版本中实现。

* * *

## 资源

- 
[nvmath-python 文档](https://docs.nvidia.com/cuda/nvmath-python/index.html)
- 
[nvmath-python Github](https://github.com/nvidia/nvmath-python)
- 

[nvmath-python 教程 ( Jupyter Notebook)](https://github.com/NVIDIA/nvmath-python/tree/main/notebooks/)

- 

[nvmath-python 示例](https://github.com/NVIDIA/nvmath-python/tree/main/examples)

- 
[CUDA-X GPU 加速库](https://developer.nvidia.cn/gpu-accelerated-libraries)
- 
[GTC 上的 nvmath-python 演示](https://www.nvidia.cn/on-demand/session/gtc24-s62162/?start=1394)
- 

[博客“使用 nvmath-python 将 Epilog 运算与矩阵乘法融合”](https://developer.nvidia.cn/blog/fusing-epilog-operations-with-matrix-multiplication-using-nvmath-python/)

- 

[教程“加速和扩展 Python 以实现 HPC”](https://github.com/samaid/pyhpc-tutorial)

**开始使用 nvmath-python**

[立即安装](https://docs.nvidia.com/cuda/nvmath-python/latest/quickstart.html &quot;立即安装&quot;)