AscendC 从入门到精通系列（四）使用 Pybind 调用 AscendC 算子

作者：zjun

2024-12-18
上海
本文字数：4117 字
阅读完需：约 14 分钟

如果已经通过 Ascend C 编程语言实现了算子，那该如何通过 pybind 进行调用呢？

1 Pybind 调用介绍

通过 PyTorch 框架进行模型的训练、推理时，会调用很多算子进行计算，其中的调用方式与 kernel 编译流程有关。

对于自定义算子工程，需要使用 PyTorch Ascend Adapter 中的 OP-Plugin 算子插件对功能进行扩展，让 torch 可以直接调用自定义算子包中的算子，详细内容可以参考 PyTorch 框架；
对于 KernelLaunch 开放式算子编程的方式，通过适配

Pybind 调用，可以实现 PyTorch 框架调用算子 kernel 程序。Pybind 是一个用于将 C++代码与 Python 解释器集成的库，实现原理是通过将 C++代码编译成动态链接库（DLL）或共享对象（SO）文件，使用 Pybind 提供的 API 将算子核函数与 Python 解释器进行绑定。在 Python 解释器中使用绑定的 C++函数、类和变量，从而实现 Python 与 C++代码的交互。在 Kernel 直调中使用时，就是将 Pybind 模块与算子核函数进行绑定，将其封装成 Python 模块，从而实现两者交互。

2 工程目录结构

该样例的工程目录结构如下：

├── CppExtensions │   ├── add_custom_test.py      // Python调用脚本 │   ├── add_custom.cpp          // 算子实现 │   ├── CMakeLists.txt          // 编译工程文件 │   ├── pybind11.cpp            // pybind11函数封装│   └── run.sh                  // 编译运行算子的脚本

复制代码

基于该算子工程，开发者进行算子开发的步骤如下：

完成算子 kernel 侧实现。
编写算子调用应用程序和定义 pybind 模块 pybind11.cpp。
编写 Python 调用脚本 add_custom_test.py，包括生成输入- 数据和真值数据，调用封装的模块以及验证结果。
编写 CMake 编译配置文件 CMakeLists.txt。
根据实际需要修改编译运行算子的脚本 run.sh 并执行该脚本，完成算子的编译运行和结果验证。

3 环境准备

3.1 安装 pytorch (这里以 2.1.0 版本为例)

// aarch64环境上安装pip3 install torch==2.1.0
// x86环境上安装pip3 install torch==2.1.0+cpu  --index-url https://download.pytorch.org/whl/cpu

复制代码

3.2 安装 torch-npu（昇腾适配 torch 的开发工程，这里以 Pytorch2.1.0、python3.9、CANN 版本 8.0.RC1.alpha002 为例）

 git clone https://gitee.com/ascend/pytorch.git -b v6.0.rc1.alpha002-pytorch2.1.0 cd pytorch/ bash ci/build.sh --python=3.9 pip3 install dist/*.whl

复制代码

3.3 安装 pybind11

pip3 install pybind11

复制代码

4 工程实现

4.1 算子 kernel 实现

之前的文章中，已经实现过，add_custom.cpp 内容如下：

/** * @file add_custom.cpp * * Copyright (C) 2024. Huawei Technologies Co., Ltd. All rights reserved. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. */#include "kernel_operator.h"constexpr int32_t BUFFER_NUM = 2; // tensor num for each queue
class KernelAdd {public:    __aicore__ inline KernelAdd() {}    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength)    {        this->blockLength = totalLength / AscendC::GetBlockNum();        this->tileNum = 8;        this->tileLength = this->blockLength / this->tileNum / BUFFER_NUM;        xGm.SetGlobalBuffer((__gm__ half *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);        yGm.SetGlobalBuffer((__gm__ half *)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);        zGm.SetGlobalBuffer((__gm__ half *)z + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);        pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(half));        pipe.InitBuffer(inQueueY, BUFFER_NUM, this->tileLength * sizeof(half));        pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->tileLength * sizeof(half));    }    __aicore__ inline void Process()    {        int32_t loopCount = this->tileNum * BUFFER_NUM;        for (int32_t i = 0; i < loopCount; i++) {            CopyIn(i);            Compute(i);            CopyOut(i);        }    }
private:    __aicore__ inline void CopyIn(int32_t progress)    {        AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();        AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();        AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);        AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], this->tileLength);        inQueueX.EnQue(xLocal);        inQueueY.EnQue(yLocal);    }    __aicore__ inline void Compute(int32_t progress)    {        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();        AscendC::Add(zLocal, xLocal, yLocal, this->tileLength);        outQueueZ.EnQue<half>(zLocal);        inQueueX.FreeTensor(xLocal);        inQueueY.FreeTensor(yLocal);    }    __aicore__ inline void CopyOut(int32_t progress)    {        AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>();        AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength);        outQueueZ.FreeTensor(zLocal);    }
private:    AscendC::TPipe pipe;    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX, inQueueY;    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueZ;    AscendC::GlobalTensor<half> xGm;    AscendC::GlobalTensor<half> yGm;    AscendC::GlobalTensor<half> zGm;    uint32_t blockLength;    uint32_t tileNum;    uint32_t tileLength;};
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength){    KernelAdd op;    op.Init(x, y, z, totalLength);    op.Process();}

复制代码

4.2 实现 pybind11.cpp

1、按需包含头文件。需要注意的是，需要包含对应的核函数调用接口声明所在的头文件 alcrtlaunch_{kernel_name}.h（该头文件为工程框架自动生成，

#include"aclrtlaunch_add_custom.h"），kernel_name为算子核函数的名称。#include <pybind11/pybind11.h>#include <torch/extension.h>
#include "aclrtlaunch_add_custom.h"#include "torch_npu/csrc/core/npu/NPUStream.h"

复制代码

2、编写框架调用程序

at::Tensor run_add_custom(const at::Tensor &x, const at::Tensor &y){    // 运行资源申请，通过c10_npu::getCurrentNPUStream()的函数获取当前NPU上的流    auto acl_stream = c10_npu::getCurrentNPUStream().stream(false);    // 分配Device侧输出内存    at::Tensor z = at::empty_like(x);    uint32_t blockDim = 8;    uint32_t totalLength = 1;    for (uint32_t size : x.sizes()) {        totalLength *= size;    }    // 用ACLRT_LAUNCH_KERNEL接口调用核函数完成指定的运算    ACLRT_LAUNCH_KERNEL(add_custom)    (blockDim, acl_stream, const_cast<void *>(x.storage().data()), const_cast<void *>(y.storage().data()),     const_cast<void *>(z.storage().data()), totalLength);     // 将Device上的运算结果拷贝回Host并释放申请的资源     return z;}

复制代码

需要注意的是，输入 x，y 的内存是在 Python 调用脚本 add_custom_test.py（往下看）中分配的。3、定义 Pybind 模块将 C++函数封装成 Python 函数。PYBIND11_MODULE 是 Pybind11 库中的一个宏，用于定义一个 Python 模块。它接受两个参数，第一个参数是封装后的模块名，第二个参数是一个 Pybind11 模块对象，用于定义模块中的函数、类、常量等。通过调用 m.def()方法，可以将步骤 2 中函数 my_add::run_add_custom()转成 Python 函数 run_add_custom，使其可以在 Python 代码中被调用。

PYBIND11_MODULE(add_custom, m) { // 模块名add_custom，模块对象m  m.doc() = "add_custom pybind11 interfaces";  // optional module docstring  m.def("run_add_custom", &my_add::run_add_custom, ""); // 将函数run_add_custom与Pybind模块进行绑定}

复制代码

4.3 编写 Python 调用脚本

在 Python 调用脚本中，使用 torch 接口生成随机输入数据并分配内存，通过导入封装的自定义模块 add_custom，调用自定义模块 add_custom 中的 run_add_custom 函数，从而在 NPU 上执行算子。

import torchimport torch_npufrom torch_npu.testing.testcase import TestCase, run_testsimport sys, ossys.path.append(os.getcwd())import add_customtorch.npu.config.allow_internal_format = Falseclass TestCustomAdd(TestCase):    def test_add_custom_ops(self):        // 分配Host侧输入内存，并进行数据初始化        length = [8, 2048]        x = torch.rand(length, device='cpu', dtype=torch.float16)        y = torch.rand(length, device='cpu', dtype=torch.float16)        // 分配Device侧输入内存，并将数据从Host上拷贝到Device上        x_npu = x.npu()        y_npu = y.npu()        output = add_custom.run_add_custom(x_npu, y_npu)        cpuout = torch.add(x, y)        self.assertRtolEqual(output, cpuout)if __name__ == "__main__":    run_tests()

复制代码

4.4 编写 CMakeLists 实现 pybind11 文件编译

编译进工程的方式有很多，各个项目不一样，这里提供一个参考：operator/AddCustomSample/KernelLaunch/CppExtensions/CMakeLists.txt · Ascend/samples - 码云 - 开源中国 (gitee.com)

发布于: 刚刚阅读数: 4

zjun

关注

还未添加个人签名 2020-03-06 加入

还未添加个人简介

发布

暂无评论

创作场景

AscendC 从入门到精通系列（四）使用 Pybind 调用 AscendC 算子

1 Pybind 调用介绍

2 工程目录结构

3 环境准备

3.1 安装 pytorch (这里以 2.1.0 版本为例)

3.2 安装 torch-npu（昇腾适配 torch 的开发工程，这里以 Pytorch2.1.0、python3.9、CANN 版本 8.0.RC1.alpha002 为例）

3.3 安装 pybind11

4 工程实现

4.1 算子 kernel 实现

4.2 实现 pybind11.cpp

4.3 编写 Python 调用脚本

4.4 编写 CMakeLists 实现 pybind11 文件编译

zjun

评论