写点什么

如何使用 Ascend 的 ATB 加速库?

作者:zjun
  • 2024-12-18
    上海
  • 本文字数:16280 字

    阅读完需:约 53 分钟

如何使用Ascend的ATB加速库?

1 前言

Ascend Transformer Boost 加速库(下文简称为 ATB 加速库)是一款高效、可靠的加速库,基于华为 Ascend AI 处理器,专门为 Transformer 类模型的训练和推理而设计。具体请阅读:ATB是什么?


那么程序猿小白如何实现一个 ATB 算子呢?

2 具体实现一个 ATB 算子

以下内容参考:


算子使用指导-加速库使用指导-Ascend Transformer Boost加速库-领域加速库开发-CANN商用版8.0.RC2.2开发文档-昇腾社区


实现一个 ATB 算子大概要有以下 10 个步骤,如下图所示。



step 1: 包含 ACL 与加速库接口头文件


#include <acl/acl.h>#include <atb/atb_infer.h>#include <atb/types.h>#include <atb/utils.h>#include "atb/infer_op_params.h"
复制代码


这里要注意:


  • 首先要安装 atb 相关的 so 文件,才能获取到相关头文件,保证程序链接不出错。

  • 不同的算子,可能包含的头文件并不相同。

  • 其它头文件,自定义添加


参考:


安装部署-Ascend Transformer Boost加速库-领域加速库开发-CANN商用版8.0.RC2.2开发文档-昇腾社区


step 2: 配置 deviceId


uint32_t deviceId = 0;aclError status = aclrtSetDevice(deviceId);
复制代码


根据需求设置 deviceId,如单机多卡,asecnd 可用的 deviceId 为 0-7(总共 8 张卡)。


step 3: 创建算子对象实例从前文ATB是什么? ATB 总共有 3 种算子实现,下文分别进行说明。


1、基础 Operation(原生算子)


第一步:构造 Operation 参数


根据要创建的算子,实例化参数结构体,参数结构体的接口定义参考 atb/infer_op_params.h 和 atb/train_op_params.h。


以 Mul 算子为例,Mul 算子归属于 Elewise,可通过以下方式构造对应参数:


atb::infer::ElewiseParam mulParam;mulParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_MUL;
复制代码


第二步:创建算子对象实例


atb::Operation *op = nullptr;atb::Status st = atb::CreateOperation(mulParam, &op);
复制代码


2、插件(Plugin)机制(插件算子)


插件算子需要是使用 Ascend c 或者其它方式实现 kernel。


建议直接本文 3.2 章节。


参考:


插件机制-ATB算子


第一步:开发算子


以使用 Ascend C 创建 Add 算子为例,用户可根据实际需求选择其他方式实现自定义算子。


参考如下:kernel_add.cpp


plugin_op_demo/kernel/kernel_add.cpp · Si1verBul1et623548/atb-op-demo - 码云 - 开源中国 (gitee.com)


第二步:创建算子对象实例


CustomOperation*op = new CustomOperation("CustomOperation");
复制代码


3、Graph Frame(图算子)


图算子有配置 TensorId 和配置 TensorName 组图两种创建和使用方式。


根据如下图算子结构图:



可以明确出,TensorId 与 TensorName 对应关系配置如下:



表 1 TensorId 与 TensorName 对应关系配置组图方式 1:配置 TensorId


第一步:构造 Operation 参数


与单算子的参数不同,图算子的参数包含图节点、输入 Tensor 数、输出 Tensor 数、中间 Tensor 数等图相关的信息。


首先,根据设计的图算子结构,分别计算出图输入 Tensor(假设为 x 个),图输出 Tensor(假设为 y 个)以及图中间 Tensor(假设为 z 个)的个数。 图输入 Tensor 的 Id 取值为[0, x - 1],图输出 Tensor 的 Id 取值为[x, x + y - 1],图中间 Tensor 的 Id 取值为[x + y, x + y + z - 1]。示例对应关系见表 1Tensor 与 TensorId 列。


然后,配置每一个节点的相关信息,包括创建好的单算子对象实例、输入 Tensor 和输出 Tensor。该节点的输入和输出 Tensor 在图里可能是图的输入 Tensor、输出 Tensor 或中间 Tensor,用户需根据其所属的图 Tensor 类型,在合适的范围内取值。


实例中的 op0 和 op1 创建过程可参考单算子的创建。


atb::GraphParam graphParam;graphParam.inTensorNum = 3;                 // 指定该图的输入Tensor数量graphParam.outTensorNum = 1;                // 指定该图的输出Tensor数量graphParam.internalTensorNum = 1;           // 指定该图的中间Tensor数量graphParam.nodes.resize(2);                 // 指定该图中的节点数量,即包含的单算子数量graphParam.nodes[0].operation = op0;        // 指定该图中的节点0的单算子对象实例graphParam.nodes[0].inTensorIds = {0, 1};   // 指定该图中的节点0需要的输入Tensor所对应的idgraphParam.nodes[0].outTensorIds = {4};     // 指定该图中的节点0输出的输出Tensor所对应的idgraphParam.nodes[1].operation = op1;        // 指定该图中的节点1的单算子对象实例graphParam.nodes[1].inTensorIds = {4, 2};   // 指定该图中的节点1需要的输入Tensor所对应的idgraphParam.nodes[1].outTensorIds = {3};     // 指定该图中的节点1输出的输出Tensor所对应的id
复制代码


第二步:创建算子对象实例


atb::Operation *op = nullptr;atb::Status st = atb::CreateOperation(graphParam, &op);
复制代码


组图方式 2:配置 TensorId


使用 TensorId 组图需要提前定义,操作过程繁琐。该组图通过字符串定义每个 Tensor,可行性更高。示例对应关系见上表 1 种 Tensor 与 TensorName。


第一步:创建图算子构造器


atb::GraphOpBuilder* graphOpBuilder;CreateGraphOpBuilder(&graphOpBuilder);
复制代码


第二步:初始化图算子构造器


// lambda函数,通过图算子的输入TensorDesc推导输出TensorDesc,包括DataType、Format、Shape等atb::InferShapeFunc inferShapeFunc = [=](const atb::SVector<atb::TensorDesc> &inTensorDescs, atb::SVector<atb::TensorDesc> &outTensorDescs) {    outTensorDescs.at(0) = inTensorDescs.at(0);    return atb::NO_ERROR;};graphOpBuilder->Init("DemoGraphOperation", inferShapeFunc, {"a", "b", "c"}, {"output"});
复制代码


第三步:用图算子构造器构图


构图时可通过定义 lambda 函数对 Tensor 进行 reshape,需保证 reshape 前后的 shape 大小一致。


op0 等单算子的创建过程可参考上述单算子的创建。


graphOpBuilder->AddOperation(op0, {"a", "b"}, {"a_add_b_output"});graphOpBuilder->AddOperation(op1, {"a_add_b_output", "c"}, {"output"});
复制代码


第四步:用图算子构造器构图


atb::Operation *op = graphOpBuilder->Build(); // 使用时需判断op是否为空指针DestroyGraphOpBuilder(graphOpBuilder); // 销毁图算子构造器
复制代码


step 4: 创建输入输出 tensor,并存入 VariantPackVariantPack 中包含输入和输出 Tensor 列表。VariantPack 中传入的每个输入 Tensor 要求大于 0 且不超过 256GB。


// 设置各个intensor的属性void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) {    for (size_t i = 0; i < intensorDescs.size(); i++) {        intensorDescs.at(i).dtype = ACL_FLOAT16;        intensorDescs.at(i).format = ACL_FORMAT_ND;        intensorDescs.at(i).shape.dimNum = 2;        intensorDescs.at(i).shape.dims[0] = 2;        intensorDescs.at(i).shape.dims[1] = 2;    }}
// 设置各个intensor并且为各个intensor分配内存空间,此处的intensor为手动设置,工程实现上可以使用torchTensor转换或者其他简单数据结构转换的方式void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs){ std::vector<char> zeroData(8, 0); // 一段全0的hostBuffer for (size_t i = 0; i < inTensors.size(); i++) { inTensors.at(i).desc = intensorDescs.at(i); inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i)); int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU内存 if (ret != 0) { std::cout << "alloc error!"; exit(0); } ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, zeroData.data(), zeroData.size(), ACL_MEMCPY_HOST_TO_DEVICE); //拷贝CPU内存到NPU侧 }}
// 设置各个outtensor并且为outtensor分配内存空间,同intensor设置void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs){ for (size_t i = 0; i < outTensors.size(); i++) { outTensors.at(i).desc = outtensorDescs.at(i); outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i)); int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } }}// 按上述方法构造所有输入和输出tensor,存入VariantPackatb::VariantPack pack;atb::SVector<atb::TensorDesc> intensorDescs;atb::SVector<atb::TensorDesc> outtensorDescs;
uint32_t inTensorNum = op->GetInputNum();uint32_t outTensorNum = op->GetOutputNum();pack.inTensors.resize(inTensorNum);intensorDescs.resize(inTensorNum);
CreateInTensorDescs(intensorDescs);CreateInTensors(pack.inTensors, intensorDescs); outtensorDescs.resize(outTensorNum);pack.outTensors.resize(outTensorNum);op->InferShape(intensorDescs, outtensorDescs);CreateOutTensors(pack.outTensors, outtensorDescs);
复制代码


step 5: 创建 context,配置 streamContext 主要负责对 NPU 中使用的 Stream 进行管理。


atb::Context *context = nullptr;st = atb::CreateContext(&context);
aclrtStream stream = nullptr;status = aclrtCreateStream(&stream);context->SetExecuteStream(stream);
复制代码


step 6: 调用 Setup 接口,计算 workspace 大小


uint64_t workspaceSize = 0;st = op->Setup(pack, workspaceSize, context);
复制代码


step 7: 根据 workspace 大小申请 NPU 内存


void *workspace = nullptr;if (workspaceSize != 0) {    status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);    if (status != 0) {        std::cout << "alloc error!";        exit(0);    }}
复制代码


当 workspace 大小为 0 时,无需执行该步骤,否则会报错。


step 8: 调用 Execute 接口,执行算子


st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);
复制代码


step 9: 销毁创建的对象,释放内存


// 流同步,作用是等待device侧任务计算完成auto ret = aclrtSynchronizeStream(stream);if (ret != 0) {    std::cout << "sync error!";    exit(0);}
status = aclrtDestroyStream(stream); // 销毁streamst = atb::DestroyOperation(op); // 销毁op对象st = atb::DestroyContext(context); // 销毁context// 销毁输入tensorfor (size_t i = 0; i < pack.inTensors.size(); i++) { aclrtFree(pack.inTensors.at(i).deviceData);}// 销毁输出tensorfor (size_t i = 0; i < pack.outTensors.size(); i++) { aclrtFree(pack.outTensors.at(i).deviceData);}aclrtFree(pack.outTensors.at(0).deviceData); // 销毁输出tensorstatus = aclrtFree(workspace); // 销毁workspaceaclrtResetDevice(deviceId); // 重置deviceId
复制代码


step 10: demo 运行编译源文件:


# g++编译demo工程,demo.cpp为demo对应的源码文件g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" demo.cpp -l atb -l ascendcl -o demo
复制代码


这里:


ATB_HOME_PATH:指的是 atb 库文件的安装路径。


执行:


./demo # 运行可执行文件
复制代码

3 完整代码文件

3.1 单算子完整示例

文件命名为 atb_mul_operation.cpp


// step1:包含ACL与加速库接口头文件#include <iostream>#include <vector>#include <acl/acl.h>#include <atb/atb_infer.h>#include <atb/types.h>#include <atb/utils.h>#include "atb/infer_op_params.h"

void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) { for (size_t i = 0; i < intensorDescs.size(); i++) { intensorDescs.at(i).dtype = ACL_FLOAT16; intensorDescs.at(i).format = ACL_FORMAT_ND; intensorDescs.at(i).shape.dimNum = 2; intensorDescs.at(i).shape.dims[0] = 2; intensorDescs.at(i).shape.dims[1] = 2; }}
// 设置各个intensor并且为各个intensor分配内存空间,此处的intensor为手动设置,工程实现上可以使用torchTensor转换或者其他简单数据结构转换的方式void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs){ std::vector<char> zeroData(8, 0); // 一段全0的hostBuffer for (size_t i = 0; i < inTensors.size(); i++) { inTensors.at(i).desc = intensorDescs.at(i); inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i)); int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU内存 if (ret != 0) { std::cout << "alloc error!"; exit(0); } ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, zeroData.data(), zeroData.size(), ACL_MEMCPY_HOST_TO_DEVICE); //拷贝CPU内存到NPU侧 }}
// 设置各个outtensor并且为outtensor分配内存空间,同intensor设置void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs){ for (size_t i = 0; i < outTensors.size(); i++) { outTensors.at(i).desc = outtensorDescs.at(i); outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i)); int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } }}
int main() { // step2:配置deviceId uint32_t deviceId = 0; aclError status = aclrtSetDevice(deviceId);
// step3:创建算子对象实例,以Mul算子为例,Mul算子归属于Elewise,可通过以下方式构造对应参数 // 第一步:构造Operation参数 atb::infer::ElewiseParam mulParam; mulParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_MUL;
// 第二步:创建算子对象实例 atb::Operation *op = nullptr; atb::Status st = atb::CreateOperation(mulParam, &op);
// step4:创建输入输出tensor,并存入VariantPack atb::VariantPack pack; atb::SVector<atb::TensorDesc> intensorDescs; atb::SVector<atb::TensorDesc> outtensorDescs;
uint32_t inTensorNum = op->GetInputNum(); uint32_t outTensorNum = op->GetOutputNum(); pack.inTensors.resize(inTensorNum); intensorDescs.resize(inTensorNum);
CreateInTensorDescs(intensorDescs); CreateInTensors(pack.inTensors, intensorDescs); outtensorDescs.resize(outTensorNum); pack.outTensors.resize(outTensorNum); op->InferShape(intensorDescs, outtensorDescs); CreateOutTensors(pack.outTensors, outtensorDescs);
// step5:创建context,配置stream atb::Context *context = nullptr; st = atb::CreateContext(&context);
aclrtStream stream = nullptr; status = aclrtCreateStream(&stream); context->SetExecuteStream(stream);
// step6:调用Setup接口,计算workspace大小 uint64_t workspaceSize = 0; st = op->Setup(pack, workspaceSize, context);
// step7:根据workspace大小申请NPU内存 void *workspace = nullptr; if (workspaceSize != 0) { status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); if (status != 0) { std::cout << "alloc error!"; exit(0); } }
// step8:调用Execute接口,执行算子 st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);
// step9:销毁创建的对象,释放内存 // 流同步,作用是等待device侧任务计算完成 auto ret = aclrtSynchronizeStream(stream); if (ret != 0) { std::cout << "sync error!"; exit(0); }
status = aclrtDestroyStream(stream); // 销毁stream st = atb::DestroyOperation(op); // 销毁op对象 st = atb::DestroyContext(context); // 销毁context // 销毁输入tensor for (size_t i = 0; i < pack.inTensors.size(); i++) { aclrtFree(pack.inTensors.at(i).deviceData); } // 销毁输出tensor for (size_t i = 0; i < pack.outTensors.size(); i++) { aclrtFree(pack.outTensors.at(i).deviceData); } status = aclrtFree(workspace); // 销毁workspace aclrtResetDevice(deviceId); // 重置deviceId
return 0;}
复制代码


也可以参考:


single_op_demo/single_op_demo.cpp · Si1verBul1et623548/atb-op-demo - 码云 - 开源中国 (gitee.com)编译运行:


# g++编译demo工程,demo.cpp为demo对应的源码文件g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_mul_operation.cpp -l atb -l ascendcl -o atb_mul_operation

# 运行可执行文件./atb_mul_operation
复制代码

3.2 插件(Plugin)机制(插件算子)完整示例

参考:


Si1verBul1et623548/atb-op-demo​gitee.com/geyunqi/atb-op-demo/tree/master/plugin_op_demo


进入到 plugin_op_demo 目录后,执行


bash run.sh
复制代码


在 plugin_op_demo/build 得到输出


total 68drwxr-xr-x. 3 root root  4096 Sep 29 20:02 ./drwxr-xr-x. 5 root root  4096 Sep 29 20:02 ../-rw-r--r--. 1 root root 14543 Sep 29 20:02 CMakeCache.txtdrwxr-xr-x. 6 root root  4096 Sep 29 20:02 CMakeFiles/-rw-r--r--. 1 root root  5773 Sep 29 20:02 Makefile-rw-r--r--. 1 root root  1664 Sep 29 20:02 cmake_install.cmake-rwxr-xr-x. 1 root root 27720 Sep 29 20:02 libplugin_add.so*
复制代码


可见,当前编译为一个动态库 so 的形式。但是里面的过程,已经能够描述清楚为 plugin 的单算子怎么写。

3.3 Graph Frame(图算子)

3.3.1 按照组图方式 1:配置 TensorId 实现


文件命名为 atb_add_graph_by_tensor_id.cpp


// step1:包含ACL与加速库接口头文件#include <iostream>#include <vector>#include <acl/acl.h>#include <atb/atb_infer.h>#include <atb/types.h>#include <atb/utils.h>#include "atb/infer_op_params.h"

void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) { for (size_t i = 0; i < intensorDescs.size(); i++) { intensorDescs.at(i).dtype = ACL_FLOAT16; intensorDescs.at(i).format = ACL_FORMAT_ND; intensorDescs.at(i).shape.dimNum = 2; intensorDescs.at(i).shape.dims[0] = 2; intensorDescs.at(i).shape.dims[1] = 2; }}
// 设置各个intensor并且为各个intensor分配内存空间,此处的intensor为手动设置,工程实现上可以使用torchTensor转换或者其他简单数据结构转换的方式void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs){ for (size_t i = 0; i < inTensors.size(); i++) { inTensors.at(i).desc = intensorDescs.at(i); inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i)); std::vector<uint16_t> hostData(atb::Utils::GetTensorNumel(inTensors.at(i)), 2); // 一段全2的hostBuffer int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU内存 if (ret != 0) { std::cout << "alloc error!"; exit(0); } ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, hostData.data(), hostData.size() * sizeof(uint16_t), ACL_MEMCPY_HOST_TO_DEVICE); //拷贝CPU内存到NPU侧 }}
// 设置各个outtensor并且为outtensor分配内存空间,同intensor设置void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs){ for (size_t i = 0; i < outTensors.size(); i++) { outTensors.at(i).desc = outtensorDescs.at(i); outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i)); int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } }}

//在构造图参数时,有两个点需要重点关注。一是Tensor的ID,ATB图接口中把Tensor分为三种类型,输入、输出和中间Tensor,顾名思义,输入输出Tensor是整图的输入输出Tensor,中间tensor则是在整图内的Tensor。构图时的TensorID从小到大应保证//为输入Tensor、输出Tensor、中间Tensor的顺序,且每一种Tensor的个数要与参数中设置的一致。二是要注意排布Node的顺序,用户需要根据计算图的拓扑结构把计算图变成一个有序队列,同时还要保证tensor与节点之间的关系和计算图保持一致。void CreateGraphOperation(atb::GraphParam &opGraph, atb::Operation **operation){ // 构图流程 opGraph.inTensorNum = 4; opGraph.outTensorNum = 1; opGraph.internalTensorNum = 2; opGraph.nodes.resize(3);
enum InTensorId { //定义各TensorID IN_TENSOR_A = 0, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D, ADD3_OUT, ADD1_OUT, ADD2_OUT };
size_t nodeId = 0; atb::Node &addNode = opGraph.nodes.at(nodeId++); atb::Node &addNode2 = opGraph.nodes.at(nodeId++); atb::Node &addNode3 = opGraph.nodes.at(nodeId++);
atb::infer::ElewiseParam addParam; addParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD; atb::Status status = atb::CreateOperation(addParam, &addNode.operation); addNode.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B}; addNode.outTensorIds = {ADD1_OUT};
atb::infer::ElewiseParam addParam2; addParam2.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD; status = atb::CreateOperation(addParam2, &addNode2.operation); addNode2.inTensorIds = {IN_TENSOR_C, IN_TENSOR_D}; addNode2.outTensorIds = {ADD2_OUT};
atb::infer::ElewiseParam addParam3; addParam3.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD; status = CreateOperation(addParam3, &addNode3.operation); addNode3.inTensorIds = {ADD1_OUT, ADD2_OUT}; addNode3.outTensorIds = {ADD3_OUT};
status = atb::CreateOperation(opGraph, operation);}
void PrintOutTensorValue(atb::Tensor &outTensor){ // 输出Tensor拷贝回host侧并打印 std::vector<uint16_t> outBuffer(atb::Utils::GetTensorNumel(outTensor)); int ret = aclrtMemcpy(outBuffer.data(), outBuffer.size() * sizeof(uint16_t), outTensor.deviceData, outTensor.dataSize, ACL_MEMCPY_DEVICE_TO_HOST); if (ret != 0) { std::cout << "copy error!"; exit(0); } for (size_t i = 0; i < outBuffer.size(); i = i + 1) { std::cout << "out[" << i << "] = " << (uint32_t)outBuffer.at(i) << std::endl; }}
int main() { // step2:配置deviceId uint32_t deviceId = 0; aclError status = aclrtSetDevice(deviceId);
// step3:创建图算子对象实例 // 第一步:构造Operation参数 atb::Operation *op = nullptr; atb::GraphParam opGraph;
// 第二步:创建opGraph CreateGraphOperation(opGraph, &op);
// step4:创建输入输出tensor,并存入VariantPack atb::VariantPack pack; atb::SVector<atb::TensorDesc> intensorDescs; atb::SVector<atb::TensorDesc> outtensorDescs;
uint32_t inTensorNum = op->GetInputNum(); uint32_t outTensorNum = op->GetOutputNum(); pack.inTensors.resize(inTensorNum); intensorDescs.resize(inTensorNum);
CreateInTensorDescs(intensorDescs); CreateInTensors(pack.inTensors, intensorDescs); outtensorDescs.resize(outTensorNum); pack.outTensors.resize(outTensorNum); op->InferShape(intensorDescs, outtensorDescs); CreateOutTensors(pack.outTensors, outtensorDescs);
// step5:创建context,配置stream atb::Context *context = nullptr; auto st = atb::CreateContext(&context);
aclrtStream stream = nullptr; status = aclrtCreateStream(&stream); context->SetExecuteStream(stream);
// step6:调用Setup接口,计算workspace大小 uint64_t workspaceSize = 0; st = op->Setup(pack, workspaceSize, context);
// step7:根据workspace大小申请NPU内存 void *workspace = nullptr; if (workspaceSize != 0) { status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); if (status != 0) { std::cout << "alloc error!"; exit(0); } }
// step8:调用Execute接口,执行算子 st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);
// step9:销毁创建的对象,释放内存 // 流同步,作用是等待device侧任务计算完成 auto ret = aclrtSynchronizeStream(stream); if (ret != 0) { std::cout << "sync error!"; exit(0); }
// 打印输出Tensor的值 PrintOutTensorValue(pack.outTensors.at(0));
status = aclrtDestroyStream(stream); // 销毁stream st = atb::DestroyOperation(op); // 销毁op对象 st = atb::DestroyContext(context); // 销毁context // 销毁输入tensor for (size_t i = 0; i < pack.inTensors.size(); i++) { aclrtFree(pack.inTensors.at(i).deviceData); } // 销毁输出tensor for (size_t i = 0; i < pack.outTensors.size(); i++) { aclrtFree(pack.outTensors.at(i).deviceData); } status = aclrtFree(workspace); // 销毁workspace aclrtResetDevice(deviceId); // 重置deviceId
return 0;}
复制代码


编译运行:


# g++编译demo工程,demo.cpp为demo对应的源码文件g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_add_graph_by_tensor_id.cpp -l atb -l ascendcl -o atb_add_graph_by_tensor_id
# 运行可执行文件./atb_add_graph_by_tensor_id
# 如果运行出现coredump,尝试在g++的编译命令中添加-D_GLIBCXX_USE_CXX11_ABI=0,也就是上述的编译命令为:#g++ -D_GLIBCXX_USE_CXX11_ABI=0 -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_add_graph_by_tensor_id.cpp -l atb -l ascendcl -o atb_add_graph_by_tensor_id
复制代码

3.3.2 按照组图方式 2:配置 TensorName 实现。


文件命名为 atb_add_graph_by_tensor_name.cpp


// step1:包含ACL与加速库接口头文件#include <iostream>#include <vector>#include <acl/acl.h>#include <atb/atb_infer.h>#include <atb/types.h>#include <atb/utils.h>#include "atb/infer_op_params.h"

void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) { for (size_t i = 0; i < intensorDescs.size(); i++) { intensorDescs.at(i).dtype = ACL_FLOAT16; intensorDescs.at(i).format = ACL_FORMAT_ND; intensorDescs.at(i).shape.dimNum = 2; intensorDescs.at(i).shape.dims[0] = 2; intensorDescs.at(i).shape.dims[1] = 2; }}
// 设置各个intensor并且为各个intensor分配内存空间,此处的intensor为手动设置,工程实现上可以使用torchTensor转换或者其他简单数据结构转换的方式void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs){ for (size_t i = 0; i < inTensors.size(); i++) { inTensors.at(i).desc = intensorDescs.at(i); inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i)); std::vector<uint16_t> hostData(atb::Utils::GetTensorNumel(inTensors.at(i)), 2); // 一段全2的hostBuffer int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU内存 if (ret != 0) { std::cout << "alloc error!"; exit(0); } ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, hostData.data(), hostData.size() * sizeof(uint16_t), ACL_MEMCPY_HOST_TO_DEVICE); //拷贝CPU内存到NPU侧 }}
// 设置各个outtensor并且为outtensor分配内存空间,同intensor设置void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs){ for (size_t i = 0; i < outTensors.size(); i++) { outTensors.at(i).desc = outtensorDescs.at(i); outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i)); int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } }}
static uint64_t DIM3 = 3;
struct LlamaMlpParamGb { bool transpose = true;};
atb::Operation* Linear(const LlamaMlpParamGb &param){ atb::Operation* op = nullptr; atb::infer::LinearParam linearParam; linearParam.hasBias = false; linearParam.transposeB = param.transpose; CreateOperation(linearParam, &op); return op;}
atb::Operation* Split(const LlamaMlpParamGb &param){ atb::Operation* op = nullptr; atb::infer::SplitParam splitParam = {2, 2}; CreateOperation(splitParam, &op); return op;}
atb::Operation* Swish(const LlamaMlpParamGb &param){ atb::Operation* op = nullptr; atb::infer::ActivationParam activationParam; activationParam.activationType = atb::infer::ActivationType::ACTIVATION_SWISH; CreateOperation(activationParam, &op); return op;}
atb::Operation* Mul(const LlamaMlpParamGb &param){ atb::Operation* op = nullptr; atb::infer::ElewiseParam elewiseParam; elewiseParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_MUL; CreateOperation(elewiseParam, &op); return op;}
atb::Status CreateLlamaMlpOperationByGraphOpBuilder(const LlamaMlpParamGb &param, atb::Operation **operation){ atb::InferShapeFunc inferShapeFunc = [=](const atb::SVector<atb::TensorDesc> &inTensorDescs, atb::SVector<atb::TensorDesc> &outTensorDescs) { outTensorDescs.at(0) = inTensorDescs.at(0); if (param.transpose == true) { outTensorDescs.at(0).shape.dimNum = DIM3; outTensorDescs.at(0).shape.dims[0] = inTensorDescs.at(0).shape.dims[0]; outTensorDescs.at(0).shape.dims[1] = inTensorDescs.at(0).shape.dims[1]; outTensorDescs.at(0).shape.dims[2] = inTensorDescs.at(1).shape.dims[0] / 2; } else { outTensorDescs.at(0).shape.dimNum = DIM3; outTensorDescs.at(0).shape.dims[0] = inTensorDescs.at(0).shape.dims[0]; outTensorDescs.at(0).shape.dims[1] = inTensorDescs.at(0).shape.dims[1]; outTensorDescs.at(0).shape.dims[2] = inTensorDescs.at(1).shape.dims[1] / 2; } return atb::NO_ERROR; };
atb::ReshapeFunc reshape_01_2 = [](const atb::Dims &oldShape, atb::Dims &newShape) { newShape.dimNum = 2; // dimNum: 2 newShape.dims[0] = oldShape.dims[0] * oldShape.dims[1]; newShape.dims[1] = oldShape.dims[1]; }; atb::ReshapeFunc unsqueueze_0 = [](const atb::Dims &oldShape, atb::Dims &newShape) { newShape.dimNum = 3; // dimNum: 3 newShape.dims[0] = 1; newShape.dims[1] = oldShape.dims[0]; newShape.dims[2] = oldShape.dims[1]; }; atb::GraphOpBuilder* graphOpBuilder; CreateGraphOpBuilder(&graphOpBuilder);
graphOpBuilder->Init( "LlamaMlpGraphOp", inferShapeFunc, {"hidden_states", "weight"}, {"mlp_out"} );
graphOpBuilder->Reshape("hidden_states", reshape_01_2, "hidden_states_"); graphOpBuilder->AddOperation(Linear(param), {"hidden_states_", "weight"}, {"linear_out"}); graphOpBuilder->Reshape("linear_out", unsqueueze_0, "linear_out_"); graphOpBuilder->AddOperation(Split(param), {"linear_out_"}, {"gate_out", "up_out"}); graphOpBuilder->AddOperation(Swish(param), {"gate_out"}, {"swish_out"}); graphOpBuilder->AddOperation(Mul(param), {"swish_out", "up_out"}, {"mlp_out"});
*operation = graphOpBuilder->Build(); DestroyGraphOpBuilder(graphOpBuilder); return atb::NO_ERROR;}
void PrintOutTensorValue(atb::Tensor &outTensor){ // 输出Tensor拷贝回host侧并打印 std::vector<uint16_t> outBuffer(atb::Utils::GetTensorNumel(outTensor)); int ret = aclrtMemcpy(outBuffer.data(), outBuffer.size() * sizeof(uint16_t), outTensor.deviceData, outTensor.dataSize, ACL_MEMCPY_DEVICE_TO_HOST); if (ret != 0) { std::cout << "copy error!"; exit(0); } for (size_t i = 0; i < outBuffer.size(); i = i + 1) { std::cout << "out[" << i << "] = " << (uint32_t)outBuffer.at(i) << std::endl; }}
int main() { // step2:配置deviceId uint32_t deviceId = 0; aclError status = aclrtSetDevice(deviceId);
// step3:创建图算子对象实例 // 第一步:构造Operation参数 atb::Operation *op = nullptr; ::LlamaMlpParamGb opGraph;
// 第二步:创建opGraph CreateLlamaMlpOperationByGraphOpBuilder(opGraph, &op);
// step4:创建输入输出tensor,并存入VariantPack atb::VariantPack pack; atb::SVector<atb::TensorDesc> intensorDescs; atb::SVector<atb::TensorDesc> outtensorDescs;
uint32_t inTensorNum = op->GetInputNum(); uint32_t outTensorNum = op->GetOutputNum(); pack.inTensors.resize(inTensorNum); intensorDescs.resize(inTensorNum);
CreateInTensorDescs(intensorDescs); CreateInTensors(pack.inTensors, intensorDescs); outtensorDescs.resize(outTensorNum); pack.outTensors.resize(outTensorNum); op->InferShape(intensorDescs, outtensorDescs); CreateOutTensors(pack.outTensors, outtensorDescs);
// step5:创建context,配置stream atb::Context *context = nullptr; auto st = atb::CreateContext(&context);
aclrtStream stream = nullptr; status = aclrtCreateStream(&stream); context->SetExecuteStream(stream);
// step6:调用Setup接口,计算workspace大小 uint64_t workspaceSize = 0; st = op->Setup(pack, workspaceSize, context);
// step7:根据workspace大小申请NPU内存 void *workspace = nullptr; if (workspaceSize != 0) { status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); if (status != 0) { std::cout << "alloc error!"; exit(0); } }
// step8:调用Execute接口,执行算子 st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);
// step9:销毁创建的对象,释放内存 // 流同步,作用是等待device侧任务计算完成 auto ret = aclrtSynchronizeStream(stream); if (ret != 0) { std::cout << "sync error!"; exit(0); }
// 打印输出Tensor的值 PrintOutTensorValue(pack.outTensors.at(0));
status = aclrtDestroyStream(stream); // 销毁stream st = atb::DestroyOperation(op); // 销毁op对象 st = atb::DestroyContext(context); // 销毁context // 销毁输入tensor for (size_t i = 0; i < pack.inTensors.size(); i++) { aclrtFree(pack.inTensors.at(i).deviceData); } // 销毁输出tensor for (size_t i = 0; i < pack.outTensors.size(); i++) { aclrtFree(pack.outTensors.at(i).deviceData); } status = aclrtFree(workspace); // 销毁workspace aclrtResetDevice(deviceId); // 重置deviceId
return 0;}
复制代码


编译运行:


# g++编译demo工程,demo.cpp为demo对应的源码文件g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_add_graph_by_tensor_name.cpp -l atb -l ascendcl -o atb_add_graph_by_tensor_name
# 运行可执行文件./atb_add_graph_by_tensor_name
# 如果运行出现coredump,尝试在g++的编译命令中添加-D_GLIBCXX_USE_CXX11_ABI=0,也就是上述的编译命令为:#g++ -D_GLIBCXX_USE_CXX11_ABI=0 -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "
复制代码


用户头像

zjun

关注

还未添加个人签名 2020-03-06 加入

还未添加个人简介

评论

发布
暂无评论
如何使用Ascend的ATB加速库?_CANN_zjun_InfoQ写作社区