onnx转om模型及推理流程

一、资源与端到端流程

精度性能要求

om模型推理的精度与PyTorch模型相比，精度下降不要不超过1%。
npu单颗芯片吞吐率×4要大于gpu T4吞吐率性能才达标。

二、环境搭建

可以查看相应的教程

三、端到端推理实例

onnx转om模型：

1.设置环境变量

export install_path=/usr/local/Ascend/ascend-toolkit/latest
export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH
export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH
export ASCEND_OPP_PATH=${install_path}/opp
export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/

如果直接在命令行中使用的话，环境变量只对当前窗口有效，如果想要始终有效可以直接将环境变量添加到bashrc中去。

vim ~/.bashrc
source ~/.bashrc

2.使用atc将onnx模型转换为om模型文件

通过ATC工具可以将开源框架的网络模型（如Caffe、TensorFlow等）以及Ascend IR定义的单算子描述文件，通过ATC（Ascend Tensor Compiler）将其转换成昇腾AI处理器支持的离线模型，模型转换过程中可以实现算子调度的优化、权值数据重排、内存使用优化等，可以脱离设备完成模型的预处理。
工具使用方法可以参考CANN V100R020C10 开发辅助工具指南 (推理) 01
CANN 5.0.1 开发辅助工具指南 (推理) 01 ATC工具使用指南

atc --framework=5 --model=model_rectify_random.onnx --input_format=NCHW --input_shape="input.1:1,3,112,112" --output=model_rectify_random --log=debug --soc_version=Ascend310

四、数据集预处理

五、离线推理

5.1 benchmark工具概述

benchmark工具为华为自研的模型推理工具，支持多种模型的离线推理，能够迅速统计出模型在Ascend310上的性能，支持真实数据和纯推理两种模式
获取工具及使用方法可以参考CANN V100R020C10 推理benchmark工具用户指南 01

5.2 纯推理

./benchmark_tools/benchmark.x86_64 -batch_size=1 -om_path=./model_rectify_random.onnx.om -round=50 -device_id=0

六、profiling性能分析

首先需要下载相应的脚本文件张东宇/onnx_tools，作者已进行相应的封装，简单易用

sh run_profiling.sh /root/huawei/benchmark_tools/benchmark.x86_64 /root/huawei/model_rectify_random.om /root/huawei/profiling_result 0
会生成相应的结果文件夹，然后主要分析summary目录下的op_statistic_0_1.csv和op_summary_0_1.csv表中数据

七、模型调优

7.1 NPU下om模型autotune调优

关于可以Auto Tune架构可以查看相应简介
Auto Tune架构_昇腾CANN（20.0, 训练场景）_开发辅助工具_Auto Tune工具使用指导_Auto Tune简介_华为云
以及：Auto Tune架构 - CANN 5.0.1 开发辅助工具指南 (推理) 01 - 华为

7.1.1首先要配置环境变量，Auto Tune工具执行前，利用export命令，在当前终端下声明环境变量，关闭Shell终端失效。

export install_path=/usr/local/Ascend/ascend-toolkit/latest
export LD_LIBRARY_PATH=${install_path}/acllib/lib64:${install_path}/atc/lib64:$LD_LIBRARY_PATH
export PATH=${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH
export ASCEND_OPP_PATH=${install_path}/opp

# Auto Tune可选环境变量
export TUNE_BANK_PATH=/home/HwHiAiUser/custom_tune_bank
export REPEAT_TUNE=False
export TUNE_TIMEOUT=2000
export TUNE_OPS_NAME=conv_layers/Pad_1 # 网络中需要调优的节点的name， 若指定此环境变量，则仅

# 对指定的节点进行调优
export ASCEND_DEVICE_ID=0
export TE_PARALLEL_COMPILER=2
export ENABLE_TUNE_BANK=True

# 离线调优环境变量
export ENABLE_TUNE_DUMP=True

# 离线调优场景下可选环境变量
export TUNE_DUMP_PATH=/home/HwHiAiUser/DumpData

7.1.2 atc命令

atc --input_format=ND --framework=5 --model=bert_base_batch_1_sim.onnx --input_shape="input_ids:1,512;token_type_ids:1,512;attention_mask:1,512" --output=bert_base_batch_1_sim_repeat_auto_2000 --auto_tune_mode="RL,GA" --log=info --soc_version=Ascend310 --op_select_implmode=high_performance
# 如果输入是NCHW的话直接使用以下命令，比之前直接转换om模型多了--auto_tune_mode="RL,GA"和--op_select_implmode=high_performance
atc --framework=5 --model=model_rectify_random.onnx --input_format=NCHW --input_shape="input.1:1,3,112,112" --output=model_rectify_random --auto_tune_mode="RL,GA" --log=debug --soc_version=Ascend310 --op_select_implmode=high_performance

auto_tune_mode：开启autotune功能
op_select_implmode=high_performance：开启高性能模式
开启repeat autotune方法：添加--auto_tune_mode="RL,GA"同时export REPEAT_TUNE=True
如果使用aipp进行图片预处理需要添加--insert_op_conf=aipp_efficientnet-b0_pth.config
算子精度通过参数--precision_mode选择，默认值force_fp16

然后关闭autotune工具再重新进行atc转换

export TUNE_BANK_PATH=/usr/local/Ascend/ascend-toolkit/5.0.T205/x86_64-linux/atc/data/rl/Ascend310/custom
# 上面这个步骤配置环境变量的时候要查看具体的路径是什么比如是5.0而不是5.0.T205，然后进入里面就会有一个json文件
atc --framework=5 --model=model_rectify_random.onnx --input_format=NCHW --input_shape="input.1:1,3,112,112" --output=model_rectify_random --log=debug --soc_version=Ascend310

然后再次测试性能

./benchmark_tools/benchmark.x86_64 -batch_size=1 -om_path=./model_rectify_random.om -round=50 -device_id=0

最终发现时间确实会有一定程度的提升

可以再通过profiling工具进行分析

sh run_profiling.sh /root/huawei_2/benchmark_tools/benchmark.x86_64 /root/huawei_2/model_rectify_random.om /root/huawei_2/profiling_result 0