推理模型部署(一)：ONNX runtime 实践

简单来说，对于机器学习模型过程可分为训练迭代和部署上线两个方面：

训练迭代，即通过特定的数据集、模型结构、损失函数和评价指标的确定，到模型参数的训练，以尽可能达到SOTA(State of the Art)的结果。
部署上线，即指让训练好的模型在特定环境中运行的过程，更多关注于部署场景、部署方式、吞吐率和延迟。

在实际场景中，深度学习模型通常通过PyTorch、TensorFlow等框架来完成，直接通过这些模型来进行推理效率并不高，特别是对延时要求严格的线上场景。由此，经过工业界和学术界数年的探索，模型部署有了一条流行的流水线：

这一条流水线解决了模型部署中的两大问题：使用对接深度学习框架和推理引擎的中间表示，开发者不必担心如何在新环境中运行各个复杂的框架；通过中间表示的网络结构优化和推理引擎对运算的底层优化，模型的运算效率大幅提升。

接下来，我们将通过一步步的实践来体验模型部署的过程。

1. ONNX 面面观

ONNX （Open Neural Network Exchange）是 Facebook 和微软在2017年共同发布的，用于标准描述计算图的一种格式。ONNX 已经对接了多种深度学习框架(如Tensorflow, PyTorch, Scikit-learn， MXNet等)和多种推理引擎。因此，ONNX 被当成了深度学习框架到推理引擎的桥梁，就像编译器的中间语言一样。由于各框架兼容性不一，我们通常只用 ONNX 表示更容易部署的静态图。

ONNX 的文件格式

ONNX文件是基于Protobuf进行序列化。从onnx.proto3协议中我们需要重点知道的数据结构如下：

ModelProto：模型的定义，包含版本信息，生产者和GraphProto。
GraphProto: 包含很多重复的NodeProto, initializer, ValueInfoProto等，这些元素共同构成一个计算图，在GraphProto中，这些元素都是以列表的方式存储，连接关系是通过Node之间的输入输出进行表达的。
NodeProto: onnx的计算图是一个有向无环图(DAG)，NodeProto定义算子类型，节点的输入输出，还包含属性。
ValueInforProto: 定义输入输出这类变量的类型。
TensorProto: 序列化的权重数据，包含数据的数据类型，shape等。
AttributeProto: 具有名字的属性，可以存储基本的数据类型(int, float, string, vector等)也可以存储onnx定义的数据结构(TENSOR, GRAPH等)。

构建一个简单的ONNX图

下面我们将通过onnx的语法构造一个简单的ONNX模型:

helper.make_tensor_value_infoValueInfoProtoNodeProtohelper.make_nodeMulcAddcMulAddhelper.make_graphGraphProtohelper.make_graphNodeProtoValueInfoProtohelper.make_modelGraphProtoModelProtomake_model

onnx.checker.check_modelonnx.saveonnx.load

import onnx
from onnx import helper
from onnx import TensorProto

# input and output
a = helper.make_tensor_value_info('a', TensorProto.FLOAT, [10, 10])
x = helper.make_tensor_value_info('x', TensorProto.FLOAT, [10, 10])
b = helper.make_tensor_value_info('b', TensorProto.FLOAT, [10, 10])
output = helper.make_tensor_value_info('output', TensorProto.FLOAT, [10, 10])

# Mul
mul = helper.make_node('Mul', ['a', 'x'], ['c'])

# Add
add = helper.make_node('Add', ['c', 'b'], ['output'])

# graph and model
graph = helper.make_graph([mul, add], 'linear_func', [a, x, b], [output])
model = helper.make_model(graph)

# save model
onnx.checker.check_model(model)
onnx.save(model, 'linear.onnx')

model_ = onnx.load('linear.onnx')
print(model_)

下面便是打印出来的ONNX模型的完整信息：

ir_version: 8
graph {
  node {
    input: "a"
    input: "x"
    output: "c"
    op_type: "Mul"
  }
  node {
    input: "c"
    input: "b"
    output: "output"
    op_type: "Add"
  }
  name: "linear_func"
  input {
    name: "a"
    type {
      tensor_type {
        elem_type: 1
        shape {
          dim {
            dim_value: 10
          }
          dim {
            dim_value: 10
          }
        }
      }
    }
  }
  input {
    name: "x"
    type {
      tensor_type {
        elem_type: 1
        shape {
          dim {
            dim_value: 10
          }
          dim {
            dim_value: 10
          }
        }
      }
    }
  }
  input {
    name: "b"
    type {
      tensor_type {
        elem_type: 1
        shape {
          dim {
            dim_value: 10
          }
          dim {
            dim_value: 10
          }
        }
      }
    }
  }
  output {
    name: "output"
    type {
      tensor_type {
        elem_type: 1
        shape {
          dim {
            dim_value: 10
          }
          dim {
            dim_value: 10
          }
        }
      }
    }
  }
}
opset_import {
  version: 17
}

同时，我们可以通过Netron 查看ONNX模型结构：

接下来，需要用ONNX runtime验证结果的正确性：

import onnxruntime
import numpy as np

sess = onnxruntime.InferenceSession('linear.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
a = np.random.rand(10, 10).astype(np.float32)
b = np.random.rand(10, 10).astype(np.float32)
x = np.random.rand(10, 10).astype(np.float32)

output = sess.run(['output'], {'a': a, 'b': b, 'x': x})[0]

assert np.allclose(output, a * x + b)

另外，我们还可以对读取出来的模型进行修改，如：

import onnx
model = onnx.load('linear.onnx')

node = model.graph.node
node[1].op_type = 'Sub'

onnx.checker.check_model(model)
onnx.save(model, 'linear_2.onnx')

a * x + ba * x - b

Torch2ONNX

torch.nn.Moduletorch.jit.ScriptModule

torch.onnx.exporttorch.jit.ScriptModule

torch.onnx.export()ScriptModuleargsScriptModuletorch.onnx.export()ScriptModuleargs

下面通过一个例子来说二者的区别：

nn=2n=3

import torch

class Model(torch.nn.Module):
    def __init__(self, n):
        super().__init__()
        self.n = n
        self.conv = torch.nn.Conv2d(3, 3, 3)

    def forward(self, x):
        for i in range(self.n):
            x = self.conv(x)
        return x


models = [Model(2), Model(3)]
model_names = ['model_2', 'model_3']

for model, model_name in zip(models, model_names):
    dummy_input = torch.rand(1, 3, 10, 10)
    dummy_output = model(dummy_input)
    model_trace = torch.jit.trace(model, dummy_input)
    model_script = torch.jit.script(model)

    # 跟踪法与直接 torch.onnx.export(model, ...)等价
    torch.onnx.export(model_trace, dummy_input, f'{model_name}_trace.onnx')
    # 记录法必须先调用 torch.jit.sciprt
    torch.onnx.export(model_script, dummy_input, f'{model_name}_script.onnx')

torch.jit.tracen

2. ONNX runtime 运行 BERT

ONNX Runtime是由微软维护的一个跨平台机器学习推理加速器，即”推理引擎“。ONNX Runtime 是直接对接 ONNX 的，即 ONNX Runtime 可以直接读取并运行 .onnx 文件, 而不需要再把 .onnx 格式的文件转换成其他格式的文件。也就是说，对于 PyTorch -> ONNX -> ONNX Runtime 这条部署流水线，只要在目标设备中得到 .onnx 文件，并在 ONNX Runtime 上运行模型，模型部署就算大功告成了。

下面我们将通过ONNX Runtime来运行BERT模型。

2.1 加载数据与模型

import torch
import onnx
import onnxruntime
import transformers
import os

# Whether allow overwriting existing ONNX model and download the latest script from GitHub
enable_overwrite = True
# Total samples to inference, so that we can get average latency
total_samples = 1000
# ONNX opset version
opset_version=11

cache_dir = "./squad"
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)

predict_file_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
predict_file = os.path.join(cache_dir, "dev-v1.1.json")
if not os.path.exists(predict_file):
    import wget
    print("Start downloading predict file.")
    wget.download(predict_file_url, predict_file)
    print("Predict file downloaded.")

model_name_or_path = "bert-large-uncased-whole-word-masking-finetuned-squad"
max_seq_length = 128
doc_stride = 128
max_query_length = 64

from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)

# Load pretrained model and tokenizer
config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)
config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)
tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)
model = model_class.from_pretrained(model_name_or_path, from_tf=False, config=config, cache_dir=cache_dir)
# load some examples
from transformers.data.processors.squad import SquadV1Processor

processor = SquadV1Processor()
examples = processor.get_dev_examples(None, filename=predict_file)

from transformers import squad_convert_examples_to_features
features, dataset = squad_convert_examples_to_features( 
            examples=examples[:total_samples], # convert enough examples for this notebook
            tokenizer=tokenizer,
            max_seq_length=max_seq_length,
            doc_stride=doc_stride,
            max_query_length=max_query_length,
            is_training=False,
            return_dataset='pt'
        )

2.2 导出ONNX模型

output_dir = "./onnx"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)   
export_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opset{}.onnx'.format(opset_version))

import torch
use_gpu = torch.cuda.is_available()
device = torch.device("cuda" if use_gpu else "cpu")

# Get the first example data to run the model and export it to ONNX
data = dataset[0]
inputs = {
    'input_ids':      data[0].to(device).reshape(1, max_seq_length),
    'attention_mask': data[1].to(device).reshape(1, max_seq_length),
    'token_type_ids': data[2].to(device).reshape(1, max_seq_length)
}

# Set model to inference mode, which is required before exporting the model because some operators behave differently in 
# inference and training mode.
model.eval()
model.to(device)

if enable_overwrite or not os.path.exists(export_model_path):
    with torch.no_grad():
        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
        torch.onnx.export(model,                                            # model being run
                          args=tuple(inputs.values()),                      # model input (or a tuple for multiple inputs)
                          f=export_model_path,                              # where to save the model (can be a file or file-like object)
                          opset_version=opset_version,                      # the ONNX version to export the model to
                          do_constant_folding=True,                         # whether to execute constant folding for optimization
                          input_names=['input_ids',                         # the model's input names
                                       'input_mask', 
                                       'segment_ids'],
                          output_names=['start', 'end'],                    # the model's output names
                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes
                                        'input_mask' : symbolic_names,
                                        'segment_ids' : symbolic_names,
                                        'start' : symbolic_names,
                                        'end' : symbolic_names})
        print("Model exported at ", export_model_path)

2.3 PyTorch 推理

首先在PyTorch 测出推理精度及延时的基准值

import time

# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.
latency = []
with torch.no_grad():
    for i in range(total_samples):
        data = dataset[i]
        inputs = {
            'input_ids':      data[0].to(device).reshape(1, max_seq_length),
            'attention_mask': data[1].to(device).reshape(1, max_seq_length),
            'token_type_ids': data[2].to(device).reshape(1, max_seq_length)
        }
        start = time.time()
        outputs = model(**inputs)
        latency.append(time.time() - start)
print("PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))

在一张A10卡上，PyTorch 用时：PyTorch cuda Inference time = 12.33 ms

2.4 使用 ONNX runtime 推理

import psutil
import onnxruntime
import numpy

assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()
device_name = 'gpu'

sess_options = onnxruntime.SessionOptions()

# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.
# Note that this will increase session creation time so enable it for debugging only.
sess_options.optimized_model_filepath = os.path.join(output_dir, "optimized_model_{}.onnx".format(device_name))

# Please change the value according to best setting in Performance Test Tool result.
sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)

session = onnxruntime.InferenceSession(export_model_path, sess_options,providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider'])
#  onnxruntime.InferenceSession(onnx_path,providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])


latency = []
for i in range(total_samples):
    data = dataset[i]
    ort_inputs = {
        'input_ids':  data[0].cpu().reshape(1, max_seq_length).numpy(),
        'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),
        'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()
    }
    start = time.time()
    ort_outputs = session.run(None, ort_inputs)
    latency.append(time.time() - start)
    
print("OnnxRuntime {} Inference time = {} ms".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))

在一张A10卡上，ONNX runtime 用时：OnnxRuntime gpu Inference time = 7.05 ms

性能提高了约75%。

与此同时，还需要验证精度：

print("***** Verifying correctness *****")
for i in range(2):    
    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-02, atol=1e-02))
    diff = ort_outputs[i] - outputs[i].cpu().numpy()
    max_diff = numpy.max(numpy.abs(diff))
    avg_diff = numpy.average(numpy.abs(diff))
    print(f'maximum_diff={max_diff} average_diff={avg_diff}')

结果如下：

***** Verifying correctness *****
PyTorch and ONNX Runtime output 0 are close: True
maximum_diff=0.002591252326965332 average_diff=0.0004398506134748459
PyTorch and ONNX Runtime output 1 are close: True
maximum_diff=0.0033492445945739746 average_diff=0.00040213397005572915

精度在误差范围内，以上就是一个完整的ONNX runtime运行实际模型的例子。

参考资料

Political language…is designed to make lies sound truthful and murder respectable, and to give an appearance of solidity to pure wind. --George Orwell