简单来说,对于机器学习模型过程可分为训练迭代和部署上线两个方面:
- 训练迭代,即通过特定的数据集、模型结构、损失函数和评价指标的确定,到模型参数的训练,以尽可能达到SOTA(State of the Art)的结果。
- 部署上线,即指让训练好的模型在特定环境中运行的过程,更多关注于部署场景、部署方式、吞吐率和延迟。
在实际场景中,深度学习模型通常通过PyTorch、TensorFlow等框架来完成,直接通过这些模型来进行推理效率并不高,特别是对延时要求严格的线上场景。由此,经过工业界和学术界数年的探索,模型部署有了一条流行的流水线:
这一条流水线解决了模型部署中的两大问题:使用对接深度学习框架和推理引擎的中间表示,开发者不必担心如何在新环境中运行各个复杂的框架;通过中间表示的网络结构优化和推理引擎对运算的底层优化,模型的运算效率大幅提升。
接下来,我们将通过一步步的实践来体验模型部署的过程。
1. ONNX 面面观
ONNX (Open Neural Network Exchange)是 Facebook 和微软在2017年共同发布的,用于标准描述计算图的一种格式。ONNX 已经对接了多种深度学习框架(如Tensorflow, PyTorch, Scikit-learn, MXNet等)和多种推理引擎。因此,ONNX 被当成了深度学习框架到推理引擎的桥梁,就像编译器的中间语言一样。由于各框架兼容性不一,我们通常只用 ONNX 表示更容易部署的静态图。
ONNX 的文件格式
ONNX文件是基于Protobuf进行序列化。 从onnx.proto3协议中我们需要重点知道的数据结构如下:
- ModelProto:模型的定义,包含版本信息,生产者和GraphProto。
- GraphProto: 包含很多重复的NodeProto, initializer, ValueInfoProto等,这些元素共同构成一个计算图,在GraphProto中,这些元素都是以列表的方式存储,连接关系是通过Node之间的输入输出进行表达的。
- NodeProto: onnx的计算图是一个有向无环图(DAG),NodeProto定义算子类型,节点的输入输出,还包含属性。
- ValueInforProto: 定义输入输出这类变量的类型。
- TensorProto: 序列化的权重数据,包含数据的数据类型,shape等。
- AttributeProto: 具有名字的属性,可以存储基本的数据类型(int, float, string, vector等)也可以存储onnx定义的数据结构(TENSOR, GRAPH等)。
构建一个简单的ONNX图
下面我们将通过onnx的语法构造一个简单的ONNX模型:
helper.make_tensor_value_infoValueInfoProtoNodeProtohelper.make_nodeMulcAddcMulAddhelper.make_graphGraphProtohelper.make_graphNodeProtoValueInfoProtohelper.make_modelGraphProtoModelProtomake_model
onnx.checker.check_modelonnx.saveonnx.load
import onnx
from onnx import helper
from onnx import TensorProto
# input and output
a = helper.make_tensor_value_info('a', TensorProto.FLOAT, [10, 10])
x = helper.make_tensor_value_info('x', TensorProto.FLOAT, [10, 10])
b = helper.make_tensor_value_info('b', TensorProto.FLOAT, [10, 10])
output = helper.make_tensor_value_info('output', TensorProto.FLOAT, [10, 10])
# Mul
mul = helper.make_node('Mul', ['a', 'x'], ['c'])
# Add
add = helper.make_node('Add', ['c', 'b'], ['output'])
# graph and model
graph = helper.make_graph([mul, add], 'linear_func', [a, x, b], [output])
model = helper.make_model(graph)
# save model
onnx.checker.check_model(model)
onnx.save(model, 'linear.onnx')
model_ = onnx.load('linear.onnx')
print(model_)
下面便是打印出来的ONNX模型的完整信息:
ir_version: 8
graph {
node {
input: "a"
input: "x"
output: "c"
op_type: "Mul"
}
node {
input: "c"
input: "b"
output: "output"
op_type: "Add"
}
name: "linear_func"
input {
name: "a"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_value: 10
}
dim {
dim_value: 10
}
}
}
}
}
input {
name: "x"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_value: 10
}
dim {
dim_value: 10
}
}
}
}
}
input {
name: "b"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_value: 10
}
dim {
dim_value: 10
}
}
}
}
}
output {
name: "output"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_value: 10
}
dim {
dim_value: 10
}
}
}
}
}
}
opset_import {
version: 17
}
同时,我们可以通过Netron 查看ONNX模型结构:
接下来,需要用ONNX runtime验证结果的正确性:
import onnxruntime
import numpy as np
sess = onnxruntime.InferenceSession('linear.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
a = np.random.rand(10, 10).astype(np.float32)
b = np.random.rand(10, 10).astype(np.float32)
x = np.random.rand(10, 10).astype(np.float32)
output = sess.run(['output'], {'a': a, 'b': b, 'x': x})[0]
assert np.allclose(output, a * x + b)
另外,我们还可以对读取出来的模型进行修改,如:
import onnx
model = onnx.load('linear.onnx')
node = model.graph.node
node[1].op_type = 'Sub'
onnx.checker.check_model(model)
onnx.save(model, 'linear_2.onnx')
a * x + ba * x - b
Torch2ONNX
torch.nn.Moduletorch.jit.ScriptModule
torch.onnx.exporttorch.jit.ScriptModule
torch.onnx.export()ScriptModuleargsScriptModuletorch.onnx.export()ScriptModuleargs
下面通过一个例子来说二者的区别:
nn=2n=3
import torch
class Model(torch.nn.Module):
def __init__(self, n):
super().__init__()
self.n = n
self.conv = torch.nn.Conv2d(3, 3, 3)
def forward(self, x):
for i in range(self.n):
x = self.conv(x)
return x
models = [Model(2), Model(3)]
model_names = ['model_2', 'model_3']
for model, model_name in zip(models, model_names):
dummy_input = torch.rand(1, 3, 10, 10)
dummy_output = model(dummy_input)
model_trace = torch.jit.trace(model, dummy_input)
model_script = torch.jit.script(model)
# 跟踪法与直接 torch.onnx.export(model, ...)等价
torch.onnx.export(model_trace, dummy_input, f'{model_name}_trace.onnx')
# 记录法必须先调用 torch.jit.sciprt
torch.onnx.export(model_script, dummy_input, f'{model_name}_script.onnx')
torch.jit.tracen
2. ONNX runtime 运行 BERT
ONNX Runtime是由微软维护的一个跨平台机器学习推理加速器,即”推理引擎“。ONNX Runtime 是直接对接 ONNX 的,即 ONNX Runtime 可以直接读取并运行 .onnx 文件, 而不需要再把 .onnx 格式的文件转换成其他格式的文件。也就是说,对于 PyTorch -> ONNX -> ONNX Runtime 这条部署流水线,只要在目标设备中得到 .onnx 文件,并在 ONNX Runtime 上运行模型,模型部署就算大功告成了。
下面我们将通过ONNX Runtime来运行BERT模型。
2.1 加载数据与模型
import torch
import onnx
import onnxruntime
import transformers
import os
# Whether allow overwriting existing ONNX model and download the latest script from GitHub
enable_overwrite = True
# Total samples to inference, so that we can get average latency
total_samples = 1000
# ONNX opset version
opset_version=11
cache_dir = "./squad"
if not os.path.exists(cache_dir):
os.makedirs(cache_dir)
predict_file_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
predict_file = os.path.join(cache_dir, "dev-v1.1.json")
if not os.path.exists(predict_file):
import wget
print("Start downloading predict file.")
wget.download(predict_file_url, predict_file)
print("Predict file downloaded.")
model_name_or_path = "bert-large-uncased-whole-word-masking-finetuned-squad"
max_seq_length = 128
doc_stride = 128
max_query_length = 64
from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)
# Load pretrained model and tokenizer
config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)
config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)
tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)
model = model_class.from_pretrained(model_name_or_path, from_tf=False, config=config, cache_dir=cache_dir)
# load some examples
from transformers.data.processors.squad import SquadV1Processor
processor = SquadV1Processor()
examples = processor.get_dev_examples(None, filename=predict_file)
from transformers import squad_convert_examples_to_features
features, dataset = squad_convert_examples_to_features(
examples=examples[:total_samples], # convert enough examples for this notebook
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=doc_stride,
max_query_length=max_query_length,
is_training=False,
return_dataset='pt'
)
2.2 导出ONNX模型
output_dir = "./onnx"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
export_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opset{}.onnx'.format(opset_version))
import torch
use_gpu = torch.cuda.is_available()
device = torch.device("cuda" if use_gpu else "cpu")
# Get the first example data to run the model and export it to ONNX
data = dataset[0]
inputs = {
'input_ids': data[0].to(device).reshape(1, max_seq_length),
'attention_mask': data[1].to(device).reshape(1, max_seq_length),
'token_type_ids': data[2].to(device).reshape(1, max_seq_length)
}
# Set model to inference mode, which is required before exporting the model because some operators behave differently in
# inference and training mode.
model.eval()
model.to(device)
if enable_overwrite or not os.path.exists(export_model_path):
with torch.no_grad():
symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
torch.onnx.export(model, # model being run
args=tuple(inputs.values()), # model input (or a tuple for multiple inputs)
f=export_model_path, # where to save the model (can be a file or file-like object)
opset_version=opset_version, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names=['input_ids', # the model's input names
'input_mask',
'segment_ids'],
output_names=['start', 'end'], # the model's output names
dynamic_axes={'input_ids': symbolic_names, # variable length axes
'input_mask' : symbolic_names,
'segment_ids' : symbolic_names,
'start' : symbolic_names,
'end' : symbolic_names})
print("Model exported at ", export_model_path)
2.3 PyTorch 推理
首先在PyTorch 测出推理精度及延时的基准值
import time
# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.
latency = []
with torch.no_grad():
for i in range(total_samples):
data = dataset[i]
inputs = {
'input_ids': data[0].to(device).reshape(1, max_seq_length),
'attention_mask': data[1].to(device).reshape(1, max_seq_length),
'token_type_ids': data[2].to(device).reshape(1, max_seq_length)
}
start = time.time()
outputs = model(**inputs)
latency.append(time.time() - start)
print("PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
在一张A10卡上,PyTorch 用时:PyTorch cuda Inference time = 12.33 ms
2.4 使用 ONNX runtime 推理
import psutil
import onnxruntime
import numpy
assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()
device_name = 'gpu'
sess_options = onnxruntime.SessionOptions()
# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.
# Note that this will increase session creation time so enable it for debugging only.
sess_options.optimized_model_filepath = os.path.join(output_dir, "optimized_model_{}.onnx".format(device_name))
# Please change the value according to best setting in Performance Test Tool result.
sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)
session = onnxruntime.InferenceSession(export_model_path, sess_options,providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider'])
# onnxruntime.InferenceSession(onnx_path,providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])
latency = []
for i in range(total_samples):
data = dataset[i]
ort_inputs = {
'input_ids': data[0].cpu().reshape(1, max_seq_length).numpy(),
'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),
'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()
}
start = time.time()
ort_outputs = session.run(None, ort_inputs)
latency.append(time.time() - start)
print("OnnxRuntime {} Inference time = {} ms".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))
在一张A10卡上,ONNX runtime 用时:OnnxRuntime gpu Inference time = 7.05 ms
性能提高了约75%。
与此同时,还需要验证精度:
print("***** Verifying correctness *****")
for i in range(2):
print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-02, atol=1e-02))
diff = ort_outputs[i] - outputs[i].cpu().numpy()
max_diff = numpy.max(numpy.abs(diff))
avg_diff = numpy.average(numpy.abs(diff))
print(f'maximum_diff={max_diff} average_diff={avg_diff}')
结果如下:
***** Verifying correctness *****
PyTorch and ONNX Runtime output 0 are close: True
maximum_diff=0.002591252326965332 average_diff=0.0004398506134748459
PyTorch and ONNX Runtime output 1 are close: True
maximum_diff=0.0033492445945739746 average_diff=0.00040213397005572915
精度在误差范围内,以上就是一个完整的ONNX runtime运行实际模型的例子。
参考资料
Political language…is designed to make lies sound truthful and murder respectable, and to give an appearance of solidity to pure wind. --George Orwell