原文:RUSTGO: CALLING RUST FROM GO WITH NEAR-ZERO OVERHEAD
作者:Filippo Valsorda
翻译:雁惊寒
摘要:本文介绍了在Go中调用Rust代码这个实验。你无需知道Rust或者编译器的内部原理,只需知道链接器有什么用即可。以下是译文
Go语言完美支持直接调用汇编程序。stdlib中的很多快速加密代码都是使用精心优化过的汇编语言编写的,速度是优化前的20倍以上。
但是,编写汇编代码很难,检查汇编代码更难。如果我们可以用更高级的语言编写这些热门函数就好了。
本文介绍了在Go中调用Rust代码这个实验。你无需知道Rust或者编译器的内部原理,只需知道链接器有什么用即可。
为什么是 Rust
坦率地讲:我对Rust并不熟悉,也并不觉得自己用Rust进行日常编程是被迫的。然而,我知道Rust是一个可调和可优化的语言,并且比汇编更易阅读。(事实上任何一个语言都比汇编更容易阅读!)
Go一直在努力寻找自己擅长的地方,但它只接受自己速度足够快这个特点。我很喜欢它这个特点。但对于我们今天要做的工作,我们需要有一种语言,它能在手动关闭了安全检查的情况下生成完全基于栈的函数。
因此,如果存在一种语言,我们能够像约束汇编一样约束它,并能像汇编一样进行优化,那它可能就是Rust。
最后,Rust安全性高,更新频繁,尤其存在着一个很不错的高性能Rust加密代码生态系统。
为什么不是 cgo
Go具备外部函数接口(Foreign Function Interface, FFI)机制,名叫cgo。 cgo允许Go程序以最自然的方式调用C函数(但其实一点都不自然)。
通过使用C的应用程序二进制接口(Application Binary Interface, ABI)作为FFI的通用语言,我们可以在任何语言中调用其他任何语言:Rust可以编译成一个暴露C接口的库,然后cgo就可以使用它了。这很尴尬,但确实有效。
我们甚至可以使用reverse-cgo把Go编译到C库中,供其他任意一个语言调用,例如在这篇文章中描述的那样。
但是,cgo为了实现这个功能做了很多事情:它为C的生存生成了一个完整的栈,这使得在Go回调中存在一定的延迟……这简直可以写一篇文章了。
因此,每一次cgo调用的性能成本对于我们这个例子来说实在太高了。
将它们链接在一起
所以我的想法是:如果我们可以让Rust代码像汇编一样受到约束,我们应该就能够想汇编一样使用它,直接调用它。也许还要用一点点胶水。
我们没有必要在中间表示层工作,因为Go编译器从Go 1.3版本开始就能在链接之前将代码和高级汇编转换为机器码了。
clanggccCGO_LDFLAGS
在cgo安全特性的底层能找到一个跨语言的函数调用。
如果我们可以弄清楚如何在不给编译器打补丁的情况下做到这一点就好了。首先,我们来搞清楚如何将Go程序与Rust文件链接到起来。
#cgogo build.s
go build-x
-x -ldflags "-v -linkmode=external '-extldflags=-v'"
rustgo: rustgo.a
go tool link -o rustgo -extld clang -buildmode exe -buildid b01dca11ab1e -linkmode external -v rustgo.a
rustgo.a: hello.go hello.o
go tool compile -o rustgo.a -p main -buildid b01dca11ab1e -pack hello.go
go tool pack r rustgo.a hello.o
hello.o: hello.s
go tool asm -I "$(shell go env GOROOT)/pkg/include" -D GOOS_darwin -D GOARCH_amd64 -o hello.o hello.s
hello.gohello.s
现在,如果我们要链接一个Rust对象,我们首先要将其构建为一个静态库……
libhello.a: hello.rs
rustc -g -O --crate-type staticlib hello.rs
……然后告诉外部链接器将它们链接在一起。
rustgo: rustgo.a libhello.a
go tool link -o rustgo -extld clang -buildmode exe -buildid b01dca11ab1e -linkmode external -v -extldflags='-lhello -L"$(CURDIR)"' rustgo.a
$ make
go tool asm -I "/usr/local/Cellar/go/1.8.1_1/libexec/pkg/include" -D GOOS_darwin -D GOARCH_amd64 -o hello.o hello.s
go tool compile -o rustgo.a -p main -buildid b01dca11ab1e -pack hello.go
go tool pack r rustgo.a hello.o
rustc --crate-type staticlib hello.rs
note: link against the following native artifacts when linking against this static library
note: the order and any duplication can be significant on some platforms, and so may need to be preserved
note: library: System
note: library: c
note: library: m
go tool link -o rustgo -extld clang -buildmode exe -buildid b01dca11ab1e -linkmode external -v -extldflags="-lhello -L/Users/filippo/code/misc/rustgo" rustgo.a
HEADER = -H1 -T0x1001000 -D0x0 -R0x1000
searching for runtime.a in /usr/local/Cellar/go/1.8.1_1/libexec/pkg/darwin_amd64/runtime.a
searching for runtime/cgo.a in /usr/local/Cellar/go/1.8.1_1/libexec/pkg/darwin_amd64/runtime/cgo.a
0.00 deadcode
0.00 pclntab=166785 bytes, funcdata total 17079 bytes
0.01 dodata
0.01 symsize = 0
0.01 symsize = 0
0.01 reloc
0.01 dwarf
0.02 symsize = 0
0.02 reloc
0.02 asmb
0.02 codeblk
0.03 datblk
0.03 sym
0.03 headr
0.06 host link: "clang" "-m64" "-gdwarf-2" "-Wl,-headerpad,1144" "-Wl,-no_pie" "-Wl,-pagezero_size,4000000" "-o" "rustgo" "-Qunused-arguments" "/var/folders/ry/v14gg02d0y9cb2w9809hf6ch0000gn/T/go-link-412633279/go.o" "/var/folders/ry/v14gg02d0y9cb2w9809hf6ch0000gn/T/go-link-412633279/000000.o" "-g" "-O2" "-lpthread" "-lhello" "-L/Users/filippo/code/misc/rustgo"
0.34 cpu time
12641 symbols
5764 liveness data
跳转到Rust中
好了,链接成功了,下面我们需要在Go代码中以某种方式调用Rust函数了。
CALL hello(SB)
func hello()
我尝试了上述的所有的方法来调用外部(Rust)函数,但都提示找不到符号名称或函数体。
但是在某一天,cgo终于以某种方式成功调用了这个外部函数!怎么做到的呢?
几天之后,我偶然间发现了答案。
//go:cgo_import_static _cgoPREFIX_Cfunc__Cmalloc
//go:linkname __cgofn__cgoPREFIX_Cfunc__Cmalloc _cgoPREFIX_Cfunc__Cmalloc
var __cgofn__cgoPREFIX_Cfunc__Cmalloc byte
var _cgoPREFIX_Cfunc__Cmalloc = unsafe.Pointer(&__cgofn__cgoPREFIX_Cfunc__Cmalloc)
//go:linknamebyte//go:cgo_import_static
hello.rs
#[no_mangle]
pub extern fn hello() {
println!("Hello, Rust!");
}
hello.go
package main
//go:cgo_import_static hello
func trampoline()
func main() {
println("Hello, Go!")
trampoline()
}
hello.s
TEXT ·trampoline(SB), 0, $2048
JMP hello(SB)
RET
CALLJMP
Hello, Go!
Hello, Rust!
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0]
$2048
调用约定
现在,需要返回一些东西,并传入一些参数,我们需要关注一下Go和Rust的调用约定。 调用约定定义了参数和返回值在函数调用中的位置。
Go的调用约定可以在这里和这里找到。对于Rust,我们要看一下FFI的默认值,这是标准C的调用约定。
我们还需要一个调试器。
Go调用约定
Go调用约定几乎没有文档记录,但是我们需要了解一下才能继续进行后续的工作,所以,我们可以从反汇编中学到一点东西。我们来看一个非常简单的函数。
// func foo(x, y uint64) uint64
TEXT ·foo(SB), 0, $256-24
MOVQ x+0(FP), DX
MOVQ DX, ret+16(FP)
RET
foo
func main() {
foo(0xf0f0f0f0f0f0f0f0, 0x5555555555555555)
rustgo[0x49d785]: movabsq $-0xf0f0f0f0f0f0f10, %rax
rustgo[0x49d78f]: movq %rax, (%rsp)
rustgo[0x49d793]: movabsq $0x5555555555555555, %rax
rustgo[0x49d79d]: movq %rax, 0x8(%rsp)
rustgo[0x49d7a2]: callq 0x49d8a0 ; main.foo at hello.s:14
rsp16(rsp)CALLCALL
rspmovqpush
rustgo`main.foo at hello.s:14:
rustgo[0x49d8a0]: movq %fs:-0x8, %rcx
rustgo[0x49d8a9]: leaq -0x88(%rsp), %rax
rustgo[0x49d8b1]: cmpq 0x10(%rcx), %rax
rustgo[0x49d8b5]: jbe 0x49d8ee ; main.foo + 78 at hello.s:14
[...]
rustgo[0x49d8ee]: callq 0x495d10 ; runtime.morestack_noctxt at asm_amd64.s:405
rustgo[0x49d8f3]: jmp 0x49d8a0 ; main.foo at hello.s:14
runtime.morestackNOSPLIT
rustgo[0x49d8b7]: subq $0x108, %rsp
[...]
rustgo[0x49d8e6]: addq $0x108, %rsp
rustgo[0x49d8ed]: retq
rsprsprsp
rustgo[0x49d8be]: movq %rbp, 0x100(%rsp)
rustgo[0x49d8c6]: leaq 0x100(%rsp), %rbp
[...]
rustgo[0x49d8de]: movq 0x100(%rsp), %rbp
rbp
rustgo[0x49d8ce]: movq 0x110(%rsp), %rdx
rustgo[0x49d8d6]: movq %rdx, 0x120(%rsp)
最后,从代码本身可以看出,返回值刚好超过了参数。
虚拟寄存器
SPFPrsprbp
SPrspSPMOVQ SP, DXrsp
FPrsp
rbprspomit-stack-pointerrspFP
C调用约定
x86-64上的默认C调用约定“sysv64”是完全不同的:
JMPCALL
rustc-g
将它们粘在一起
asmcgocall
rsp
package main
//go:cgo_import_static increment
func trampoline(arg uint64) uint64
func main() {
println(trampoline(41))
}
TEXT ·trampoline(SB), 0, $2048-16
MOVQ arg+0(FP), DI // Load the argument before messing with SP
MOVQ SP, BX // Save SP in a callee-saved registry
ADDQ $2048, SP // Rollback SP to reuse this function's frame
ANDQ $~15, SP // Align the stack to 16-bytes
CALL increment(SB)
MOVQ BX, SP // Restore SP
MOVQ AX, ret+8(FP) // Place the return value on the stack
RET
#[no_mangle]
pub extern fn increment(a: u64) -> u64 {
return a + 1;
}
CALL
CALL_cgo_thread_startcgo_import_staticCALL
callq 0x40a27cd ; x_cgo_thread_start + 29
//go:linkname
import _ "unsafe"
//go:cgo_import_static increment
//go:linkname increment increment
var increment uintptr
var _increment = &increment
MOVQ ·_increment(SB), AX
CALL AX
它快吗
整个练习的重点是要能够调用Rust。因此,rustgo调用必须要跟汇编调用一样快才有用。
评测时间!
//go:noinline
-g -O
name time/op
CallOverhead/Inline 1.72ns ± 3%
CallOverhead/Go 4.60ns ± 2%
CallOverhead/rustgo 5.11ns ± 4%
CallOverhead/cgo 73.6ns ± 0%
rustgo比Go函数调用慢11%,比cgo快了几乎15倍!
在没有函数指针的Linux上运行时,性能更好,只有2%的开销。
name time/op
CallOverhead/Inline 1.67ns ± 2%
CallOverhead/Go 4.49ns ± 3%
CallOverhead/rustgo 4.58ns ± 3%
CallOverhead/cgo 69.4ns ± 0%
实例
对于这个真实的演示,我选择了优秀的curve25519-dalek库,特别是将曲线基点乘以标量并返回其Edwards表示的任务。
由于存在CPU频率调节的影响,Cargo基准在多次执行的时候摇摆不定,但他们建议操作将占用22.9μs±17%。
test curve::bench::basepoint_mult ... bench: 17,276 ns/iter (+/- 3,057)
test curve::bench::edwards_compress ... bench: 5,633 ns/iter (+/- 858)
在GO方面,我们暴露了一个简单的API。
func ScalarBaseMult(dst, in *[32]byte)
在Rust方面,它与建立用于正常FFI的接口没有区别。
老实说,我花了好长时间才弄明白Rust并完成这项工作。
#![no_std]
extern crate curve25519_dalek;
use curve25519_dalek::scalar::Scalar;
use curve25519_dalek::constants;
#[no_mangle]
pub extern fn scalar_base_mult(dst: &mut [u8; 32], k: &[u8; 32]) {
let res = &constants::ED25519_BASEPOINT_TABLE * &Scalar(*k);
dst.clone_from(res.compress_edwards().as_bytes());
}
.acargo build --releaseCargo.toml
[package]
name = "ed25519-dalek-rustgo"
version = "0.0.0"
[lib]
crate-type = ["staticlib"]
[dependencies.curve25519-dalek]
version = "^0.9"
default-features = false
features = ["nightly"]
[profile.release]
debug = true
最后,我们需要调整蹦床,来传入两个参数,不返回任何值。
TEXT ·ScalarBaseMult(SB), 0, $16384-16
MOVQ dst+0(FP), DI
MOVQ in+8(FP), SI
MOVQ SP, BX
ADDQ $16384, SP
ANDQ $~15, SP
MOVQ ·_scalar_base_mult(SB), AX
CALL AX
MOVQ BX, SP
RET
结果是一个透明的Go调用,性能与纯Rust基准测试非常接近,比cgo几乎快了6%!
name old time/op new time/op delta
RustScalarBaseMult 23.7μs ± 1% 22.3μs ± 4% -5.88% (p=0.003 n=5+7)
作为比较,github.com/agl/ed25519/edwards25519提供了类似的功能,纯Go库的耗时几乎是3倍。
h := &edwards25519.ExtendedGroupElement{}
edwards25519.GeScalarMultBase(h, &k)
h.ToBytes(&dst)
name time/op
GoScalarBaseMult 66.1μs ± 2%
包装起来
package main
//go:binary-only-package$GOPATH/pkg.a
.a
在Go侧则很简单。
//go:binary-only-package
// Package edwards25519 implements operations on an Edwards curve that is
// isomorphic to curve25519.
//
// Crypto operations are implemented by calling directly into the Rust
// library curve25519-dalek, without cgo.
//
// You should not actually be using this.
package edwards25519
import _ "unsafe"
//go:cgo_import_static scalar_base_mult
//go:linkname scalar_base_mult scalar_base_mult
var scalar_base_mult uintptr
var _scalar_base_mult = &scalar_base_mult
// ScalarBaseMult multiplies the scalar in by the curve basepoint, and writes
// the compressed Edwards representation of the resulting point to dst.
func ScalarBaseMult(dst, in *[32]byte)
go tool link
.a.o.aararlibed25519_dalek_rustgo.aedwards25519.a
libed25519_dalek_rustgo.aedwards25519.a
edwards25519/edwards25519.a: edwards25519/rustgo.go edwards25519/rustgo.o target/release/libed25519_dalek_rustgo.a
go tool compile -N -l -o -p main -pack edwards25519/rustgo.go
go tool pack r edwards25519/rustgo.o # from edwards25519/rustgo.s
mkdir -p target/release/libed25519_dalek_rustgo && cd target/release/libed25519_dalek_rustgo && \
rm -f *.o && ar xv "$(CURDIR)/target/release/libed25519_dalek_rustgo.a"
go tool pack r target/release/libed25519_dalek_rustgo/*.o
.PHONY: install
install: edwards25519/edwards25519.a
mkdir -p "$(shell go env GOPATH)/pkg/darwin_amd64/$(IMPORT_PATH)/"
cp edwards25519/edwards25519.a "$(shell go env GOPATH)/pkg/darwin_amd64/$(IMPORT_PATH)/"
太惊喜了,这竟然有用!
.a
package main
import (
"bytes"
"encoding/hex"
"fmt"
"testing"
"github.com/FiloSottile/ed25519-dalek-rustgo/edwards25519"
)
func main() {
input, _ := hex.DecodeString("39129b3f7bbd7e17a39679b940018a737fc3bf430fcbc827029e67360aab3707")
expected, _ := hex.DecodeString("1cc4789ed5ea69f84ad460941ba0491ff532c1af1fa126733d6c7b62f7ebcbcf")
var dst, k [32]byte
copy(k[:], input)
edwards25519.ScalarBaseMult(&dst, &k)
if !bytes.Equal(dst[:], expected) {
fmt.Println("rustgo produces a wrong result!")
}
fmt.Printf("BenchmarkScalarBaseMult\t%v\n", testing.Benchmark(func(b *testing.B) {
for i := 0; i < b.N; i++ {
edwards25519.ScalarBaseMult(&dst, &k)
}
}))
}
go build
$ go build -ldflags '-linkmode external -extldflags -lresolv'
$ ./ed25519-dalek-rustgo
BenchmarkScalarBaseMult 100000 19914 ns/op
libresolv
note: link against the following native artifacts when linking against this static library
note: the order and any duplication can be significant on some platforms, and so may need to be preserved
note: library: System
note: library: resolv
note: library: c
note: library: m
现在,链接成系统库将是一个问题,因为它将永远不会发生内部链接和交叉编译……
no_std
no_std
no_std
$ ar t target/release/libed25519_dalek_rustgo.a
__.SYMDEF
ed25519_dalek_rustgo-742a1d9f1c101d86.0.o
ed25519_dalek_rustgo-742a1d9f1c101d86.crate.allocator.o
curve25519_dalek-03e3ca0f6d904d88.0.o
subtle-cd04b61500f6e56a.0.o
std-72653eb2361f5909.0.o
panic_unwind-d0b88496572d35a9.0.o
unwind-da13b913698118f9.0.o
arrayref-2be0c0ff08ae2c7d.0.o
digest-f1373d68da35ca45.0.o
generic_array-95ca86a62dc11ddc.0.o
nodrop-7df18ca19bb4fc21.0.o
odds-3bc0ea0bdf8209aa.0.o
typenum-a61a9024d805e64e.0.o
rand-e0d585156faee9eb.0.o
alloc_system-c942637a1f049140.0.o
libc-e038d130d15e5dae.0.o
alloc-0e789b712308019f.0.o
std_unicode-9735142be30abc63.0.o
compiler_builtins-8a5da980a34153c7.0.o
absvdi2.o
absvsi2.o
absvti2.o
[... snip ...]
truncsfhf2.o
ucmpdi2.o
ucmpti2.o
core-9077840c2cc91cbf.0.o
no_std
no_stdno_stdcurve25519-dalekcargo updateno_stdno_stdpanic_fmtpanic_fmtno_mangle
lib.rs
#![no_std]
#![feature(lang_items, compiler_builtins_lib, core_intrinsics)]
use core::intrinsics;
#[allow(private_no_mangle_fns)] #[no_mangle] // rust-lang/rust#38281
#[lang = "panic_fmt"] fn panic_fmt() -> ! { unsafe { intrinsics::abort() } }
#[lang = "eh_personality"] extern fn eh_personality() {}
extern crate compiler_builtins; // rust-lang/rust#43264
extern crate rlibc;
go build
Linux
在Linux上完全没用。
fmax
$ ld -r -o linux.o target/release/libed25519_dalek_rustgo/*.o
$ nm -u linux.o
U _GLOBAL_OFFSET_TABLE_
U abort
U fmax
U fmaxf
U fmaxl
U logb
U logbf
U logbl
U scalbn
U scalbnf
U scalbnl
--gc-sections
$ go build -ldflags '-extld clang -linkmode external -extldflags -Wl,--gc-sections'
--gc-sections.a
ld -r --gc-sections -u $SYMBOL.o-r-u$SYMBOLscalar_base_mult
为什么在macOS上不存在这个问题呢? 如果我们手动链接就会出现这个问题,但是macOS编译器在默认情况下会自动去掉无用的符号。
$ ld -e _scalar_base_mult target/release/libed25519_dalek_rustgo/*.o
Undefined symbols for architecture x86_64:
"___assert_rtn", referenced from:
_compilerrt_abort_impl in int_util.o
"_copysign", referenced from:
___divdc3 in divdc3.o
___muldc3 in muldc3.o
"_copysignf", referenced from:
___divsc3 in divsc3.o
___mulsc3 in mulsc3.o
"_copysignl", referenced from:
___divxc3 in divxc3.o
___mulxc3 in mulxc3.o
"_fmax", referenced from:
___divdc3 in divdc3.o
"_fmaxf", referenced from:
___divsc3 in divsc3.o
"_fmaxl", referenced from:
___divxc3 in divxc3.o
"_logb", referenced from:
___divdc3 in divdc3.o
"_logbf", referenced from:
___divsc3 in divsc3.o
"_logbl", referenced from:
___divxc3 in divxc3.o
"_scalbn", referenced from:
___divdc3 in divdc3.o
"_scalbnf", referenced from:
___divsc3 in divsc3.o
"_scalbnl", referenced from:
___divxc3 in divxc3.o
ld: symbol(s) not found for inferred architecture x86_64
$ ld -e _scalar_base_mult -dead_strip target/release/libed25519_dalek_rustgo/*.o
这是Makefile的一部分,它能够与外部链接一起使用。
edwards25519/edwards25519.a: edwards25519/rustgo.go edwards25519/rustgo.o edwards25519/libed25519_dalek_rustgo.o
go tool compile -N -l -o [email protected] -p main -pack edwards25519/rustgo.go
go tool pack r [email protected] edwards25519/rustgo.o edwards25519/libed25519_dalek_rustgo.o
edwards25519/libed25519_dalek_rustgo.o: target/$(TARGET)/release/libed25519_dalek_rustgo.a
ifeq ($(shell go env GOOS),darwin)
$(LD) -r -o [email protected] -arch x86_64 -u "_$(SYMBOL)" $^
else
$(LD) -r -o [email protected] --gc-sections -u "$(SYMBOL)" $^
endif
CALL
//go:cgo_import_static scalar_base_mult
//go:cgo_import_dynamic scalar_base_mult
我仍然不知道为什么把它留那会导致这个问题,但添加它的话又能使rustgo包同时在外部和内部链接,在Linux和macOS上都有效。
重发布
.a//go:binary-only-packagelinux_amd64darwin_amd64.a
$ tar tf ed25519-dalek-rustgo_go1.8.3.tar.gz
src/github.com/FiloSottile/ed25519-dalek-rustgo/
src/github.com/FiloSottile/ed25519-dalek-rustgo/.gitignore
src/github.com/FiloSottile/ed25519-dalek-rustgo/Cargo.lock
src/github.com/FiloSottile/ed25519-dalek-rustgo/Cargo.toml
src/github.com/FiloSottile/ed25519-dalek-rustgo/edwards25519/
src/github.com/FiloSottile/ed25519-dalek-rustgo/main.go
src/github.com/FiloSottile/ed25519-dalek-rustgo/Makefile
src/github.com/FiloSottile/ed25519-dalek-rustgo/release.sh
src/github.com/FiloSottile/ed25519-dalek-rustgo/src/
src/github.com/FiloSottile/ed25519-dalek-rustgo/target.go
src/github.com/FiloSottile/ed25519-dalek-rustgo/src/lib.rs
src/github.com/FiloSottile/ed25519-dalek-rustgo/edwards25519/rustgo.go
src/github.com/FiloSottile/ed25519-dalek-rustgo/edwards25519/rustgo.s
pkg/linux_amd64/github.com/FiloSottile/ed25519-dalek-rustgo/edwards25519.a
pkg/darwin_amd64/github.com/FiloSottile/ed25519-dalek-rustgo/edwards25519.a
一旦像上述那样安装完之后,软件包就可以像本地包那样使用了。
-Ctarget-cpu=native
$ benchstat bench-none.txt bench-haswell.txt
name old time/op new time/op delta
ScalarBaseMult/rustgo 22.0μs ± 3% 20.2μs ± 2% -8.41% (p=0.001 n=7+6)
$ benchstat bench-haswell.txt bench-native.txt
name old time/op new time/op delta
ScalarBaseMult/rustgo 20.2μs ± 2% 20.1μs ± 2% ~ (p=0.945 n=6+7)
.a
把它变成一个真实的东西
是的,这很有趣!
g
NOSPLITmorestackrsp
GoSlice
#[repr(C)]
struct GoSlice {
array: *mut u8,
len: i32,
cap: i32,
}