Zig supports inline assembly, which is useful when:
- Writing an operating system, which requires direct access to special CPU registers and controllers
- Implementing syscalls in standard libraries
- Accessing microcontroller features on embedded systems
- Handwriting a performance-critical hot-path function where the optimizer doesn’t do the right thing
Inline assembly, however, can quickly get unwieldy, and getting the input/output constraints right can be tricky.
In this post, we’ll explore an alternative: write the assembly in separate files, run your favorite assembler, and then embed the resulting machine code using @embedFile
. Running the assembler can obviously be automated from the build script.
The code is tested on macOS, but should work with minor modifications on Linux (primarily the syscall number)
Object files vs. raw machine code
Why not simply generate object files from the assembler and link it with the Zig code? That’s likely the better option in most cases. However, in free-standing and embedded environments, you might not want the overhead and relocation complexity. Embedding machine code has very niche use cases, but, hey, it’s fun and still useful to know, and it’s an excuse to look into efficient use of ABI calling conventions.
Assembling
We’ll be using NASM to assemble the code, which allows us to emit raw machine code without any additional metadata.
Below is a simple snippet that adds two 64-bit numbers and returns the result:
[BITS 64]
mov rax, rdi
add rax, rsi
ret
There’s no function prologue and epilogue - we’re not using the stack, so there’s no need for the overhead. There isn’t even a label naming the function.
However, as far as the System V AMD64 ABI is concerned, this is a function, and we’ll be able to call it from Zig. Given the ABI calling convention, we know that the first argument is passed in rdi
and the second in rsi
, and we know that we should put the result in rax
. The BITS 64
directive tells NASM to produce 64-bit machine code.
Let’s assemble it:
nasm -f bin -o asm-add.bin asm-add.s
Thanks to -f bin
, this will produce a file called asm-add.bin
containing the raw machine code, with no metadata like headers and sections. Just raw machine code for each instruction in the assembly file.
You can check that the .bin file is correct by disassembling it:
ndisasm -b 64 asm-add.bin
This will output the same instructions as in the assembly file.
We might as well dig a little deeper and look at the actual machine code:
hexdump -C asm-add.bin
00000000 48 89 f8 48 01 f0 c3
That’s 7 bytes of machine code to add two numbers:
; 0x48 is the REX prefix to indicate 64-bit operands
; 0x89 is the MOV instruction
; 0xf8 is the MOD R/M byte to specify the source and destination registers
48 89 f8: mov rax, rdi
; 0x48 is the REX prefix
; 0x01 is the ADD instruction
; 0xf0 is the MOD R/M byte to specify operand registers
48 01 f0: add rax, rsi
; 0xc3 is simply the RET instruction. It pops the return address from the stack and jumps to it
; The return address is placed there by the CALL instruction that Zig generates.
c3: ret
Calling from Zig
Next, we’ll call the add function from Zig.
The process is roughly as follows:
- Allocate a page of memory
- Write the machine code to the page. The machine code is loaded at compile time with
@embedFile
- Mark the page as readable and executable using
mprotect
- Cast the address of the buffer to a function pointer
- Call the function
asm.zig
const std = @import("std");
pub fn main() !void {
const code = try std.heap.page_allocator.alignedAlloc(u8, std.mem.page_size, std.mem.page_size);
defer {
// In order to deallocate, we have to make the page writable again
std.os.mprotect(code, std.os.PROT.WRITE) catch unreachable;
std.heap.page_allocator.free(code);
}
// Wrap the code page in a buffer stream and write the machine code to it
var buf = std.io.fixedBufferStream(code);
_ = try buf.write(@embedFile("asm-add.bin"));
try std.os.mprotect(code, std.os.PROT.READ | std.os.PROT.EXEC);
// Make a Zig function pointer for adding two u64s and returning the result
const add: *const fn(a: u64, b: u64) callconv(.C) u64 = @ptrCast(code);
// Call the machine code through the function pointer
// This will put the arguments into rdi and rsi, and return the result in rax
const res = add(1, 2);
std.debug.print("Res = {d}\n", .{res});
}
Let’s run it:
zig run asm.zig
Res = 3
The structure is pretty nice: all the assembly code is in a separate file, and we can call it from Zig with no overhead beyond using registers for arguments, according to a specific calling convention.
If multiple CPU architectures are supported, the correct machine code file can be selected at compile time, typically by switching on builtin.cpu.arch
Let’s take a quick look at what Zig generates for the add(1,2)
call:
mov rax, qword ptr [rbp - 96]
mov edi, 1
mov esi, 2
call rax
The first line puts the pointer to the add function into rax
. The next two lines put the arguments into edi
and esi
, and the last line calls the function that was loaded into rax
. Note that edi/esi are the 32-bit lower halves of rdi/rsi - the upper halves are zeroed out, and thus reading rdi/rsi will work as expected in the add implementation.
A larger example with fast memcpy and syscalls
Here’s an expanded version of the example, with two more functions: an AVX based memcpy, and an example of using syscalls to print a string to stdout.
asm.zig
pub fn main() !void {
const code = try std.heap.page_allocator.alignedAlloc(u8, std.mem.page_size, std.mem.page_size);
defer {
// In order to deallocate, we have to make the page writable again
std.os.mprotect(code, std.os.PROT.WRITE) catch unreachable;
std.heap.page_allocator.free(code);
}
// Wrap the code page in a buffer stream and write the machine code to it
var buf = std.io.fixedBufferStream(code);
_ = try buf.write(@embedFile("asm-add.bin"));
try std.os.mprotect(code, std.os.PROT.READ | std.os.PROT.EXEC);
// Make a Zig function pointer for adding two u64s and returning the result
const add: *const fn(a: u64, b: u64) callconv(.C) u64 = @ptrCast(code);
// Call the machine code through the function pointer
const res = add(1, 2);
std.debug.print("Res = {d}\n", .{res});
// Fast memcpy. To make the example short, we simply overwrite the code page with the memcpy code.
// In a real program, we would append the memcpy code to the code page, at a suitably aligned offset.
std.os.mprotect(code, std.os.PROT.WRITE) catch unreachable;
try buf.seekTo(0);
_ = try buf.write(@embedFile("asm-opt-memcpy.bin"));
try std.os.mprotect(code, std.os.PROT.READ | std.os.PROT.EXEC);
// This memcpy requires the destination and source to be aligned to 32 bytes, and len to be a multiple of 64 bytes
const fast_memcpy: *const fn(dst: u64, src: u64, len: u64) callconv(.C) ?[*]const u8 = @ptrCast(code);
var dst = try std.heap.page_allocator.alignedAlloc(u8, 32, 64);
var src = try std.heap.page_allocator.alignedAlloc(u8, 32, 64);
const test_bytes = "0123456789012345678901234567890123456789012345678901234567891234";
@memcpy(src, test_bytes);
_ = fast_memcpy(@intFromPtr(dst.ptr), @intFromPtr(src.ptr), 64);
if (std.mem.eql(u8, test_bytes, dst)) {
std.debug.print("fast_memcpy works!\n", .{});
} else {
std.debug.print("fast_memcpy failed!\n", .{});
}
// Next machine code file writes a message to stdout. We once again overwrite the code page for simplicity.
std.os.mprotect(code, std.os.PROT.WRITE) catch unreachable;
try buf.seekTo(0);
_ = try buf.write(@embedFile("asm-syscall.bin"));
try std.os.mprotect(code, std.os.PROT.READ | std.os.PROT.EXEC);
const hello_world: *const fn(msg: ?[*:0]const u8, len: u64) callconv(.C) void = @ptrCast(code);
hello_world("Hello, world!!!\n", 16);
}
Below are the two new assembly files.
asm-opt-memcpy.s
In this case, we do need the function prologue and epilogue, because we’re using the stack to store the arguments and local variables.
This source is based on a disassembly on Compiler Explorer.
[BITS 64]
fast_memcpy:
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 256
mov qword [rsp + 104], rdi
mov qword [rsp + 96], rsi
mov qword [rsp + 88], rdx
mov qword [rsp + 80], 64
.LBB1_1:
cmp qword [rsp + 88], 0
je .LBB1_3
mov rax, qword [rsp + 96]
mov qword [rsp + 120], rax
mov rax, qword [rsp + 120]
vmovdqa ymm0, [rax]
vmovdqa [rsp + 32], ymm0
mov rax, qword [rsp + 96]
add rax, 32
mov qword [rsp + 112], rax
mov rax, qword [rsp + 112]
vmovdqa ymm0, [rax]
vmovdqa [rsp], ymm0
mov rax, qword [rsp + 104]
vmovdqa ymm0, [rsp + 32]
mov qword [rsp + 232], rax
vmovdqa [rsp + 192], ymm0
vmovdqa ymm0, [rsp + 192]
mov rax, qword [rsp + 232]
vmovntdq [rax], ymm0
mov rax, qword [rsp + 104]
add rax, 32
vmovdqa ymm0, [rsp]
mov qword [rsp + 184], rax
vmovdqa [rsp + 128], ymm0
vmovdqa ymm0, [rsp + 128]
mov rax, qword [rsp + 184]
vmovntdq [rax], ymm0
mov rcx, qword [rsp + 80]
mov rax, qword [rsp + 88]
sub rax, rcx
mov qword [rsp + 88], rax
mov rax, qword [rsp + 80]
add rax, qword [rsp + 96]
mov qword [rsp + 96], rax
mov rax, qword [rsp + 80]
add rax, qword [rsp + 104]
mov qword [rsp + 104], rax
jmp .LBB1_1
.LBB1_3:
mov rsp, rbp
pop rbp
vzeroupper
ret
asm-syscall.s
[BITS 64]
; set up syscall arguments
mov rdx, rsi
mov rsi, rdi
; pick the write syscall on macOS
mov rax, 0x2000004
; stdout
mov rdi, 1
; invoke
syscall
zig run asm.zig
Res = 3
fast_memcpy works!
Hello, world!!!