On intercepting syscalls on Linux
System calls (syscalls) are the primary way for applications to request services from the operating system. Hooking a syscall to run custom code, or replace its function entirely, has a lot of usecases: debugging, monitoring, emulation/virtualization, performance monitoring and malware behavior analysis. For the sake of this article, I will assume your CPU is x64, as not all the methods are portable. There are quite a few ways to hook syscalls on Linux, so let’s dive into them.
ptrace
Perhaps the most commonly used, the ptrace syscall allows a tracer process to attach to a tracee process to control its flow of execution and edit its memory contents.
This is used by many debugging tools such as gdb and strace.
Specifically, the tracer can request PTRACE_SYSCALL
, at which point the kernel marks the tracee’s syscalls as traced. When the tracee executes a syscall, the tracer is given control and can inspect or modify the syscall. Similarly, when the syscall returns, the tracer is given control again and can modify the return value.
This is the most robust way of handling syscall interception. ptrace
is used by the strace
utility.
An example:
# Log open file syscalls of all processes
sudo strace -e trace=open,openat -f -p `pgrep -d, .`
eBPF
eBPF is a virtual machine for sandboxed execution of user-defined code within the OS kernel. This code is written in a limited language also called eBPF. The language is limited (non-Turing complete) so that it can be statically checked for certain properties (e.g., no non-terminating programs) at compile time. It comes with Linux by default.
You can use it to run code in response to certain events within the kernel, most often syscalls and network activity. A common use case is inspecting/filtering network packets. While eBPF is great for monitoring syscalls, it does not allow modifications to arguments of incoming syscalls or their return values. In addition, due to eBPF being limited and sandboxed, operations such as network I/O cannot be performed in response to syscalls.
For a quick example:
// test.bpf
tracepoint:syscalls:sys_enter_openat {
printf("PID %d (%s) opened file %s\n", pid, comm, str(args->filename));
}
You can run this script with bpftrace
:
sudo bpftrace ./test.bpf
SystemTap
SystemTap is a tool that allows you to write a script to attach handlers which respond to events within the kernel. SystemTap compiles this script into a kernel module and loads it. To intercept syscalls, you can write a SystemTap script (*.stp) that attaches to certain events, e.g.:
// openprobe.stp
probe syscall.openat
{
// Log the PID and the file being opened
printf("PID %d (%s) is opening file: %s\n", pid(), execname(), user_string($filename))
}
You can then run it with
sudo stap ./openprobe.stp
This seems very simple, perhaps too good to be true. And it is, unfortunately. SystemTap has numerous issues, including its slow startup times and frequent crashes.
LD_PRELOAD
The LD_PRELOAD trick allows you to load a dynamic library before any others, thereby overwriting symbols that may already exist. This means you can overwrite the syscall wrapper functions in libc
, such as read and write, effectively intercepting the underlying syscall as well. This is extremely fast as there is no roundtrip to the kernel.
The downside is that you can not intercept everything this way. Some programs statically link libc
, or do not link libc
at all. Moreover, not every syscall has an associated wrapper in libc
, e.g. openat2. In these cases, LD_PRELOAD
cannot help you.
You cannot use this to log syscalls for all processes, but only for a single one. The example is rather long because we have to deal with the fact that open
is variadic (i.e. it accepts a variable number of arguments):
// ldpreload_open.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdarg.h>
#include <unistd.h>
// Pointer to the original open()
static int (*real_open)(const char *pathname, int flags, ...) = NULL;
// Override open()
int open(const char *pathname, int flags, ...) {
if (!real_open) {
// Load the original open function
real_open = dlsym(RTLD_NEXT, "open");
}
// Log the call
printf("[LD_PRELOAD] PID %d: open called with file: %s\n", getpid(), pathname);
// If O_CREAT is used, we need the mode argument
va_list args;
va_start(args, flags);
int fd;
if (flags & O_CREAT) {
mode_t mode = va_arg(args, mode_t);
fd = real_open(pathname, flags, mode);
} else {
fd = real_open(pathname, flags);
}
va_end(args);
return fd;
}
Now compile this to a shared library:
gcc -shared -fPIC -o ldpreload_open.so ldpreload_open.c -ldl
Then run a program with it:
LD_PRELOAD=./ldpreload_open.so ls /
zpoline
zpoline is a novel research project that employs dynamic binary rewriting to intercept syscalls entirely in userspace. It works as follows:
-
Before the process executes its main function,
zpoline
binary-rewrites allsyscall
instructions tocall %rax
. This works because both these instructions are 2 bytes large. -
The process will now instead jump to the address indicated by register
rax
. At this point,rax
will contain the number specifying which syscall the process wants to execute, which is between 0 and 456. This would normally be an invalid address (e.g., 0 is a null pointer), butzpoline
has written trampoline code at these addresses (hence the name). -
The trampoline code is entered and it will call a function in a dynamic library you specified at launch time. It provides you with the syscall number as well. Your function can run any custom code and return any value for the syscall it wants.
Binary-rewriting the entire process at the beginning is quite expensive. You also suffer this cost for any child processes spawned afterwards that also need their syscalls intercepted. However, once this is done, zpoline
only incurs a slight runtime overhead. It is therefore best suited to long-lived processes that do not spawn a lot of child processes.
For an example of zpoline
, first clone the repo:
git clone https://github.com/yasukata/zpoline
Compile the zpoline library:
make
Then in apps/basic
, replace main.c
’s contents with this:
#include <stdio.h>
#include <sys/syscall.h>
typedef long (*syscall_fn_t)(long, long, long, long, long, long, long);
static syscall_fn_t next_sys_call = NULL;
static long hook_function(long a1, long a2, long a3, long a4, long a5, long a6,
long a7) {
if (a1 == SYS_openat) {
printf("openat called with filename:%s\n", (const char *)a3);
}
// printf("output from hook_function: syscall number %ld\n", a1);
return next_sys_call(a1, a2, a3, a4, a5, a6, a7);
}
int __hook_init(long placeholder __attribute__((unused)),
void *sys_call_hook_ptr) {
next_sys_call = *((syscall_fn_t *)sys_call_hook_ptr);
*((syscall_fn_t *)sys_call_hook_ptr) = hook_function;
return 0;
}
Make the example:
make
and run it with:
# v This is a prerequisite for zpoline to work
sudo sh -c 'echo 0 > /proc/sys/vm/mmap_min_addr'
LD_PRELOAD=./libzpoline.so LIBZPHOOK=./apps/basic/libzphook_basic.so ls /
Summary
We’ve covered five ways of intercepting syscalls on Linux. For a quick summary of the methods we’ve compared, here is a table:
Modify syscalls | Intercepts everything | Disallow syscalls | Fast at runtime | Fast startup | No kernel code needed | |
---|---|---|---|---|---|---|
ptrace | ✔ | ✔ | ✔ | ✖ | ✔ | ✔ |
eBPF | ✖ | ✔ | ✖ | ✔ | ✔ | ✔ |
SystemTap | ✔ | ✔ | ✔ | ✖ | ✖ | ✖ |
LD_PRELOAD | ✔ | ✖ | ✖ | ✔ | ✔ | ✔ |
zpoline | ✔ | ✔ | ✖ | ✔ | ✖ | ✔ |