On intercepting syscalls on Linux
Sjors Holtrop

On intercepting syscalls on Linux

low-level
syscalls hooking

System calls (syscalls) are the primary way for applications to request services from the operating system. Hooking a syscall to run custom code, or replace its function entirely, has a lot of usecases: debugging, monitoring, emulation/virtualization, performance monitoring and malware behavior analysis. For the sake of this article, I will assume your CPU is x64, as not all the methods are portable. There are quite a few ways to hook syscalls on Linux, so let’s dive into them.

ptrace

Perhaps the most commonly used, the ptrace syscall allows a tracer process to attach to a tracee process to control its flow of execution and edit its memory contents. This is used by many debugging tools such as gdb and strace. Specifically, the tracer can request PTRACE_SYSCALL, at which point the kernel marks the tracee’s syscalls as traced. When the tracee executes a syscall, the tracer is given control and can inspect or modify the syscall. Similarly, when the syscall returns, the tracer is given control again and can modify the return value. This is the most robust way of handling syscall interception. ptrace is used by the strace utility. An example:

# Log open file syscalls of all processes
sudo strace -e trace=open,openat -f -p `pgrep -d, .`

eBPF

eBPF is a virtual machine for sandboxed execution of user-defined code within the OS kernel. This code is written in a limited language also called eBPF. The language is limited (non-Turing complete) so that it can be statically checked for certain properties (e.g., no non-terminating programs) at compile time. It comes with Linux by default.

You can use it to run code in response to certain events within the kernel, most often syscalls and network activity. A common use case is inspecting/filtering network packets. While eBPF is great for monitoring syscalls, it does not allow modifications to arguments of incoming syscalls or their return values. In addition, due to eBPF being limited and sandboxed, operations such as network I/O cannot be performed in response to syscalls.

For a quick example:

// test.bpf
tracepoint:syscalls:sys_enter_openat {
    printf("PID %d (%s) opened file %s\n", pid, comm, str(args->filename));
}

You can run this script with bpftrace:

sudo bpftrace ./test.bpf

SystemTap

SystemTap is a tool that allows you to write a script to attach handlers which respond to events within the kernel. SystemTap compiles this script into a kernel module and loads it. To intercept syscalls, you can write a SystemTap script (*.stp) that attaches to certain events, e.g.:

// openprobe.stp
probe syscall.openat
{
    // Log the PID and the file being opened
    printf("PID %d (%s) is opening file: %s\n", pid(), execname(), user_string($filename))
}

You can then run it with

sudo stap ./openprobe.stp

This seems very simple, perhaps too good to be true. And it is, unfortunately. SystemTap has numerous issues, including its slow startup times and frequent crashes.

LD_PRELOAD

The LD_PRELOAD trick allows you to load a dynamic library before any others, thereby overwriting symbols that may already exist. This means you can overwrite the syscall wrapper functions in libc, such as read and write, effectively intercepting the underlying syscall as well. This is extremely fast as there is no roundtrip to the kernel.

The downside is that you can not intercept everything this way. Some programs statically link libc, or do not link libc at all. Moreover, not every syscall has an associated wrapper in libc, e.g. openat2. In these cases, LD_PRELOAD cannot help you.

You cannot use this to log syscalls for all processes, but only for a single one. The example is rather long because we have to deal with the fact that open is variadic (i.e. it accepts a variable number of arguments):

// ldpreload_open.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdarg.h>
#include <unistd.h>

// Pointer to the original open()
static int (*real_open)(const char *pathname, int flags, ...) = NULL;

// Override open()
int open(const char *pathname, int flags, ...) {
    if (!real_open) {
        // Load the original open function
        real_open = dlsym(RTLD_NEXT, "open");
    }

    // Log the call
    printf("[LD_PRELOAD] PID %d: open called with file: %s\n", getpid(), pathname);

    // If O_CREAT is used, we need the mode argument
    va_list args;
    va_start(args, flags);
    int fd;
    if (flags & O_CREAT) {
        mode_t mode = va_arg(args, mode_t);
        fd = real_open(pathname, flags, mode);
    } else {
        fd = real_open(pathname, flags);
    }
    va_end(args);

    return fd;
}

Now compile this to a shared library:

gcc -shared -fPIC -o ldpreload_open.so ldpreload_open.c -ldl

Then run a program with it:

LD_PRELOAD=./ldpreload_open.so ls /

zpoline

zpoline is a novel research project that employs dynamic binary rewriting to intercept syscalls entirely in userspace. It works as follows:

  1. Before the process executes its main function, zpoline binary-rewrites all syscall instructions to call %rax. This works because both these instructions are 2 bytes large.

  2. The process will now instead jump to the address indicated by register rax. At this point, rax will contain the number specifying which syscall the process wants to execute, which is between 0 and 456. This would normally be an invalid address (e.g., 0 is a null pointer), but zpoline has written trampoline code at these addresses (hence the name).

  3. The trampoline code is entered and it will call a function in a dynamic library you specified at launch time. It provides you with the syscall number as well. Your function can run any custom code and return any value for the syscall it wants.

Binary-rewriting the entire process at the beginning is quite expensive. You also suffer this cost for any child processes spawned afterwards that also need their syscalls intercepted. However, once this is done, zpoline only incurs a slight runtime overhead. It is therefore best suited to long-lived processes that do not spawn a lot of child processes.

For an example of zpoline, first clone the repo.

git clone https://github.com/yasukata/zpoline

Compile the zpoline library

make

Then in apps/basic, replace main.c’s contents with this:

#include <stdio.h>
#include <sys/syscall.h>

typedef long (*syscall_fn_t)(long, long, long, long, long, long, long);

static syscall_fn_t next_sys_call = NULL;

static long hook_function(long a1, long a2, long a3, long a4, long a5, long a6,
                          long a7) {
  if (a1 == SYS_openat) {
    printf("openat called with filename:%s\n", (const char *)a3);
  }
  // printf("output from hook_function: syscall number %ld\n", a1);
  return next_sys_call(a1, a2, a3, a4, a5, a6, a7);
}

int __hook_init(long placeholder __attribute__((unused)),
                void *sys_call_hook_ptr) {
  next_sys_call = *((syscall_fn_t *)sys_call_hook_ptr);
  *((syscall_fn_t *)sys_call_hook_ptr) = hook_function;

  return 0;
}

Make the example:

make

and run it with

# v This is a prerequisite for zpoline to work
sudo sh -c 'echo 0 > /proc/sys/vm/mmap_min_addr'
LD_PRELOAD=./libzpoline.so LIBZPHOOK=./apps/basic/libzphook_basic.so ls /

Summary

We’ve covered five ways of intercepting syscalls on Linux. For a quick summary of the methods we’ve compared, here is a table:

Modify syscallsIntercepts everythingDisallow syscallsFast at runtimeFast startupNo kernel code needed
ptrace
eBPF
SystemTap
LD_PRELOAD
zpoline