Seccomp/BPF

Seccomp #

Sandboxing is hard. But if it is needed and seccomp is used, ensure the following is done:

BPF filter handlers and ptrace #

If the seccomp filter uses the SECCOMP_RET_TRACE action, then all BPF filter handlers must be reviewed. These handlers are usually implemented in a process (the “tracer”) that is tracing the sandboxed process (the “tracee”) with the ptrace mechanism. Ensure that the following is done.

  • All ptrace tracing options (e.g., PTRACE_O_TRACEFORK, PTRACE_O_TRACECLONE, etc.) are used on the tracee.
    • Otherwise, the tracee may escape the sandbox by creating child processes and calling execve.
  • The PTRACE_O_EXITKILL is set on all sandboxed processes.
    • Otherwise, a crash of the tracer process allows malicious processes to trace sandboxed tracees and handle all SECCOMP_RET_TRACE actions (ptrace-stop events).
  • Syscalls that the tracee may use to impact the tracer are forbidden:
    • rt_sigqueueinfo and rt_tgsigqueueinfo with any tracer’s TID
    • kill, tgkill, and tkill with any tracer’s TID
    • setpriority, sched_{setaffinity,setattr,setparam,setscheduler}, and prlimit64 with any tracer’s TID
    • process_vm{writev,readv} with any tracer’s TID
    • open, rename, and other similar syscalls on /proc/<tracer-pid> and /proc/<tracer-pid>/task/<tracer-tid>
  • The correct syscall table is used to determine which syscall caused the ptrace-stop event.
    • This information is not provided in the event.
    • To determine the syscall number, the tracer can do the following:
      • Use PTRACE_GET_SYSCALL_INFO in new kernels
      • Read registers (e.g., with PTRACE_GETREGSET) and check the following in old kernels (this is a bit tricky and is often prone to TOCTOU bugs):
        • Bitness of registers (RAX versus EAX)
        • Instruction used to execute the syscall (the int 0x80 and sysenter instructions in x64 use x86’s table, and syscall uses x64’s table)
        • CS register flags
        • Use of the vsyscall mechanism (RIP & ~0x0C00 == 0xFFFFFFFFFF600000)
          • If used, then finding the executed instruction is more complicated; the stack needs to be parsed to find the saved RIP.
    • The kernel downcasts syscall numbers to ints ( at least modern kernels do, both for 64-bit and 32-bit syscall tables), and the BPF filter should use 32-bit opcodes for syscall numbers.
  • Handlers interpret arguments (registers) according to the ABI ( values may be silently truncated).
    • Both the BPF filter and a handler should be in agreement on arguments’ bitness. Check BPF opcode docs to establish bitness.
  • Memory-level race conditions are dealt with.
    • If the tracer reads the tracee’s memory, then the memory could have been asynchronously modified by another thread (within a single thread group or process) or by another process (if some memory was explicitly shared). It is hard to fix this race condition completely.
    • For a single process, this race condition can be fixed by pausing all other threads of the tracee when the syscall entry hook is called and unfreezing them on syscall exit. It is possible to pause a group of tasks (processes or threads) using the cgroups (control groups) Linux kernel feature. In cgroups v1, the freezer controller/subsystem can be used to do so. In cgroups v2, the cgroup.freeze file can be written to in order to freeze all tasks within a cgroup (and tasks in all descendant cgroups). We also recommend reading the Thread Granularity section of the cgroup v2 documentation.
    • To protect against shared memory race conditions, the tracer would have to freeze all processes that the memory map was shared with. Alternatively, the memory could be exclusively locked (assuming it cannot be unlocked by other processes; this solution would need further investigation).
    • The userfaultfd syscall enables attackers to win races with 100% reliability.
  • Operating-system-level race conditions are dealt with.
    • All the common vulnerabilities apply here, such as changing file paths with symlinks, and changing the tracee’s resources, like the current working directory, environment variables, and file descriptors (or anything under /proc/<pid>).
    • Note that the two race conditions above can happen inside a handler execution but also during handler versus kernel execution (inside a syscall).
  • If a syscall should be dropped (skipped), it is done in syscall-enter-stop (or ptrace-stop), not in syscall-exit-stop. The syscall-exit-stop event occurs after the syscall is executed, so even if the syscall’s return value is modified to indicate an error in that stop, the syscall has already executed and its effects cannot be undone.
  • If the tracee should be killed in a handler, it is done with SIGKILL (i.e., it is terminated immediately) and not by any delayed mechanism that would allow the tracee to execute after the handler returns but before the termination.
  • Signals and ptrace events are correctly handled.
    • The syscall-enter-stop and syscall-exit-stop events must be correctly tracked by the tracer. The ptrace API does not provide a means to differentiate between these two. A common exploit is to manually send a SIGTRAP signal to the tracee or to make the tracee execute int 3 (the software interrupt instruction) to confuse the handlers. A common solution is to use the PTRACE_O_TRACESYSGOOD option: it allows the tracer to easily differentiate between syscall-{enter,exit}-stop and other stop commands.
    • SIGKILL can terminate a process abruptly, and that event must be handled correctly.
      • The syscall-exit-stop event may not be delivered after syscall execution. Tracing of syscalls should start within syscall-enter-stop and end within either syscall-exit-stop or the tracee’s exit event. Otherwise, an executed syscall can be missed by the tracer.
      • According to the ptrace man page, “the tracer cannot assume that the ptrace-stopped tracee exists.”
      • The PTRACE_EVENT_EXIT event may or may not be delivered after SIGKILL, depending on kernel version.
    • A signal with WIFEXITED(status) and WIFSIGNALED(status) is not always delivered upon tracee termination. For example, it is not delivered when a thread (that is not thread group leader) calls execve. The tracer should additionally use PTRACE_EVENT_EXEC to detect all possible tracee terminations.
    • When a new process is forked or cloned, then PTRACE_EVENT_CLONE (for the parent) and PTRACE_EVENT_STOP (for the child) signals are delivered simultaneously in undetermined order. This behavior may enable race condition bugs.
  • The clone syscall with the CLONE_UNTRACED flag is not allowed.
    • This flag allows the tracee to clone in a way that the child is not traced by the original tracer. The tracee can then attach to the new thread via ptrace and handle all the SECCOMP_RET_TRACE actions (effectively disabling relevant seccomp filters). Note that this trick does not work for seccomp actions that explicitly drop syscalls (like SECCOMP_RET_ERRNO)—these are still blocked. To prevent this vulnerability, the following must be done:
      • Handle the clone syscall and remove the flag in the ptrace-stop event or block clone with the flag (blocking should be implemented in BPF, as it is less error-prone).
        • The clone3 syscall must be blocked and its return value must be set to ENOSYS. This syscall stores arguments in memory and cannot be inspected in the BPF filter; a ptrace handler would be vulnerable to TOCTOU attacks. ENOSYS error makes programs fall back to using clone.
      • Note that the kernel may have reversed the order of clone arguments (the CONFIG_CLONE_BACKWARDS* configurations; consult the Flatpak seccomp implementation).
  • If the tracer uses PTRACE_PEEK* ptrace calls, then errors are handled correctly: the errno must be consulted, not the return value.
  • Return values of the tracer’s calls to ptrace(ATTACH/SEIZE) are checked.
    • An error means there is no effective sandboxing, as a malicious process may then try to attach to the tracee.
    • Note that the tracee (or the malicious process) may try to force an error in multiple ways, such as by using prctl(PR_SET_PTRACER) or prctl(PR_SET_DUMPABLE, 0), or by changing the YAMA policy (/proc/sys/kernel/yama/ptrace_scope).
  • Syscalls executed via the vsyscall mechanism are handled correctly.
  • The following obscure syscalls are blocked, if possible. While these syscalls should not enable seccomp bypasses, they usually should be blocked “just in case” and because they can be abused to circumvent filters for other syscalls:
This content is licensed under a Creative Commons Attribution 4.0 International license.