Seccomp #
Sandboxing is hard. But if it is needed and seccomp is used, ensure the following is done:
- The BPF filter
checks the architecture (e.g.,
x86-64versusi386). - The BPF filter
checks the ABI/calling convention (e.g.,
x86-64versusx32ABIs forx86-64architecture).- This means checking for syscalls with numbers greater than
0x40000000(the__X32_SYSCALL_BITflag).
- This means checking for syscalls with numbers greater than
- Syscalls
implemented in
vsyscallandVDSOare handled correctly.- This is important only for optimized syscalls like
gettimeofday,time,getcpu, andclock_gettime. - Seccomp interaction with
VDSOrequires further research.
- This is important only for optimized syscalls like
-
io_uringsyscalls are blocked.- These syscalls allow programs to execute some syscalls without the BPF filter noticing them. Note that
Docker blocks
io_uringsyscalls by default.
- These syscalls allow programs to execute some syscalls without the BPF filter noticing them. Note that
Docker blocks
- All semantically equivalent syscalls are handled in the same way (e.g.,
chmod,fchmod,fchmodat,fchmodat2;seccomp, andprctl(PR_SET_SECCOMP)).- Consult the kernel’s syscall tables.
- Syscalls enabling code execution in the kernel (e.g.,
kexec_file_load,finit_module) are prevented.- A malicious kernel module can easily manipulate the seccomp sandbox.
- If any of the following syscalls are blocked or traced,
then the
restart_syscallsyscall is also blocked or traced:poll,nanosleep,clock_nanosleep, orfutex. - Old kernel versions are supported if needed:
- For Linux kernel versions prior to 5.4, the BPF filter
checks for
compatsyscalls confusion (i.e., calling 64-bit ABI syscalls with the__X32_SYSCALL_BITbit). - For Linux kernel versions prior to 4.8, the BPF filter
disables the use of
ptracefor all sandboxed processes. - For ancient Linux kernel versions, the sandbox
blocks access to the
/procfilesystem instead of seccomp-related syscalls. - The Android LG kernel’s
buggy
sys_set_media_extsyscall is handled correctly.
- For Linux kernel versions prior to 5.4, the BPF filter
checks for
- If special handling of syscalls is needed (not simply allow/disallow), then mechanisms like landlock or namespaces are used.
- If special handling of syscalls is needed and seccomp must be used for the task, then the
SECCOMP_SET_MODE_FILTERoption is used withSECCOMP_RET_TRACEactions andptrace.-
SECCOMP_SET_MODE_FILTERis not used withSECCOMP_FILTER_FLAG_NEW_LISTENERandseccomp_unotify. This mechanism is inherently insecure. Consult theseccomp_unotifyman page. - A similar syscall user dispatch mechanism is also inherently insecure.
- The
SECCOMP_RET_USER_NOTIFactions have precedence over theSECCOMP_RET_TRACEactions: after the seccomp sandbox is enabled, addition ofSECCOMP_RET_USER_NOTIFmust not be allowed. The most secure solution is to forbidseccompandprctl(PR_SET_SECCOMP)altogether. - The checklist on BPF filter handlers and
ptracebelow is consulted.
-
BPF filter handlers and ptrace #
If the seccomp filter uses the SECCOMP_RET_TRACE action, then all BPF filter handlers must be reviewed. These handlers are usually implemented in a process (the “tracer”) that is tracing the sandboxed process (the “tracee”) with the ptrace mechanism. Ensure that the following is done.
- All
ptracetracing options (e.g.,PTRACE_O_TRACEFORK,PTRACE_O_TRACECLONE, etc.) are used on the tracee.- Otherwise, the tracee may escape the sandbox by creating child processes and calling
execve.
- Otherwise, the tracee may escape the sandbox by creating child processes and calling
- The
PTRACE_O_EXITKILLis set on all sandboxed processes.- Otherwise, a crash of the tracer process allows malicious processes to trace sandboxed tracees and handle all
SECCOMP_RET_TRACEactions (ptrace-stopevents).
- Otherwise, a crash of the tracer process allows malicious processes to trace sandboxed tracees and handle all
- Syscalls that the tracee may use to impact the tracer are forbidden:
rt_sigqueueinfoandrt_tgsigqueueinfowith any tracer’s TIDkill,tgkill, andtkillwith any tracer’s TIDsetpriority,sched_{setaffinity,setattr,setparam,setscheduler}, andprlimit64with any tracer’s TIDprocess_vm{writev,readv}with any tracer’s TIDopen,rename, and other similar syscalls on/proc/<tracer-pid>and/proc/<tracer-pid>/task/<tracer-tid>
-
The correct syscall table is used to determine which syscall caused the
ptrace-stopevent.- This information is not provided in the event.
- To determine the syscall number, the tracer can do the following:
- Use
PTRACE_GET_SYSCALL_INFOin new kernels - Read registers (e.g., with
PTRACE_GETREGSET) and check the following in old kernels (this is a bit tricky and is often prone to TOCTOU bugs):- Bitness of registers (
RAXversusEAX) - Instruction used to execute the syscall (the
int 0x80andsysenterinstructions in x64 use x86’s table, andsyscalluses x64’s table) - CS register flags
- Use of the
vsyscallmechanism (RIP & ~0x0C00 == 0xFFFFFFFFFF600000)- If used, then finding the executed instruction is more complicated; the stack needs to be parsed to find the saved RIP.
- Bitness of registers (
- Use
- The kernel downcasts syscall numbers to ints ( at least modern kernels do, both for 64-bit and 32-bit syscall tables), and the BPF filter should use 32-bit opcodes for syscall numbers.
- Handlers interpret arguments (registers) according to the ABI (
values may be silently truncated).
- Both the BPF filter and a handler should be in agreement on arguments’ bitness. Check BPF opcode docs to establish bitness.
- Memory-level race conditions are dealt with.
- If the tracer reads the tracee’s memory, then the memory could have been asynchronously modified by another thread (within a single thread group or process) or by another process (if some memory was explicitly shared). It is hard to fix this race condition completely.
- For a single process, this race condition can be fixed by pausing all other threads of the tracee when the syscall entry hook is called and unfreezing them on syscall exit. It is possible to pause a group of tasks (processes or threads) using the cgroups (control groups) Linux kernel feature. In
cgroups v1, the freezer controller/subsystem can be used to do so. In
cgroups v2, the
cgroup.freezefile can be written to in order to freeze all tasks within a cgroup (and tasks in all descendant cgroups). We also recommend reading the Thread Granularity section of the cgroup v2 documentation. - To protect against shared memory race conditions, the tracer would have to freeze all processes that the memory map was shared with. Alternatively, the memory could be exclusively locked (assuming it cannot be unlocked by other processes; this solution would need further investigation).
- The
userfaultfdsyscall enables attackers to win races with 100% reliability.
- Operating-system-level race conditions are dealt with.
- All the common vulnerabilities apply here, such as changing file paths with symlinks, and changing the tracee’s resources, like the current working directory, environment variables, and file descriptors (or anything under
/proc/<pid>). - Note that the two race conditions above can happen inside a handler execution but also during handler versus kernel execution (inside a syscall).
- All the common vulnerabilities apply here, such as changing file paths with symlinks, and changing the tracee’s resources, like the current working directory, environment variables, and file descriptors (or anything under
- If a syscall should be dropped (skipped), it is done in
syscall-enter-stop(orptrace-stop), not insyscall-exit-stop. Thesyscall-exit-stopevent occurs after the syscall is executed, so even if the syscall’s return value is modified to indicate an error in that stop, the syscall has already executed and its effects cannot be undone. - If the tracee should be killed in a handler, it is done with
SIGKILL(i.e., it is terminated immediately) and not by any delayed mechanism that would allow the tracee to execute after the handler returns but before the termination. - Signals and
ptraceevents are correctly handled.- The
syscall-enter-stopandsyscall-exit-stopevents must be correctly tracked by the tracer. TheptraceAPI does not provide a means to differentiate between these two. A common exploit is to manually send aSIGTRAPsignal to the tracee or to make the tracee executeint 3(the software interrupt instruction) to confuse the handlers. A common solution is to use thePTRACE_O_TRACESYSGOODoption: it allows the tracer to easily differentiate betweensyscall-{enter,exit}-stopand otherstopcommands. -
SIGKILLcan terminate a process abruptly, and that event must be handled correctly.- The
syscall-exit-stopevent may not be delivered after syscall execution. Tracing of syscalls should start withinsyscall-enter-stopand end within eithersyscall-exit-stopor the tracee’s exit event. Otherwise, an executed syscall can be missed by the tracer. - According to the
ptraceman page, “the tracer cannot assume that the ptrace-stopped tracee exists.” - The
PTRACE_EVENT_EXITevent may or may not be delivered afterSIGKILL, depending on kernel version.
- The
- A signal with
WIFEXITED(status)andWIFSIGNALED(status)is not always delivered upon tracee termination. For example, it is not delivered when a thread (that is not thread group leader) callsexecve. The tracer should additionally usePTRACE_EVENT_EXECto detect all possible tracee terminations. - When a new process is forked or cloned, then
PTRACE_EVENT_CLONE(for the parent) andPTRACE_EVENT_STOP(for the child) signals are delivered simultaneously in undetermined order. This behavior may enable race condition bugs.
- The
- The
clonesyscall with theCLONE_UNTRACEDflag is not allowed.- This flag allows the tracee to clone in a way that the child is not traced by the original tracer. The tracee can then attach to the new thread via
ptraceand handle all theSECCOMP_RET_TRACEactions (effectively disabling relevant seccomp filters). Note that this trick does not work for seccomp actions that explicitly drop syscalls (likeSECCOMP_RET_ERRNO)—these are still blocked. To prevent this vulnerability, the following must be done:- Handle the
clonesyscall and remove the flag in theptrace-stopevent or blockclonewith the flag (blocking should be implemented in BPF, as it is less error-prone).- The
clone3syscall must be blocked and its return value must be set toENOSYS. This syscall stores arguments in memory and cannot be inspected in the BPF filter; aptracehandler would be vulnerable to TOCTOU attacks.ENOSYSerror makes programs fall back to usingclone.
- The
- Note that the kernel
may have reversed the order of
clonearguments (theCONFIG_CLONE_BACKWARDS*configurations; consult the Flatpak seccomp implementation).
- Handle the
- This flag allows the tracee to clone in a way that the child is not traced by the original tracer. The tracee can then attach to the new thread via
- If the tracer uses
PTRACE_PEEK*ptracecalls, then errors are handled correctly: theerrnomust be consulted, not the return value. - Return values of the tracer’s calls to
ptrace(ATTACH/SEIZE)are checked.- An error means there is no effective sandboxing, as a malicious process may then try to attach to the tracee.
- Note that the tracee (or the malicious process) may try to force an error in multiple ways, such as by using
prctl(PR_SET_PTRACER)orprctl(PR_SET_DUMPABLE, 0), or by changing the YAMA policy (/proc/sys/kernel/yama/ptrace_scope).
- Syscalls executed via the
vsyscallmechanism are handled correctly.- Such syscalls cannot be dynamically replaced, only dropped.
- The following obscure syscalls are blocked, if possible. While these syscalls should not enable seccomp bypasses, they usually should be blocked “just in case” and because they can be abused to circumvent filters for other syscalls:
modify_ldtuselib- Filesystem-manipulating syscalls (
chroot,pivot_root,mount,move_mount,open_tree,fsopen,fsmount) - VM/NUMA ops (
move_pages,mbind,set_mempolicy,migrate_pages)