Difference between linux capabities and seccomp

I want to know exact difference between linux capabilities and seccomp.

I'll explain the exact differences below, but here's the general explanation: Capabilities involve various checks in kernel functions reachable by syscalls. If the check fails (i.e. the process lacks the necessary capability), the syscall is typically made to return an error. The check can be done either right at the beginning of a specific syscall, or deeper in the kernel in areas that might be reachable through multiple different syscalls (such as writing to a specific privileged file).

Seccomp is a syscall filter which is applied to all syscalls before they are run. A process can set up a filter which allows them to revoke their right to run certain syscalls, or specific arguments for certain syscalls. The filter is usually in the form of eBPF bytecode, which the kernel uses to check whether or not a syscall is permitted for that process. Once a filter is applied, it cannot be loosened at all, only made more strict (assuming the syscalls responsible for loading a seccomp policy are still allowed).

Note that some syscalls cannot be restricted by either seccomp or capabilities because they are not real syscalls. This is the case with vDSO calls, which are userspace implementations of several syscalls that do not strictly need the kernel. Attempting to block getcpu() or gettimeofday() is futile for this reason, since a process will use the vDSO instead of the native syscall anyway. Thankfully, these syscalls (and their associated virtual implementations) are largely harmless.

Also is it that linux capabilities uses seccomp internally or is it other way round or they both are completely different.

They are implemented completely differently internally. I wrote another answer elsewhere on the current implementation of various sandboxing technologies explaining how they differ and what they're for.

Capabilities

Many syscalls which do privileged things may include an internal check to ensure that the calling process has sufficient capabilities. The kernel stores the list of capabilities that a process has, and once a process drops capabilities, it cannot get them back. For example, trying to write to /dev/cpu/*/msr will fail unless the process calling the open() syscall has CAP_SYS_RAWIO. This can be seen in the kernel source code responsible for modifying MSRs (low-level CPU features):

static int msr_open(struct inode *inode, struct file *file)
{
    unsigned int cpu = iminor(file_inode(file));
    struct cpuinfo_x86 *c;

    if (!capable(CAP_SYS_RAWIO))
        return -EPERM;

    if (cpu >= nr_cpu_ids || !cpu_online(cpu))
        return -ENXIO;  /* No such CPU */

    c = &cpu_data(cpu);
    if (!cpu_has(c, X86_FEATURE_MSR))
        return -EIO;    /* MSR not supported */

    return 0;
}

Some syscalls won't run at all if the correct capability is not present, such as vhangup():

SYSCALL_DEFINE0(vhangup)
{
    if (capable(CAP_SYS_TTY_CONFIG)) {
        tty_vhangup_self();
        return 0;
    }
    return -EPERM;
}

Capabilities can be thought of as broad classes of privileged functionality that can be selectively removed from a process or user. The specific functions that have capability checks vary from kernel version to kernel version, and there is often bickering between kernel developers over whether or not a given function should require capabilities to run. Generally, reducing capabilities from a process improves security by reducing the number of privileged actions it can perform. Note that some capabilities are considered root-equivalent, meaning that, even if you disable all other capabilities, they can, in some conditions, be used to regain full permissions. Many examples are given by the creator of grsecurity, Brad Spengler. An obvious example would be CAP_SYS_MODULE which allows loading arbitrary kernel modules. Another would be CAP_SYS_ADMIN which is a catch-all capability nearly equivalent to root.

Mode 1 seccomp

There are two types of seccomp: mode 1 (strict) and mode 2 (filter). Mode 1 is extremely restrictive and, once enabled, only allows four syscalls. These syscalls are read(), write(), exit(), and rt_sigreturn(). A process is immediately sent the fatal SIGKILL signal from the kernel if it ever attempts to use a syscall that is not on the whitelist. This mode is the original seccomp mode and does not require generating and sending eBPF bytecode to the kernel. A special syscall is made, after which mode 1 will be active for the lifetime of the process: seccomp(SECCOMP_SET_MODE_STRICT) or prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT). Once active, it cannot be turned off.

Following is an example program that securely executes bytecode which returns 42:

#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <linux/seccomp.h>

/* "mov al,42; ret" aka "return 42" */
static const unsigned char code[] = "\xb0\x2a\xc3";

void main(void)
{
    int fd[2], ret;

    /* spawn child process, connected by a pipe */
    pipe(fd);
    if (fork() == 0) {
        close(fd[0]);

        /* enter mode 1 seccomp and execute untrusted bytecode */
        prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
        ret = (*(uint8_t(*)())code)();

        /* send result over pipe, and exit */
        write(fd[1], &ret, sizeof(ret));
        syscall(SYS_exit, 0);
    } else {
        close(fd[1]);

        /* read the result from the pipe, and print it */
        read(fd[0], &ret, sizeof(ret));
        printf("untrusted bytecode returned %d\n", ret);
    }
}

Mode 1 is the original mode, and was added for the purpose of making it possible to execute untrusted bytecode for raw computations. A broker process would fork a child (and possibly set up communication via pipes), and the child would enable seccomp, preventing it from doing anything but reading to and writing from file descriptors that are already open, and exiting. This child process could then execute untrusted bytecode safely. Not many people used this mode, but before Linus could complain loudly enough to kill it, the Google Chrome team expressed a desire to use it for their browser. This created renewed interest in seccomp and saved it from an untimely death.

Mode 2 seccomp

The second mode, filter, also called seccomp-bpf, allows the process to send a fine-grained filter policy to the kernel, allowing or denying entire syscalls, or specific syscall arguments or ranges of arguments. The policy also specifies what happens in the event of a violation (for example, should the process be killed, or should the syscall merely be denied?) and whether or not the violation should be logged. Because Linux syscalls are kept in registers and thus can only be integers, it's impossible to filter the memory contents that a syscall argument might point to. For example, although you can prevent open() from being called with the write-capable O_RDWR or O_WRONLY flags, you cannot whitelist the individual path that is opened. The reason for this is that, to seccomp, the path is nothing more than a pointer to memory containing the null-terminated filesystem path. There's no way to guarantee that the memory holding the path hasn't been changed by a sibling thread between the seccomp check passing and the pointer being dereferenced, short of putting it in read-only memory and denying memory-related syscalls access to it. It's often necessary to use LSMs like AppArmor.

This is an example program using mode 2 seccomp to ensure it can only print its current PID. This program uses the libseccomp library which makes creating seccomp eBPF filters easier, although it is also possible to do it the hard way without any abstracting library.

#include <seccomp.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>

void main(void)
{
    /* initialize the libseccomp context */
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

    /* allow exiting */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    /* allow getting the current pid */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getpid), 0);

    /* allow changing data segment size, as required by glibc */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);

    /* allow writing up to 512 bytes to fd 1 */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 2,
        SCMP_A0(SCMP_CMP_EQ, 1),
        SCMP_A2(SCMP_CMP_LE, 512));

    /* if writing to any other fd, return -EBADF */
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EBADF), SCMP_SYS(write), 1,
        SCMP_A0(SCMP_CMP_NE, 1));

    /* load and enforce the filters */
    seccomp_load(ctx);
    seccomp_release(ctx);

    printf("this process is %d\n", getpid());
}

Mode 2 seccomp was created because mode 1 obviously had its limitations. Not every task can be separated into a pure-bytecode process that could run in a child process and communicate via pipes or shared memory. This mode has far more features and its functionality continues to be slowly expanded. However, it still has its downsides. Safely using mode 2 seccomp requires a deep understanding of syscalls (want to block kill() from killing other processes? To bad, you can kill processes with fcntl() too!). It is also fragile, as changes to the underlying libc can cause breakage. The glibc open() function, for example, no longer always uses the syscall of that name and instead may use openat(), breaking policies which whitelisted only the former.