A new capabilities patch for Linux

by David A. Madore

✱✱✱ This patch has been abandoned in favor of cuppabilities: see here for explanations ✱✱✱

Table of contents

What's this?

A patch to make Linux capabilities into something useful.

In short: currently (i.e., prior to applying this patch), Linux has capabilities, but they are (deliberately) crippled, and thus, essentially useless, because nobody could agree on coherent semantics for them; this patch uncripples them and attempts to give them reasonable semantics that will, hopefully, neither break legacy Unix programs nor those that use the current capabilies system (essentially, Bind9 and NTP); basically, capabilities are currently useless because they are never inheritable (=preserved across execve()) and this patch makes them so (but carefully enough so as not to confuse existing programs). Furthermore, whereas the current Linux capabilities are only “additional” capabilities (meaning that normal, non-root, processes, have none, and adding capabilities leads up to root), the patch also suggests (and, to some extent, implements) a new bunch of “regular” capabilites, which are present on all normal processes and can be removed so as to provide some measure of fault-containment for partially untrusted or potentially buggy programs (thus, these new capabilities can be said to lead down).

Note: Although I believe that this patch will not break anything, it is still little tested and should be considered alpha quality: it should on no account be applied on security-critical systems or on a system were local users are not to be trusted: the security implications are quite complex and I could quite possibly be wrong in thinking that it doesn't open any local root hole.

Why is it abandoned?

This patch has been abandoned due to heavy criticism on the linux-kernel mailing list: essentially because it abandoned POSIX semantics, because it made capabilites inheritable by default, which some people do not want, and because it used the capabilities model (designed for overprivileged processes) to also model underprivileged processes, contrary to what was intended. So it was obvious that the patch could never gain sufficient acceptance as to be included in the kernel. Rather than pursuing it independently, I am trying a more consensual approach: I am splitting the changes in two completely independent parts:

The two are entirely independent.

This web page is kept for documentation purposes.

Where can I get it?

See this FTP directory.

The present version (0.4.4) is to be applied against Linux version 2.6.18-rc6, although it should not be too picky about that. (The possibility of serving a git tree is being considered.)

What are capabilities?

Traditional Unix semantics know only two levels of privileges: root and non-root. Root processes are able to bypass essentially all security checks (mandatory access controls) in the kernel, whereas non-root processes are subject to all of them. There is no intermediate situation. This all-or-nothing solution has the merit of simplicity, but it also means that a program that requires any level of privileges must be made suid root, making it a privileged target for attack and thus dangerous. What capabilities do is split the single “root” privilege in thirty-odd mostly independent bits so that programs requiring special privileges can be given just those required and not full root privileges.

For example, the NTP daemon needs only the CAP_SYS_TIME capability, in order to set and skew the system clock: so a capability-aware version of it starts as root (so with all capabilities) but drops all the unnecessary ones early in the code.

Furthermore, the patch discussed here adds a new bunch of capabilities which are present in normal (non-root) processes and which can be removed to give a process even lower abilities: for example, all daemons could be run without the CAP_REG_SXID capability, thus making them incapable of elevating privileges by executing s[ug]id executables, so even if the daemon is compromised, the attacker could less easily exploit possible local root holes on the attacked machine (this could offer some measure of protection when running under chroot is not feasible).

How are Linux capabilities currently crippled?

Currently (i.e., prior to applying this patch), Linux has a notion of capabilities. However, it is almost entirely useless and, therefore, almost entirely unused. Roughly speaking, all root processes have all capabilities and all non-root processes have none: whenever an executable is execve()d as root, it gains all capabilities, and wheneve it is execve()d as non-root, it loses all. Thus, there is no way to export capabilities from one program to another. Basically the only thing one can do with them is for a daemon (e.g., Bind9) to start as root, drop some (but not all) capabilities and switch to a different uid (with a special, prctl(PR_SET_KEEPCAPS,1,…), request to maintain capabilities across setuid()): better than no knowledge of capabilities at all but, still, not very useful. One cannot run a given program with restricted capabilities except by patching the program's code (that is, it must be made capability-aware, and very few programs are): there is simply no way to restrict capabilities in one program and from that point execute another (because all capabilities will be lost on execve()).

Furthermore, Linux entirely disables one of its capabilities, CAP_SETPCAP (which would have permitted transfering capabilities from one process to another to some extent), because it was incorrectly thought to be responsible for a past sendmail-related exploit. There's really no reason to disable this (useful) capability, and doing so further cripples the already deficient Linux caps system.

What does this patch do, in more details?

Most importantly, this patch makes capabilities inheritable: i.e., when a process execve()s another executable, capabilities will be kept (even in the absence of filesystem support for capabilities); well, it's not really that simple, because we have to make sure not to break anything, but that's the gist of the idea: the detailed semantics will be described in detail below. The patch also restores the CAP_SETPCAP capability which was removed for no real reason.

Furthermore, the patch adds a new bunch of capabilities: presently the Linux capabilities are 32-bit wide with normal non-root processes having 0 bits everywhere, and this patch makes them 64-bit wide with normal non-root processes having sixteen 1's in the (new) upper half (normal root processes have 1's everywhere, of course). Moving from to 64-bit wide capability sets means that the kernel-level interface changes; however, so as not to break the (very few) programs and libraries that currently use capset() and capget(), a the kernel checks the magic version number and will, if necessary, reply with the former interface.

The patch adds a number of such “regular” (a better name would be welcome…) capabilities (most important among them is CAP_REG_SXID, which controls a process's ability to execute suid programs): but they are intended mostly as a proof-of-concept and it is quite possible that they will be changed in the future.

Finally, version 0.4.2 of the patch also adds filesystem support for capabilities (through extended attributes): this is a merge of a patch provided by Serge E. Hallyn, who is in no way to blame for my mischief.

The patch is also available in split form: part 1 introduces 64-bit wide capability sets, part 2 introduces the new inheritance rules, part 3 introduces the new (regular) capabilities, and part 4 (almost entirely Serge's work) introduces the filesystem support.

What are the permitted, effective and inheritable capability sets (for a process)?

Each process (or, more accurately, each task) has, at all times, not one but three sets of capabilities: they are called the permitted, effective and inheritable capability sets. Each capability can be present or absent in each of the sets.

The effective set is the one which is actually used to check permissions when making system calls that require capabilities. For example, a process needs to have the CAP_CHOWN capability in its effective set in order to execute the chown() system call.

The permitted set is the set of capabilities to which the process has access, at most. The effective set is, at all times, a subset of the permitted set: when a given capability is present in the permitted set, the process may, at will, add it or remove it to its effective set. Once a capability is removed from the permitted set, however, it cannot be regained except by executing a suid executable or by having another process use CAP_SETPCAP. (This is quite similar to the effective and real/saved uid's in the traditional Unix approach.)

The inheritable set is also a subset of the permitted set, and corresponds to capabilities which will be passed across execve() (note that fork() does not affect capabilities in any way: both the parent and child processes receive the same capability sets as before the fork()). However, the fine print is a bit more complicated (and, in any case, in the present, pre-patch, situation, capabilities are simply not inherited).

A previous Linux capabilities patch which I had written increased the number of sets to four, adding a bounding set to the story. This did not meet much enthusiasm and this functionality is now essentially replaced by the CAP_REG_SXID capability.

What are the permitted, effective and inheritable capability sets (for an executable file)?

Now not only does every process have three sets of capabilities, but, with filesystem support for capabilities, every executable file should also have three sets of capabilities, also (confusingly) called the inheritable (=allowed), permitted (=forced) and effective sets.

The executable's inheritable set corresponds to the set of capabilities it is willing to receive upon execve(); the executable's permitted set (a decidedly bad terminology! forced would be much better, but I am told it is deprecated) are capabilities which are automatically added upon execve(), whether the process possessed them or not (thus, this is similar to the traditional Unix suid mechanism); lastly, the executable's effective set indicates which capabilities should be initially made effective. (Contrarily to processes, there is no reason for an executable file's inheritable set to be a subset of the permitted set; in fact, quite the contrary: inheritable bits are interesting only when they are not in the permitted sets—sorry, I'm not the one to blame for the confusion.)

Any executable in the absence of filesystem support for capabilities, or any executable file which is not specially marked, is considered as though it had every bit set in the inheritable and effective sets and none in the permitted set—except when it's suid root, in which case it also has a full permitted set (so it will gain all capabilities upon execve()), or (in version 0.4.4 of the patch) if it's suid anything else or sgid, in which case all capability sets are equal to the set of “regular” capabilities (so as to provide a sanitized environment), except in the case when it would break normal Unix rules (for example, exec of a suid non-root or sgid program from real-uid=0 should only restrict the effective set—I'm afraid it's quite a mess).

There is also a (system-wide) capability bounding set, which controls which capabilities can actually be gained upon execve(): thus it can be used to permanently disable a certain capability (for all future processes).

What about filesystem support for capabilities? Does this patch add it?

Version 0.4.2 of the patch adds (optional) filesystem support for capabilities (but only for the low-order part of capabilities, i.e., those 32 bits which existed before the patch): it is controlled by an extended attribute with name security.capability (the format is as follows: the attribute must contain four 32-bit words in little-endian format, the first containing the version number 0x19980330 and the three next containing the effective, permitted and inheritable sets in this order). As previously explained, this is the merge of a patch provided by Serge E. Hallyn (though a few adaptations have been made, such as making the CAP_REG_SXID capability and nosuid mount option defeat the effective set in the executable's capabilities). Version 0.3.1 has no filesystem support.

With the present patch but without filesystem support, or for files which are unmarked, executables are assumed to have a full set of inheritable (=allowed) and effective capabilities (meaning that they will receive all inheritable and effective capabilities from their parent: this is necessary so as not to break Unix semantics) and an empty set of permitted (=forced) capabilities, except when they are suid root, in which case all sets contain all capabilities.

What are the semantics this patch creates for capabilities?

As explained above, each task has three sets of capabilities, the permitted, effective and inheritable sets. We must now describe how these sets are changed or consulted upon certain system calls.

Whenever a permission needs to be checked, the effective set is consulted. This is the standard Linux behavior, and I do not change this.

When a process fork()s, its capability sets are not modified: both the parent and child processes receive the same capability sets as before the fork(). This is also unchanged by the patch.

In order to set capabilities for a target task, the following checks are observed: first, the target must be the same process as the caller task or the caller must possess the CAP_SETPCAP capability (in its effective set). Second, the newly raised bits in the inheritable and permitted sets (of the target) must be part of the current permitted set of the caller. Thirdly, the constraints must be preserved of the (new) effective and inheritable sets (of the target) being subsets of the (new) permitted set (of the target). All of this is current Linux code, unchanged by the patch (except for the part about enforcing the inheritable set to be a subset of the permitted set: this may have been an oversight or perhaps a different interpretation of what the inheritable set means, so I found it cleaner to enforce the constraint by intersecting the requested inheritable set with the new permitted set).

Then we have compatibility rules for set*uid(): the reason for this is that legacy Unix programs gain or lose privileges by using the seteuid(), setuid() and cousin functions, so we must emulate them with capabilities and make sure they have the same behavior. This is how we do it, when a program does not explicitly request (using prctl(PR_SET_KEEPCAPS,1,…)) to keep capabilities upon set*uid(): when all three of the real, effective and saved uid's are set to non-zero (meaning the program wishes to permanently abandon its root privileges), all three capability sets are cleared of their additional (system) parts (all but bits 32–47); when the effective uid is set to non-zero, only the effective set of capabilities is thus affected, and when the effective uid is reset to zero, the effective set is raised to the full permitted set. This is, essentially, what the current Linux code does (except for the inheritable set). When the program did explicitly request (using prctl(PR_SET_KEEPCAPS,1,…)) to keep capabilities upon set*uid(), then nothing is altered, except that the inheritable set is cleared of additional (system) capabilities (so as to conform avoid surprising programs which expected capabilities not to be inherited): perhaps even this behavior could be suppressed using prctl(PR_SET_KEEPCAPS,2,…) or something.

Finally, we must describe how the three capability sets are affected by execve(). Recall that there are also three capability sets associated with an executable file. Let us call P(per), P(eff) and P(inh) the permitted, effective and inheritable sets for the task before execve(), P′(per), P′(eff) and P′(inh) the corresponding sets after execve(), and F(per), F(eff) and F(inh) the permitted, effective and inheritable sets for an executable file. Finally, call bnd the system-wide capability bounding set. Then the rules enforced by the patch are as follows:

or, for those who prefer plain ASCII:

    P'(per) <- (P(inh) & F(inh)) | (F(per) & bnd)
    P'(eff) <- (P(inh) & P(eff) & F(inh)) | (F(per) & F(eff) & bnd)
    P'(inh) <- P'(per)

The first rule is exactly the one documented in the capabilities(7) manual page. The other two differ slightly, but this is demonstrably unavoidable if we are not to break traditional Unix semantics (the documented rule for the effective set is P′(eff)P′(per)F(eff) ≡ (P(inh)F(eff)F(inh)) ∪ (F(per)F(eff)bnd), but this implies that P′(eff) does not depend on P(eff), thus breaking the traditional Unix semantics that all of uid, euid and suid are preserved upon execve(); similarly, the documented rule for the inheritable set, viz., P′(inh)P(inh), means that if an executed suid program itself executes something else, its privileges would be lost). To justify why the proposed rules are intuitive, consider this: the first part of the expression for P′(per) or P′(eff) represents the capabilities inherited by the exec'ed program (thus, it should be formed by combining those capabilities which the process had before exec and was willing to pass on, and those which the file is willing to inherit), and the second part represents the capabilities provoked by the exec, and is determined solely by file capabilities (the difference between the rule we use for the effective set and the one documented in capabilities(7) should not be a cause for alarm: the security-critical part is the one which concerns the forced bits, i.e., the second part of the expression and, for that, it is identical; in any case, no program can presently rely on the documented behavior since it is not at all implemented!). As for the rule on the inheritable set, it is quite intuitive (unless they act otherwise, processes will propagate all their capabilities rather than merely those they themselves received in that way) and conforming to the Unix legacy behavior.

Now in the absence of filesystem support for capabilities, we must examine what happens for (a) a non-suid-root executable file (F(inh) and F(eff) are full and F(per) is empty), and (b) a suid-root executable file (F(inh), F(eff) and F(per) are all full). In the first case, the rules become:

—which is quite unsurprising. In the case (b), assuming the capability bounding set has not been decreased by the administrator, all sets are set to full, which is the desired behavior.

Additionally (not in version 0.3.0 of the patch), the compatibility rules for set*uid() (described above) are applied also on execve(): this is to cover the (presumably very rare) case when a process running as root (some uid=0) executes a suid non-root executable, thus switching to a different euid and expecting to lose its effective capabilities (and possibly permitted/inheritable also, in case the process has real uid nonzero).

How do you know nothing will break?

Of course I can't be 100% sure unless I use a formal prover to certify the semantics, which is not really feasible. This is why I'd like the patch to be (1) peer-reviewed and (2) tested (on non-security-critical systems at first!). But I can offer some arguments.

What about suid non-root programs?

The question arises of what should be done about suid non-root (and sgid) programs. Version 0.4.4 of the patch behaves differently, in this respect, from prior versions.

Prior versions did not change the capabilities upon non-root suid/sgid exec. One might argue, however, that the patch makes suid non-root programs vulnerable, as they could be executed with less (regular) capabilities than they expect. However, this is not believed to be a serious problem, because (a) such programs are much rarer than suid root programs, (b) damage, if any, would be less limited (no special capabilities are at stake, only access to the filesystem), (c) removing regular capabilities makes system calls fail with a clean error code (nothing exotic like the setuid() function which exhibits a very subtle difference in behavior according as the CAP_SETUID capability is set or not, which made the sendmail exploit possible), and (d) system calls can always fail, so adding new causes for failure is not introducing anything significantly different. So I claim that this behavior is safe.

However, since security is a matter of excessive paranoia, version 0.4.4 offers a different behavior by default: non-root suid/sgid executables behave as though they had the inheritable (=allowed), effective and permitted (=forced) sets of capabilities all equal to the set of “regular” (normal, non-root) capabilities. Considering the rules of inheritance, this means that they start with exactly the regular capabilities in every set. Well, it's a bit more complicated: when root execs an sgid program, for example, it shouldn't drop capabilities (if you want the gory details: if all uids before exec are non-zero then all capability sets are set to the regular caps, and if any is zero then the inheritable (=allowed) and effective sets of the executable are assumed to be the full set and the permitted (=forced) set to the regular caps; the compatiblity rules for set*uid() will take care of dropping caps if root uid is actually dropped permanently).

Is there a test suite somewhere?

There is an embryo for a test suite: see here. Just extract it and type make (as root). It doesn't test every aspect of the patch, though. Make sure to use the test suite version which matches that of the patch!

How can one make something useful of this patch?

So far, an upgrade of the libcap library remains to be written, so expect things to be a little rough. But it is still possible to write simple programs which make use of the patch. (I have chosen not to include linux/capability.h from the programs and, rather, redefine the constants, which would be a very bad habit in the long run but which is probably simpler while the code is still experimental.)

The following program (which should be run by an unprivileged user) runs a shell (or the program specified on the command line) without the CAP_REG_SXID capability. This means that, from this shell, it is impossible to elevate privileges by executing a set[ug]id program: so it would be a good idea to execute certain daemons from this wrapper.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/prctl.h>

#define _LINUX_CAPABILITY_VERSION  0x20060903

#define CAP_REG_SXID 35

typedef struct user_cap_header_struct {
        uint32_t version;
        pid_t pid;
} *cap_user_header_t;

typedef struct user_cap_data_struct {
        uint64_t effective;
        uint64_t permitted;
        uint64_t inheritable;
} *cap_user_data_t;

long
capget (cap_user_header_t header, cap_user_data_t dataptr)
{
  return syscall (SYS_capget, header, dataptr);
}

long
capset (cap_user_header_t header, cap_user_data_t dataptr)
{
  return syscall (SYS_capset, header, dataptr);
}

int
main (int argc, char *argv[])
{
  struct user_cap_header_struct header;
  struct user_cap_data_struct data;
  uint64_t mask = ~(1ULL<<CAP_REG_SXID);
  const char *shell;

  header.version = _LINUX_CAPABILITY_VERSION;
  header.pid = getpid ();
  capget (&header, &data);
  data.effective &= mask;
  data.permitted &= mask;
  data.inheritable &= mask;
  capset (&header, &data);
  shell = getenv ("SHELL");
  if ( ! shell )
    shell = "/bin/sh";
  if ( argc > 1 )
    return execvp (argv[1], argv+1);
  else
    return execl (shell, shell, NULL);
}

The following program (which should be made suid root and then run as an unprivileged user) runs a shell (or the program specified on the command line) with the CAP_CHOWN capability. So, from that shell, chown functions as root although the user is otherwise unprivileged): if you install this program executable by a certain group, this effectively gives chown privilege to the members of that group. (Of course, CAP_CHOWN is an example: the same example could be used with other capabilities—see the capabilities(7) manual page for examples.)

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/prctl.h>

#define _LINUX_CAPABILITY_VERSION  0x20060903

#define CAP_CHOWN 0
#define CAP_REGULAR_SET 0x0000ffff00000000ULL

typedef struct user_cap_header_struct {
        uint32_t version;
        pid_t pid;
} *cap_user_header_t;

typedef struct user_cap_data_struct {
        uint64_t effective;
        uint64_t permitted;
        uint64_t inheritable;
} *cap_user_data_t;

long
capget (cap_user_header_t header, cap_user_data_t dataptr)
{
  return syscall (SYS_capget, header, dataptr);
}

long
capset (cap_user_header_t header, cap_user_data_t dataptr)
{
  return syscall (SYS_capset, header, dataptr);
}

int
main (int argc, char *argv[])
{
  struct user_cap_header_struct header;
  struct user_cap_data_struct data;
  uint64_t mask = CAP_REGULAR_SET|(1ULL<<CAP_CHOWN);
  const char *shell;

  prctl (PR_SET_KEEPCAPS,1,0,0,0);
  setuid (getuid ());
  header.version = _LINUX_CAPABILITY_VERSION;
  header.pid = getpid ();
  capget (&header, &data);
  data.permitted &= mask;
  data.effective = data.permitted;
  data.inheritable = data.permitted;
  capset (&header, &data);
  shell = getenv ("SHELL");
  if ( ! shell )
    shell = "/bin/sh";
  if ( argc > 1 )
    return execvp (argv[1], argv+1);
  else
    return execl (shell, shell, NULL);
}

What capabilities exist that I could play with?

See the capabilities(7) manual page for a list, or, better, read include/linux/capability.h (from the kernel source tree).

With this patch, capabilities come in two bunches: additional capabilities (numbers 0 through 31—and 48 through 63, but those are unused) are not possessed by normal non-root processes, and these are exactly the capabilities of an unpatched Linux kernel, whereas regular capabilities, numbers 32 through 47, are normally possessed by all processes and can be removed to make a process underprivileged. The patch offers six of those, but they are to be thought more of a “proof of concept” than as a serious proposal:

Further additions which might be considered could be: having a capability required for any kind of network access.

What are the differences between the various versions of the patch?