✱✱✱ This patch has been abandoned in favor of cuppabilities: see here for explanations ✱✱✱
A patch to make Linux capabilities into something useful.
In short: currently (i.e., prior to applying this patch), Linux has
capabilities, but they are (deliberately) crippled, and thus, essentially useless, because
nobody could agree on coherent semantics for them; this patch uncripples them and attempts to give them reasonable semantics that will, hopefully,
neither break legacy Unix programs nor those that use the current
capabilies system (essentially, Bind9 and NTP); basically,
capabilities are currently useless because they are never
inheritable (=preserved across execve()) and
this patch makes them so (but carefully enough so as not to confuse
existing programs).  Furthermore, whereas the current Linux
capabilities are only “additional” capabilities (meaning
that normal, non-root, processes, have none, and adding capabilities
leads up to root), the patch also suggests (and, to some
extent, implements) a new bunch of
“regular” capabilites, which are present on all normal
processes and can be removed so as to provide some measure of
fault-containment for partially untrusted or potentially buggy
programs (thus, these new capabilities can be said to lead
down).
Note: Although I believe that this patch will not break anything, it is still little tested and should be considered alpha quality: it should on no account be applied on security-critical systems or on a system were local users are not to be trusted: the security implications are quite complex and I could quite possibly be wrong in thinking that it doesn't open any local root hole.
This patch has been abandoned due to heavy criticism on the linux-kernel mailing list: essentially because it abandoned POSIX semantics, because it made capabilites inheritable by default, which some people do not want, and because it used the capabilities model (designed for overprivileged processes) to also model underprivileged processes, contrary to what was intended. So it was obvious that the patch could never gain sufficient acceptance as to be included in the kernel. Rather than pursuing it independently, I am trying a more consensual approach: I am splitting the changes in two completely independent parts:
inhcaps mount option to
make capabilities inheritable by default on that filesystem (while
otherwise retaining the POSIX.1e semantics),
andThe two are entirely independent.
This web page is kept for documentation purposes.
See this FTP directory.
The present version (0.4.4) is to be applied against Linux version 2.6.18-rc6, although it should not be too picky about that. (The possibility of serving a git tree is being considered.)
Traditional Unix semantics know only two levels of privileges: root and non-root. Root processes are able to bypass essentially all security checks (mandatory access controls) in the kernel, whereas non-root processes are subject to all of them. There is no intermediate situation. This all-or-nothing solution has the merit of simplicity, but it also means that a program that requires any level of privileges must be made suid root, making it a privileged target for attack and thus dangerous. What capabilities do is split the single “root” privilege in thirty-odd mostly independent bits so that programs requiring special privileges can be given just those required and not full root privileges.
For example, the NTP
daemon needs only the CAP_SYS_TIME capability, in order
to set and skew the system clock: so a capability-aware version of it
starts as root (so with all capabilities) but drops all the
unnecessary ones early in the code.
Furthermore, the patch discussed here adds a new bunch of capabilities which are
present in normal (non-root) processes and which can be removed to
give a process even lower abilities: for example, all daemons could be
run without the CAP_REG_SXID capability, thus making them
incapable of elevating privileges by executing s[ug]id
executables, so even if the daemon is compromised, the attacker could
less easily exploit possible local root holes on the attacked machine
(this could offer some measure of protection when running under chroot
is not feasible).
Currently (i.e., prior to applying this patch), Linux has a notion
of capabilities.  However, it is almost entirely useless and,
therefore, almost entirely unused.  Roughly speaking, all root
processes have all capabilities and all non-root processes have none:
whenever an executable is execve()d as root, it gains all
capabilities, and wheneve it is execve()d as non-root, it
loses all.  Thus, there is no way to export capabilities from one
program to another.  Basically the only thing one can do with them is
for a daemon (e.g., Bind9) to
start as root, drop some (but not all) capabilities and switch to a
different uid (with a special,
prctl(PR_SET_KEEPCAPS,1,…), request to maintain
capabilities across setuid()): better than no knowledge
of capabilities at all but, still, not very useful.  One cannot run a
given program with restricted capabilities except by patching the
program's code (that is, it must be made capability-aware, and very
few programs are): there is simply no way to restrict capabilities in
one program and from that point execute another (because all
capabilities will be lost on execve()).
Furthermore, Linux entirely disables one of its capabilities,
CAP_SETPCAP (which would have permitted transfering
capabilities from one process to another to some extent), because it
was incorrectly
thought
to be responsible for a past sendmail-related exploit.  There's really
no reason to disable this (useful) capability, and doing so further
cripples the already deficient Linux caps system.
Most importantly, this patch makes capabilities inheritable: i.e.,
when a process execve()s another executable, capabilities
will be kept (even in the absence of filesystem support for capabilities);
well, it's not really that simple, because we have to make sure not to
break anything, but that's the gist of the idea: the detailed
semantics will be described in detail below.
The patch also restores the CAP_SETPCAP capability which
was removed for no real reason.
Furthermore, the patch adds a new bunch of capabilities: presently the
Linux capabilities are 32-bit wide with normal non-root processes
having 0 bits everywhere, and this patch makes them
64-bit wide with normal non-root processes having sixteen 1's in
the (new) upper half (normal root processes have 1's
everywhere, of course).  Moving from to 64-bit wide capability sets
means that the kernel-level interface changes; however, so as not to
break the (very few) programs and libraries that currently use
capset() and capget(), a the kernel checks
the magic version number and will, if necessary, reply with the former
interface.
The patch adds a number of such “regular” (a better
name would be welcome…) capabilities (most important among them
is CAP_REG_SXID, which controls a process's ability to
execute suid programs): but they are intended mostly as a
proof-of-concept and it is quite possible that they will be changed in
the future.
Finally, version 0.4.2 of the patch also adds filesystem support for capabilities (through extended attributes): this is a merge of a patch provided by Serge E. Hallyn, who is in no way to blame for my mischief.
The patch is also available in split form: part 1 introduces 64-bit wide capability sets, part 2 introduces the new inheritance rules, part 3 introduces the new (regular) capabilities, and part 4 (almost entirely Serge's work) introduces the filesystem support.
Each process (or, more accurately, each task) has, at all times, not one but three sets of capabilities: they are called the permitted, effective and inheritable capability sets. Each capability can be present or absent in each of the sets.
The effective set is the one which is actually used to
check permissions when making system calls that require capabilities.
For example, a process needs to have the CAP_CHOWN
capability in its effective set in order to execute the
chown() system call.
The permitted set is the set of capabilities to which
the process has access, at most.  The effective set is, at all times,
a subset of the permitted set: when a given capability is present in
the permitted set, the process may, at will, add it or remove it to
its effective set.  Once a capability is removed from the permitted
set, however, it cannot be regained except by executing a
suid executable or by having another process use
CAP_SETPCAP.  (This is quite similar to the effective and
real/saved uid's in the traditional Unix approach.)
The inheritable set is also a subset of the permitted
set, and corresponds to capabilities which will be passed across
execve() (note that fork() does not affect
capabilities in any way: both the parent and child processes receive
the same capability sets as before the fork()).  However,
the fine print is a bit more complicated
(and, in any case, in the present, pre-patch, situation, capabilities
are simply not inherited).
A previous Linux capabilities patch which
I had written increased the number of sets to four, adding a
bounding set to the story.  This did not meet much
enthusiasm and this functionality is now essentially replaced by the
CAP_REG_SXID capability.
Now not only does every process have three sets of capabilities, but, with filesystem support for capabilities, every executable file should also have three sets of capabilities, also (confusingly) called the inheritable (=allowed), permitted (=forced) and effective sets.
The executable's inheritable set corresponds to the set of
capabilities it is willing to receive upon execve(); the
executable's permitted set (a decidedly bad terminology! forced
would be much better, but I am told it is deprecated) are capabilities
which are automatically added upon execve(), whether the
process possessed them or not (thus, this is similar to the
traditional Unix suid mechanism); lastly, the
executable's effective set indicates which capabilities should be
initially made effective.  (Contrarily to processes, there is no
reason for an executable file's inheritable set to be a subset of the
permitted set; in fact, quite the contrary: inheritable bits are
interesting only when they are not in the permitted sets—sorry,
I'm not the one to blame for the confusion.)
Any executable in the absence of filesystem support for capabilities, or
any executable file which is not specially marked, is considered as
though it had every bit set in the inheritable and effective sets and
none in the permitted set—except when it's suid
root, in which case it also has a full permitted set (so it will gain
all capabilities upon execve()), or (in version 0.4.4 of
the patch) if it's suid anything else or
sgid, in which case all capability sets are equal to the
set of “regular” capabilities (so as to provide a
sanitized environment), except in the case when it would break normal
Unix rules (for example, exec of a suid non-root or
sgid program from real-uid=0 should only
restrict the effective set—I'm afraid it's quite a mess).
There is also a (system-wide) capability bounding set, which
controls which capabilities can actually be gained upon
execve(): thus it can be used to permanently disable a
certain capability (for all future processes).
Version 0.4.2 of the patch adds (optional) filesystem support for
capabilities (but only for the low-order part
of capabilities, i.e., those 32 bits which existed before the patch):
it is controlled by an extended attribute with name
security.capability (the format is as follows: the
attribute must contain four 32-bit words in little-endian format, the
first containing the version number 0x19980330 and the
three next containing the effective, permitted and inheritable sets in
this order).  As previously explained, this is the merge of a patch provided by Serge
E. Hallyn (though a few adaptations have been made, such as
making the CAP_REG_SXID capability and
nosuid mount option defeat the effective set in the
executable's capabilities).  Version 0.3.1 has no filesystem
support.
With the present patch but without filesystem support, or for files which are unmarked, executables are assumed to have a full set of inheritable (=allowed) and effective capabilities (meaning that they will receive all inheritable and effective capabilities from their parent: this is necessary so as not to break Unix semantics) and an empty set of permitted (=forced) capabilities, except when they are suid root, in which case all sets contain all capabilities.
As explained above, each task has three sets of capabilities, the permitted, effective and inheritable sets. We must now describe how these sets are changed or consulted upon certain system calls.
Whenever a permission needs to be checked, the effective set is consulted. This is the standard Linux behavior, and I do not change this.
When a process fork()s, its capability sets are not
modified: both the parent and child processes receive the same
capability sets as before the fork().  This is also
unchanged by the patch.
In order to set capabilities for a target task, the
following checks are observed: first, the target must be the same
process as the caller task or the caller must possess the
CAP_SETPCAP capability (in its effective set).  Second,
the newly raised bits in the inheritable and permitted sets (of the
target) must be part of the current permitted set of the caller.
Thirdly, the constraints must be preserved of the (new) effective and
inheritable sets (of the target) being subsets of the (new) permitted
set (of the target).  All of this is current Linux code, unchanged by
the patch (except for the part about enforcing the inheritable set to
be a subset of the permitted set: this may have been an oversight or
perhaps a different interpretation of what the inheritable set means,
so I found it cleaner to enforce the constraint by intersecting the
requested inheritable set with the new permitted set).
Then we have compatibility rules for set*uid(): the
reason for this is that legacy Unix programs gain or lose privileges
by using the seteuid(), setuid() and cousin
functions, so we must emulate them with capabilities and make sure
they have the same behavior.  This is how we do it, when a program
does not explicitly request (using
prctl(PR_SET_KEEPCAPS,1,…)) to keep capabilities
upon set*uid(): when all three of the real,
effective and saved uid's are set to non-zero (meaning
the program wishes to permanently abandon its root privileges), all
three capability sets are cleared of their additional (system) parts
(all but bits 32–47); when the effective uid is
set to non-zero, only the effective set of capabilities is thus
affected, and when the effective uid is reset to zero,
the effective set is raised to the full permitted set.  This is,
essentially, what the current Linux code does (except for the
inheritable set).  When the program did explicitly request (using
prctl(PR_SET_KEEPCAPS,1,…)) to keep capabilities
upon set*uid(), then nothing is altered, except that the
inheritable set is cleared of additional (system)
capabilities (so as to conform avoid surprising programs which
expected capabilities not to be inherited): perhaps even this behavior
could be suppressed using
prctl(PR_SET_KEEPCAPS,2,…) or something.
Finally, we must describe how the three capability sets are
affected by execve().  Recall that there are also three capability sets associated with
an executable file.  Let us call P(per), P(eff)
and P(inh) the permitted, effective and inheritable sets
for the task before execve(), P′(per),
P′(eff) and P′(inh) the
corresponding sets after execve(), and F(per),
F(eff) and F(inh) the permitted, effective and
inheritable sets for an executable file.  Finally, call bnd
the system-wide capability bounding set.  Then the rules enforced by
the patch are as follows:
or, for those who prefer plain ASCII:
    P'(per) <- (P(inh) & F(inh)) | (F(per) & bnd)
    P'(eff) <- (P(inh) & P(eff) & F(inh)) | (F(per) & F(eff) & bnd)
    P'(inh) <- P'(per)
The first rule is exactly the one documented in the
capabilities(7) manual page.  The other two differ
slightly, but this is demonstrably unavoidable if we are not to break
traditional Unix semantics (the documented rule for the effective set
is
P′(eff) ← P′(per) ∩
F(eff) ≡ (P(inh) ∩ F(eff)
∩ F(inh)) ∪ (F(per) ∩
F(eff) ∩ bnd)
, but this implies that
P′(eff) does not depend on P(eff), thus
breaking the traditional Unix semantics that all of uid,
euid and suid are preserved upon
execve(); similarly, the documented rule for the
inheritable set, viz.,
P′(inh) ← P(inh)
, means that
if an executed suid program itself executes something
else, its privileges would be lost).  To justify why the proposed
rules are intuitive, consider this: the first part of the expression
for P′(per) or P′(eff) represents
the capabilities inherited by the exec'ed program (thus, it
should be formed by combining those capabilities which the process had
before exec and was willing to pass on, and those which the file is
willing to inherit), and the second part represents the capabilities
provoked by the exec, and is determined solely by file
capabilities (the difference between the rule we use for the effective
set and the one documented in capabilities(7) should not
be a cause for alarm: the security-critical part is the one which
concerns the forced bits, i.e., the second part of the expression and,
for that, it is identical; in any case, no program can presently rely
on the documented behavior since it is not at all implemented!).  As
for the rule on the inheritable set, it is quite intuitive (unless
they act otherwise, processes will propagate all their capabilities
rather than merely those they themselves received in that way) and
conforming to the Unix legacy behavior.
Now in the absence of filesystem support for capabilities, we must examine what happens for (a) a non-suid-root executable file (F(inh) and F(eff) are full and F(per) is empty), and (b) a suid-root executable file (F(inh), F(eff) and F(per) are all full). In the first case, the rules become:
—which is quite unsurprising. In the case (b), assuming the capability bounding set has not been decreased by the administrator, all sets are set to full, which is the desired behavior.
Additionally (not in version 0.3.0 of the patch), the
compatibility rules for set*uid() (described above) are
applied also on execve(): this is to cover the
(presumably very rare) case when a process running as root (some
uid=0) executes a suid non-root
executable, thus switching to a different euid and
expecting to lose its effective capabilities (and possibly
permitted/inheritable also, in case the process has real
uid nonzero).
Of course I can't be 100% sure unless I use a formal prover to certify the semantics, which is not really feasible. This is why I'd like the patch to be (1) peer-reviewed and (2) tested (on non-security-critical systems at first!). But I can offer some arguments.
execve() is unsurprising: when executing a
suid root executable, all caps are set; when executing a
non-suid executable, all caps are preserved (since
non-caps-aware legacy programs always have the inheritable set equal
to the permitted set, this follows from the rules we described); and
the case of executing a suid non-root executable has also
been taken care of specifically (by applying the compatibility rules
for set*uid()).  Behavior upon set*uid() is
also preserved: the compatibility rules ensure that the effective
capabilities are synchronized with euid being zero and
the permitted/inheritable capabilities with some uid
being zero; note that the patch does not modify the
set*uid() functions in any way, and only modifies the
compatibility rules insofar as to keep the inheritable set
synchronized with the permitted set (for non-caps-aware legacy
programs) and to retain regular caps.execve(), it get what it expects because the
compatibility rules for set*uid() clears the inheritable
set of additional (system) bits.  Furthermore, the kernel offers a
compatibility version of the
capset()/capget() interface so that binaries
will not break.The question arises of what should be done about suid non-root (and sgid) programs. Version 0.4.4 of the patch behaves differently, in this respect, from prior versions.
Prior versions did not change the capabilities upon non-root
suid/sgid exec.  One might argue, however,
that the patch makes suid non-root programs vulnerable,
as they could be executed with less (regular) capabilities than they
expect.  However, this is not believed to be a serious problem,
because (a) such programs are much rarer than suid
root programs, (b) damage, if any, would be less limited (no
special capabilities are at stake, only access to the filesystem),
(c) removing regular capabilities makes system calls fail with a
clean error code (nothing exotic like the setuid()
function which exhibits a very subtle difference in behavior according
as the CAP_SETUID capability is set or not, which made
the sendmail exploit possible), and (d) system calls can always
fail, so adding new causes for failure is not introducing anything
significantly different.  So I claim that this behavior is safe.
However, since security is a matter of excessive paranoia, version
0.4.4 offers a different behavior by default: non-root
suid/sgid executables behave as though they
had the inheritable (=allowed), effective and permitted (=forced) sets
of capabilities all equal to the set of “regular” (normal,
non-root) capabilities.  Considering the rules of
inheritance, this means that they start with exactly the regular
capabilities in every set.  Well, it's a bit more complicated: when
root execs an sgid program, for example, it shouldn't
drop capabilities (if you want the gory details: if all
uids before exec are non-zero then all capability sets
are set to the regular caps, and if any is zero then the inheritable
(=allowed) and effective sets of the executable are assumed to be the
full set and the permitted (=forced) set to the regular caps; the
compatiblity rules for set*uid() will take care of
dropping caps if root uid is actually dropped
permanently).
There is an embryo for a test suite: see here.
Just extract it and type make (as root).  It doesn't test
every aspect of the patch, though.  Make sure to use the test suite
version which matches that of the patch!
So far, an upgrade of the libcap library remains to be
written, so expect things to be a little rough.  But it is still
possible to write simple programs which make use of the patch.  (I
have chosen not to include linux/capability.h from the
programs and, rather, redefine the constants, which would be a very
bad habit in the long run but which is probably simpler while the code
is still experimental.)
The following program (which should be run by an unprivileged user)
runs a shell (or the program specified on the command line) without
the CAP_REG_SXID capability.  This means that, from this
shell, it is impossible to elevate privileges by executing a
set[ug]id program: so it would be a good idea to execute
certain daemons from this wrapper.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/prctl.h>
#define _LINUX_CAPABILITY_VERSION  0x20060903
#define CAP_REG_SXID 35
typedef struct user_cap_header_struct {
        uint32_t version;
        pid_t pid;
} *cap_user_header_t;
typedef struct user_cap_data_struct {
        uint64_t effective;
        uint64_t permitted;
        uint64_t inheritable;
} *cap_user_data_t;
long
capget (cap_user_header_t header, cap_user_data_t dataptr)
{
  return syscall (SYS_capget, header, dataptr);
}
long
capset (cap_user_header_t header, cap_user_data_t dataptr)
{
  return syscall (SYS_capset, header, dataptr);
}
int
main (int argc, char *argv[])
{
  struct user_cap_header_struct header;
  struct user_cap_data_struct data;
  uint64_t mask = ~(1ULL<<CAP_REG_SXID);
  const char *shell;
  header.version = _LINUX_CAPABILITY_VERSION;
  header.pid = getpid ();
  capget (&header, &data);
  data.effective &= mask;
  data.permitted &= mask;
  data.inheritable &= mask;
  capset (&header, &data);
  shell = getenv ("SHELL");
  if ( ! shell )
    shell = "/bin/sh";
  if ( argc > 1 )
    return execvp (argv[1], argv+1);
  else
    return execl (shell, shell, NULL);
}
The following program (which should be made suid root
and then run as an unprivileged user) runs a shell (or the program
specified on the command line) with the CAP_CHOWN
capability.  So, from that shell, chown functions as root
although the user is otherwise unprivileged): if you install this
program executable by a certain group, this effectively gives
chown privilege to the members of that group.  (Of
course, CAP_CHOWN is an example: the same example could
be used with other capabilities—see the
capabilities(7) manual page for examples.)
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/prctl.h>
#define _LINUX_CAPABILITY_VERSION  0x20060903
#define CAP_CHOWN 0
#define CAP_REGULAR_SET 0x0000ffff00000000ULL
typedef struct user_cap_header_struct {
        uint32_t version;
        pid_t pid;
} *cap_user_header_t;
typedef struct user_cap_data_struct {
        uint64_t effective;
        uint64_t permitted;
        uint64_t inheritable;
} *cap_user_data_t;
long
capget (cap_user_header_t header, cap_user_data_t dataptr)
{
  return syscall (SYS_capget, header, dataptr);
}
long
capset (cap_user_header_t header, cap_user_data_t dataptr)
{
  return syscall (SYS_capset, header, dataptr);
}
int
main (int argc, char *argv[])
{
  struct user_cap_header_struct header;
  struct user_cap_data_struct data;
  uint64_t mask = CAP_REGULAR_SET|(1ULL<<CAP_CHOWN);
  const char *shell;
  prctl (PR_SET_KEEPCAPS,1,0,0,0);
  setuid (getuid ());
  header.version = _LINUX_CAPABILITY_VERSION;
  header.pid = getpid ();
  capget (&header, &data);
  data.permitted &= mask;
  data.effective = data.permitted;
  data.inheritable = data.permitted;
  capset (&header, &data);
  shell = getenv ("SHELL");
  if ( ! shell )
    shell = "/bin/sh";
  if ( argc > 1 )
    return execvp (argv[1], argv+1);
  else
    return execl (shell, shell, NULL);
}
See the capabilities(7) manual page for a list, or,
better, read include/linux/capability.h (from the kernel
source tree).
With this patch, capabilities come in two bunches: additional capabilities (numbers 0 through 31—and 48 through 63, but those are unused) are not possessed by normal non-root processes, and these are exactly the capabilities of an unpatched Linux kernel, whereas regular capabilities, numbers 32 through 47, are normally possessed by all processes and can be removed to make a process underprivileged. The patch offers six of those, but they are to be thought more of a “proof of concept” than as a serious proposal:
CAP_REG_FORK (number 32) allows the process to
fork().CAP_REG_OPEN (number 33) allows the process to
open() a file.CAP_REG_EXEC (number 34) allows the process to
execve() an executable.CAP_REG_SXID (number 35) allows the process to gain
privileges by execve()ing an s[ug]id
executable.  This is thought to be the most useful of the lot because
it provides a form of confinement against privilege escalation: it
would seem like a good idea to run various daemons with this
capability turned off.  In version 0.4.2 of the patch, this also turns
off the permitted (=forced) set of capabilities on an executable
file.CAP_REG_WRITE (number 36) [introduced in version
0.4.2 of the patch] is required for the process to perform any kind of
write operation on the filesystem.  (This could also be quite
interesting—unfortunately, for the moment, it even forbids
writing to /dev/null, which confuses a lot of
scripts.)CAP_REG_PTRACE (number 37) [introduced in version
0.4.3 of the patch] is required for the ptrace() system
call (except for self-inspection).Further additions which might be considered could be: having a capability required for any kind of network access.
execve().  It also
changes CAP_REG_SXID so that its absence will return
EPERM when attempting to execute a
suid/sgid executable (rather than execute it
with no permissions changed): this is to preserve the security of
executable but not readable images.CAP_REG_PTRACE capability.  Finally, it corrects a stupid
bug which forced the inheritable set of a process to be a subset of
the effective (rather than permitted) set.CAP_REG_WRITE capability, fixes
a couple of bugs and adds restrictions on kill() and
whatnot (part of the filesystem patch by Serge E. Hallyn).  It
also disallows the permitted (=forced) executable set in
nosuid-mounted filesystems and when the
CAP_REG_SXID capability is absent.