[ENS]
[ENS students]
[David Madore]
[Mathematics]
[Computer science]
[Programs]
[Linux]
[Literature]
(Buzzword mode ON.)
Unix filesystems (and, by extension, filesystems everywhere) follow a tree-like structure, and consequently have peculiar (but necessary) restrictions on their semantics; because of that, they employ (i.e. they can employ) a reference-counting algorithm (keeping track of “inode link counts”) to detect when resources can be freed.
Unix has had “hard links” (i.e. multiply-referenced
inodes) ever since the beginning. However, as part of the
above-mentioned restrictions, hard links cannot be used on directories
(actually, this limitation did not exist in the earliest versions of
Unix, say, V5; but it appeared as early as V7). Berkeley Unix
introduced symbolic links as a solution to this problem. Indeed,
since symbolic links are asymmetric, and do not participate in the
determination of what constitutes garbage (in GC terms they are
“weak pointers”; actually, they are just names), their
presence does not break the tree-like nature of the filesystem. But
symbolic links are not entirely satisfactory either, precisely because
of their asymmetric nature (or, rather, they are completely
satifactory, but only insofar as the filesystem is fixed — when
we start moving or removing things, it becomes obvious that symbolic
links are not perfect, “dangling symlinks” being just one
small aspect of this imperfection). Hard links, on the other hand,
are elegant but practically unusable because they are too limited:
they cannot be used on directories, and they cannot span across
devices. Because of this they have appeared as an odd and isolated
feature of Unix (and also the cause of many security problems because
sysadmins are not too aware of them or forget about their existence:
for example a recursive chown
on a user's home dir is a
dangerous thing since the user might have linked
/etc/passwd
into his home).
The GCFS project would relax these restrictions on the filesystem (that is, would provide a filesystem with more flexible semantics). This has advantages and drawbacks, of course, but the GCFS is more intended as an experiment with Unix filesystem semantics and Linux kernel hacking (HINT: c-o-o-l) than as a stable, usable, working filesystem. If this is not sufficiently clear, read: the GCFS is likely to turn your partition, or the whole Universe, into a thick goo, and make Asmodeus and all his legions flow through your nostrils; but the whole point is that this is fun.
In short, the GCFS would make the following extensions to the Unix filesystem semantics:
For a more detailed description of the desired semantics, see the “Semantics' description” section below.
This section describes the desired semantics of the garbage collected file system without worrying about either the implementation details, or the relation to Linux.
The most important improvement I suggest over traditional semantics is to permit hard links between directories. Two hard linked directories (but in fact, I should be speaking of “two names for the same directory”) always contain exactly the same files. Here is a simple example:
computer luser ~ $ mkdir /tmp/foo computer luser ~ $ cd /tmp/foo computer luser /tmp/foo $ mkdir bar computer luser /tmp/foo $ ls -li total 1 17291 drwxrwxr-x 2 luser luser 1024 Mar 32 14:29 bar computer luser /tmp/foo $ ln bar baz computer luser /tmp/foo $ ls -li total 2 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 14:29 bar 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 14:29 baz computer luser /tmp/foo $ cd bar computer luser /tmp/foo/bar $ touch qux computer luser /tmp/foo/bar $ ls -lai total 2 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 14:30 . 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 14:29 .. 17294 -rw-rw-r-- 1 luser luser 0 Mar 32 14:30 qux computer luser /tmp/foo/bar $ cd ../baz computer luser /tmp/foo/baz $ ls -lai total 2 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 14:30 . 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 14:29 .. 17294 -rw-rw-r-- 1 luser luser 0 Mar 32 14:30 qux computer luser /tmp/foo/baz $
In the above example, the file qux
was created in the
/tmp/foo/bar
directory, and it “magically”
appeared in the /tmp/foo/baz
directory. At this point, I
may want to remove the /tmp/foo/bar
directory (again,
this terminology is not good: I am merely removing the
/tmp/foo/bar
name of the directory node). It
would be rather pointless to require that the directory be empty
before it can be removed. We continue the example:
computer luser /tmp/foo/baz $ cd /tmp/foo computer luser /tmp/foo $ ls -li total 2 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 14:30 bar 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 14:30 baz computer luser /tmp/foo $ rmdir bar computer luser /tmp/foo $ ls -li total 1 17291 drwxrwxr-x 2 luser luser 1024 Mar 32 14:30 baz computer luser /tmp/foo $ ls -lai baz total 2 17291 drwxrwxr-x 2 luser luser 1024 Mar 32 14:30 . 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 14:30 .. 17294 -rw-rw-r-- 1 luser luser 0 Mar 32 14:30 qux computer luser /tmp/foo $
Now since it was possible to remove the /tmp/foo/bar
directory without it being empty, there is no reason not to do the
same with the /tmp/foo/baz
directory:
computer luser /tmp/foo $ rmdir baz computer luser /tmp/foo $ ls -li total 0 computer luser /tmp/foo $
At this point, the sometime-called /tmp/foo/bar
directory, and the qux
file it contained are no longer
visible. Unless some process in the system has an open inode to one
of these (or has one of these as current working directory or local
root), they have become garbage, and they will be reclaimed
by the system through a mechanism that, for the moment we leave
unspecified. This is the garbage-collection step, which need not
worry us as far as defining the semantics goes, because garbage-collection has no semantics.
We now consider a more vicious example:
computer luser /tmp/foo $ mkdir bar computer luser /tmp/foo $ cd bar computer luser /tmp/foo/bar $ ln . loop computer luser /tmp/foo/bar $ ls -lai total 3 17296 drwxrwxr-x 3 luser luser 1024 Mar 32 14:31 . 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 14:31 .. 17296 drwxrwxr-x 3 luser luser 1024 Mar 32 14:31 loop computer luser /tmp/foo/bar $ cd loop computer luser /tmp/foo/bar/loop $ ls -li total 1 17296 drwxrwxr-x 3 luser luser 1024 Mar 32 14:31 loop computer luser /tmp/foo/bar/loop $ ls -li loop total 1 17296 drwxrwxr-x 3 luser luser 1024 Mar 32 14:31 loop computer luser /tmp/foo/bar/loop $ cd /tmp/foo/bar computer luser /tmp/foo/bar $
In the last step of the example above, we have been prudent and have
used cd /tmp/foo/bar
to go back to our starting point.
What “cd ..
” would have done is a bit subtle,
because of the combination of several factors (the shell, the Linux
dentry cache system, and the filesystem semantics for
“..
”), something we shall discuss later. Similarly, we have been careful to use
“ls -li
” rather than “ls
-lai
” in the above examples, for good reason.
At any rate, now that we have created a loop, we see how essential it
is to be able to rmdir
a non empty directory, otherwise
it would be plain impossible to get rid of that loop
directory entry. But with a GCFS, we can do:
computer luser /tmp/foo/bar $ rmdir loop computer luser /tmp/foo/bar $ ls -l total 2 17296 drwxrwxr-x 2 luser luser 1024 Mar 32 14:32 . 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 14:31 ..
Of course, nothing says we can't have more vicious structures. I suggest drawing the graph as it evolves to understand the following:
computer luser /tmp/foo/bar $ ln . baz computer luser /tmp/foo/bar $ ls -lai total 3 17296 drwxrwxr-x 3 luser luser 1024 Mar 32 14:33 . 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 14:31 .. 17296 drwxrwxr-x 3 luser luser 1024 Mar 32 14:33 baz computer luser /tmp/foo/bar $ ln .. qux computer luser /tmp/foo/bar $ ls -lai total 4 17296 drwxrwxr-x 3 luser luser 1024 Mar 32 14:33 . 17290 drwxrwxr-x 4 luser luser 1024 Mar 32 14:31 .. 17296 drwxrwxr-x 3 luser luser 1024 Mar 32 14:33 baz 17290 drwxrwxr-x 4 luser luser 1024 Mar 32 14:31 qux computer luser /tmp/foo/bar $ ls -li qux total 1 17296 drwxrwxr-x 3 luser luser 1024 Mar 32 14:33 bar computer luser /tmp/foo/bar $ ln . qux/zot computer luser /tmp/foo/bar $ ls -li qux total 2 17296 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 bar 17296 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 zot computer luser /tmp/foo/bar $ mkdir qux/fum computer luser /tmp/foo/bar $ cd /tmp/foo/fum computer luser /tmp/foo/fum $ mv ../bar . computer luser /tmp/foo/fum $ ls -lai total 3 17299 drwxrwxr-x 3 luser luser 1024 Mar 32 14:33 . 17290 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 .. 17296 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 bar computer luser /tmp/foo/fum $ ls -lai bar total 4 17296 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 . 17299 drwxrwxr-x 3 luser luser 1024 Mar 32 14:33 .. 17296 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 baz 17290 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 qux computer luser /tmp/foo/fum $ ls -li bar/baz total 2 17296 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 baz 17290 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 qux computer luser /tmp/foo/fum $ ls -li bar/baz/qux total 2 17299 drwxrwxr-x 3 luser luser 1024 Mar 32 14:33 fum 17296 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 zot computer luser /tmp/foo/fum $ rmdir bar computer luser /tmp/foo/fum $ ls -lai /tmp/foo total 13 17290 drwxrwxr-x 4 luser luser 1024 Mar 32 14:33 . 2001 drwxrwxrwt 42 root root 10240 Mar 32 14:29 .. 17299 drwxrwxr-x 3 luser luser 1024 Mar 32 14:34 fum 17296 drwxrwxr-x 3 luser luser 1024 Mar 32 14:33 zot computer luser /tmp/foo/fum $ rmdir ../zot computer luser /tmp/foo/fum $ ls -i /tmp/foo 17299 fum computer luser /tmp/foo/fum $ cd computer luser ~ $ rmdir /tmp/foo computer luser ~ $
The distinction between names and (i)nodes is important in traditional Unix filesystems. In the GCFS, it is even much more important. I will now try to make this clear (see also the first paragraph of the formal specifications of the semantics).
A node is the primary component of a filesystem: a file, a directory, a symlink, or even more complicated things. In Unix, it is represented by an inode (I do not wish to dwell upon the subtle distinction between the node, which is the abstract filesystem element, and the inode, which is its concrete Unix representation, so I shall use both terms more or less interchangeably). Think of the filesystem as a graph, and each vertex of the graph as a node.
Names are the mechanism used to designate nodes in the filesystem. If you think of the filesystem as a graph, (simple) names are labels which are attached to the arrows of the graph — rather than the nodes. The target of the arrow is the node which is referred to by the name, and its source is a directory node, the directory in which the node lies. (The uniqueness conditions ensures that from a given directory there is only one arrow with a given name, i.e. the name determines the referred node.)
A pathname (as opposed to a simple name) is a succession of
names that represents a path in the graph, starting from the root node
(in the case of an absolute path) or any directory node (for a
relative path), following the successively named arrows. A pathname
has an abstract form (the path in the graph together with the names of
the arrows), and a concrete form (the caracter string used to
represent it, made by separating the names with the “directory
separator character” which is “/
” under
Unix).
This distinction between names and nodes is not
hair-splitting. It is important to understand that some operations
naturally deal with nodes, and others with names. An operation such
as the Unix chmod()
deals with a node; admittedly, the
chmod()
system call takes a pathname as argument, but in
fact, it operates on the node that is referred to by that pathname; in
fact, chmod()
should be, as much as possible, replaced by
fchmod()
, which truly operates on the node (represented
as an open file descriptor), because this avoids many race conditions.
On the other hand, unlink()
operates on a name
and not a node. One way to see this is to try to conceive a
funlink()
system call: clearly that doesn't make sense,
because in the presence of hard links (a node having several
names), it is not possible to determine which one should be
unlinked. What unlink()
does, is remove a name;
in a traditional Unix filesystem, the (referred to) node will
frequently be removed together with the name, but this is really an
accident, removing the node is really the task of a garbage collector
when the node no longer has any name.
Similarly, a little thought will reveal that the link()
operation takes two arguments of different nature. The second (the
new name) is truly a name, whereas the first (the old name) is only
used as a name insofar as it refers to a node. It would be quite
possible to conceive a flink()
system call, wherewith the
first argument would be an open file descriptor, and the second one a
new name to give to the node. (This could be useful to relink an
unlinked file which has been kept open.)
Under a traditional Unix filesystem, the rmdir()
system
call can be viewed as taking a node. This is only true because
directories can have no hard linkes, so a directory node has a unique
name. But, in fact, nobody sees it that way (fortunately for us), and
nobody has been sufficiently perverse to imagine a
frmdir()
system call. In the GCFS, there is
no good reason to distinguish unlink()
and
rmdir()
at all.
Note that Linux does not strictly adhere to this logic. One example
is the case of “rmdir(".")
”: the logical
meaning of this is, “remove the entry with the name
‘.
’ in the current directory”; and the
actual meaning is “remove the entry in the parent
directory that refers to the current directory”. In the
GCFS, doing “rmdir(".")
” will be
forbidden, because we always want to keep the
“.
” directory entry; however, if the
directory is /tmp/foo
say, then doing
“rmdir("/tmp/foo")
” will be permitted,
because that does mean “remove the
‘foo
’ entry in the /tmp
directory”. When any directory node has but one name, this is
no longer hair-splitting, it is quite essential to know which name
should be removed. Likewise, “rmdir("..")
”
will also be forbidden (but this is even more subtle).
..
”
The “..
” directory entry has its own peculiar
problems. In a traditional Unix filesystem, it refers to the parent
directory. With the freer semantics that we advocate, this expression
has no meaning, so we must settle the case of
“..
” somehow. And since it is extremely
frequently used, we must find a system that will coincide with
traditional semantics insofar as are defined.
I shall define two possible kinds of semantics for
“..
”. The first (“stateless
semantics”) just does its best to keep
“..
” pointing to the best candidate for a
parent. The second (“universal covering semantics”) makes
“..
” point to the Right Place, but at the
cost of some magic. I now explain these two systems.
..
”
In the first (“stateless semantics” — a completely
inappropriate term, but I didn't find a better one just now),
“.
” and “..
” are
ordinary directory entries in many respect. Concerning
“.
”, it always points to the directory itself
(it is a looping arrow), simply because it is created that way, and it
is not permitted to modify it or remove it (the “unlink”,
“rmdir” and “rename” opertions return EINVAL
when asked to remove, alter or move the “.
”
entry). Concerning “..
”, we have a weaker
guarantee: it always exists and always points to a directory node, but
that directory can be just about anything (though in most
“common cases” it will function as expected). The rules
are as follows: when the directory is created, together with a name,
“..
” is made to point to the parent
directory, i.e. the directory in which the name was bound;
linking the directory elsewhere, or unlinking a name to it, does not
change “..
”, and the “unlink”,
“rmdir” and “rename” operations return EINVAL
when asked to remove, alter or move the “..
”
entry, just like “.
”; but
“..
” does get modified when the
“rename” operation is applied to the directory node in
which it lives: it is then made to point to the target directory of
the node (that in which the new name for the directory will be
created). Note in particular that “rename
("/tmp/foo/bar", "/tmp/foo/bar")
” is not a no-op, since
it guarantees that the “..
” entry in
/tmp/foo/bar
will then point to /tmp/foo
.
Here is an example transcript in which “..
”
follows these “stateless semantics” rules:
computer luser ~ $ mkdir /tmp/corge computer luser ~ $ cd /tmp/corge computer luser /tmp/corge $ mkdir grault computer luser /tmp/corge $ ls -lai total 12 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 . 2001 drwxrwxrwt 42 root root 10240 Mar 32 15:00 .. 17291 drwxrwxr-x 2 luser luser 1024 Mar 32 15:00 grault computer luser /tmp/corge $ ln grault ../flarp computer luser /tmp/corge $ ls -ladi ../flarp 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 flarp computer luser /tmp/corge $ cd ../flarp computer luser /tmp/flarp $ ls -lai total 2 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 . 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 .. computer luser /tmp/flarp $ ls -lai .. total 12 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 . 2001 drwxrwxrwt 42 root root 10240 Mar 32 15:00 .. 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 grault computer luser /tmp/flarp $ rmdir ../grault computer luser /tmp/flarp $ ls -lai total 2 17291 drwxrwxr-x 2 luser luser 1024 Mar 32 15:00 . 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 .. computer luser /tmp/flarp $ ls -lai .. total 11 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 . 2001 drwxrwxrwt 42 root root 10240 Mar 32 15:00 .. computer luser /tmp/flarp $ strace mv /tmp/flarp /tmp/gorp 2>&1 | grep rename rename("/tmp/flarp", "/tmp/gorp") = 0 computer luser /tmp/flarp $ ls -lai total 11 17291 drwxrwxr-x 2 luser luser 1024 Mar 32 15:00 . 2001 drwxrwxrwt 42 root root 10240 Mar 32 15:00 .. computer luser /tmp/flarp $ ls -ladi ../gorp 17291 drwxrwxr-x 2 luser luser 1024 Mar 32 15:00 gorp computer luser /tmp/flarp $ rmdir ../gorp computer luser /tmp/flarp $ ls -lai total 11 17291 drwxrwxr-x 1 luser luser 1024 Mar 32 15:00 . 2001 drwxrwxrwt 42 root root 10240 Mar 32 15:00 .. computer luser /tmp/flarp $ rmdir ../corge computer luser /tmp/flarp $ cd computer luser ~ $
Note how the “..
” entry refers to
/tmp/corge
(even after the /tmp/corge/grault
name has been removed) until the move, and after that refers to
/tmp
. This may seem rather surprising, but in fact that
is normal Unix behavior: rename
on a directory will
change the “..
” entry of that directory.
..
”
The other semantics I will describe, the “universal
covering” semantics is rather more subtle. The point is that
there really isn't a “..
” entry in a
directory (nor is there a “.
” entry, in
fact). Essentially, “.
” in a path means
“ignore”, and “..
” means
“remove the last component”; thus, where
“..
” points depends not only on where
(i.e. on which node) we are, but also on how we got there. So
the idea is that we keep, for each open file and directory, and also
for the current working directory (and current local root) of each
process, not only the node it corresponds to, but also a “node
path”, i.e. the list of nodes traversed to get to the node
un question. The node is the last element of the node path; what is
meant by “.
” is “the node itself”
and what is meant by “..
” is “the
previous node in the node path” (actually, the node path
obtained by removing the last element of the given node path; if the
node path is a single node, it is untouched, because you can't go
lower than the root).
With this system, it may be possible to allow things such as
“rmdir(".")
” (meaning, remove the last arrow
followed to get to the current node; but then we need to remember not
only the nodes traversed, but the actual arrows followed, which is not
exactly the same thing since there can be several arrows —
having different names — with the same source and target).
However, for simplicity, we will continue to disallow them (with
EINVAL; morality: if Solaris can do it, why not us?). Apart from
that, “.
” and “..
”
are used to get a node (and not a name), and the node is obtained by
the process we have described (the last or before-last in the node
path).
With this kind of semantics, we have, e.g. the following transcript:
computer luser ~ $ mkdir /tmp/corge computer luser ~ $ cd /tmp/corge computer luser /tmp/corge $ mkdir grault computer luser /tmp/corge $ ls -lai total 12 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 . 2001 drwxrwxrwt 42 root root 10240 Mar 32 15:00 .. 17291 drwxrwxr-x 2 luser luser 1024 Mar 32 15:00 grault computer luser /tmp/corge $ ln grault ../flarp computer luser /tmp/corge $ ls -ladi ../flarp 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 flarp computer luser /tmp/corge $ cd ../flarp computer luser /tmp/flarp $ ls -lai total 11 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 . 2001 drwxrwxrwt 42 root root 10240 Mar 32 15:00 .. computer luser /tmp/flarp $ ls -lai ../corge total 12 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 . 2001 drwxrwxrwt 42 root root 10240 Mar 32 15:00 .. 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 grault computer luser /tmp/flarp $ cd ../corge/grault computer luser /tmp/corge/grault $ ls -lai total 2 17291 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 . 17290 drwxrwxr-x 3 luser luser 1024 Mar 32 15:00 .. computer luser /tmp/corge/grault $ rmdir /tmp/flarp computer luser /tmp/corge/grault $ rmdir /tmp/corge computer luser /tmp/corge/grault $ cd computer luser ~ $
The nice thing about these semantics is that the Linux VFS, since the 2.1.x kernels, essentially already incorporates them: the “dentries” of the Linux kernel correspond more or less to the “node paths” I have just described.
FIXME: Actually, there is a problem with
“rename”. Imagine, for example, that we are in
/tmp/corge/grault
and that we issue a rename
("/tmp/corge/grault", "/tmp/flarp")
. The logical thing, after
I have said, in the “universal covering” semantics, would
be for “..
” to continue to point to
/tmp/corge
since the node path has not been changed; this
is definitely what would happen after a link
("/tmp/corge/grault", "/tmp/flarp"); unlink
("/tmp/corge/grault")
. But with the semantics currently
enforced by the Linux kernel, after the “rename”
operation, “..
” will point to
/tmp
as in the “stateless semantics” descibed
previously. The way I understand it, this decision is taken the
kernel not so much for its semantic value, but rather to make sure the
cache works as it should. In the GCFS, the cache will in
any case be a considerable pain…
Requirement: The GCFS will follow any
one of these semantics, or even a combination of them, in the sense
that the “..
” entry is required to behave in
a certain way if (and only if) both kinds of semantics agree
that it should behave in that way. Other than that requirement, the
actual semantics used will be determined by ease of implementation (to
be examined).
The filesystem semantics deals with the following notions:
The “stateless” semantics will guarantee that every
directory node always has two entries, one with name
“.
”, whose associated node is the directory
node itself, and one with name “..
”, whose
associated node is always a directory node. With the “universal
covering” semantics, no directory node will have these entries
(though they will appear to be present anyway).
We now describe the various permitted operations which mutate the filesystem: “open”, “creat”, “mkdir”, “unlink”, “rmdir”, “link” and “rename”.
Each operation comes in two flavors: the “abstract” form, and the “concrete” form. The difference is that the abstract form does not perform path resolution whereas the concrete form does. We start by describing the abstract forms. They generally take a directory descriptor and a non empty path component, with the intention that the node upon which the action is performed is the node having the name of the path component, in the directory node referred by the descriptor. Note that, except otherwise specified, these operations are atomic.
Note: If you think these descriptions are incredibly
complex, you are right: they are. They need to be, however, because I
am concerned with getting all the gory error return codes right. For
example, under Linux, rename("foo/","bar")
returns
ENOTDIR is foo
exists but is not a directory; and
rename("foo","bar/")
returns ENOENT if foo
exists and is not a directory, except if the same is true of
bar
, in which case it returns ENOTDIR. Strange, isn't
it? So I tried in what follows to reproduce the Linux error codes as
precisely as possible: if this condition were relaxed, the description
would be considerably easier. If you think what follows is a pain to
read, think how much it was a pain to write :-b
open
open
operation
takes a directory descriptor and a path component. It looks up the
path component's name in the directory descriptor's node's directory
entries. If no matching entry is found, it fails with the error
ENOENT. Now assume an entry is found. If the path component had
trailing directory separator characters and the entry's node is not a
directory node, then open
fails with ENOTDIR.
(Naturally, other verifications are possible, notably verification of
execute permission on the directory node and of read or write
permission on the node found; they may cause other failures.)
Otherwise it succeeds, and returns a descriptor that refers to the
entry's node. In the “stateless” semantics, the
descriptor is precisely the entry's node; in the “universal
covering” semantics, it is obtained by appending the entry's
node to the node path. A special case of open
is when
the path component's name is “.
” or
“..
”, in which case the behavior depends on
the semantics under consideration; in “stateless”
semantics, there is no special behavior, the lookup performs as usual
(and always succeeds); in the “universal covering”
semantics, no lookup is performed: for “.
”,
the returned descriptor is that which was passed, whereas for
“..
” it is that which is obtained by removing
the last element from the node path (unless the node path is already
of length one).creat
open
, with the following differences. If the name is
matched, and the entry's node is a directory, it fails with EISDIR; if
the name is matched, and isn't a directory but the path component had
trailing directory separator characters, it fails with ENOTDIR. If
the name is unmatched but the path component had trailing directory
separator characters, it fails with ENOENT. Finally, if the name is
unmatched and the path component had no trailing directory separator
characters, it succeeds (there might be other conditions causing it to
fail, of course): it creates a fresh file node, adds an entry in the
directory node having the name of the path component, that refers to
that fresh file node, and returns a descriptor referring to the file
node. (With “universal covering” semantics, the
descriptor is obtained by adding the node at the end of the node
path.) When the path component's name is “.
”
or “..
”, then creat
always fails
with the error EISDIR.mkdir
creat
except
that it creates a directory node instead of a file node, and does not
return anything useful (except possibly an error code). If the named
entry already exists in the given directory node, it fails with EEXIST
(except if the path component had trailing directory separator
characters and the entry's node is not a directory node in which case
it fails with ENOTDIR). If the given path component name is
“.
” or “..
”, the
operation always fails with EEXIST. If the name is unmatched (whether
or not the path component name had trailing directory separator
characters) the function succeeds. The new directory node is
initially empty except possibly for two entries for
“.
” and “..
”: with
“universal covering” semantics, these entries are not
created; with “stateless” semantics, the
“.
” entry refers to the created directory
node itself, and the “..
” entry refers to the
node it was created in (the node referred by the first argument to
abstract mkdir
).unlink
.
” or “..
”
directory entry must fail (with EINVAL). If the entry does not exist,
the operation fails with ENOENT. If the entry exists but is not a
directory whereas the name had trailing directory separator
characters, the operation fails with ENOTDIR.rmdir
unlink
, except insofar as it will
only remove directories (or, if you want, this is the same as
unlink
with a name with trailing directory separator
characters).link
open
(if this fails,
then link
fails with the same error as
open
). After this source node has been obtained,
link
creates a directory entry in the given (target)
directory node, having the given (target) name, and which points to
the given (source) node. (If the name did already exist in the target
directory node, the operation fails with EEXIST, except if the target
component had trailing directory separated characters and its name
referred to a file node, in which case it is ENOTDIR instead.) The
target name may not be “.
” or
“..
” otherwise link
fails with
EEXIST. If the target path component has trailing directory separator
characters and the source node is not a directory, then it fails with
ENOENT. Note that the source node can be a directory node,
and in this case the “..
” entry in this
directory is not modified (I make this clear because
rename
may behave differently).rename
link
and unlink
. Abstractly,
rename
takes a (source) directory descriptor, a (source)
path component, a (target) directory descriptor and a (target) path
component. Neither the source nor the target name may be
“.
” or “..
”,
otherwise rename
fails with EINVAL. As a first step, the
link
operation is performed, with the difference that
according to the semantics, the “..
”
directory entry might be changed: in “stateless”
semantics, it is made to point to the target directory node. Also,
contrary to a normal link
, this is allowed to atomically
replace the target if it already existed (that is, rename
does not return EEXIST but rather removes the target). As a second
step, if the source and target (directory node and name) are not
identical, the source is removed as per unlink
(with
failures being ignored). Contrary to all operations described so far,
the rename
operation is not (necessarily)
atomic: there is a window between the two steps during which the moved
node might be seen under both the source and the target name.
The following table summarizes the result returned by the various
(abstract) mutation operations. The table has one column for each
operation described, except link
and rename
which have two columns (one for source and one for target). There are
four pairs of rows: the first corresponds to the case where the entry
does not exist in the directory, the second to the case where it
exists and refers to a file, the third to the case where it exists and
refers to a directory, the fourth to the case where it refers to
“.
” or “..
”. In
each pair of rows, the first row corresponds to the case where the
name had no trailing directory separator characters, and the second to
the case where it had. Each cell contains either the resulting error,
or “ok” to inticate success, or “fd” to
indicate success with a descriptor being returned. A greater sign in
a source column indicates that the target should then be checked. Two
entries separated by a slash in a target column correspond to the case
where the source node was a file or a directory. The value in
brackets corresponds to the traditional Unix semantics (no guarantee
about that, however).
Entry | Dirsep | open | creat
| mkdir | unlink | rmdir
| link source | link target
| rename source | rename target
|
---|---|---|---|---|---|---|---|---|---|---|
None | foo | ENOENT | fd | ok | ENOENT | ENOENT | ENOENT | ok | ENOENT | ok |
foo/ | ENOENT | ENOENT | ok | ENOENT | ENOENT | ENOENT | ENOENT/ok [ENOENT] | ENOENT | ENOENT/ok | |
File | foo | fd | fd | EEXIST | ok | ENOTDIR | > | EEXIST | > | ok |
foo/ | ENOTDIR | ENOTDIR | ENOTDIR | ENOTDIR | ENOTDIR | ENOTDIR | ENOTDIR | ENOTDIR | ENOTDIR | |
Directory | foo | fd | EISDIR | EEXIST | ok [EISDIR] | ok [(or ENOTEMPTY)] | > [EPERM] | EEXIST | > | ok [EISDIR] |
foo/ | fd | EISDIR | EEXIST | ok [EISDIR] | ok [(or ENOTEMPTY)] | > [EPERM] | EEXIST | > | ENOENT/ok [EISDIR] | |
. or .. | .. | fd | EISDIR | EEXIST | EINVAL [EISDIR] | EINVAL | > [EPERM] | EEXIST | EINVAL | EINVAL [EBUSY?] |
../ | fd | EISDIR | EEXIST | EINVAL [EISDIR] | EINVAL | > [EPERM] | EEXIST | EINVAL | EINVAL [EBUSY?] |
A proper garbage collector should have no semantics at all, i.e. it should be completely transparent: its only effect is on the amount of free space available on the device, but aside from this, user programs (in GC terms, the “mutator” should never be aware that the collector exists — provided space is unlimited, of course).
A node is said to be visible from a descriptor iff there is a path (of directory entries) to the former from the latter, or, in other words, iff the former has a name starting from the former. (As a convention, no node is ever visible from a descriptor that refers to a file.)
We consider the root node, and, for every process running in the
system, its current working directory, its local root directory (if
the chroot()
call exists), and all its open inodes (as
descriptors). These nodes will be called the root set. In
other words, the root set refers to those nodes which are
immediately accessible from some process. Any node which is visible
from a node in the root set is called live: so a
live node is one which can be named by some process in the
system. Any node which is not live is called
garbage: a garbage node is one which cannot be
named.
As far as the semantics go, garbage nodes might just as well be non existent. The collector's role is to see to it that the resources associated to them are freed.
Utilities such as Unix find
, which perform a recursive
filesystem traversal, will break in the GCFS, because
they will be captured by infinite loops, or at the very least they
will recurse multiple times through multiply linked directories.
There is no easy solution to this problem. It is part of the price that must be paid to free the filesystem from the treelike structure: graph recursion is evidently more complex than tree recursion. The only way out is to “mark” the already traversed nodes somehow, and this may take a considerable amount of memory.
If “stateless” semantics are
used, then the recursing program needs to also follow
“..
” because it may lead to a node that
cannot be reached by other means. With “universal cover” semantics, on the other
hand, “..
” always leads back to a node
already on the recursion stack, so there is no need to follow it.
One possibility, though it is by no means practical, is to perform the traversal once (or once in a while, if the filesystem is being modified), and to link all nodes (visible from a certain user — this involves security considerations) from a certain directory. Then the recursive traversal is replaced by an iterative traversal of the very large directory. Naturally, this directory might prevent garbage collection of the nodes that it links to (i.e. all the nodes), so it should be implemented with weak links if these are available.
Another thing to note is that some nodes in the filesystem may be “potential garbage” in that they are no longer linked from the root node, but may remain on disk while they are reachable from some descriptor being used by a process. Such nodes will not be visited by a recursive traversal.
The failure of the tar
utility is particularly
irritating, because tar
is very precious for performing
snapshots of parts of the filesystem to be archived, backed up, saved
for later use, or moved around. I do not see how tar
could easily be saved. For copies within the same filesystem, an
intriguing possibility is suggested by copy on
write.
I do not believe the GCFS has any notable effect on security (in the sense of constituting a security problem). Directory lookups and traversal are subject to the usual security checks. Quota checks are performed at the block level and are not perturbed by the way the filesystem is organized (though they might make the sweeping part of the collection more delicate).
There is a tricky thing concerning quotas, however: if Joe user has a large amount of resources on a disk shared with Sally, and Sally makes a hard link to Joe's resources, when Joe wants to remove the resources, he will essentially just remove his name to them, but not Sally's, so the resources will not be garbage-collected, and he might have quota problems. This problem is not specific to the GCFS since it is already true with Unix that any user can make a hard link to another user's file resources (if the file is large, and the first user removes it while the second user still has a link to it, the resource and the quota share will not be freed, either). The GCFS just makes the problem more visible because it also happens with directories. The cautious user will always truncate her files before removing them, and remove all the files within a directory before removing the directory itself.
However, the GCFS brings about with it a slight refinement of the usually coarse security mechanism of Unix. Indeed, the principle of “access through visibility” (which is in many ways the correct way of treating security) becomes slightly more usable. For example, a group of users might be able to cooperate on a project lying in a shared directory without belonging to a common group in the Unix sense of the word: each user simply has a hard link to the shared directory (which is world-writable) within a directory to which only he has access. In that way, the directory protection is offered not by the protection on the node itself but on the access to the node.
This “security through visibility” would need several
extensions to the Unix API to become truly interesting
(i.e. with descriptors taking the role of capabilities, as Mach
ports for example). For example, a flink()
system call
(allowing to relink in the filesystem an unlinked file or directory
which has been kept as an open descriptor by some process) and a
method of passing file open descriptors from one process to another
(and not just from father to child).
Weak links are something intermediate between symbolic links and hard links. They point to a well-defined node, like hard links, but they do not prevent garbage collection of this node. For reasons outlined in the part on recursive traversal, it may be desirable to introduce weak links. Furthermore, for reasons outlined in the security considerations above, it may be desirable to only allow users to only create weak links to another user's resources.
Weak links would be lazily deleted when the corresponding resource is deleted. In the sense that, when the resource is deleted, the link still exists on disk (of course: it isn't possible to look through all the disk to find the weak links to an object), but if any process tries to access the link in any way, then it is removed immediately.
Weak links are a possible improvement over the Unix FS API that does not really concern the GCFS, but I thought it worth noting.
A “copy-on-write link” behaves very much like a hard link, but with an important difference: rather than being two synchronized resources, it corresponds to two resources that are synchronized at some point, but may later diverge.
This means that the link remains a link as long as the resources are identical, but if they diverge, the link becomes a mere copy.
For example, if foo/
is a cow link to
bar/
(or vice versa, this is all
symmetric), and a file is added to foo/
, then the entire
foo/
directory is copied to a new node. But its content
will be made entirely of cow links to the entries of the
bar/
directory, except for the new entry that has been
added. In turn, if any of these subdirectories is modified, it will
be copied on write to a new node containing further cow
links.
This “copy on write” idea is also completely independent from the GCFS. I should be asking myself whether it could be implemented in a traditional (gc-less) filesystem, but every time I try to think about this question, my mind seems to go bezerk.
I have written a sample implementation, in the Scheme programming language, of the filesystem semantics I propose.
This code implements the “stateless” semantics for
“..
”, whereas this
one implements the “universal
cover” semantics.
First of all, it seems obvious to me that we want an incremental algorithm, i.e. one that can be performed in parallel with the normal filesystem activity (the “mutator” activity). A traditional GC must suspend all mutator activity in order to detect garbage. This is unacceptable in a Unix kernel. Instead, collection would be performed by a kernel thread that does not need to preempt over other processes.
No algorithm is known (that I know of) that frees the garbage by small amounts. Instead, the collector runs in cycles: it scans through the nodes in use to find out what does not constitute garbage, and frees what remains. In particular, if the mutator (a user process) demands the creation of a new node while there is none immediately available and while the collector has not completed its cycle, it must fail with ENOSPC (the alternative would be to suspend it until the collector has completed its cycle, but I don't think this is acceptable; or perhaps return EAGAIN, but then we might end up never returning ENOSPC).
[ENS]
[ENS students]
[David Madore]
[Mathematics]
[Computer science]
[Programs]
[Linux]
[Literature]
Last modified: $Date: 2000/02/04 23:25:05 $