GCFS: a Garbage-Collected Filesystem for Linux

[ENS] [ENS students] [David Madore]
[Mathematics] [Computer science] [Programs] [Linux] [Literature]

Introduction
Semantics' description
Garbage collection
- Choice of the algorithm

Introduction

(Buzzword mode ON.)

Unix filesystems (and, by extension, filesystems everywhere) follow a tree-like structure, and consequently have peculiar (but necessary) restrictions on their semantics; because of that, they employ (i.e. they can employ) a reference-counting algorithm (keeping track of “inode link counts”) to detect when resources can be freed.

Unix has had “hard links” (i.e. multiply-referenced inodes) ever since the beginning. However, as part of the above-mentioned restrictions, hard links cannot be used on directories (actually, this limitation did not exist in the earliest versions of Unix, say, V5; but it appeared as early as V7). Berkeley Unix introduced symbolic links as a solution to this problem. Indeed, since symbolic links are asymmetric, and do not participate in the determination of what constitutes garbage (in GC terms they are “weak pointers”; actually, they are just names), their presence does not break the tree-like nature of the filesystem. But symbolic links are not entirely satisfactory either, precisely because of their asymmetric nature (or, rather, they are completely satifactory, but only insofar as the filesystem is fixed — when we start moving or removing things, it becomes obvious that symbolic links are not perfect, “dangling symlinks” being just one small aspect of this imperfection). Hard links, on the other hand, are elegant but practically unusable because they are too limited: they cannot be used on directories, and they cannot span across devices. Because of this they have appeared as an odd and isolated feature of Unix (and also the cause of many security problems because sysadmins are not too aware of them or forget about their existence: for example a recursive chown on a user's home dir is a dangerous thing since the user might have linked /etc/passwd into his home).

The GCFS project would relax these restrictions on the filesystem (that is, would provide a filesystem with more flexible semantics). This has advantages and drawbacks, of course, but the GCFS is more intended as an experiment with Unix filesystem semantics and Linux kernel hacking (HINT: c-o-o-l) than as a stable, usable, working filesystem. If this is not sufficiently clear, read: the GCFS is likely to turn your partition, or the whole Universe, into a thick goo, and make Asmodeus and all his legions flow through your nostrils; but the whole point is that this is fun.

In short, the GCFS would make the following extensions to the Unix filesystem semantics:

Permit hard links between directories. This does away with the treelike nature of the filesystem.
Permit moving (or linking) directories to subdirectories of themselves. This introduces cycles in the structure.
Permit unlinking a non empty directory. The files and directories it contains will then be garbage-collected if this makes them inacessible.

For a more detailed description of the desired semantics, see the “Semantics' description” section below.

Semantics' description

This section describes the desired semantics of the garbage collected file system without worrying about either the implementation details, or the relation to Linux.

Informal discussion and examples

Simple examples

The most important improvement I suggest over traditional semantics is to permit hard links between directories. Two hard linked directories (but in fact, I should be speaking of “two names for the same directory”) always contain exactly the same files. Here is a simple example:

computer luser ~ $ mkdir /tmp/foo
computer luser ~ $ cd /tmp/foo
computer luser /tmp/foo $ mkdir bar
computer luser /tmp/foo $ ls -li
total 1
  17291 drwxrwxr-x   2 luser    luser        1024 Mar 32 14:29 bar
computer luser /tmp/foo $ ln bar baz
computer luser /tmp/foo $ ls -li
total 2
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:29 bar
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:29 baz
computer luser /tmp/foo $ cd bar
computer luser /tmp/foo/bar $ touch qux
computer luser /tmp/foo/bar $ ls -lai
total 2
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:30 .
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:29 ..
  17294 -rw-rw-r--   1 luser    luser           0 Mar 32 14:30 qux
computer luser /tmp/foo/bar $ cd ../baz
computer luser /tmp/foo/baz $ ls -lai
total 2
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:30 .
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:29 ..
  17294 -rw-rw-r--   1 luser    luser           0 Mar 32 14:30 qux
computer luser /tmp/foo/baz $

In the above example, the file qux was created in the /tmp/foo/bar directory, and it “magically” appeared in the /tmp/foo/baz directory. At this point, I may want to remove the /tmp/foo/bar directory (again, this terminology is not good: I am merely removing the /tmp/foo/bar name of the directory node). It would be rather pointless to require that the directory be empty before it can be removed. We continue the example:

computer luser /tmp/foo/baz $ cd /tmp/foo
computer luser /tmp/foo $ ls -li
total 2
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:30 bar
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:30 baz
computer luser /tmp/foo $ rmdir bar
computer luser /tmp/foo $ ls -li
total 1
  17291 drwxrwxr-x   2 luser    luser        1024 Mar 32 14:30 baz
computer luser /tmp/foo $ ls -lai baz
total 2
  17291 drwxrwxr-x   2 luser    luser        1024 Mar 32 14:30 .
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:30 ..
  17294 -rw-rw-r--   1 luser    luser           0 Mar 32 14:30 qux
computer luser /tmp/foo $

Now since it was possible to remove the /tmp/foo/bar directory without it being empty, there is no reason not to do the same with the /tmp/foo/baz directory:

computer luser /tmp/foo $ rmdir baz
computer luser /tmp/foo $ ls -li
total 0
computer luser /tmp/foo $

At this point, the sometime-called /tmp/foo/bar directory, and the qux file it contained are no longer visible. Unless some process in the system has an open inode to one of these (or has one of these as current working directory or local root), they have become garbage, and they will be reclaimed by the system through a mechanism that, for the moment we leave unspecified. This is the garbage-collection step, which need not worry us as far as defining the semantics goes, because garbage-collection has no semantics.

Loops and tangled hierarchies

We now consider a more vicious example:

computer luser /tmp/foo $ mkdir bar
computer luser /tmp/foo $ cd bar
computer luser /tmp/foo/bar $ ln . loop
computer luser /tmp/foo/bar $ ls -lai
total 3
  17296 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:31 .
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:31 ..
  17296 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:31 loop
computer luser /tmp/foo/bar $ cd loop
computer luser /tmp/foo/bar/loop $ ls -li
total 1
  17296 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:31 loop
computer luser /tmp/foo/bar/loop $ ls -li loop
total 1
  17296 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:31 loop
computer luser /tmp/foo/bar/loop $ cd /tmp/foo/bar
computer luser /tmp/foo/bar $

In the last step of the example above, we have been prudent and have used cd /tmp/foo/bar to go back to our starting point. What “cd ..” would have done is a bit subtle, because of the combination of several factors (the shell, the Linux dentry cache system, and the filesystem semantics for “..”), something we shall discuss later. Similarly, we have been careful to use “ls -li” rather than “ls -lai” in the above examples, for good reason.

At any rate, now that we have created a loop, we see how essential it is to be able to rmdir a non empty directory, otherwise it would be plain impossible to get rid of that loop directory entry. But with a GCFS, we can do:

computer luser /tmp/foo/bar $ rmdir loop
computer luser /tmp/foo/bar $ ls -l
total 2
  17296 drwxrwxr-x   2 luser    luser        1024 Mar 32 14:32 .
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:31 ..

Of course, nothing says we can't have more vicious structures. I suggest drawing the graph as it evolves to understand the following:

computer luser /tmp/foo/bar $ ln . baz
computer luser /tmp/foo/bar $ ls -lai
total 3
  17296 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:33 .
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:31 ..
  17296 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:33 baz
computer luser /tmp/foo/bar $ ln .. qux
computer luser /tmp/foo/bar $ ls -lai
total 4
  17296 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:33 .
  17290 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:31 ..
  17296 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:33 baz
  17290 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:31 qux
computer luser /tmp/foo/bar $ ls -li qux
total 1
  17296 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:33 bar
computer luser /tmp/foo/bar $ ln . qux/zot
computer luser /tmp/foo/bar $ ls -li qux
total 2
  17296 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 bar
  17296 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 zot
computer luser /tmp/foo/bar $ mkdir qux/fum
computer luser /tmp/foo/bar $ cd /tmp/foo/fum
computer luser /tmp/foo/fum $ mv ../bar .
computer luser /tmp/foo/fum $ ls -lai
total 3
  17299 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:33 .
  17290 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 ..
  17296 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 bar
computer luser /tmp/foo/fum $ ls -lai bar
total 4
  17296 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 .
  17299 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:33 ..
  17296 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 baz
  17290 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 qux
computer luser /tmp/foo/fum $ ls -li bar/baz
total 2
  17296 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 baz
  17290 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 qux
computer luser /tmp/foo/fum $ ls -li bar/baz/qux
total 2
  17299 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:33 fum
  17296 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 zot
computer luser /tmp/foo/fum $ rmdir bar
computer luser /tmp/foo/fum $ ls -lai /tmp/foo
total 13
  17290 drwxrwxr-x   4 luser    luser        1024 Mar 32 14:33 .
   2001 drwxrwxrwt  42 root     root        10240 Mar 32 14:29 ..
  17299 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:34 fum
  17296 drwxrwxr-x   3 luser    luser        1024 Mar 32 14:33 zot
computer luser /tmp/foo/fum $ rmdir ../zot
computer luser /tmp/foo/fum $ ls -i /tmp/foo
  17299 fum
computer luser /tmp/foo/fum $ cd
computer luser ~ $ rmdir /tmp/foo
computer luser ~ $

Names and nodes

The distinction between names and (i)nodes is important in traditional Unix filesystems. In the GCFS, it is even much more important. I will now try to make this clear (see also the first paragraph of the formal specifications of the semantics).

A node is the primary component of a filesystem: a file, a directory, a symlink, or even more complicated things. In Unix, it is represented by an inode (I do not wish to dwell upon the subtle distinction between the node, which is the abstract filesystem element, and the inode, which is its concrete Unix representation, so I shall use both terms more or less interchangeably). Think of the filesystem as a graph, and each vertex of the graph as a node.

Names are the mechanism used to designate nodes in the filesystem. If you think of the filesystem as a graph, (simple) names are labels which are attached to the arrows of the graph — rather than the nodes. The target of the arrow is the node which is referred to by the name, and its source is a directory node, the directory in which the node lies. (The uniqueness conditions ensures that from a given directory there is only one arrow with a given name, i.e. the name determines the referred node.)

A pathname (as opposed to a simple name) is a succession of names that represents a path in the graph, starting from the root node (in the case of an absolute path) or any directory node (for a relative path), following the successively named arrows. A pathname has an abstract form (the path in the graph together with the names of the arrows), and a concrete form (the caracter string used to represent it, made by separating the names with the “directory separator character” which is “/” under Unix).

This distinction between names and nodes is not hair-splitting. It is important to understand that some operations naturally deal with nodes, and others with names. An operation such as the Unix chmod() deals with a node; admittedly, the chmod() system call takes a pathname as argument, but in fact, it operates on the node that is referred to by that pathname; in fact, chmod() should be, as much as possible, replaced by fchmod(), which truly operates on the node (represented as an open file descriptor), because this avoids many race conditions.

On the other hand, unlink() operates on a name and not a node. One way to see this is to try to conceive a funlink() system call: clearly that doesn't make sense, because in the presence of hard links (a node having several names), it is not possible to determine which one should be unlinked. What unlink() does, is remove a name; in a traditional Unix filesystem, the (referred to) node will frequently be removed together with the name, but this is really an accident, removing the node is really the task of a garbage collector when the node no longer has any name.

Similarly, a little thought will reveal that the link() operation takes two arguments of different nature. The second (the new name) is truly a name, whereas the first (the old name) is only used as a name insofar as it refers to a node. It would be quite possible to conceive a flink() system call, wherewith the first argument would be an open file descriptor, and the second one a new name to give to the node. (This could be useful to relink an unlinked file which has been kept open.)

Under a traditional Unix filesystem, the rmdir() system call can be viewed as taking a node. This is only true because directories can have no hard linkes, so a directory node has a unique name. But, in fact, nobody sees it that way (fortunately for us), and nobody has been sufficiently perverse to imagine a frmdir() system call. In the GCFS, there is no good reason to distinguish unlink() and rmdir() at all.

Note that Linux does not strictly adhere to this logic. One example is the case of “rmdir(".")”: the logical meaning of this is, “remove the entry with the name ‘.’ in the current directory”; and the actual meaning is “remove the entry in the parent directory that refers to the current directory”. In the GCFS, doing “rmdir(".")” will be forbidden, because we always want to keep the “.” directory entry; however, if the directory is /tmp/foo say, then doing “rmdir("/tmp/foo")” will be permitted, because that does mean “remove the ‘foo’ entry in the /tmp directory”. When any directory node has but one name, this is no longer hair-splitting, it is quite essential to know which name should be removed. Likewise, “rmdir("..")” will also be forbidden (but this is even more subtle).

The problem with “`..`”

The “..” directory entry has its own peculiar problems. In a traditional Unix filesystem, it refers to the parent directory. With the freer semantics that we advocate, this expression has no meaning, so we must settle the case of “..” somehow. And since it is extremely frequently used, we must find a system that will coincide with traditional semantics insofar as are defined.

I shall define two possible kinds of semantics for “..”. The first (“stateless semantics”) just does its best to keep “..” pointing to the best candidate for a parent. The second (“universal covering semantics”) makes “..” point to the Right Place, but at the cost of some magic. I now explain these two systems.

Stateless semantics for “`..`”

In the first (“stateless semantics” — a completely inappropriate term, but I didn't find a better one just now), “.” and “..” are ordinary directory entries in many respect. Concerning “.”, it always points to the directory itself (it is a looping arrow), simply because it is created that way, and it is not permitted to modify it or remove it (the “unlink”, “rmdir” and “rename” opertions return EINVAL when asked to remove, alter or move the “.” entry). Concerning “..”, we have a weaker guarantee: it always exists and always points to a directory node, but that directory can be just about anything (though in most “common cases” it will function as expected). The rules are as follows: when the directory is created, together with a name, “..” is made to point to the parent directory, i.e. the directory in which the name was bound; linking the directory elsewhere, or unlinking a name to it, does not change “..”, and the “unlink”, “rmdir” and “rename” operations return EINVAL when asked to remove, alter or move the “..” entry, just like “.”; but “..” does get modified when the “rename” operation is applied to the directory node in which it lives: it is then made to point to the target directory of the node (that in which the new name for the directory will be created). Note in particular that “rename ("/tmp/foo/bar", "/tmp/foo/bar")” is not a no-op, since it guarantees that the “..” entry in /tmp/foo/bar will then point to /tmp/foo.

Here is an example transcript in which “..” follows these “stateless semantics” rules:

computer luser ~ $ mkdir /tmp/corge
computer luser ~ $ cd /tmp/corge
computer luser /tmp/corge $ mkdir grault
computer luser /tmp/corge $ ls -lai
total 12
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 .
   2001 drwxrwxrwt  42 root     root        10240 Mar 32 15:00 ..
  17291 drwxrwxr-x   2 luser    luser        1024 Mar 32 15:00 grault
computer luser /tmp/corge $ ln grault ../flarp
computer luser /tmp/corge $ ls -ladi ../flarp
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 flarp
computer luser /tmp/corge $ cd ../flarp
computer luser /tmp/flarp $ ls -lai
total 2
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 .
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 ..
computer luser /tmp/flarp $ ls -lai ..
total 12
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 .
   2001 drwxrwxrwt  42 root     root        10240 Mar 32 15:00 ..
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 grault
computer luser /tmp/flarp $ rmdir ../grault
computer luser /tmp/flarp $ ls -lai
total 2
  17291 drwxrwxr-x   2 luser    luser        1024 Mar 32 15:00 .
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 ..
computer luser /tmp/flarp $ ls -lai ..
total 11
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 .
   2001 drwxrwxrwt  42 root     root        10240 Mar 32 15:00 ..
computer luser /tmp/flarp $ strace mv /tmp/flarp /tmp/gorp 2>&1 | grep rename
rename("/tmp/flarp", "/tmp/gorp")       = 0
computer luser /tmp/flarp $ ls -lai
total 11
  17291 drwxrwxr-x   2 luser    luser        1024 Mar 32 15:00 .
   2001 drwxrwxrwt  42 root     root        10240 Mar 32 15:00 ..
computer luser /tmp/flarp $ ls -ladi ../gorp
  17291 drwxrwxr-x   2 luser    luser        1024 Mar 32 15:00 gorp
computer luser /tmp/flarp $ rmdir ../gorp
computer luser /tmp/flarp $ ls -lai
total 11
  17291 drwxrwxr-x   1 luser    luser        1024 Mar 32 15:00 .
   2001 drwxrwxrwt  42 root     root        10240 Mar 32 15:00 ..
computer luser /tmp/flarp $ rmdir ../corge
computer luser /tmp/flarp $ cd
computer luser ~ $

Note how the “..” entry refers to /tmp/corge (even after the /tmp/corge/grault name has been removed) until the move, and after that refers to /tmp. This may seem rather surprising, but in fact that is normal Unix behavior: rename on a directory will change the “..” entry of that directory.

“Universal cover” semantics for “`..`”

The other semantics I will describe, the “universal covering” semantics is rather more subtle. The point is that there really isn't a “..” entry in a directory (nor is there a “.” entry, in fact). Essentially, “.” in a path means “ignore”, and “..” means “remove the last component”; thus, where “..” points depends not only on where (i.e. on which node) we are, but also on how we got there. So the idea is that we keep, for each open file and directory, and also for the current working directory (and current local root) of each process, not only the node it corresponds to, but also a “node path”, i.e. the list of nodes traversed to get to the node un question. The node is the last element of the node path; what is meant by “.” is “the node itself” and what is meant by “..” is “the previous node in the node path” (actually, the node path obtained by removing the last element of the given node path; if the node path is a single node, it is untouched, because you can't go lower than the root).

With this system, it may be possible to allow things such as “rmdir(".")” (meaning, remove the last arrow followed to get to the current node; but then we need to remember not only the nodes traversed, but the actual arrows followed, which is not exactly the same thing since there can be several arrows — having different names — with the same source and target). However, for simplicity, we will continue to disallow them (with EINVAL; morality: if Solaris can do it, why not us?). Apart from that, “.” and “..” are used to get a node (and not a name), and the node is obtained by the process we have described (the last or before-last in the node path).

With this kind of semantics, we have, e.g. the following transcript:

computer luser ~ $ mkdir /tmp/corge
computer luser ~ $ cd /tmp/corge
computer luser /tmp/corge $ mkdir grault
computer luser /tmp/corge $ ls -lai
total 12
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 .
   2001 drwxrwxrwt  42 root     root        10240 Mar 32 15:00 ..
  17291 drwxrwxr-x   2 luser    luser        1024 Mar 32 15:00 grault
computer luser /tmp/corge $ ln grault ../flarp
computer luser /tmp/corge $ ls -ladi ../flarp
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 flarp
computer luser /tmp/corge $ cd ../flarp
computer luser /tmp/flarp $ ls -lai
total 11
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 .
   2001 drwxrwxrwt  42 root     root        10240 Mar 32 15:00 ..
computer luser /tmp/flarp $ ls -lai ../corge
total 12
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 .
   2001 drwxrwxrwt  42 root     root        10240 Mar 32 15:00 ..
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 grault
computer luser /tmp/flarp $ cd ../corge/grault
computer luser /tmp/corge/grault $ ls -lai
total 2
  17291 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 .
  17290 drwxrwxr-x   3 luser    luser        1024 Mar 32 15:00 ..
computer luser /tmp/corge/grault $ rmdir /tmp/flarp
computer luser /tmp/corge/grault $ rmdir /tmp/corge
computer luser /tmp/corge/grault $ cd
computer luser ~ $

The nice thing about these semantics is that the Linux VFS, since the 2.1.x kernels, essentially already incorporates them: the “dentries” of the Linux kernel correspond more or less to the “node paths” I have just described.

FIXME: Actually, there is a problem with “rename”. Imagine, for example, that we are in /tmp/corge/grault and that we issue a rename ("/tmp/corge/grault", "/tmp/flarp"). The logical thing, after I have said, in the “universal covering” semantics, would be for “..” to continue to point to /tmp/corge since the node path has not been changed; this is definitely what would happen after a link ("/tmp/corge/grault", "/tmp/flarp"); unlink ("/tmp/corge/grault"). But with the semantics currently enforced by the Linux kernel, after the “rename” operation, “..” will point to /tmp as in the “stateless semantics” descibed previously. The way I understand it, this decision is taken the kernel not so much for its semantic value, but rather to make sure the cache works as it should. In the GCFS, the cache will in any case be a considerable pain…

Requirement: The GCFS will follow any one of these semantics, or even a combination of them, in the sense that the “..” entry is required to behave in a certain way if (and only if) both kinds of semantics agree that it should behave in that way. Other than that requirement, the actual semantics used will be determined by ease of implementation (to be examined).

Formal specification

Basic notions

The filesystem semantics deals with the following notions:

Nodes: A node is an abstract entity that makes up the filesystem. File nodes are where the actual data is stored, but our semantics will not deal with the data in question. Directory nodes contain filesystem metadata; (among other things) they contain a (finite) list of directory entries, in which no two entries have the same name.
Directory entry names: A directory entry name is a character string of nonzero length not containing the directory separator character (the ASCII “slash” character under Unix).
Directory entries: A directory entry is a pair consisting of a directory entry name and a node.
Pathnames: A pathname is a character string. Conceptually, it is seen as a sequence of directory entry names separated by (non empty) sequences of directory separator characters, with possible directory separator characters at the start and end of the string (“leading” and “trailing” directory separator characters). A pathname is said to be absolute if it has leading directory separator characters, relative otherwise. The relative part of a pathname is the pathname with all leading directory separator characters removed.
Path components: A path component is a directory entry name (the name of the component), possibly followed by some directory separator characters. The first path component of a non empty relative pathname is the initial substring of that pathname that ends just before the first non directory separator character that follows a directory separator character (or the whole pathname if there is no character matching that description). Thus, it is a path component (it is the longest path component that starts the pathname). (For example, the first path component of “foo//bar/baz///qux/” is “foo//”.) The remainder of the path, of course, is the string obtained by removing the first path component at the start of the path.
Descriptors: In the “stateless” semantics, a descriptor is merely a node. In the “universal covering” semantics, a descriptor is node path, i.e. a finite sequence of nodes of non zero length, all but the last one of which are directory nodes. In more general semantics, a descriptor might be something more complicated, but a descriptor always has a referred node (the descriptor itself it the “stateless” semantics, the last node of the node path in the “universal covering” semantics). Naturally, if we speak, for example, of a “directory descriptor”, we mean a descriptor whose referred node is a directory node.
Root node: The filesystem has a root node, and referring to it, a root descriptor. The root node is a directory node.

The “stateless” semantics will guarantee that every directory node always has two entries, one with name “.”, whose associated node is the directory node itself, and one with name “..”, whose associated node is always a directory node. With the “universal covering” semantics, no directory node will have these entries (though they will appear to be present anyway).

Mutating the filesystem

We now describe the various permitted operations which mutate the filesystem: “open”, “creat”, “mkdir”, “unlink”, “rmdir”, “link” and “rename”.

Each operation comes in two flavors: the “abstract” form, and the “concrete” form. The difference is that the abstract form does not perform path resolution whereas the concrete form does. We start by describing the abstract forms. They generally take a directory descriptor and a non empty path component, with the intention that the node upon which the action is performed is the node having the name of the path component, in the directory node referred by the descriptor. Note that, except otherwise specified, these operations are atomic.

Note: If you think these descriptions are incredibly complex, you are right: they are. They need to be, however, because I am concerned with getting all the gory error return codes right. For example, under Linux, rename("foo/","bar") returns ENOTDIR is foo exists but is not a directory; and rename("foo","bar/") returns ENOENT if foo exists and is not a directory, except if the same is true of bar, in which case it returns ENOTDIR. Strange, isn't it? So I tried in what follows to reproduce the Linux error codes as precisely as possible: if this condition were relaxed, the description would be considerably easier. If you think what follows is a pain to read, think how much it was a pain to write :-b

open: The abstract open operation takes a directory descriptor and a path component. It looks up the path component's name in the directory descriptor's node's directory entries. If no matching entry is found, it fails with the error ENOENT. Now assume an entry is found. If the path component had trailing directory separator characters and the entry's node is not a directory node, then open fails with ENOTDIR. (Naturally, other verifications are possible, notably verification of execute permission on the directory node and of read or write permission on the node found; they may cause other failures.) Otherwise it succeeds, and returns a descriptor that refers to the entry's node. In the “stateless” semantics, the descriptor is precisely the entry's node; in the “universal covering” semantics, it is obtained by appending the entry's node to the node path. A special case of open is when the path component's name is “.” or “..”, in which case the behavior depends on the semantics under consideration; in “stateless” semantics, there is no special behavior, the lookup performs as usual (and always succeeds); in the “universal covering” semantics, no lookup is performed: for “.”, the returned descriptor is that which was passed, whereas for “..” it is that which is obtained by removing the last element from the node path (unless the node path is already of length one).
creat: This operation functions much like open, with the following differences. If the name is matched, and the entry's node is a directory, it fails with EISDIR; if the name is matched, and isn't a directory but the path component had trailing directory separator characters, it fails with ENOTDIR. If the name is unmatched but the path component had trailing directory separator characters, it fails with ENOENT. Finally, if the name is unmatched and the path component had no trailing directory separator characters, it succeeds (there might be other conditions causing it to fail, of course): it creates a fresh file node, adds an entry in the directory node having the name of the path component, that refers to that fresh file node, and returns a descriptor referring to the file node. (With “universal covering” semantics, the descriptor is obtained by adding the node at the end of the node path.) When the path component's name is “.” or “..”, then creat always fails with the error EISDIR.
mkdir: Works similarly to creat except that it creates a directory node instead of a file node, and does not return anything useful (except possibly an error code). If the named entry already exists in the given directory node, it fails with EEXIST (except if the path component had trailing directory separator characters and the entry's node is not a directory node in which case it fails with ENOTDIR). If the given path component name is “.” or “..”, the operation always fails with EEXIST. If the name is unmatched (whether or not the path component name had trailing directory separator characters) the function succeeds. The new directory node is initially empty except possibly for two entries for “.” and “..”: with “universal covering” semantics, these entries are not created; with “stateless” semantics, the “.” entry refers to the created directory node itself, and the “..” entry refers to the node it was created in (the node referred by the first argument to abstract mkdir).
unlink: Takes a directory descriptor and a path component, and removes the directory entry by that name. Classically, this would only work if the directory entry referred to a file node (or, at least, not to a directory node), but because of the looser semantics we are advocating, there is no reason to make this restriction (and there are no conditions like “being non empty” on the directory to be removed, either, if a directory node entry is being removed). Note that only the entry is removed. Removing the node is the task of the garbage collector (see below). Trying to unlink the “.” or “..” directory entry must fail (with EINVAL). If the entry does not exist, the operation fails with ENOENT. If the entry exists but is not a directory whereas the name had trailing directory separator characters, the operation fails with ENOTDIR.
rmdir: In the garbage-collected file system, this is no difference from unlink, except insofar as it will only remove directories (or, if you want, this is the same as unlink with a name with trailing directory separator characters).
link: Link abstractly takes a source directory descriptor, a source path component, a target directory descriptor and a target component. The source descriptor and source path component are used to obtain a node, as per open (if this fails, then link fails with the same error as open). After this source node has been obtained, link creates a directory entry in the given (target) directory node, having the given (target) name, and which points to the given (source) node. (If the name did already exist in the target directory node, the operation fails with EEXIST, except if the target component had trailing directory separated characters and its name referred to a file node, in which case it is ENOTDIR instead.) The target name may not be “.” or “..” otherwise link fails with EEXIST. If the target path component has trailing directory separator characters and the source node is not a directory, then it fails with ENOENT. Note that the source node can be a directory node, and in this case the “..” entry in this directory is not modified (I make this clear because rename may behave differently).
rename: This acts somewhat like a combination of link and unlink. Abstractly, rename takes a (source) directory descriptor, a (source) path component, a (target) directory descriptor and a (target) path component. Neither the source nor the target name may be “.” or “..”, otherwise rename fails with EINVAL. As a first step, the link operation is performed, with the difference that according to the semantics, the “..” directory entry might be changed: in “stateless” semantics, it is made to point to the target directory node. Also, contrary to a normal link, this is allowed to atomically replace the target if it already existed (that is, rename does not return EEXIST but rather removes the target). As a second step, if the source and target (directory node and name) are not identical, the source is removed as per unlink (with failures being ignored). Contrary to all operations described so far, the rename operation is not (necessarily) atomic: there is a window between the two steps during which the moved node might be seen under both the source and the target name.

The following table summarizes the result returned by the various (abstract) mutation operations. The table has one column for each operation described, except link and rename which have two columns (one for source and one for target). There are four pairs of rows: the first corresponds to the case where the entry does not exist in the directory, the second to the case where it exists and refers to a file, the third to the case where it exists and refers to a directory, the fourth to the case where it refers to “.” or “..”. In each pair of rows, the first row corresponds to the case where the name had no trailing directory separator characters, and the second to the case where it had. Each cell contains either the resulting error, or “ok” to inticate success, or “fd” to indicate success with a descriptor being returned. A greater sign in a source column indicates that the target should then be checked. Two entries separated by a slash in a target column correspond to the case where the source node was a file or a directory. The value in brackets corresponds to the traditional Unix semantics (no guarantee about that, however).

Entry Dirsep open creat mkdir unlink rmdir link source link target rename source rename target
None foo ENOENT fd ok ENOENT ENOENT ENOENT ok ENOENT ok
foo/ ENOENT ENOENT ok ENOENT ENOENT ENOENT ENOENT/ok [ENOENT] ENOENT ENOENT/ok
File foo fd fd EEXIST ok ENOTDIR > EEXIST > ok
foo/ ENOTDIR ENOTDIR ENOTDIR ENOTDIR ENOTDIR ENOTDIR ENOTDIR ENOTDIR ENOTDIR
Directory foo fd EISDIR EEXIST ok [EISDIR] ok [(or ENOTEMPTY)] > [EPERM] EEXIST > ok [EISDIR]
foo/ fd EISDIR EEXIST ok [EISDIR] ok [(or ENOTEMPTY)] > [EPERM] EEXIST > ENOENT/ok [EISDIR]
. or .. .. fd EISDIR EEXIST EINVAL [EISDIR] EINVAL > [EPERM] EEXIST EINVAL EINVAL [EBUSY?]
../ fd EISDIR EEXIST EINVAL [EISDIR] EINVAL > [EPERM] EEXIST EINVAL EINVAL [EBUSY?]

Entry	Dirsep	`open`	`creat`	`mkdir`	`unlink`	`rmdir`	`link` source	`link` target	`rename` source	`rename` target
None	`foo`	ENOENT	fd	ok	ENOENT	ENOENT	ENOENT	ok	ENOENT	ok
`foo/`	ENOENT	ENOENT	ok	ENOENT	ENOENT	ENOENT	ENOENT/ok [ENOENT]	ENOENT	ENOENT/ok
File	`foo`	fd	fd	EEXIST	ok	ENOTDIR	>	EEXIST	>	ok
`foo/`	ENOTDIR	ENOTDIR	ENOTDIR	ENOTDIR	ENOTDIR	ENOTDIR	ENOTDIR	ENOTDIR	ENOTDIR
Directory	`foo`	fd	EISDIR	EEXIST	ok [EISDIR]	ok [(or ENOTEMPTY)]	> [EPERM]	EEXIST	>	ok [EISDIR]
`foo/`	fd	EISDIR	EEXIST	ok [EISDIR]	ok [(or ENOTEMPTY)]	> [EPERM]	EEXIST	>	ENOENT/ok [EISDIR]
`.` or `..`	`..`	fd	EISDIR	EEXIST	EINVAL [EISDIR]	EINVAL	> [EPERM]	EEXIST	EINVAL	EINVAL [EBUSY?]
`../`	fd	EISDIR	EEXIST	EINVAL [EISDIR]	EINVAL	> [EPERM]	EEXIST	EINVAL	EINVAL [EBUSY?]

Garbage collection semantics

A proper garbage collector should have no semantics at all, i.e. it should be completely transparent: its only effect is on the amount of free space available on the device, but aside from this, user programs (in GC terms, the “mutator” should never be aware that the collector exists — provided space is unlimited, of course).

A node is said to be visible from a descriptor iff there is a path (of directory entries) to the former from the latter, or, in other words, iff the former has a name starting from the former. (As a convention, no node is ever visible from a descriptor that refers to a file.)

We consider the root node, and, for every process running in the system, its current working directory, its local root directory (if the chroot() call exists), and all its open inodes (as descriptors). These nodes will be called the root set. In other words, the root set refers to those nodes which are immediately accessible from some process. Any node which is visible from a node in the root set is called live: so a live node is one which can be named by some process in the system. Any node which is not live is called garbage: a garbage node is one which cannot be named.

As far as the semantics go, garbage nodes might just as well be non existent. The collector's role is to see to it that the resources associated to them are freed.

Other considerations

Recursive traversal

Utilities such as Unix find, which perform a recursive filesystem traversal, will break in the GCFS, because they will be captured by infinite loops, or at the very least they will recurse multiple times through multiply linked directories.

There is no easy solution to this problem. It is part of the price that must be paid to free the filesystem from the treelike structure: graph recursion is evidently more complex than tree recursion. The only way out is to “mark” the already traversed nodes somehow, and this may take a considerable amount of memory.

If “stateless” semantics are used, then the recursing program needs to also follow “..” because it may lead to a node that cannot be reached by other means. With “universal cover” semantics, on the other hand, “..” always leads back to a node already on the recursion stack, so there is no need to follow it.

One possibility, though it is by no means practical, is to perform the traversal once (or once in a while, if the filesystem is being modified), and to link all nodes (visible from a certain user — this involves security considerations) from a certain directory. Then the recursive traversal is replaced by an iterative traversal of the very large directory. Naturally, this directory might prevent garbage collection of the nodes that it links to (i.e. all the nodes), so it should be implemented with weak links if these are available.

Another thing to note is that some nodes in the filesystem may be “potential garbage” in that they are no longer linked from the root node, but may remain on disk while they are reachable from some descriptor being used by a process. Such nodes will not be visited by a recursive traversal.

The failure of the tar utility is particularly irritating, because tar is very precious for performing snapshots of parts of the filesystem to be archived, backed up, saved for later use, or moved around. I do not see how tar could easily be saved. For copies within the same filesystem, an intriguing possibility is suggested by copy on write.

Security considerations

I do not believe the GCFS has any notable effect on security (in the sense of constituting a security problem). Directory lookups and traversal are subject to the usual security checks. Quota checks are performed at the block level and are not perturbed by the way the filesystem is organized (though they might make the sweeping part of the collection more delicate).

There is a tricky thing concerning quotas, however: if Joe user has a large amount of resources on a disk shared with Sally, and Sally makes a hard link to Joe's resources, when Joe wants to remove the resources, he will essentially just remove his name to them, but not Sally's, so the resources will not be garbage-collected, and he might have quota problems. This problem is not specific to the GCFS since it is already true with Unix that any user can make a hard link to another user's file resources (if the file is large, and the first user removes it while the second user still has a link to it, the resource and the quota share will not be freed, either). The GCFS just makes the problem more visible because it also happens with directories. The cautious user will always truncate her files before removing them, and remove all the files within a directory before removing the directory itself.

However, the GCFS brings about with it a slight refinement of the usually coarse security mechanism of Unix. Indeed, the principle of “access through visibility” (which is in many ways the correct way of treating security) becomes slightly more usable. For example, a group of users might be able to cooperate on a project lying in a shared directory without belonging to a common group in the Unix sense of the word: each user simply has a hard link to the shared directory (which is world-writable) within a directory to which only he has access. In that way, the directory protection is offered not by the protection on the node itself but on the access to the node.

This “security through visibility” would need several extensions to the Unix API to become truly interesting (i.e. with descriptors taking the role of capabilities, as Mach ports for example). For example, a flink() system call (allowing to relink in the filesystem an unlinked file or directory which has been kept as an open descriptor by some process) and a method of passing file open descriptors from one process to another (and not just from father to child).

Weak links

Weak links are something intermediate between symbolic links and hard links. They point to a well-defined node, like hard links, but they do not prevent garbage collection of this node. For reasons outlined in the part on recursive traversal, it may be desirable to introduce weak links. Furthermore, for reasons outlined in the security considerations above, it may be desirable to only allow users to only create weak links to another user's resources.

Weak links would be lazily deleted when the corresponding resource is deleted. In the sense that, when the resource is deleted, the link still exists on disk (of course: it isn't possible to look through all the disk to find the weak links to an object), but if any process tries to access the link in any way, then it is removed immediately.

Weak links are a possible improvement over the Unix FS API that does not really concern the GCFS, but I thought it worth noting.

Copy-on-write links

A “copy-on-write link” behaves very much like a hard link, but with an important difference: rather than being two synchronized resources, it corresponds to two resources that are synchronized at some point, but may later diverge.

This means that the link remains a link as long as the resources are identical, but if they diverge, the link becomes a mere copy.

For example, if foo/ is a cow link to bar/ (or vice versa, this is all symmetric), and a file is added to foo/, then the entire foo/ directory is copied to a new node. But its content will be made entirely of cow links to the entries of the bar/ directory, except for the new entry that has been added. In turn, if any of these subdirectories is modified, it will be copied on write to a new node containing further cow links.

This “copy on write” idea is also completely independent from the GCFS. I should be asking myself whether it could be implemented in a traditional (gc-less) filesystem, but every time I try to think about this question, my mind seems to go bezerk.

Sample code

I have written a sample implementation, in the Scheme programming language, of the filesystem semantics I propose.

This code implements the “stateless” semantics for “..”, whereas this one implements the “universal cover” semantics.

Garbage collection

Choice of the algorithm

First of all, it seems obvious to me that we want an incremental algorithm, i.e. one that can be performed in parallel with the normal filesystem activity (the “mutator” activity). A traditional GC must suspend all mutator activity in order to detect garbage. This is unacceptable in a Unix kernel. Instead, collection would be performed by a kernel thread that does not need to preempt over other processes.

No algorithm is known (that I know of) that frees the garbage by small amounts. Instead, the collector runs in cycles: it scans through the nodes in use to find out what does not constitute garbage, and frees what remains. In particular, if the mutator (a user process) demands the creation of a new node while there is none immediately available and while the collector has not completed its cycle, it must fail with ENOSPC (the alternative would be to suspend it until the collector has completed its cycle, but I don't think this is acceptable; or perhaps return EAGAIN, but then we might end up never returning ENOSPC).

[ENS] [ENS students] [David Madore]
[Mathematics] [Computer science] [Programs] [Linux] [Literature]

David Madore

Last modified: $Date: 2000/02/04 23:25:05 $