Quines (self-replicating programs)

Table of contents

What is a quine? What is this page?

A “quine” (or “selfrep”) is a computer program which prints its own listing. This may sound either impossible, or trivial, or completely uninteresting, depending on your temper and your knowledge of computer science. Actually, it is possible, and there are some interesting ideas involved (in particular, writing a quine is not a hack that only works because the programming language has certain nice properties — it is a consequence of the general so-called “fixed-point” theorem, itself an instance of Cantor's ubiquitous diagonal argument).

Quines are so named after the American mathematician and logician Willard van Orman Quine (1908/06/25–2000/12/25) who introduced the concept. This page is dedicated to his memory.

I also dedicate this page to Douglas R. Hofstadter, who coined the name (in his justly famous book Gödel, Escher, Bach) and who so clearly explained quines' importance and their relation with Gödel's incompleteness theorem.

Introduction

A quine is a program which prints its own listing. This means that when the program is run, it must print out precisely those instructions which the programmer wrote as part of the program (including, of course, the instructions that do the printing, and the data used in the printing).

The easiest way to do that, of course, is to seek the source file on the disk, open it, and print its contents. That may be done, but it is considered cheating; besides, the program might not know where the source file is, it may have access to only the compiled data, or the programming language may simply forbid that sort of operations.

The interesting thing is that writing a quine does not depend on any kind of hack such as being able to read a source file, or even being able to represent quotes in several different ways. Any programming language which is Turing complete, and which is able to output any string (by a computable function of the string as program — this is a technical condition that is satisfied in every programming language in existence) has a quine program (and, in fact, infinitely many quine programs, and many similar curiosities) as follows by the fixed-point theorem. Moreover, the fixed-point theorem is constructive, so the construction of the quine is merely a matter of patience, not guesswork (or intelligence as some prefer to call it ;-). This is not to imply, of course, that actually writing a short or interesting quine may not demand a lot of cleverness. Still, it says that there is nothing “magical” behind quines; and also nothing says that they have to be obfuscated, difficult to read, or devoid of comments, as they often are.

A first attempt and example

We try writing a quine in C. We choose C because it is widely known, and also because the printf() function has features which will make writing a quine considerably easier (this is a mixed blessing: it is a gain because it makes the quine smaller, but it also makes it sensibly more obscure and “hackish”).

We will want the quine to be correct C code, so it will probably have to begin something like this:

#include <stdio.h>

int
main (void)
{

The first thing we want to do is print all what precedes. Naïvely, we could write:

  printf("#include <stdio.h>\n\nint\nmain (void)\n{\n");

Then we need to print this line itself:

  printf("printf(\"#include <stdio.h>\\n\\nint\\nmain (void)\\n{\\n\");\n");

And so on. It should be obvious that this is not going to work (except if we intend to produce a quine of infinite length, which we do not).

This is the sort of reasoning which makes some people believe that quines don't exist. The problem is that we need to print something, so we use a character string (say s) to print it, and then we need to print s itself, so we use another character string, and so on…

But wait! If we intend to print s, we don't need another string: we can use s itself. So let's give it another try:

  char *s="#include <stdio.h>\n\nint\nmain (void)\n{\n";
  printf(s);  printf("char *s=\"%s\";\n",s);

Well, it still doesn't work. But we have introduced one of the central ideas in quine-writing lore: whereas it is probably necessary to use some data to represent the code to be printed, on the other hand it is possible to reuse these data to print the data themselves. Here we're still a bit naïve: we're using s “as it stands”, but that won't work because it contains some backslashes; these would need to be further backslashified. So we have two paths before us: the King's way is to proceed with backslashification, which will work because this is a computable process. However, since we are writing in C, we choose a shortcut which uses the nice properties of the printf function:

  char *s1="#include <stdio.h>%c%cint%cmain (void)%c{%c";
  char *s2="  char *s1=%c%s%c;%c  char *s2=%c%s%c;%c";
  char n='\n', q='"';
  printf(s1,n,n,n,n,n);
  printf(s2,q,s1,q,n,q,s2,q,n);

This is a partial quine: it prints the beginning of its own listing (something in no way remarkable, since any program which doesn't print anything is a “partial quine”). Here we have passed the “catching up point”, by this I mean that the program data printed includes the data representation itself. It is then generally trivial to complete the quine (here, things are still a bit tricky because we've been doing things in a more or less ad hoc manner, and some of the data are actually hidden in the printf() statements. Nevertheless, it is not very difficult to finish:

#include <stdio.h>

int
main (void)
{
  char *s1="#include <stdio.h>%c%cint%cmain (void)%c{%c";
  char *s2="  char *s%c=%c%s%c;%c  char *s%c=%c%s%c;%c";
  char *s3="  char n='%cn', q='%c', b='%c%c';%c";
  char *sp="  printf(";
  char *s4="%ss1,n,n,n,n,n);%c";
  char *s5="%ss2,'1',q,s1,q,n,'2',q,s2,q,n);%ss2,'3',q,s3,q,n,'p',q,sp,q,n);%c";
  char *s6="%ss2,'4',q,s4,q,n,'5',q,s5,q,n);%ss2,'6',q,s6,q,n,'7',q,s7,q,n);%c";
  char *s7="%ss2,'8',q,s8,q,n,'9',q,s9,q,n);%ss2,'0',q,s0,q,n,'x',q,sx,q,n);%c";
  char *s8="%ss3,b,q,b,b,n);%ss4,sp,n);%ss5,sp,sp,n);%c";
  char *s9="%ss6,sp,sp,n);%ss7,sp,sp,n);%ss8,sp,sp,sp,n);%c";
  char *s0="%ss9,sp,sp,sp,n);%ss0,sp,sp,n,n,n);%c  return 0;%c}%c";
  char *sx="--- This is an intron. ---";
  char n='\n', q='"', b='\\';
  printf(s1,n,n,n,n,n);
  printf(s2,'1',q,s1,q,n,'2',q,s2,q,n);  printf(s2,'3',q,s3,q,n,'p',q,sp,q,n);
  printf(s2,'4',q,s4,q,n,'5',q,s5,q,n);  printf(s2,'6',q,s6,q,n,'7',q,s7,q,n);
  printf(s2,'8',q,s8,q,n,'9',q,s9,q,n);  printf(s2,'0',q,s0,q,n,'x',q,sx,q,n);
  printf(s3,b,q,b,b,n);  printf(s4,sp,n);  printf(s5,sp,sp,n);
  printf(s6,sp,sp,n);  printf(s7,sp,sp,n);  printf(s8,sp,sp,sp,n);
  printf(s9,sp,sp,sp,n);  printf(s0,sp,sp,n,n,n);
  return 0;
}

Here we have a real quine (if you find it obscure, do not worry, much clearer examples will be given further below). Note the use of the s2 string to print several lines modeled on the same pattern. Also note how the backslash required no special treatment. And note the sx string which goes to show that the classical belief that everything in a quine must be doubled, is false (the meaning of the term “intron”, which comes from molecular biology, will be made clearer below).

This quine is intermediate in elegance: on the one hand it does not assume that the computer is using an ASCII character set (you see a lot of C quines which use the fact that double quotes have ASCII code 34 and that line feed has code 10), it is valid ANSI C (with a warning, however, to the fact that I should have written “const char *” rather than just “char *”; this is much better than many quines which omit the return 0 at the end or similar things), and the longest lines are just 80 characters (often quines have terribly long lines). On the other hand, the formatting is inelegant: don't conclude from the above example that quines need be so badly presented. Also, nothing says you can't have comments within quines. We will give much more elegant examples later.

Principles for writing a quine

The basic idea is this:

It is impossible (in most programming languages) for a program to manipulate itself (i.e. its textual representation — or a representation from which its textual representation can be easily derived) directly.

So to make this possible anyway, we write the build the program from two parts, one which call the code and one which we call the data. The data represents (the textual form of) the code, and it is derived in an algorithmic way from it (mostly, by putting quotation marks around it, but sometimes in a slightly more complicated way). The code uses the data to print the code (which is easy because the data represents the code); then it uses the data to print the data (which is possible because the data is obtained by an algorithmic transformation from the code).

This idea is summarized by the sentence “quine ‘quine’”. Here, the verb to quine (invented by Douglas R. Hofstadter) means “to write (a sentence fragment) a first time, and then to write it a second time, but with quotation marks around it” (for example, if we quine “say”, we get “say ‘say’”). Thus, if we quine “quine”, we get “quine ‘quine’”, so that the sentence “quine ‘quine’” is a quine… In this linguistic analogy, the verb “to quine”, plays the role of the code, and “quine” in quotation marks plays the role of the data.

We will henceforth use the words “code” and “data” a lot, to designate the code and data parts of the quine as just explained.

If we are to take an analogy with cellular biology (thanks to Douglas Hofstadter again), what I have called the “code” would be the cell, and the “data” would be the cell's DNA: the cell is able to create a new cell using the DNA, and this involves, among other things, replicating the DNA itself. So the DNA (the data) contains all the necessary information for the replication, but without the cell (the code), or at least some other code to make the data live, it is a useless, inert, piece of data.

Note how the data may contain (depending on how it's interpreted) bits that aren't used to write the code, but are still copied when the data is written on the output. Such bits are called introns, in analogy with the parts of the genome which aren't used to produce proteins. The example we gave above had an intro (the string sx), clearly marked as such. Quite obviously an intron can be modified with great ease; it is a kind of subliminal information that is reproduced with the quine, although it is not necessary to the quine. The possible existence of introns will be the key feature making multi-quines (something we will talk about later) possible.

One word of warning: this code/data distinction in quines is pleasant and often helpful. It is not, however, completely valid in all circumstances. Sometimes the code and the data are not well distinguised, sometimes part of the code plays a data role, or vice versa. Some quines are far beyond my own modest understanding — and beyond my feeble attempts at classification and order. As in all things, caveat emptor. See this remark later in the text, however.

A second example: added clarity

We now use the principles outlined above to construct anoter quine, one which will be more elegant in its formating (but a bit less portable because we will assume an ASCII coding of characters).

This time, we gather all the data in one place, one array containing the ASCII values of the characters making up the code, and we place this array at the beginning of the program. The code will use the array to first print the array (by printing it as a list of hexadecimal integers with a proper formatting) and then print the code (by converting the ASCII values to characters).

This is completely straightforward, and while this quine is far from the shortest, I think it is the clearest I have ever seen:

/* See comments below */

const unsigned char data[] = {
/* 000000 */  0x2f,  0x2a,  0x20,  0x54,  0x68,  0x69,  0x73,  0x20,
/* 0x0008 */  0x69,  0x73,  0x20,  0x61,  0x20,  0x73,  0x65,  0x6c,
/* 0x0010 */  0x66,  0x72,  0x65,  0x70,  0x20,  0x28,  0x71,  0x75,
/* 0x0018 */  0x69,  0x6e,  0x65,  0x29,  0x20,  0x70,  0x72,  0x6f,
/* Several lines snipped.  See the original file for a complete listing. */
/* 0x02c0 */  0x20,  0x28,  0x64,  0x61,  0x74,  0x61,  0x5b,  0x69,
/* 0x02c8 */  0x5d,  0x29,  0x3b,  0x0a,  0x20,  0x20,  0x72,  0x65,
/* 0x02d0 */  0x74,  0x75,  0x72,  0x6e,  0x20,  0x30,  0x3b,  0x0a,
/* 0x02d8 */  0x7d,  0x0a,
};

/* This is a selfrep (quine) program.  It uses the above data (which
 * is no other than the ASCII representation of everything starting
 * from this comment) to print its own listing. */

#include <stdio.h>

int
main (void)
     /* The main program.  We output the data in the format used at
      * the top of this file, and then we use it to generate the rest
      * of this file. */
{
  unsigned int i;

  printf ("/* See comments below */\n\n");
  printf ("const unsigned char data[] = {");
  for ( i=0 ; i<sizeof(data) ; i++ )
    {
      if ( i%8 == 0 )
	printf ("\n/* %0#6x */",i);
      printf ("  %0#4x,", data[i]);
    }
  printf ("\n};\n\n");
  for ( i=0 ; i<sizeof(data) ; i++ )
    putchar (data[i]);
  return 0;
}

This should make it obvious that there is nothing difficult at all in writing quines. In fact this is the sort of quines we obtain by directly applying the fixed-point theorem. As mentioned, the code contains two parts: that which copies the data (the nine lines following the blank one in the main() function) and that which uses the data to copy the code (the next two lines).

Naturally, the coding of the data might be much more complex than a straightforward ASCII encoding. We will return to that subject. Also note that here there are no introns, because ASCII does not permit this (there are no comments or any such things). However, we could trivially add an intron: create a new array, const unsigned char intron[], say, put whatever data we want in it, and use the same printing routines for intron[] as we did for data[] (of course, we need to modify the code, hence the data also, to do this, but once it is done, we can put anything in the intron without modifying anything).

Another point is to be noted: in what precedes I have omitted a great many lines from the data. Had I not given a pointer to the original file, could you have reconstructed the data? Evidently, yes, and without much difficulty: just take the code, take the ASCII value of each character, and tabulate them. This violates the so-called Central Dogma, stating that the data must be used (by the code) to deduce (i.e. to print) the code, but not the converse. In practice, though, there is nothing wrong with violating the Central Dogma, in fact, you can guess that I wrote the program by first writing the quote and then calculating the data; however, introns cannot be reconstructed in that way (since the very point about introns is that all possible data will work).

What if a part of the code had been missing? Then things are much better off. For example, if the comments had been gobbled, running the program itself would have restored them (from their value encoded in the data). Even without any code at all, you would probably have guessed that the data was the ASCII representation of something and been able to restore the something in question. But see the section on bootstrapping for more about this.

The fixed-point theorem

I have mentioned the fixed-point theorem and stated that it is at the heart of the existence of quines. I will now explain what this theorem states.

(Note that this is just one of very many fixed-point theorems abounding in mathematics. This has nothing to do, for example, with Brouwer's fixed-point theorem. I don't know that any specific name is attached to this one, but I suspect it would be something like Kleene's fixed-point theorem.)

I assume no familiarity with the theory of computability. However, it will help: if you are not familiar with it, what I am going to say may sound a bit vague (but read it anyway, because you probably will grasp the idea even if the details are obscure).

Before I can state (and prove) the fixed-point theorem, I will recall some basic notions:

Using the s-m-n theorem and the universality theorem we can prove the fixed-point theorem. This states that for any computable total function h there exists an index (a program) n such that φn(…)=φh(n)(…).

In plain English, this means that if you have any algorithmic transformation h on programs then there exists a program n such that the program n does the same thing as the program n resulting of the transformation. We will explain this with further examples in a second, but first we prove the statement.

For a given program t, we consider the program s(t,t) (given by the s-m-n theorem). Essentially, s(t,t) performs what t does when it is fed itself as input. We further consider the program h(s(t,t)) which results from the tranformation h applied to s(t,t). Now by the universality theorem, there exists an index m such that φm(t…)=φh(s(t,t))(…). In other words, there is a program m which takes a program t as input, and performs what the program h(s(t,t)) does. Then I claim that the program n=s(m,m) is the desired fixed point. Indeed, φn(…)=φs(m,m)(…). But by definition of s, this is φm(m…), which in turn, by definition of m, is φh(s(m,m))(…)=φh(n)(…), quod erat demonstrandum.

To summarize the proof, we have taken the program m which, given a program t, interprets the program resulting of applying the given transformation h to t acting on itself, and we have applied that program to itself.

How does the fixed-point theorem prove the existence of quines? This is very simple: for a given program t, consider the program h(t) that prints the listing of t. Obviously this h is computable. Now the fixed-point theorem tells us that there is a program n such that h(n) and n do the same thing, i.e. printing the listing of n. So n prints the listing of n.

In practice, how do we construct n? Well, the proof of the fixed-point theorem answers this question as well. Since the proof used the universality theorem, it may seem like we need to construct an interpreter to apply the theorem. In fact, we need not: indeed, if you look closely at the proof, you will see that we used the universality theorem only for programs of the form h(…), so that we need only construct an interpreter for those programs; for our particular choice of h, this is trivial.

So consider a program t, taking an argument. We will assume that this argument is given as a variable data to be linked with the program. Then s(t,t) is the program obtained by setting this variable data to the textual value of the program (as a string, say, or as whatever coding we have chosen). Our program m takes an argument t (in the form of the data variable) and performs what h(s(t,t)) does, i.e. it prints the listing of s(t,t), which is none other than the listing of t with a definition of the variable data to be the text of t. And finally for our program n we take s(m,m), that is, we take this program m and link the variable data to be the text of the program. Quite evidently, this is precisely what we have been doing.

The fixed-point theorem has other amusing applications. Essentially, its intuitive (and effective) content is that a program may use its own source as a variable, i.e. adding to a programming language the ability for a program to manipulate itself (its source code) does not add to its expressive power. So there exists a program that compresses its own listing; there exists one which prints its own MD5 checksum (this is much easier than finding a program — indeed any file — that contains its MD5 checksum; still, someone I know thought it was impossible except by brute force — how rude — so I wrote such a program and won a bet like that); there exists a program that prints a second, different, program, that prints the first one again (here, h(t) would merely be a program that prints a lot of print calls for the various lines of t's listing); and so on.

(A passing note, which you may find a bit difficult to understand if you're not used to computability theory.) A different, perhaps more satisfactory, way of stating the fixed-point theorem would be to eliminate the universality theorem from it, and to say: for every computable function k there exists a n such that φn(…)=k(n…). This corresponds more precisely to the intuitive content we have described. It is proved without the use of the universality theorem, using only the s-m-n theorem (for the actual proof, take the proof we have just given, and replace φh(x)(…) by k(x…) everywhere). The advantage of formulating things like this is we see that it also works for primitive recursive functions (which satisfy s-m-n but not universality), so in effect a primitive recursive function can also make use of its own number. By applying the universality theorem (the function φh(x)(…) is computable, so we can call it k(x…)) we recover the fixed-point theorem as we have stated it. The examples we have given of the fixed-point theorem actually use the more restrictive (non-universal) we have just stated. The following examples will use universality (and don't work for primitive recursive functions, which is clear because primitive recursive functions always terminate).

There also exists a program that interprets its own listing: we will return to this. Also, if we take for h the function which to a program x associates the program which calculates what x does, and, at the end (provided x terminates, of course) adds 1, we would have a program x which does the same thing as running x and adding 1 to the result, and that is only possible if x does not terminate, so that the fixed-point theorem also proves the existence of an endless loop.

Exercice: Louis Reasoner believes that the fixed-point theorem proves the existence of polyglot programs (i.e. programs that are valid and do the same thing in several different programming languages). His argument is this: for a given program t (in a first programming language) consider a translation of t in a second programming language, and interpret this program literally in the first language, giving h(t). By the fixed-point theorem, there exists n such that h(n) and n have the same effect, i.e. the text of the program h(n) has the same effect in the first language (that is h(n)) and in the second (that is n). What do you think of this argument?

Answer to the exercice (in rot13): Ybhvf vf rffragvnyyl pbeerpg, ohg gurer vf abguvat cebsbhaq urer. Gurer vf n uvqqra nffhzcgvba, anzryl gung gur frpbaq ynathntr vf noyr gb vagrecerg nal cebtenz gung vg vf srq: gurer vf ab jnl gb erfgevpg gb inyvq cebtenzf (naq pregnvayl vs gur svefg ynathntr npprcgf bayl cebtenzf ortvaavat jvgu na N naq gur frpbaq ynathntr bayl cebtenzf ortvaavat jvgu n O, jr jbhyq unir n uneq gvzr svaqvat n cbyltybg). Abgvpr gung gur frpbaq ynathntr qbrfa'g rira unir gb or Ghevat-pbzcyrgr. Fhowrpg gb gur vagrecergngvba tvira nobir bs gur svkrq-cbvag gurberz, jung Ybhvf' nethzrag nzbhagf gb vf guvf: gur cebtenz (jevggra va gur svefg ynathntr) jvyy eha na vagrecergre bs gur frpbaq ynathntr ba vgf bja fbhepr pbqr (fbzrguvat jr pna qb gunaxf gb gur svkrq-cbvag gurberz); abj rivqragyl fhpu n cebtenz qbrf gur fnzr guvat va obgu ynathntrf, anzryl vagrecerg gur fbhepr pbqr va gur frpbaq ynathntr. Guvf vf abg irel hfrshy sbe pbafgehpgvat n P/Crey cbyltybg sbe rknzcyr!

The fixed-point theorem gives a different point of view on quines from the one we have given so far. The ideas we have already expressed, notably the code/data dichotomy, are perhaps not very clearly apparent. Still, they are present: we should consider the s function from the s-m-n theorem as a mean of adding data to a program (which would otherwise receive this data as an input), so the expression s(m,m) which we have seen says, in effect, add to the program m (the code) a representation of the program m itself (the data). Introns can exist because the function s is free to add extra data to the data required of it, if it wants.

Multi-quines: making use of introns

We start by saying what a bi-quine (or more generally a multi-quine) is. To begin, here is what it is not: a bi-quine is not a program which prints a second program, which in turn prints the first again (actually, it is that, but things are a bit more subtle). This is too easy to do (we have proved the existence of such using the fixed-point theorem): one program is almost a quine, and the other is merely a sequence of calls to print the code of the other one.

A multi-quine is also not a polyglot quine (a quine that can be read, and is a quine, in several different languages). True, polyglot quines actually are multi-quines if you think well about it (the converse is not true), but polyglot quines don't exist for every combination of programming languages (although it is true that some people have been incredibly smart at constructing them) whereas multi-quines do — polyglot quines are a hack whereas multi-quines are a general phenomenon.

A bi-quine is a very interesting kind of program: when run normally, it is a quine. But if it called with a particular command line argument, it will print a different program, its “brother”. Its brother is also a quine, but in a different programming language, so its brother prints its own listing when run normally. But when run with a particular command line argument, the brother prints the listing of the original program. So in effect, a bi-quine is a set of two programs each of which is able to print either of the two. More generally, a multi-quine is a set of r different programs (in r different languages — without this condition we could take them all equal to a single quine), each of which is able to print any of the r programs (including itself) according to the command line argument it is passed. (Note that cheating is not allowed: the command line arguments must not be too long — passing the full text of a program is considered cheating ;-).

There are several ways to prove the existence of multi-quines using fixed-point theorems. Here is one (we leave it to the reader to fill in the missing details). We just consider the case of a bi-quine, i.e. r=2. We consider, in language 1, a program of two parameters that will normally print the first, but that will print the second if a special argument is passed to it. By the fixed-point theorem, we can assume that the first text is its own listing, so that we get a program of one parameter that will print its own listing except that it will print the parameter if called with a special argument. Do the same for language 2. We now have two programs. Substitute one in the other: there is a program, of one parameter, in language 1, that will print its own listing, except when it is called with a special argument, in which case it will print a program, in language 2, which prints its own listing except when it is called with a special argument, in which case it will print the initial parameter (passed to the first program). Finally, apply the fixed-point theorem to that. Voilà, we have the bi-quine.

So, to create multi-quines, we make use of introns (following, essentially, the proof given just above). We have r programs, so r code sets (one in each language); besides, each of the r programs has, in addition to its code set, r data sets, one representing each of the r code sets (so r-1 of the data sets are introns as far as the quine structure goes) in a given coding (in principle it would be possible for each of the r2 data sets to use a different coding, but there is no reason to use a different coding for various data sets in the same program, and even between programs it is reasonable to use more or less similar codings, at least insofar as the programming languages allow this). When program i (running code set i in language i) is asked to produce the listing of program j, it will use its j-th data set to produce the j-th code set, and then it will use all of its r data sets to produce the r data sets of program j (coded in the same or in a similar way).

In practice, we write a quine program similar, say, to the second example we have given on this page, to which we add an intron. Using this intron, the quine is able, when passed a particular parameter, to produce a representation (valid in the second programming language) of the two data sets (the actual data of the quine and the intron) followed by some data specified by the intron. Then we do the same in the other programming language, with the data representation we have elected to produce (and the second program, when passed the special argument, must produce data representation as we have used in the first program). Finally, we synchronize the introns: we use the intron of the first program to represent the code of the second program and the intron of the second to represent the code of the first. (Remember, the nice thing about introns is that we can change them after the quine has been written, without removing its quinishness.)

If you would feel more comfortable with an example, I have written a C/Perl bi-quine. (For fun, I only give out the C version: if you want the Perl version you will have to run the program with the magic word as argument.) In the C version, c_data is the main data set and perl_data is an intron; in the Perl version, of course, things are reversed. (The coding is not quite the same, also, although both are hexadecimal.)

Bootstrapping: recovering the code from the data

As we have already explained and illustrated, a quine is basically a bunch of data, plus an active part, the code, which reads the data twice: once to reproduce the data, and once to reproduce the code; the data represents the code, and the code interprets that representation and recovers the code. There are two parts in the code: that which uses the data to copy the data and that which uses the data to copy the code.

Now what if we are given only the data part of the quine? In the analogy I have given with cellular biology, this is the equivalent of having the DNA (the genome is what I have been calling the “data”… ugh) and wanting to reconstruct a cell.

Well, it is a matter of how difficult the coding (another word to beware) is. If I give you the following quine fragment (the data part):

const char data [] =
"#include <stdio.h>\n\nint\nmain (void)\n{\n  unsigned int i;\n\n  p"
"rintf (\"const char data [] =\");\n  for ( i=0 ; data[i] ; i++ "
")\n    {\n      if ( i%60 == 0 )\n\tprintf (\"\\n\\\"\");\n      switc"
"h ( data[i] )\n\t{\n\tcase '\\\\':\n\tcase '\"':\n\t  printf (\"\\\\%c\", d"
"ata[i]);\n\t  break;\n\tcase '\\n':\n\t  printf (\"\\\\n\");\n\t  break;\n"
"\tcase '\\t':\n\t  printf (\"\\\\t\");\n\t  break;\n\tdefault:\n\t  printf"
" (\"%c\", data[i]);\n\t}\n      if ( i%60 == 59 || !data[i+1] )\n\t"
"printf (\"\\\"\");\n    }\n  printf (\";\\n\\n\");\n  for ( i=0 ; data["
"i] ; i++ )\n    putchar (data[i]);\n  return 0;\n}\n";

you probably won't have much trouble recovering the complete quine. This is because the representation chosen here is completely trivial. We can proceed as follows: just run the tiny instruction printf ("%s", data); on the above data and you get the code; put the code and the data together, and you get a first program which is almost the quine (it may differ in inessential factors, for example if you put the data after the code rather than before); but this program will produce the original quine when run. This process is called bootstrapping, and it is similar to the process of bootstrapping, say, a C compiler (you start with an initial C compiler, which may be much simpler, much less featureful, or much less efficient, than the C compiler you want to build, and you run it on the sources of the desired C compiler, giving a first binary C compiler, which you use a second time to recompile its own sources).

The possibility of bootstrapping means that to some extent quines are self-healing: if the code is damaged but still able to use the data to recover the original code, bootstrapping can be performed.

However, nothing says a quine must use a simple coding like ASCII. I have written a quine that stores, in its data, a compressed (gzipped) representation of the code. This means that whereas the code that uses the data to produce the data is trivial (it is the same as that used in our previous example), on the other hand the code that uses the data to produce the code is much more involved, because it must actually uncompress the data. (The gzip format is very strange and very unpleasant to uncompress. I have written a set of routines to decode it, which are included in the quine of course, and which I put in the public domain if they can be useful to anyone.) Here, the gzip program (plus a bit of interpreting the data as binary) could serve to bootstrap.

Similarly, if I give you the following piece of data:

const char data [] =
"#vapyhqr <fgqvb.u>\n\nvag\nznva (ibvq)\n{\n  hafvtarq vag v;\n\n  c"
"evags (\"pbafg pune qngn [] =\");\n  sbe ( v=0 ; qngn[v] ; v++ "
")\n    {\n      vs ( v%60 == 0 )\n\tcevags (\"\\a\\\"\");\n      fjvgp"
"u ( qngn[v] )\n\t{\n\tpnfr '\\\\':\n\tpnfr '\"':\n\t  cevags (\"\\\\%p\", q"
"ngn[v]);\n\t  oernx;\n\tpnfr '\\a':\n\t  cevags (\"\\\\a\");\n\t  oernx;\n"
"\tpnfr '\\g':\n\t  cevags (\"\\\\g\");\n\t  oernx;\n\tqrsnhyg:\n\t  cevags"
" (\"%p\", qngn[v]);\n\t}\n      vs ( v%60 == 59 || !qngn[v+1] )\n\t"
"cevags (\"\\\"\");\n    }\n  cevags (\";\\a\\a\");\n  sbe ( v=0 ; qngn["
"v] ; v++ )\n    {\n      vs ( ( qngn[v] >= 'N' && qngn[v] < 'A"
"' )\n\t   || ( qngn[v] >= 'n' && qngn[v] < 'a' ) )\n\tchgpune (q"
"ngn[v] + 13);\n      ryfr vs ( ( qngn[v] >= 'A' && qngn[v] <="
" 'M' )\n\t\t|| ( qngn[v] >= 'a' && qngn[v] <= 'm' ) )\n\tchgpune "
"(qngn[v] - 13);\n      ryfr\n\tchgpune (qngn[v]);\n    }\n  erghe"
"a 0;\n}\n";

you will have no trouble recovering the original program if you have a little bit of geek culture, but you probably get my point anyway.

In fact, let us take an extreme example: I have written a quine that stores its code enciphered with the blowfish cryptographic algorithm (by Bruce Schneier) in its data. Of course, the key is part of the code (without the key, the data is useless). Moreover, I have added an intron to the program, which is encrypted with the same key. When the program is run with the magic word as argument, it deciphers (and prints) the intron rather than printing its own listing. This has an amusing consequence: if the key is removed from the listing, then practically nothing is missing from the code, and yet it is impossible to bootstrap; even though we have most of the plain code, the complete ciphered data and secret, we can't do much with it because all is locked by a key (and blowfish is not known to be vulnerable to a known-plaintext attack). In fact, the situation is even more ironic than that since the key is present in the crypted data: we are, essentially, in the situation of someone locked outside his home with the key inside.

(Note that in writing this quine I have implemented the blowfish encryption and decryption algorithm — in fact, the quine contains the full functions, far more than are necessary for what it does. I put these functions in the public domain: you can find them here without the quine part. Be careful: although I am using this just for fun, this is nevertheless strong crypto. So be careful about your local crypto laws.)

A point might be made here about the distinction between code and data: here I claim that the key is part of the code and not the data. The difference is not so much in how the key is used as in how it is stored. In fact, if the key is in the code (as in my quine) the program's skeleton is basically this:

/* See comments below */

const unsigned char data[] = {
/* Lots of encrypted data corresponding to everything starting
 * from the next comment. */
};

/* Code starts here */

/* Decryption routines omitted. */

const char key[] = "Foobar";

int
main (void)
{
  printf ("/* See comments below */\n\n");
  printf ("const unsigned char data[] = {\n");
  pretty_hexadecimal_printout (data);
  printf ("};\n\n");
  decipher (key, data);
  return 0;
}

and as explained, if the key is removed, it is “locked inside the house”. However, if we had some magical way of deciphering blowfish, we could recover the key (even if our magical method did not let us do this a priori) because it is part of the code, so it is stored among the encrypted data. On the other hand, if the key is data, the program looks like this:

/* See comments below */

const char key[] = "Foobar";

const unsigned char data[] = {
/* Lots of encrypted data corresponding to everything starting
 * from the next comment. */
};

/* Code starts here */

/* Decryption routines omitted. */

int
main (void)
{
  printf ("/* See comments below */\n\n");
  printf ("const char key[] = \"%s\";\n\n", key);
  printf ("const unsigned char data[] = {\n");
  pretty_hexadecimal_printout (data);
  printf ("};\n\n");
  decipher (key, data);
  return 0;
}

This may not appear very different, but it is. This time, there isn't a copy of the key “inside the house”. The key is part of the data, it is the only part of the data that is stored in clear. I think there is something to this idea of distinguishing the “code” and “data” parts of a quine not by what they are used for but how they are printed.

While it is true that some parts of the code can be recovered by a bootstrapping process, on the other hand, the data can never be recovered in that way. Any part of a quine which, if it is modified, does not change the program output (meaning that the program output is still the original quine), is not data, it is code. (This applies, for example, to the comments inside the data section of the program.) (Well, all right, I guess there is room for discussion.)

However, the data contains parts of a different nature: when they are modified, the output produced by the program is modified, but it remains a quine. Those are the introns we have already much talked about. In a way, introns represent the exact opposite of the principle of bootstrapping: in the case of bootstrapping, we hope that after a certain number of iterations we will hit the original program again; but if we modify an intron, the program remains a quine, so it will not “heal” itself, it will just remain in its modified form.

Recapitulation

I have been introducing a great many names and concepts. I will summarize them here.

There are analogies with compilers (or interpreters) of course. An intron within a compiler would be something that cannot be bootstrapped, essentially because the compiler (or interpreter) merely copies the behavior of the underlying system (compiler) to itself. This is what Ken Thompson explains (he gives the example of \v in C) in his Turing Award speech quoted in the links section below. Irrelevant code differences are differences between two compilers which perform the same task (i.e. output the same binaries) but in a different way (their binaries are different), for example the same compiler compiled with two different compilers; then we can do a bootstrapping, i.e. recompile the compiler and obtain the “fixed-point” version.

Self-interpretation: using data as code

In this section I must give my examples in Scheme rather than in C because Scheme permits the manipulation of programs (meta-expressions) as data (symbolic expressions).

Consider the two following elegant Scheme quine programs. First this one:

(define (line-write x) (write x) (newline))
(define (d l) (map line-write l))
(define (mid) (display "(do '(") (newline))
(define (end) (display "))") (newline))
(define (do l) (d l) (mid) (d l) (end))
(do '(
(define (line-write x) (write x) (newline))
(define (d l) (map line-write l))
(define (mid) (display "(do '(") (newline))
(define (end) (display "))") (newline))
(define (do l) (d l) (mid) (d l) (end))
))

and second this one

(define x '(
(display "(define x '(")
(newline)
(map (lambda (s) (write s) (newline)) x)
(display "))")
(newline)
(display "(map eval x)")
(newline)
))
(map eval x)

The first one is easy enough to understand, and follows the usual pattern well: the five lines ending with the second-to-last are the “data” (as well as the two character strings, I suppose), and the rest is the “code”. The code (the do function essentially) uses the data (the l variable essentially) to print the code (the first (d l)) and then print the data (the second (d l)).

But the second example is a bit strange: evidently the x variable (the lines from the second to the eight) is data. The code, essentially, is limited to the single instruction (map eval x). If you are unfamiliar with Scheme, this means: “consider x as a list of Scheme instructions and execute them”. So what we are doing here is using the data, in effect, as code. This is curious because the whole point of a quine, really is to use code as data and here we are using data as code. But in a way it makes sense: if you consider x to be written in a programming language which is just like Scheme except that the code can be accessed as data… through the variable x! Then x's rôle is to print x itself plus the “interpreter” ((map eval x)).

I have also written a quine in Bourne shell along the same principles. It is rather subtle to understand, but I think it is worth the trouble. If you prefer the “dc” programming languages, the compare this quine (along the lines of the first Scheme program above, i.e. the “normal” lines) and that one (which also uses the data-as-code principle and it is shorter).

I'm not entirely sure whether this way of writing quines is actually qualitatively different from the “normal” way. (For example, do they correspond to a different proof of the fixed-point theorem, perhaps one that uses one more time the universality theorem — I can manufacture such a proof but it is not really convincing.) It is true that if we compare the two Scheme programs, or the two dc programs, given above, there seems to be an important difference (namely, that there is much more redundancy in the first shan in the second). But maybe that is just a naïve way of thinking. Still, I can't help but think there is some relation with the two ways of writing the Curry Y (fixed-point) combinator: as λf.((λx.(f(xx)))(λx.(f(xx)))) or as λf.((λx.(xx))(λx.(f(xx)))). But maybe I'm gone totally off my rocker there.

To conclude this section, I'd like to mention one program I wrote that I'm particularly fond of. It is not a quine and it is in no way so impressive; but in fact it was considerably more difficult to write than a quine. It consists of a (rather minimal) Scheme interpreter, written in Scheme. And that interpreter is applied to itself acting upon itself. So it is a Scheme interpreter trying to interpret a Scheme interpreter interpreting a Scheme interpreter interpreting… well, you get the picture. As each interpreter prints some debugging information about the program it is interpreting, this leads to a lot of output data (with curious properties; for example, search for the string “Now starting evaluation…” without quotes around it, and see how it becomes logarithmically rarer and rarer). If you have read the cryptic comment I have made a while back on the use of the universality theorem in the fixed-point theorem, this is a case were we need the universality theorem, and indeed, it is the central part of our program (writing an interpreter). You should also note the analogy with Gödel's theorem, because this self-interpreting program is much closer to Gödel's theorem than ordinary quines. Naturally, if we allow the use of the eval function (but that's cheating), we can rewrite my program in a much simpler way:

((lambda (expr) (eval `(,expr ,expr)))
 '(lambda (expr) (eval `(,expr ,expr))))

(a cute endless loop).

Conclusion

Well, I've written much more than I intended to. I wanted to make this a small page on The Art Of Quine Programming, and it turned out to be quine (oh, what a strange slip! I meant “quite” of course) a monument.

I haven't given enormously many examples, but I hope the examples I've given were clear enough so that, if you didn't know how to write quines initially, now you do (even if you didn't understand all that's on this page). If you want more examples, have a look at my personal quines collection (all written by yours truly), which you can also access by FTP, or download as a single tarball. Also look at some of the links below, where a great number of more quines can be found.

Yow! I've just lost the SOURCE CODE for all my QUINE PROGRAMS! What will I DO NOW with just the BINARIES?