SIMPLE Documentation

I got the name from a quote of the program fortune (``THE LESSER-KNOWN PROGRAMMING LANGUAGES #10: SIMPLE'') that describes a programming language (the ``Sheer Idiot's Monopurpose Programming Language Environment'') whose only possible statements are BEGIN, END and STOP. Since the first version of simple had exacly two commands, namely @void@ and @id@, I fealt the name was rather appropriate. Since the, simple has grown quite immensely, and the name is now really more ironic than anything else.

Perhaps it should be called COMPLEX - Complex is an Oversized Macro Processing Language Element with Xtras.

What is SIMPLE?

As its name indicates, SIMPLE is a macro preprocessor. It can be described by the following formula: one third of m4, one third of Scheme, one third of sgml, and one large third of psychopathic mania.

Note that SIMPLE is the name of the language, and simple is the name of the program that implements this language. Since simple is the only existing SIMPLE implementation, confusion between the two - as is likely to happen even in this document - is not too important.

What is this document?

This document is the complete documentation of the simple program. It is both the user's manual of the simple program, the reference manual of the SIMPLE language and the specifications (sort of).

This document, however, doesn't tell you how to compile simple or how to run it. For this, you must consult the installation instructions and the man page.

The version of simple described in this document is version 2.2.0.

What is SIMPLE's goal?

SIMPLE's reason for existence is this: I got fed up of the existence of many different document formats (among which TeX, LaTeX, HTML, troff), and I wanted a universal format that would convert to all of those. Besides, TeX is pretty damn ugly, so I wanted a format that would abstract TeX away and let me type something more elegant.

The texinfo format is a similar type of universal format. It is, however, mainly restricted to writing documentation. I wanted a format that would fit all sorts of uses. I imagined calling this format STeX, and having it look like HTML.

simple is what I actually wrote. It is the preprocessor used to convert STeX to any other format, by defining the STeX tags as SIMPLE macros.

What is a macro preprocessor exactly?

A macro preprocessor is a dedicated programming language, such as m4 or cpp, that looks partly like a programming language and partly like ``cat''.

While it is not possible to say exactly what programming languages are macro preprocessors and which are not, the main difference is that a macro preprocessor will often leave most of its input untouched. Thus, if you run cpp, m4 or simple on a file containing the single line ``Hello, world!'', you get that as output also (plus, in the case of cpp, an extra line that contains a line number).

While an ordinary programming language needs a special instruction (such as printf or write) to output something, a macro preprocessor does this by default. The only parts of the input that are actually interpreted by the macro preprocessor are called macros. Thus, if an expression occurs very frequently in your file, you might want to define a macro to represent it, and you will then use this macro instead of your expression, so as to make the file shorter. Naturally, this is the most limited kind of use for a macro preprocessor, but it is still an important one.

What are SIMPLE's strengths?

SIMPLE has a very logical syntax, much more so than m4 or cpp.
SIMPLE manipulates token lists rather than character lists, which makes it much less awkward than m4 for certain things (such as quoting).
The use of a macro introduction character (namely <, as in SGML) makes sure that SIMPLE's macros won't get in the way when you weren't expecting it.
As a programming language, SIMPLE is more usable than m4 (as for cpp, it doesn't even have Turing power, so forget it).
SIMPLE has a lexer that is much more powerful and flexible than that of m4 (which, essentially, doesn't have one).

What are SIMPLE's weaknesses?

SIMPLE's syntax makes it a little difficult to use in preprocessing programs (in C for example).
SIMPLE's quotation rules (although quite logical and elegant) mean that sometimes you have to scratch your head for a long time to determine how many levels of quote to use.
simple is rather slow and inefficient in terms of memory consumption.
simple doesn't have all the cool builtin macros that m4 has (output control is missing, notably - but that probably will come someday).
simple's error messages are utterly useless.

SIMPLE user's guide and tutorial

Convention: SIMPLE examples will be presented in the following way:

<def|greet|Hello, @1@!>%
<greet|world>

Hello, world!

The part before the arrow (->) is the input which is presented to simple and second part is the output produced by it. Thus, the previous example means that if you run simple on a file that contains just two lines, the first being <def|greet|Hello, @1@!>% and the second being <greet|world>, the output will just be one line, whose content is Hello, world!.

We encourage readers to try all the examples.

Basics

Using `simple` as `cat`

Let us start with something very simple.

DON'T PANIC

DON'T PANIC

In other words, simple just copies to the output whatever it is fed in; that is true so long as the input does not contain any of the ten special characters, which are (by default) `, ", %, <, |, >, #, [, ] and @. (Note that the first character in this list is the backquote (grave accent) which may be difficult to tell apart from the right quote (acute accent, apostrophe) if your character font is weird.)

Here is another example:

Hello, world!
This SIMPLE business is *really* simple.
\bye

Hello, world!
This SIMPLE business is *really* simple.
\bye

Escaping special characters

Now suppose you want to use one of the special characters - that is, produce it in the output. Just typing it is likely not to do what you want. Luckily, however, SIMPLE provides a way of making a special character non-special, called escaping it.

Escaping one character

The backquote ` is SIMPLE's escape character. This means that any special character (including the backquote itself) looses its special signification when preceeded by the backquote (unless the backquote in question is itself... well, you get the picture). Here is a first example:

Santa Claus `<santa.claus`@toys.np`>

Santa Claus <santa.claus@toys.np>

The backquote prevented SIMPLE from interpreting the special characters <, > and @. If you left out those backquotes, you would get an error message.

Note that it is not an error to escape an ordinary (i.e. not special) character: it just leaves the ordinary character in question unaltered. For example:

Wonderful`!

Wonderful!

Thus, if you can't remember whether a character is special or not, you can always put a backquote before it - it won't hurt.

The backquote will also have effect on the backquote itself. Thus, if you want to produce a backquote in the output, you must double it in the input:

````How strange!''  said Alice.  ````I could have sworn you said ``hello'
to me.  I must be hearing voices.''

``How strange!''  said Alice.  ``I could have sworn you said `hello'
to me.  I must be hearing voices.''

Escaping several characters at once

It is occasionally useful to escape several characters in a row without having to put a backquote in front of each of them. There is a way to do that: put a double quote (") character around the whole string which should be escaped (one at the beginning and one at the end). Here is an example:

Santa Claus "<santa.claus@toys.np>"

Santa Claus <santa.claus@toys.np>

Of course, the quoted string doesn't have to contain any special characters. Ordinary characters will work fine.

"Simple is wonderful!"

Simple is wonderful!

Note that this method works provided that the characters in question do not include the backquote or the double quote characters themselves. Let us repeat that: within double quotes, every special character looses its special signification except the backquote (which is still used to escape the next character, generally another backquote or double quote) and the double quote (which is used to terminate the escaped region).

A few more examples should make all this clear:

Santa Claus "<santa.claus@toys.np>"
"Percentage #2 is 25%, and that is < [or >] than @ home."
There are no quotes around "this".
However, there are quotes around `"this`".
"Even inside quotes, the backquote (``) must be escaped, as in: ````."
"The same applies to the quotes (`") themselves, obviously."

Santa Claus <santa.claus@toys.np>
Percentage #2 is 25%, and that is < [or >] than @ home.
There are no quotes around this.
However, there are quotes around "this".
Even inside quotes, the backquote (`) must be escaped, as in: ``.
The same applies to the quotes (") themselves, obviously.

Comments

The special character % is the comment character. Anything that follows it (if it not escaped, that is), and up to the end of the line, is ignored by SIMPLE. It will be gobbled, that is, it will not appear on the output. Here is an example:

This is not a comment.
%This is a comment.

This is not a comment.

Of course, if you want to produce a percent sign on the output, you must escape the comment character. Then, it will be copied to the output and it will not start a comment.

Note that a comment character causes SIMPLE to ignore everything that follows up to and including the end of line character. In other words, if a line ends with a comment, the next line will follow immediately on the same output line:

This is ordinary text %and this is a comment.
and this is the continuation of it.
Note how the new line character was swallowed by the comment.
%
10`% of 90 is 9.
`%This is not a comment %but this is.
so it should appear on the output.

This is ordinary text and this is the continuation of it.
Note how the new line character was swallowed by the comment.
10% of 90 is 9.
%This is not a comment so it should appear on the output.

Macros

So far, we've had SIMPLE do two things: repeating input and ignoring it. Luckily, however, a macro preprocessor has other uses. As the name indicates, most of those center around macros.

Defining our first macro

To define a macro in SIMPLE, the syntax is

<def|macro name|macro content>

To call a macro once it is defined, the syntax is

<macro name>

We start with a traditional greeting:

<def|greeting|Hello, world!><greeting>
<greeting>
<greeting>
How I like to say ````<greeting>''.

Hello, world!
Hello, world!
Hello, world!
How I like to say ``Hello, world!''.

This example defines a macro called <greeting> whose content is Hello, world!, and then calls that macro four times.

A few points to note

Note that in the example above, the definition of the macro is immediately followed by a call to the macro. This is because spaces, new lines, or any other character, are not ignored after a macro definition. If we added a new line character, it would appear on the output, which is bad:

Now we are about to define the macro.
<def|greeting|Hello, world!>
Now we have defined it.
Now we are about to use it.
<greeting>
Now we have used it.

Now we are about to define the macro.

Now we have defined it.
Now we are about to use it.
Hello, world!
Now we have used it.

(Note the blank line in position 2.)

Luckily, we have seen above how to gobble such unwanted new line characters: just add a comment character at the end of the line. Thus, it is frequent in SIMPLE that macro definitions (and a lot of other commands which don't produce output) be immediately followed by a comment character. This is how we would rewrite our first example for better readability:

<def|greeting|Hello, world!>%
<greeting>
<greeting>
<greeting>
How I like to say ````<greeting>''.

Hello, world!
Hello, world!
Hello, world!
How I like to say ``Hello, world!''.

Of course, it is still possible to have the string <greeting> appear as such on the output. To do that, you must simply escape it:

<def|greeting|Hello, world!>%
<greeting>
The input ````"<greeting>"'' produces the output ````<greeting>''.

Hello, world!
The input ``<greeting>'' produces the output ``Hello, world!''.

Macros with parameters

It is frequently useful to have macros which take a parameter, and whose expansion will depend on that parameter. SIMPLE makes this possible.

To call a macro with parameters, the syntax is

<macro name|parameters>

where the parameters are separated by the vertical bar (``pipe'') special character |.

In the macro expansion, the sequence @1@ is replaced by the value of the first parameter, the sequence @2@ by that of the second, and so on.

Here is an example of a macro with a parameter:

<def|greet|Hello, @1@!>%
<greet|Peter>
<greet|Paul>
<greet|Mary-Jane>
<greet|everybody else>
<greet|world>

Hello, Peter!
Hello, Paul!
Hello, Mary-Jane!
Hello, everybody else!
Hello, world!

Note that if a macro makes reference to more parameters than are actually supplied, the extra references will be expanded to a blank string. Conversely, if more parameters are passed to a macro than it actually uses, the extra parameters are simply ignored. Here is an example:

<def|secondarg|@2@>%
Without arguments: <secondarg>
With one argument: <secondarg|first>
With two arguments: <secondarg|first|second>
With three arguments: <secondarg|first|second|third>
With four arguments: <secondarg|first|second|third|fourth>

Without arguments: 
With one argument: 
With two arguments: second
With three arguments: second
With four arguments: second

(Although you probably cannot see it, there is exactly one blank space at the end of each of the first two output lines.)

Builtin macros

Builtin macros are macros which are internal to simple. Rather than being defined by the user (as those we have seen up to now), they are written in C within the code of simple itself. It is not permitted to redefine a builtin macro (that is, to define a user macro having the same name as an existing builtin macro).

Builtin macros will not only expand to certain strings, but they can also have side effects. Up to now, we have seen one builtin macro, the <def> builtin, which takes two parameters: the name of a user macro to define (or redefine), and the actual definition. The <def> builtin itself expands to nothing (that is, when you write something like <def|foo|bar>, nothing actually goes to the output because of this - it is the act of calling <foo> which will produce something); however, it has one important side effect, namely defining a new builtin.

Now, SIMPLE has a lot of builtin macros, which will do various important things. Let me just mention a few, just to give a feeling of what they are. Actually, for the moment, you will probably find them pretty useless, because a lot still has to be said about how SIMPLE works before they can be used effectively.

The <quit> builtin will quit SIMPLE prematurely:

This sentence goes out.
<quit>
This one does not.

This sentence goes out.

(in fact, it quits so abruptly that, if you were feeding simple's data on the standard input, the second sentence will not even be read, and might go to another program reading the same standard input after simple).

The <id> builtin just returns its first argument. The <void> builtin, on the other hand, ignores everything you pass to it, and expands to nothing at all.

<id|This is a string>
<void|This is not a pipe>
<id|First|Second|Third|Fourth>

This is a string

First

<head> returns the first character (actually, token) of its first argument. <tail> returns every character (actually, token) except the first:

<head|This is a string>
<tail|And this is another>
<head|Hi, Joe!><tail|Zello, world!>

T
nd this is another
Hello, world!

<translate> takes two parameters: the first is a string of even length, called a translation table, and the second is a string to translate. It will replace every occurrence in the string to translate of the first character of the translation table by the second character of the translation table, every occurrence of the third character by the fourth character, and so on.

<translate|aeeatnntoiio|Now is the time for all good men
to come to the aid of the party.>

Niw os nha noma fir ell giid mat
ni cima ni nha eod if nha perny.

The <len> builtin evaluates to the length (written in decimal) of its first argument:

<len|Now is the time>

Beyond the basics

Quoting

Introduction to quoting

Let's start with a simple situation. Suppose you want to define a macro <exit> which simply calls the <quit> builtin. The straightforward approach is this:

I am about to define the macro.
<def|exit|<quit>>%
I have now defined the macro.
<exit>
I have now used the macro.

I am about to define the macro.

It may appear surprising (especially to people familiar with Scheme) that only the first sentence gets output - and simple quits after that. But in fact, it is quite logical: we are defining <exit> to be the result of <quit> - and in order to find what that result is, simple will call the <quit> builtin, and therefore quit. What we really need is not to define <exit> to the result of <quit>, but rather to the (token) string <quit> itself, so that when it is evaluated, and only then, the <quit> builtin will be called - thus quitting the program. This is done as follows:

I am about to define the macro.
<def|exit|[<quit>]>%
I have now defined the macro.
<exit>
I have now used the macro.

I am about to define the macro.
I have now defined the macro.

In other words, the [ and ] special characters, called SIMPLE's quote characters, have prevented the <quit> builtin from being called. For this reason, they are also called ``evaluation preventers'', and quoting is also called ``preventing evaluation''.

It may seem a bit difficult at first, especially to people familiar with m4, to understand the difference between quoting and escaping. I think the best possible explanation is to see what would happen if we escaped <quit> rather than quoting it:

I am about to define the macro.
<def|exit|"<quit>">%
I have now defined the macro.
<exit>
I have now used the macro.

I am about to define the macro.
I have now defined the macro.
<quit>
I have now used the macro.

The <exit> macro was this time defined to be the character string <quit>, in which the < and > characters are not the special begin-of-macro and end-of-macro tokens, but simply the (dumb) ASCII characters. Therefore, nothing special happens when you call it.

Let us give one more, striking, example of quoting:

Compare this:
<def|test|foo>%
<def|foo|<test>>%
<def|bar|[<test>]>%
<def|test|bar>%
<foo> (this should be ````foo'')
<bar> (this should be ````bar'')

Compare this:
foo (this should be ``foo'')
bar (this should be ``bar'')

What happens is this: <test> is initially defined to be foo. Then <foo> is defined to be the value of <test>, that is, foo, whereas <bar> is defined to be <test>. Then <test> is redefined to be bar. At that point, when <foo> is evaluated, it gives foo (because it was defined as such), whereas <bar> gives <test>, which then gives bar, because <test> is defined as bar at that point.

How evaluation proceeds

It is now time to explain how SIMPLE evaluates things. We give a rather simplified explanation now, reserving the full explanation to later.

The first thing that SIMPLE does when it sees a macro call is evaluate the arguments. It does so even if the macro doesn't actually need the arguments, even if it ignores them. This model is called ``eager evaluation''. Thus, if the arguments produce any side effect, those will take place immediately. For example:

Before
<void|<quit>>
After

Before

Even though <void> ignores its argument, the argument <quit> gets evaluated, and simple exits immediately.

Evaluation of the arguments takes place in left to right order. For example:

<void|<def|x|1>|<def|x|2>>%
<x>

Here, the first argument is evaluated, and defines <x> to be 1, and then the second argument is evaluated, and redefines <x> to be 2. So after the call, <x> is equal to 2.

Once the arguments have been evaluated, they are passed, in evaluated form, to the actual macro. If the macro is a user-defined macro, the @1@, @2@, and so on, strings in its definitions, are replaced by the (evaluated) values of the corresponding arguments. It it is a builtin, what the builtin does depends on what it is. In any case, some expansion is produced (together with some possible side-effects if the macro is a builtin). After that, the expansion gets evaluated once more, and it is the resulting string that gives the final result of the evaluation. This last step makes it possible for a macro to call other macros: it contains the calling sequence for these other macros in its definition, and since the expansion gets reevaluated, the other macros will be called.

Finally, if the evaluation procedure ever runs into a quoted expression, it will just evaluate it to the quoted expression (without the quotes).

Let us now work through a few examples to make things clear.

<len|<id|Hello, world!>>
<len|<id|[Hello, world!]>>
<len|<id|[[Hello, world!]]>>
<len|<id|[[[Hello, world!]]]>>
<len|[<id|Hello, world!>]>
<len|[[<id|Hello, world!>]]>

In the first line, the <len> call first evaluates its first argument, <id|Hello, world!>. The latter first evaluates its first argument, Hello, world!. Since this does not contain any macro call, it is returned as such. The <id> builtin then expands to Hello, world!, which is reevaluated. Since it (still) does not contain any macro call, it is returned as such. Thus, <len> gets Hello, world! as its first argument. It then expands to the length of that string (namely 13). Then that expansion is reevaluated, and since it does not contain any macro call, it is evaluated again to 13.

The second line is similar. This time, <id> gets [Hello, world!] as its first argument, and the latter evaluates to Hello, world!, and everything then proceeds as previously. For the third line, the argument to <id> is [[Hello, world!]], which evaluates to [Hello, world!]. So the <id> builtin expands to [Hello, world!], but that gets reevaluated to Hello, world!, and then everything proceeds again in the same fashion (we still get 13 as answer). It is only after three levels of quote that the quotes overpower the <id>: on the fourth line, the <id> call finally evaluates to [Hello, world!], and <len> calculates the length of that, in other words 15.

On the fifth line, the parameter to <len> is [<id|Hello, world!>]. That evaluates to <id|Hello, world!>. This was not a macro call, so we do not have another evaluation. Therefore, <len> calculates the length of <id|Hello, world!>, and the result is 18, which gets reevaluated to 18. Similarily, on the last line, the length of [<id|Hello, world!>] is calculated, and it is 20.

Here is another example:

<def|double_bubble|[<@1@|@1@>]>%
<double_bubble|id>

id

In the first line, the first argument to <def> is double_bubble, and it does not change on evaluation. The second argument is [<@1@|@1@>] and it evaluates to <@1@|@1@>. Thus, <double_bubble> is defined as <@1@|@1@> (the <def> builtin itself evaluates to nothing). The second line calls <double_bubble> with id as first argument (which remains the same upon evaluation). Thus, <double_bubble> expands to <id|id>, which is then reevaluated. The first argument to that call is id, and it remains the same, so <id> expands to id and that gets reevaluated again, but does not change. Thus, the final result is id.

Naturally, replacing the second line by <double_bubble|double_bubble> would have caused an endless loop (however, since SIMPLE is ``properly tail-recursive'', no stack overflow is to be feared).

Make the best use of quotation

It is not always obvious what is to be quoted and what is not. For the <def> builtin, there is the following rule of the thumb: quote the second argument (this is the most frequent case) when you think you are defining a function, and do not quote it when you think you are defining a variable. Thus, in the foobar example given earlier, you should imagine <foo> as a variable defined as foo (because that is what <test> was equal to when <foo> was defined), and <bar> as a function that returns the current value of the <test> variable.

The <if> builtin presents its particular problems. It is a builtin that generally takes 4 parameters. The first two parameters are compared: if they are equal, <if> expands to the third parameter, otherwise to the fourth. Here is an example:

<def|greet|[<if|@1@||[Hello!]|[Hello, @1@!]>]>%
<greet>
<greet|Peter>
<greet|world>
<greet|>

Hello!
Hello, Peter!
Hello, world!
Hello!

The outer level of quote in the definition of <greet> is evidently necessary: if it weren't there, SIMPLE would immediately (at the definition) compare @1@ and the empty string, and therefore define <greet> as Hello, @1@!. The inner level of quote isn't necessary in as trivial example as this, because Hello!, Hello, Peter! and so on, expand to themselves anyway (note incidentally that quoting doesn't prevent parameter substitution). However, the inner quotes are necessary in a case such as this:

<def|prepare_to_greet|[<if|@1@||[<def|whom_to_greet|world>]|%
[<def|whom_to_greet|@1@>]>]>%
<def|greet|[Hello, <whom_to_greet>!]>%
<prepare_to_greet|Paul><greet>
<prepare_to_greet><greet>
<greet>

Hello, Paul!
Hello, world!
Hello, world!

For this reason, it is generally best to quote the two last arguments to <if> - so that only one of them actually gets evaluated, which is often the desired effect. However, the two first arguments will generally go unquoted (for example, to compare the first character of the first argument with an x, one will do <if|<head|@1@>|x|[<it_is_an_x>]|[<it_is_not_an_x>]>.

More about quoting

In the first place, it should be noted that the quote brackets need not be balanced with respect to the macro (angle) brackets. The following works fine:

<def|bgroup|[<]>%
<def|egroup|[>]>%
<id|Hello, world!>
<bgroup>id|Hello, world!>
<id|Hello, world!<egroup>
<bgroup>id|Hello, world!<egroup>

Hello, world!
Hello, world!
Hello, world!
Hello, world!

Similarily, the following works:

<def|greet2|Hello, @1@, and also you, @2@!>%
<greet2|Peter|Paul>
<def|two_people|[Judy|Jane]>%
<greet2|<two_people>>

Hello, Peter, and also you, Paul!
Hello, Judy, and also you, Jane!

In other words, what we have said about evaluation is not completely true: the parameter evaluation step does not localize all the parameters and then evaluate them all. Rather, it progressively evaluates what it receives, switching to the next parameter when it sees a |, and ending the parameter list when it sees a >. The name of the macro also is not special, as the following shows:

<def|verse0|The Moving Finger writes; and, having writ,>%
<def|verse1|Moves on: nor all your Piety nor Wit>%
<def|verse2|  Shall lure it back to cancel half a Line,>%
<def|verse3|Nor all your Tears wash out a Word of it.>%
<def|cur|0>%
<def|nextverse|[<def|cur|<+|<cur>|1>>]>%
<def|quoteverse|[<verse<cur>><nextverse>]>%
<quoteverse>
<quoteverse>
<quoteverse>
<quoteverse>

The Moving Finger writes; and, having writ,
Moves on: nor all your Piety nor Wit
  Shall lure it back to cancel half a Line,
Nor all your Tears wash out a Word of it.

Here, the <cur> variable serves as a counter. <nextverse> is a function that increases this counter by one, and <quoteverse> will first call the function whose name begins with verse, followed by the value of <cur>, and then call <nextverse>. The builtin function <+> calculates the sum of its arguments (which are supposed to be numbers). Note that the quotes around <def|cur|<+|<cur>|1>> are essential, but that a contrario there should be no quotes around <+|<cur>|1>; this is in accord with our rule of thumb given earlier, since <cur> is considered a variable whereas <nextverse> is considered a function.

Quoting a single token

Now assume we wanted to quote a single [ token. How could we do that? Obviously [[] couldn't work. To solve this problem, SIMPLE introduces another quoting method which is less convenients than brackets, but which works all the time: the pound (number, hash, sharp) sign, #. It quotes whatever token comes immediately after it - and that is generally the same as putting the token in question within brackets. For example, #< is about the same thing as [<] (not exactly the same thing - for example, the length of the former is 2 whereas that of the latter is 3 - SIMPLE does not internally convert one to the other, but they do have the same effect).

Since the # sign will quote a single token, if a two quoting levels are required (such as are obtained by using two bracket levels), three # signs will be needed (the first to quote the second, and the second and third to quote whatever is to be quoted). This is essentially the reason why quoting with # is awkward.

As an example of how this work, let us consider the following:

<len|<id>>
<len|#<id#>>
<len|##<id>>
<len|###<id###>>

In the first line, <id> evaluates to the empty string, and its length is 0. In the second case, the parameter #<id#> evaluates to <id> and its length is 4. In the third line, the parameter ##<id> evaluates to # (the < is not quoted, so it does start a macro), which has length 1. In the last line, the parameter ###<id###> evaluates to #<id#> and that has length 6.

Quoting with # and quoting with brackets does not mix as well as one would like. Just as the " (double quote) character escapes every character except for ` and ", so the brackets quote every character except for # - as for brackets, they get quoted so long as they are balanced. To see how things get evaluated, use the <inputform> builtin, which converts a token string to harmless characters so you can see what's in it:

<inputform|[>]>
<inputform|#>>
<inputform|[]>
<inputform|#[>
<inputform|##>
<inputform|[[]]>
<inputform|[##]>
<inputform|[[##]]>
<inputform|###>>
<inputform|#[#]>
<inputform|[#[]>
<inputform|[[#[]]>
<inputform|[[###[]]>

>
>

[
#
[]
#
[#]
#>
[]
[
[[]
[#[]

Quoting parameters

Suppose we want to write a function that will take a string as input and return the string The length of the string is:(whatever). The obvious thing to write is something like this:

<def|whatis_len|[The length of the string is: <len|@1@>]>%
<whatis_len|Hello, world!>

The length of the string is: 13

Unfortunately, if after this we na�vely try to write <whatis_len|[<quit>]>, we find that it simply quits the program. What is wrong here? Well, [<quit>] evaluates to <quit>. Thus, the <whatis_len> call expands as The length of the string is: <len|<quit>>, and when this gets evaluated (remember that the result of macro expansion always gets reevaluated), the <quit> builtin forces simple to quit.

A better solution is this

<def|whatis_len|[The length of the string is: <len|[@1@]>]>%
<whatis_len|[<quit>]>

The length of the string is: 6

This time, the first parameter [<quit>] evaluates to <quit>, so that the <whatis_len> call expands to The length of the string is: <len|[<quit>]>. In this call, the [<quit>] parameter to <len> evaluates to <quit>, so that <len> expands to 6, hence the final result.

This solution, however, is not wholly robust. Indeed, if we then run <whatis_len|##>, we have a strange surprise: simple writes the beginning, ``The length of the string is:'', and then stops there. What is happening? If we trace the calls as above, we find that we end up at an evaluation of <len|[#]>. Now in this thing, the # quotes the following ], which then does not close the preceding [, so that the > gets quoted, and the <len macro call start is unterminated. (Ending the file here would cause versions of SIMPLE prior to 2.2.0 to go mad and possibly take the system down with them. This is now corrected.)

The correct thing would be to quote argument 1 using # rather than [ and ]. But this seems impossible: while it is obvious how to put [ and ] around argument 1, putting a # sign before every token of it is not directly possible. SIMPLE provides a solution, however: instead of writing @1@, write @,1@, and this will be replaced by argument 1, only with every token preceded by a #. This gives us:

<def|whatis_len|[The length of the string is: <len|@,1@>]>%
<whatis_len|Hello, world!>
<whatis_len|[<quit>]>
<whatis_len|[[<quit>]]>
<whatis_len|##>
<whatis_len|[<]>

The length of the string is: 13
The length of the string is: 6
The length of the string is: 8
The length of the string is: 1
The length of the string is: 1

Let us work out the third example (the other ones are similar). The parameter [[<quit>]] evaluates to [<quit>]. Therefore, @,1@ is replaced in the body of the definition by #[#<#q#u#i#t#>#]. This is the first parameter to <len>, and it evaluates to [<quit>], whose length is 8, hence the result.

Now when should one use @1@ and when should one use @,1@? This is not always obvious. In many cases, it is unimportant. When it is important, it is nearly always @,1@ that should be used - so when in doubt, use that. @1@ should be used only when you're sure your arguments need to be evaluated one more time - or if you're concerned about efficiency. If you need to pass some argument to another macro, then have no doubt, @,1@ is the thing you need. However, you should remember that if @,1@ is contained within brackets (except the outside pair of brackets just needed to protect the second argument of <def> from evaluation), those brackets won't be effective (they won't add another layer of quotation), because of the way brackets and # interact. There is no problem in having @,1@ expand an argument that contains brackets.

Parameter substitution

We have seen that @1@, @2@, and so on, in a used-defined macro, expand to the first, second, and so on, arguments respectively. Similarily, @,1@, @,2@, and so on, expand to the quoted arguments. What else can be used in that direction?

Well, in the first place, whatever it's good for, @0@ can be used to refer to the command name itself. Negative values count the argument from the end, @-1@ being the last argument, @-2@ the penultimate one, and so on (going down like that, we encounter the command name again, and arguments before that are considered empty).

If multiple commas are used, they will provide multiple levels of quotation. Thus, @,,1@ refers to the first argument quoted twice, and so on. Generally, you will want to use as many commas as there are levels of brackets around the argument reference (not counting the external brackets that are just used for <def>), possibly one more.

Apart from the comma, it is also possible to have a semicolon: @;1@ gets replaced by all the arguments, starting from the first, quoted and preceded by | tokens (which are themselves not quoted, of course). Note that even the first argument gets preceded by such a | token. (If there are no arguments then @;1@ gets replaced by the empty string.) Similarily, @;2@ gets replaced by all arguments from the second one on, @;0@ by all arguments including the function name (even the function name is preceded by a | token), and @;-3@ by the three last arguments, and so on. The expression @;@ can be used as a synonym to @;1@.

Next, @.1@ is to @;1@ what @1@ is to @,1@. In other words, it expands once again to all the arguments separated by |, but this time they are not quoted. (In general, this is less useful, since the arguments themselves may contain |, and it is then not possible to tell them apart with certitude.)

As for @?@, it expands to the number of parameters, not counting the macro name itself. Finally, @@ expands (or, more appropriately, contracts) to a single @ token.

Note that the recognition of the @...@ sequences is done directly on the user macro definition string, before any evaluation takes place. Thus, it is not possible to refer to the <n>-th parameter as @<n>@ or some such thing. If such a thing were needed, you should call an indexing function, passing to it the value of <n> and all the parameters of your macro (using @;1@).

Here are now a few examples of using these various strings:

<def|greet_all|[<if|@?@|0|[Hello, everybody!]|@?@|1|[Hello, @1@!]|%
[Hello<greet_helper@;1@>!]>]>%
<def|greet_helper|[<if|@?@|1|[ and @,1@]|%
[, @1@<greet_helper@;2@>]>]>%
<greet_all>
<greet_all|mighty Caesar>
<greet_all|Peter|Paul>
<greet_all|Groucho|Chico|Harpo>

Hello, everybody!
Hello, mighty Caesar!
Hello, Peter and Paul!
Hello, Groucho, Chico and Harpo!

Here we are using an extension of the <if> builtin which we haven't talked about: when it takes 7 parameters, it will return the third if the two first match; otherwise, if the fourth and fifth match, it will retrun the sixth; otherwise, it will return the seventh. Thus, if <greet_all> is given no arguments, it returns Hello, everybody!. If it is given one argument, it returns Hello, @1@!. Otherwise, it calls a function called <greet_helper> with all its arguments, to separate the names with commas, and put the word and before the last. It does this recursively: it writes a comma and a space followed by its first argument, and calls itself again with one less argument (@;2@), until it is down to one single argument (the last), in which case it writes the and before the argument. In this example we could very well have written @,1@ instead of @1@, it wouldn't have mattered (since the arguments are supposed to be devoid of special tokens anyway).

<def|first|[<def|second|[@,,1@ and @@,1@@]>]>%
<first|Romeo><second|Juliet>
<first|The Iliad><second|the Odyssey>
<first|So long,>%
````<second|thanks for all the fish>'' said the dolphins.
<inputform|<first|[<id>]><second|[<void>]>>

Romeo and Juliet
The Iliad and the Odyssey
``So long, and thanks for all the fish'' said the dolphins.
<id> and <void>

The important thing to note in the example above is the double at signs. The definition of <first> is <def|second|[@,,1@ and @@,1@@]>, so that if we look at what <first|Romeo> does, it expands to <def|second|[###R###o###m###e###o and @,1@]>, because @@ will become @ and @,,1@ expands to the first argument quoted twice. Now this is evaluated and therefore <second> is defined as #R#o#m#e#o and @,1@ (remember that the brackets do not quote the pound signs). Thus, <second|Juliet> then expands to #R#o#m#e#o and #J#u#l#i#e#t, which evaluates to Romeo and Juliet. In this case, of course, the quotation has been nothing more than pedantic. In the last example, however, the quotation is exactly calibrated so that the <second> call will evaluate to <id> and <void> (too little quotation would have evaluated those functions, and too much would have introduced spurious # signs).

More about the lexer

So far we have described quite precisely how token strings are manipulated, but we haven't exactly explained what token strings are, or how they are produced. So we now go back up from the evaluator to the lexer.

The only things that SIMPLE manipulates are token strings. There is one token for every possible character, plus eight special tokens, which makes 264 tokens in total. Tokens 0 to 255 are the ordinary character tokens, and they are not treated specially by simple (even the `, < and similar characters are not special). The eight special tokens are BEGIN_COMMAND, END_COMMAND, NEXT_PARAM, OPEN_QUOTE, CLOSE_QUOTE, QUOTE_NEXT, AT_SIGN and the very special token EOF (which is produced only at the end of the last input file). When up to now we have spoken of the < token, we really meant the BEGIN_COMMAND token, not the < character token, which is in no way special. It just so happens that (in the SIMPLE setup configuration) < produces the BEGIN_COMMAND token, wheras the < character token is produced by `<.

The BEGIN_COMMAND token (with value 257 to be precise) normally corresponds to <, and it causes SIMPLE to start macro expansion. The END_COMMAND token (value 258) normally corresponds to >, and it causes SIMPLE to terminate parameter collection and proceed to the actual expansion. The NEXT_PARAM token (value 259) normally corresponds to |, and it causes SIMPLE to move to the next parameter. The OPEN_QUOTE token (value 260) normally corresponds to [, and it causes SIMPLE to start a bracket quotation level, whereas the CLOSE_QUOTE token (value 261), normally ], causes SIMPLE to end it. The QUOTE_NEXT token (value 262), normally #, causes SIMPLE to quote the next token. Finally, the AT_SIGN token (value 262), normally @, does not have any special effect upon SIMPLE when it is read by the evaluator - it simply acts as a marker so that user-defined macro expansion can replace it by something else.

On the lexer level, the problem is different. The lexer receives a stream of input characters and is in charge of producing a stream of input tokens for the evaluator to use. Most characters produce the ordinary character token to which they correspond. This is true except of SIMPLE's ten special characters, which are `, ", %, <, >, |, [, ], # and @. Of those, the last seven are active characters, which means that they will generate a certain string of tokens when encountered by the lexer. In SIMPLE's initial configuration, this string is a one-token string for each of those characters, as described above - but it is possible to change this and make certain characters active and produce a longer string. The other three characters are special for the lexer, but do not correspond to special tokens. The ` character is an ``escape next'' character, and it causes the next character encountered (whatever it is) to produce an ordinary character token. The " character is the ``escape string'' character, and it causes every following character, except ` and ", and up to the next ", to produce an ordinary character token. Finally, the % character is the ``comment'' character, and it causes every following character, up to the next \n (end-of-line) character, to produce no token at all.

Five builtin macros will affect how the lexer performs the character-to-token translation. The <cartype_ordinary> builtin takes one argument, a list of characters which are to be made ordinary characters. Note that by ``a list of characters'' we mean a list of character tokens - since they probably are special before the call, they must be escaped for the call to work properly. For example, the following command will cause SIMPLE to cease performing any special action on the characters it reads:

<cartype_ordinary|```"`%`<`>`|`[`]`#`@>``That's <done>!''
Now [and from now on] no character is special for SIMPLE.
(This is irrevocable, naturally!)
`, ", %, <, >, |, [, # and @ no longer act specially.

``That's <done>!''
Now [and from now on] no character is special for SIMPLE.
(This is irrevocable, naturally!)
`, ", %, <, >, |, [, # and @ no longer act specially.

<cartype_escape_next> causes a character (or several characters) to act as an escape-next character. This does not cause any previous escape-next characters to loose their special meaning (you must explicitely call <cartype_ordinary> for that). Similarily for <cartype_escape_string> and <cartype_comment> which add new escape-string and comment characters.

As for <cartype_active>, it takes two arguments. The first should be just one single ordinary character (what happens if more than one is supplied is left unspecified), and the second should be a list of tokens. From then on, when that character is used, the given list of tokens will be produced by the lexer.

Some users might like the following setup:

<cartype_escape_next|\><cartype_ordinary|\`>%
<cartype_comment|;><cartype_ordinary|\%>;
<cartype_active|(|#<>(cartype_ordinary|\<>;
(cartype_active|)|#>>(cartype_ordinary|\>);
(cartype_active| |#|)(cartype_ordinary \|);
(cartype_active ' ##)(cartype_ordinary \[\]\#);
(cartype_active $ '@)(cartype_ordinary \@);
(def greet "Hello, "$1$"!");
(greet world)
(cartype_active * '(quit'));
*
"This never gets reached!"

Hello, world!

Appendices

Differences between SIMPLE and m4

The most important difference between m4 and SIMPLE is that SIMPLE has a lexer step whereas m4 does not. In other words, whereas SIMPLE manipulates token strings, m4 handles character strings. To make this difference more striking, consider the following m4 excerpt:

changequote({,})dnl
define(__lq,{`})define(__rq,{'})dnl
define(_lq,{changequote({,})__lq{}changequote(`,')})dnl
define(_rq,{changequote({,})__rq{}changequote(`,')})dnl
changequote(`,')dnl
This is a backquote: _lq

This is a backquote: `

What exactly does it do? It first of all changes the quote characters to be { and }, and then defines the macro __lq to be a left quote (this definition is now possible since the left quote is no longer special) and the macro __rq to be a right quote. After that, it defines the macro _lq to change the quote characters to be { and }, call __lq and then change the quote character to ` and ' again. It similarily defines _rq. Finally, it defines the quote characters back to ` and '. When the macro _lq is called after that, it temporarily changes the quote characters to { and } and during that time calls the __lq macro, which produces a backquote.

Now why can't we do without _lq and _rq altogether and simply call __lq (which is, after all, defined to be a backquote)? Because even though the backquote was not special when the __lq macro was called, it has not been made inactive. If it is special when the macro is called, the backquote will have whatever special effect it is supposed to have. So if you replace the last _lq in the example above by __lq, you do not obtain the desired effect.

This kind of thing is not completely absent in SIMPLE: we have seen something similar earlier. In fact, SIMPLE's quotation mechanism is very similar to m4's (but with the additional QUOTE_NEXT (#) token - and also with the difference that it is not possible to change what tokens produce quotation, although it is possible to change what characters produce those tokens). However, SIMPLE has another level that m4 does not have, namely escaping. Once a character has been escaped in SIMPLE, it becomes forever an ordinary token.

The nearest (almost line-for-line) equivalent to the above m4 program in SIMPLE would be:

<cartype_ordinary|``><cartype_escape_next|\>%
<def|__bq|`>%
<def|_bq|[<cartype_ordinary|`><cartype_escape_next|\\>%
<__bq><cartype_ordinary|\\><cartype_escape_next|`>]>%
<cartype_ordinary|\\><cartype_escape_next|`>%
This is a backquote: <_bq>

This is a backquote: `

However, with SIMPLE, this is unnecessarily complicated. For one thing, contrary to what happens with m4, here <__bq> will work just as well (nay, better) than <_bq>. So <_bq> can be dispensed with altogether. In fact, the cartype macros can also be dispensed with, and we can write:

<def|__bq|``>%
This is a backquote: <__bq>

This is a backquote: `

Of course, we can do even simpler:

This is a backquote: ``

This is a backquote: `

Now let us go further in the hair-splitting and suppose we wanted to mimic the above m4 program at the quotation level in SIMPLE rather than at the escaping level. This is totally useless of course, but let us proceed anyway. The difficulty is that SIMPLE lacks the changequote macro that m4 has (the quote tokens are always OPEN_QUOTE and CLOSE_QUOTE). But we can still get around somewhat:

<def|__oq|#[><def|__cq|#]>%
<def|_oq|[<qdefof|__oq>]><def|_cq|[<qdefof|_cq>]>%
This is an open quote token (in input form): <inputform|<_oq>>

This is an open quote token (in input form): [

Here we take advantage of the <qdefof> builtin, which takes a user macro name and expands to the quoted definition of that macro. Thus, the <__oq> macro is defined as [ (the open quote token), and the <_oq> macro expands as the quoted definition of the <__oq> macro, thus as #[, which evaluates to [, and sequuntur sequentia.

Of course, these quotation subtleties are not the only difference between m4 and SIMPLE. Another one is the way macros are called: whereas SIMPLE uses beginning-of-call and end-of-call tokens, m4 macro calls are unadorned. The advantage of the m4 method, of course, is that it makes things shorter. The advantage of the SIMPLE method, is that macro calls are well delimited, and no unpleasant little `' will have to be added here or there. Also, parameter treatment is more uniform and perhaps less confusing than the m4 way of doing things, in which cutemacro() is a macro call with one parameter.

Finally, the stock of builtin macros for SIMPLE and m4 are quite different. In particular, SIMPLE does not have all the cool post-processing features that m4 has (I refer in particular to the divert family of macros). It is the author's intention, however, to eventually add those features to SIMPLE.

However, on the whole, m4 and SIMPLE are very much similar after all. For one thing, the author learned all he knows about writing macro processors by groveling over the m4 sources (although no part of the m4 sources has been borrowed into simple).

If you liked the foobar example earlier, here it is in m4:

Compare this:
define(`_test',`foo')dnl
define(`_foo',_test)dnl
define(`_bar',`_test')dnl
define(`_test',`bar')dnl
_foo (this should be ```foo''')
_bar (this should be ```bar''')

Compare this:
foo (this should be ``foo'')
bar (this should be ``bar'')

(here it is not possible to call the macros foo and bar, because in m4 macros interact unpleasantly with ``ordinary strings'').

Here is the final treat for this section:
(in m4)

define(double_bubble,`$1(`$1')')dnl
double_bubble(`double_bubble')

(in SIMPLE)

<def|double_bubble|[<@1@|@1@>]>%
<double_bubble|double_bubble>

Differences between SIMPLE and Scheme

Another language which SIMPLE resembles somewhat is Scheme. There are, however, some important differences which might cause confusion.

Probably the most important difference is that SIMPLE doesn't really have variables. Only things that would be more appropriately described (at least in a Scheme-oriented terminology) as functions. Since it has only one data type (token strings), SIMPLE does not distinguish the number 4, the string "4", the function returning the number 4, or the function returning the string "4". After something like <def|a|[4]>, the ``variable'' a is either of them. So we say that SIMPLE only has functions.

Now consider writing <def|a|[4]> and then calling <a> in SIMPLE, as opposed to writing (define (a) 4) (which really means (define a (lambda () 4))) and then calling (a) in Scheme. In both cases, a is defined as the function returning 4. The way the call works, however, is subtly different: in Scheme, a evaluates to the function returning 4 (that is, (lambda () 4)), and (a) applies this function. In SIMPLE, however, a alone doesn't evaluate to anything: it is just a dumb string, namely "a". It is <a> that will actually look a up, find its value (namely 4), perform parameter substitution / macro expansion, and then reevaluate the result.

There is another difference in the example given above: in Scheme, given (define a (lambda () 4)), the a is not evaluated, for otherwise it would cause an undefined symbol error (compare with setq in common lisp, as opposed to set). This is because define is a syntactic form rather than an ordinary function. In SIMPLE, however, there is nothing special about the <def> builtin, and its first argument gets evaluated (just as its zeroth and second arguments); but it is simply the string a, so it doesn't wreak any havoc. In particular, it is possible in SIMPLE to write something like <def|name|a><def|<name>|4>, which is not possible in Scheme (at least not directly). Actually, it is possible to do this in Scheme provided an eval function is available, by doing (define (name) 'a) (eval `(define ,(name) (lambda () 4))), which closely mimics the above SIMPLE fragment.

Similarily, the second argument to lambda does not get evaluated in Scheme, which means that it is not possible in Scheme (at least not directly!) to define a function a which returns, say, the value of the function b applied with no arguments, at define time (as opposed to recalculating it each time). In SIMPLE this would be <def|a|<b>> as opposed to <def|a|[<b>]> (see also this example). Actually, it is possible to get around things in Scheme provided an eval function is available, by doing (define a (eval `(lambda () ,(b)))). Of course, the really rational thing to do would be to write (define a (b)), but we are insisting on using functions because that is the only thing that SIMPLE has.

Just to insist heavily, here is the foobar example again, translated in Scheme:

(display "Compare this:") (newline)
(define (test) "foo")
(define foo (eval `(lambda () ,(test))))
(define bar (lambda () (test)))
(define (test) "bar")
(display (foo)) (display " (this should be ``foo'')") (newline)
(display (bar)) (display " (this should be ``bar'')") (newline)

Compare this:
foo (this should be ``foo'')
bar (this should be ``bar'')

The fact that the result of any macro expansion is automatically reevaluated has been used as an argument against macro preprocessors, because it imposes an undue amount of quoting for things to work properly. This is not untrue, but the problem is not due to the reevaluation of macro expansion, but rather to the inexistence of variables. That is, writing in Scheme with only function variables is comparable to writing in SIMPLE. The evaluation steps are actually rather similar: first, the arguments (including the zeroth argument, the actual function) are evaluated; then, in Scheme, the zeroth argument is checked to be a function, whereas in SIMPLE it is checked to be the name of a function; after that, parameters are substituted for their value (in Scheme, these values are bound for the parameter names in the function's context, wheras in SIMPLE the @-strings are replaced by the value of the parameters, generally with quotation); the last step is the actual function call in Scheme, and the way SIMPLE simulates that is by reevaluating the result of the parameter substitution. That is a perfectly normal and necessary step. In Scheme also, if you want a function that will return (quit) without calling it, you must write (define a (lambda () (quote (quit)))), thus using two levels of quotation (the lambda and the quote), just like <def|a|[[<quit>]]> in SIMPLE.

Reference section

SIMPLE's algorithm

There are three steps in the SIMPLE preprocessing process: the lexer step, the expansion step and the output step.

Lexer

The lexer step reads the characters from the input files and produces tokens. There are five possible character types: ordinary characters, escape-next characters, escape-string characters, comment characters, and active characters. To an active character is associated, also, a list of tokens.

In normal functioning mode, the lexer will produce one ordinary character token for every ordinary character it reads. If it reads an escape-next character, it switches to escape-next mode; if it reads an escape-string character, it switches to escape-string mode; if it reads a comment character, it switches to comment mode; if it reads an active character, it remembers looks up the corresponding token list, sets a pointer to the beginning of that list and enters active mode. Finally, an end of (last input) file causes the lexer to emit an EOF token. In escape-next mode, any character produces the corresponding character token, but also causes the lexer to quit escape-next mode. In escape-string mode, any character produces the corresponding character token, except escape-next characters and escape-string characters: the latter causes the lexer to return to normal mode, while the former causes it to enter the special escape-next-within-escape-string mode, which functions exactly as escape-next mode except that the lexer returns to escape-string mode afterwards (rather than normal mode). In comment mode, no character produces a token; the end-of-line character causes the lexer to return to normal mode. Finally, in active mode, the lexer does not swallow any character; it produces the next output token in the token list of the active character, and remains in active mode if there still are tokens to be produced, otherwise returns to normal mode.

Input files are organized on a stack, the input stack. When one of them comes to an end, it is popped from the input file, and characters are then read from the next one down the stack. This is true whatever mode the lexer was in: thus, it is possible to have a comment start in one file and continue in the next (if the first file does not end with an end-of-line character). Or it is possible for an escape-next character at the end of one file to escape the first character of the next file. All this is strongly deprecated, however. Pushing input files on the input stack is done either through simple's command line, or using the <include> builtin.

When the last file on the input stack ends, the lexer is fed an EOF character. In ordinary mode, escape-next mode, escape-string mode, or escape-next-within-escape-string mode, this produces an EOF token. In comment mode, this is an error (the very lame ``reason'' is that this means you are trying to comment, therefore ignore, the EOF, which is illegal). Of course, EOF cannot be read in active mode.

Expansion step

The macro interpreter reads the tokens that the lexer feeds it with.

Let us start with ordinary evaluation. It reads one token: if that token is a regular token it gets shipped out. If it is a QUOTE_NEXT token, the next token gets shipped out (but not the QUOTE_NEXT token itself). If it is OPEN_QUOTE, the following tokens, up to the next balanced CLOSE_QUOTE, all get shipped out and are not interpreted, except QUOTE_NEXT which still has the same effect. Neither the OPEN_QUOTE nor the corresponding CLOSE_QUOTE token get shipped out, however. It is invalid for an END_COMMAND, NEXT_PARAM or CLOSE_QUOTE to be found during ordinary expansion (``Misplaced special token''). Along the same logic, I guess it should also be invalid for the AT_SIGN token to be found, but it actually turns out to be much more practical if it is allowed (a lot of macro definitions don't have to be quoted), so it is. When a BEGIN_COMMAND token is read, parameter evaluation begins (but the token itself is lost):

Parameter evaluation proceeds in exactly the same way as ordinary expansion (described above - indeed it is performed by the same procedure) with two notable differences: one, the tokens resulting of the expansion do not go to the output stream, but rather to the argument token lists, starting with argument list 0. Two, the tokens NEXT_PARAM and END_COMMAND are no longer invalid. When a NEXT_PARAM token is read, the current argument list is concluded and another one is started (it starts empty) to which the following tokens will be expanded (the NEXT_PARAM token itself does not go anywhere). When an END_COMMAND token is read, macro expansion is performed (described in the next paragraph). Note that if during parameter evaluation a BEGIN_COMMAND token is found, another parameter evaluation is started (recursively).

Macro expansion decodes the argument 0 of the argument list, the command name. It is illegal for that argument to contain any special token, or the character 0 (because of the way character strings are coded in C). Any other character is valid (but using a space in a macro name is vigorously deprecated). When the command name is read, it is matched against the known command names: builtins first and user-defined later. If it does not match anything, an error is produced. Otherwise, the actual expansion is performed, based on whatever parameters (arguments) were read. Builtins and user expansion are somewhat different: builtins may have ``magical effects'' such as changing global variables, whereas user macros cannot (at least not directly). Both builtins and user macros, however, have an expansion value. The important thing to note, however, is that this expansion value will not be shipped to the output but will be added back to the input, before the tokens produced by the lexer, and before any other ``extra input'' of that kind which may have been previously produced. Thus, after a macro is expanded, the results of this expansion are reread, and for example if they contain BEGIN_COMMAND tokens they may lead to further macro expansions. This is how we assure that the result of macro expansion is reevaluated.

There isn't much to be said about expansion of builtins: it must be described individually for each builtin. Note well that each builtin has an expansion value and an effect, and either one (or both) may be nil. The <id> builtin for example has an expansion but no effect whereas the <out> builtin has an effect but no expansion.

User-defined macros on the other hand never have an effect. They only have an expansion, which is normally simply equal to their definition: the expansion of a user-defined macro is normally simply obtained by copying the definition. However, there is an exception to this (and an important one indeed), namely argument replacement. An argument replacement is triggered by the appearance of the AT_SIGN token in the definition string of the macro. When that token is found, tokens are read until the next AT_SIGN token and must constitute an ordinary string (i.e. contain no special token). Note that the appearance of a QUOTE_NEXT token in the definition string of a macro has no influence whatsoever on the process. It will not make the AT_SIGN any less active or less special. (However, of course, the @ regular character token is in no way special, but that is another matter.) What actually goes to the expansion string depends on what was found between the AT_SIGN tokens:

If there is no token between the two AT_SIGN tokens, the expansion string will be appended by one (single) AT_SIGN.
If there is a single question mark (the ordinary question mark regular character token, that is) between the two AT_SIGN tokens, the number of parameters, in decimal, is appended to the expansion string. This number does not take in count the argument zero (the name of the macro). Thus, it can be 0.
If there is an integer number between the two AT_SIGN tokens, the expansion string will be appended by the corresponding argument string. Argument string 0 is the name of the macro itself. Arguments 1 and on are the actual parameters. A number greater than the number of arguments will not cause an error but will just produce the empty string (in other words, the expansion string remains as it was). Negative numbers count from the end, -1 referring to the last argument (and of course too low negative numbers just yield nothing).
If the string between the two AT_SIGN tokens is a number preceded by a period (dot), then the expansion string is appended by each consecutive argument starting from the one having the given number and going up to the last one, each argument being preceded by a NEXT_PARAM token (even the first one). If the period is alone and not followed by a number then that counts as one, so that all the parameters (except the function name itself) get appended to the expansion list, each one preceded by a NEXT_PARAM token. We just note that if you are thinking of using this, you are probably wrong, unless you know exactly what you are doing; you more probably need a semicolon (see below).
If the string between the two AT_SIGN tokens is a number preceded by a comma, then the effect is the same as the number alone, except that the argument is not just copied to the expansion list but is copied with every token preceded by a QUOTE_NEXT token. In other words, something like @,3@ means ``the third argument, quoted''. This is frequently useful when the parameter should be passed to another macro, to prevent one evaluation too many. When the number is preceded by several commas, as many levels of quotations are produced as there are commas.
If the string between the two AT_SIGN tokens starts by a semicolon, then the effect is like the corresponding string with a period (see above) except that each parameter gets quoted as described above for the comma construction. Note however that the NEXT_PARAM tokens that separate the parameters do not get quoted.

We now say a word about the EOF token: it is invalid for this token to be found in any other operation than ordinary expansion. It is impossible (except for an internal error, of course :-) for this token to appear in the definition of a macro or a parameter of a well-completed call, so that the question is irrelevant at that point. But it can appear while expanding arguments, and that is an error (``Unterminated command''). Similarily, it is an error for that token to appear within a scan-for-CLOSE_QUOTE, or just after a QUOTE_NEXT (``Unterminated quotation'').

Output step

For the time being, the output step is extremely simple: regular character tokens get printed on the standard output. An EOF token terminates the session and flushes the output. Any other token, that is, any special token, is an error (``Non-output token found in output'').

Builtin macros

`<id>`

The <id> builtin can take any number of arguments. When called with no arguments, it does nothing and evaluates to nothing. When called with at least one argument, it evaluates to the first argument.

Note in particular that the first argument to <id> will be evaluated twice; indeed, arguments to macros are always evaluated, and the result of the macro call (here, the first argument again) is re-evaluated. So, in a way, <id> is the ``opposite'' of the quotes [ and ] (or #), and it corresponds more to some ``eval'' kind of function than to the identity.

`<void>`

The <void> builtin can take any number of arguments. In all cases, it does nothing and evaluates to nothing.

Note that even though all arguments are discarded, they are still evaluated, as arguments always are (except of course that quotes may prevent this). Thus, the <void> builtin can be useful, given a macro which does something and evaluates to something, to perform the macro's action while throwing the evaluation away.

`<out>`

The <out> builtin can take any number of arguments. When called with no arguments, it does nothing and evaluates to nothing. When called with at least one argument, it outputs the first argument on the output stream (that argument had better not contain any special tokens) and evaluates to nothing.

Thus, the <out> builtin bypasses all normal evaluation and puts a string directly on the output stream as soon as the builtin is evaluated.

The following example illustrates the difference between <id> and <out>:

<out|This text gets printed.>
<id|So does this one.>
<void|<out|This text also gets printed.>>
<void|<id|This one doesn't, however.>>
<def|double|@1@@1@>%
<double|<out|This text gets printed once.>>
<double|<id|This one, twice.>>

This text gets printed.
So does this one.
This text also gets printed.

This text gets printed once.
This one, twice.This one, twice.

`<def>`

The <def> builtin takes exactly two arguments; the first is a character string specifying which macro is to be defined, and the second is the definition to give the macro. If the macro was not already defined, it is created; if the macro already existed, the old definition is forgotten. The <def> builtin itself evaluates to nothing (and not to the definition just given).

Trying to redefine a builtin is an error.

Note that token strings such as @1@ or similar, are not interpreted upon macro definition but upon macro call.

We note that one will generally want to quote the second argument to <def> unless one wants the macro calls it contains to be performed at the moment of definition (or, of course, unless the definition is just text, which is quite common). The first argument to <def> is just text, so it doesn't have to be quoted (though that doesn't hurt, naturally).

`<include>`

The <include> builtin takes exactly one argument. That argument is a character string that designates a file that is pushed on the input stack. Thus, in effect, it is included at that point in the current file. The current macro expansion is not affected. This builtin expands to nothing.

`<>`

The <> (empty name) builtin takes any number of arguments. With no arguments, it does nothing and expands to nothing. With at least one argument, it calls the macro whose name is given by the first argument, and with parameters given by the subsequent arguments. That is, it explicitely demands the expansion of this macro. The <> builtin itself expands to nothing (the called macro directly places its result on the output).

`<quote>`

The <quote> builtin takes one argument, and evaluates to that argument with a QUOTE_NEXT token inserted before every token. In other words, <quote> returns its first argument, quoted. This quoting will just suffice to counter the fact that the macro's result (whether passed as argument to another macro, or directly on the the input stream) gets evaluated again. So that actually <quote> provides an effective identity function! And remember also that the parameter you pass to the <quote> builtin will also be evaluated - but you can prevent this by putting it between brackets.

`<dquote>`

The <dquote> builtin does twice the effect of the <quote> builtin: in other words, it puts three QUOTE_NEXT tokens before each token of its parameter. So the argument gets quoted twice, but in effect only once because one level of quote will (nearly) always disappear somehow.

`<inputform>`

The <inputform> builtin takes one argument, and replaces it by one possible input form which would produce that effect, under the standard SIMPLE lexer steup. In other words, a QUOTE_NEXT token (for example) gets replaced by a # character, whereas a # character gets replaced by the two characters `#, and so on (the " character is not used: ` is used for escaping all special characters).

For example,

<out|DEBUG: <inputform|@,1@>
>

might be a good thing to have at the beginning of a macro...

`<if>`

The <if> builtin takes a number of arguments that is not congruent to two modulo three (that is, any natural number is fine except 2, 5, 8, 12, 15 and so on). It reads these arguments by loads of three: if the first two are equal (as lists of tokens) then it evaluates to the third; otherwise, if the fourth and the fifth are equal then it evaluates to the sixth; otherwise, if the seventh and the eighth are equal then it evaluates to the ninth; and so on. If the number of arguments is congruent to one modulo three then upon doing this, there is one argument left over and that is the expansion when all tests fail; otherwise, the expansion when all tests fail is empty.

In other words, <if> provides the analog of an ``if... then... else if... else if... ... else...'' construction. Here is an example:

<def|duck|[<if|@1@|1|one|@1@|2|two|@1@|3|three|infinity>]>%
<duck|1>
<duck|2>
<duck|3>
<duck|4>

one
two
three
infinity

Note that one will almost always have to quote the conditional expansion strings (the ``then'' parameters) since one wants them evaluated only when the particular ``if'' case is realized. See this section for more information on this subject.

`<defof>`

The <defof> builtin takes a user macro name and returns its definition, unquoted. This is about the same as actually calling the macro, except that macro parameter substitution is not performed: in other words, the resulting string will still contain @1@ and similar strings in the appropriate place.

If you are thinking of using this macro, what you really need is probably the following one:

`<qdefof>`

The <qdefof> builtin takes a user macro name and returns its definition, quoted. This is mainly useful for two things. One, if <mymacro> is a user-defined macro, and one wants to define <yourmacro> to the same thing, one can do

<def|yourmacro|<qdefof|mymacro>>

Two, if one wants to query the definition of a user-defined macro <mymacro>, the simple way is to do

<out|<inputform|<qdefof|mymacro>>>

(of course, unless you're using this inside a macro, you don't really need the <out>).

`<quit>`

The <quit> builtin takes any number of arguments and quits simple. It does this by sending an EOF token directly to the output.

`<error>`

The <error> builtin forces an error to be produced. If it is called with at least one argument, the first argument gives the error message.

`<head>`

The <head> builtin expects one argument, and that argument should not be empty. It returns the first token of its argument.

`<tail>`

The <tail> builtin expects one argument, and that argument should not be empty. It returns its argument with the first token removed.

`<head>`

The <ahead> builtin expects one argument, and that argument should not be empty. It returns the last token of its argument.

`<atail>`

The <atail> builtin expects one argument, and that argument should not be empty. It returns its argument with the last token removed.

`<len>`

The <len> builtin expects one argument, and returns the length of that argument (as a decimal number).

`<push>`

The <push> builtin expects at least one argument, a character string. It pushes all the other arguments on the token list stack named by the first argument.

Token list stacks live in a separate name space from ordinary variables. They cannot be accessed directly but only through builtins such as this one. They are automatically created on demand, and they are always created empty. It is quite legal for a stack to have an empty name, and that is even recommended if only one stack is being used. For example, <push||Peter|Paul> pushes the strings Peter and Paul on ``the'' stack.

`<last>`

The <last> builtin expects exactly one argument, a character string. It returns the token string on the top of the named token string stack (see the <push> builtin for a description of token string stacks).

`<pop>`

The <pop> builtin expects exactly one argument, a character string. It removes the token string on the top of the named token string stack (see the <push> builtin for a description of token string stacks).

`<poplast>`

The <poplast> builtin expects exactly one argument, a character string. It returns and removes the token string on the top of the named token string stack (see the <push> builtin for a description of token string stacks).

`<depth>`

The <depth> builtin expects exactly one argument, a character string. It returns the depth of the named token string stack (see the <push> builtin for a description of token string stacks).

Arithmetical operators

The <+> builtin expects any number of arguments, which must be character strings representing integers, and it computes their sum.

The <-> builtin expects two arguments, which must be character strings representing integers, and it computes the difference of the first argument minus the second.

The <*> builtin expects any number of arguments, which must be character strings representing integers, and it computes their product.

The <div> builtin expects two arguments, which must be character strings representing integers, the second one non zero, and it computes the quotient of the first argument by the second.

The <mod> builtin expects two arguments, which must be character strings representing integers, the second one non zero, and it computes the remainder of the first argument by the second.

The <eq> builtin expects two arguments, which must be character strings representing integers, and it returns 1 if they are equal (as integers), 0 otherwise.

The <neq> builtin expects two arguments, which must be character strings representing integers, and it returns 1 if they are different (as integers), 0 otherwise.

The <gt> builtin expects two arguments, which must be character strings representing integers, and it returns 1 if the first is greater than the second, 0 otherwise.

The <ge> builtin expects two arguments, which must be character strings representing integers, and it returns 1 if the first is greater or equal to the second, 0 otherwise.

The <lt> builtin expects two arguments, which must be character strings representing integers, and it returns 1 if the first is less than the second, 0 otherwise.

The <le> builtin expects two arguments, which must be character strings representing integers, and it returns 1 if the first is less or equal to the second, 0 otherwise.

The <and> builtin expects any number of arguments, which must be character strings representing integers, and it returns 1 if they are all non zero, 0 otherwise.

The <or> builtin expects any number of arguments, which must be character strings representing integers, and it returns 1 if at least one of them is non zero, 0 otherwise.

The <not> builtin expects one argument, which must be a character string representing an integer, and it returns 1 if that integer is zero, 0 otherwise.

The <band> builtin expects any number of arguments, which must be character strings representing integers, and it returns the bitwise AND of these integers.

The <bor> builtin expects any number of arguments, which must be character strings representing integers, and it returns the bitwise OR of these integers.

The <bnot> builtin expects one argument, which must be a character string representing an integer, and it returns the bitwise NOT of that integer.

Cartype macros

To understand what these macros do, please read the section on how the lexer works first.

The <cartype_ordinary> builtin expects one argument, which must be a character string. It makes every character of that string an ordinary character for the lexer.

The <cartype_escape_next> builtin expects one argument, which must be a character string. It makes every character of that string an escape-next character for the lexer.

The <cartype_escape_string> builtin expects one argument, which must be a character string. It makes every character of that string an escape-string character for the lexer.

The <cartype_comment> builtin expects one argument, which must be a character string. It makes every character of that string a comment character for the lexer.

The <cartype_active> builtin expects two arguments, the first of which must be a character string, which should be one character long. It makes that character an active character for the lexer, and associates to it the value of the second argument (an arbitrary token string, possibly empty).

`<translate>`

The <translate> builtin expects two arguments. The first argument, called the translation table, should be of even length. Every occurrence of the first token of the translation table within the second argument is replaced by the second token of the table. Every occurrence of the third token is replaced by the fourth, and so on. Note that these replacements take place simultaneously rather than consecutively.

`<find>`

The <find> builtin expects two arguments. It searches for a copy of the second argument within the first argument. If such an occurrence is found, it returns the position of its first character (0 being the beginning of the string). Otherwise, it returns the empty string.

`<substr>`

The <substr> builtin expects three arguments, of which the second and third must be character strings representing integers. It returns the sub-token-string of the first argument which starts at the position given by the second argument (0 being the beginning of the string) and whose length is given by the third argument.

Miscellaneous

What architectures does `simple` run on?

I wish I knew. Starting from version 2.1.0, I am using GNU automake and autoconf which is supposed to make porting programs much easier. But of course, autoconf is no good so long as the C program itself does not use the appropriate #ifdefs where it should. So I tried fixing those portability problems I knew about (but I didn't always get a chance to try the fix out). Still, there may be many others that I am not aware of.

Anyhow, simple will work on Linux, with either libc5 or glibc2 (aka libc6) because that's its original platform. I've had it run on Solaris and I'm currently working on the SunOS port (still untested). I gather that it will work nicely on any recent flavor of Unix. Note that it requires an ANSI C compiler (just get gcc if you don't have one).

Portability checks include: the presence of getopt_long() (if you don't have that, you can't use long options), the presence of getopt() (if you don't have that, you can't use any options), the presence of memmove() (if you don't have that, some things will work real slow), the presence of snprintf() (if you don't have that, you might have a security problem, but in fact that should never happen because I allow 30 characters just for printing an integer and I guess that should always suffice) and the fact that realloc(NULL...) and free(NULL) do what they should (if they don't it ought to be no problem because I just use a small wrapper around them).

Beyond that, it gets complicated. An MS-DOS port seems hopeless because I make a very intensive use of realloc(), sometimes with rather large memory blocks, so the huge model would be a necessity, and probably the poor thing would choke itself out of memory real fast even if it can be compiled.

Where can I get `simple`?

simple comes in one tar.gz file which contains the C source, makefiles, test simple files, whatever STeX files already exist, a man page, this document (and the GNU Public License). It is available through the WWW at the address http://www.eleves.ens.fr:8080/home/madore/simple.tgz. Or in the /pub/Linux/apps/tex directory on Sunsite.

David Madore