In a previous Zig post, we wrote a small postfix notation calculator.
This post takes a look at something that take prefix notation to the extreme: a new language in the Lisp family called Bio
The goal of this project is to make it easy to play around with Lisp dialect ideas and to make a well-documented and readable Lisp interpreter for others to learn from. This article aims to point out some key areas people stumble on when writing a basic Lisp interpreter.
Acknowledgments: Bio is written in Zig, a fact I think awesome designer Joy Machs was able to neatly incorporate into the lambda/lisp oriented logo. I uncovered a stdlib bug while working on this; thanks to jumpnbrownweasel for fixing it quickly.
Scope
This initial iteration gives us a language with:
- Higher-order functions with lexical scoping and enough juice to implement recursion in the language itself
- The ability to create composite data types with polymorphic behavior.
- Macros
- Garbage collection and atom interning
- An error handling mechanism
- Tail call elimination
- A module system
- A language reference
- A small but extensible standard library written in Bio, with functions like filter, map, and quicksort
- A REPL and a couple of ways to load and evaluate source files
Examples
Here’s a few sample Bio expressions:
; Define a variable, note that (define) works as well
(var x 5)
; Define a couple of functions (one using λ and π symbols) and call them
(var square (lambda (x) (* x x)))
(var circumference (λ (r) (* 2 π r)))
(print "square is:" (square x) ", circumference is:" (circumference x))
; Define a macro. Arguments are not evaluated until needed by the macro expansion
(var if-above-10 (macro (val then else)
`(if (> ,val 10) ,then ,else)
))
; Prints aye
(if-above-10 12 (print 'aye) (print 'nay))
; Filter out negative numbers from a sorted list => (1 2 5 40)
(filter
(quicksort '(5 40 1 -3 2) <)
(λ (x) (>= x 0)))
; Double every element in a list => (2 4 6)
(map (λ (x) (* 2 x)) '(1 2 3))
; Summation through zipping lists => (2 6 11)
(map + '(0 2 5) '(1 2 3) '(1 2 3))
; Add ten to every item and print it
; Note that "each" is just a standard library function
(var mylist '(1 2 3))
(each mylist (lambda (x)
(print (+ x 10) ", ")))
; Tail-call elimination makes it possible to make
; any loop construct you want.
; Here's the while loop from the standard library in action:
(var c 0)
(while (< c 10)
(print "Value is now " c "\n")
(inc! c)
)
; Read two numbers. If the second one is zero, this will
; fail with "Division by zero"
(try (math.safe-div (io.read-number) (io.read-number))
(print "The result is: " #value)
(print "Failed: " #!)
)
; Import a point module. By convention, module variables are Pascal cased.
(var Point (import "examples/mod-point.lisp"))
; Make a position (a composite type), where
; new-point and new-location are just functions
(var mypos (Point new-point 61.5 42.2))
(var new-york (Point new-location 40.7554351 -73.9981619))
; Update using applicative syntax
((mypos update) 17.2 20.5)
; Update again, now using function call syntax but in a different environment
(mypos (update 27.2 20.5))
; Print the location by accessing the x and y variables of the position
(print "Moved to:" (mypos x) (mypos y) "\n")
; Print the location using the simple as-string version
(print "Located at" ((mypos as-string)) "\n")
; Print the location as longitude/latitude => "40.7554351° N 73.9981619° W"
(print "Located at" ((new-york as-string)) "\n")
; Introspection => "number, #t, #t"
(print (string (typename math.pi) ", " (symbol? 'a) ", " (callable? +)))
; Let's implement factorial in the fanciest way imaginable with lots of
; lambda symbols and a fixpoint combinator. The standard library also
; contains a direct-recursive version.
(var Z (λ (f) ((λ (g) (g g)) (λ (g) (f (λ (a) ((g g) a)))))))
(var ! (Z (λ (r) (λ (x) (if (< x 2) 1 (* x (r (- x 1))))))))
; Does it work?
(assert (= 120 (! 5)))
Source code and the language reference is available here: https://github.com/cryptocode/bio
Bio is work in progress; the article-source release contains the version that existed at the time of writing this article.
Using Bio
To build, clone the repository and run zig build
You can now use Bio in two ways:
- Start the REPL with
./bio
- Run a program with
./bio run myprogram.lisp
There are some examples you play around with, like this one that prints a Sierpinksi triangle. You can change the size by editing the source file.
./bio run examples/triangles.lisp
^
^ ^
^ ^
^ ^ ^ ^
^ ^
^ ^ ^ ^
^ ^ ^ ^
^ ^ ^ ^ ^ ^ ^ ^
^ ^
^ ^ ^ ^
^ ^ ^ ^
^ ^ ^ ^ ^ ^ ^ ^
^ ^ ^ ^
^ ^ ^ ^ ^ ^ ^ ^
^ ^ ^ ^ ^ ^ ^ ^
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
Here’s a persistent album database. It persists by saving Bio expressions to a file and then evaluating them on startup:
./bio run examples/albums.lisp
Welcome to the Bio Album Database
1: Add 2: Find by name 3: List all 4: Exit
What kind of Lisp flavor is it?
It’s a single namespace Lisp inspired by Scheme. Scheme, like Lisp itself, is a family of dialects, and Bio is yet another one. Some of the differences come from ideas I want to explore, such as the ability to treat environments as expressions, which is the building block for modules and composite data types.
The best way to learn Bio is to play around with it, type (env) in the REPL for a list of functions, and consult the language reference. There are also some example files in the repository.
The floor plan
The Bio source code is broken into its main areas of concern, thus filenames should be self explanatory: gc.zig contains the garbage collector, ast.zig the expression AST, intrinsics.zig contains the intrinsics, etc.
Dealing with intrinsics
A decent chunk of the code base is the implementation of intrinsic functions and symbols, such as stdList
to implement (list 1 2 3)
and #t
for the truth symbol. All these functions are prefixed “std” and have the same signature:
fn std<intrinsic>(ev: *Interpreter, env: *Env, args: []const *Expr) anyerror!*Expr
This signature fits the ExprValue
union type for intrinsic functions.
The reader and evaluator
The Interpreter
struct implements the logic to read and evaluate expressions. When initialized, a root environment is created and all the intrinsic symbols and functions are registered with it. This way, when your Bio program calls (list 1 2 3)
, the list
function is readily available in the root environment.
In a Lisp interpreter, we want to move from the world of text to the world of s-expressions as quickly as possible.
First order of business is to split expressions like (var x (* 2.45 scale))
into a stream of tokens like (, var, x, (, *, 2.45, scale, ), )
If we required the programmer to always put spaces around parentheses, we could use the std.mem.split
function. But that wouldn’t be very friendly, so we’re going to extend the idea of the Spliterator into a token iterator for Lisp expressions. We can now easily iterate over tokens using it.next()
The read()
function uses the Lisperator and produces exactly one expression: an atom
or a list
.
Of course, a list can contain other lists and atoms, so read
works recursively.
An atom is either a symbol
or a number
. Some symbols are predefined, like #t
and #f
for true and false. Symbols can also be strings, like "This message"
The whole REPL process goes like this:
- readBalancedExpr, which reads a complete expression that may span multiple lines.
- parseAndEvalExpression is then called, which in turn calls
parse
andeval
- parse is a small helper to set up the Lisperator, and call
read
with it. Theread
function recursively parses the s-expression. - eval recursively evaluates the expression and returns the result, which the REPL prints.
- parse is a small helper to set up the Lisperator, and call
Intrinsic functions
Rather than implementing all basic Lisp functions in Zig, we’ll implement a handful of general functions from which a Bio standard library can be built. For example, all functions querying lists, such as car
and cdr
, are built from an intrinsic range
function.
The standard library is located in std.lisp
, which is loaded on startup. For example, here’s how car
and cdr
are defined in terms of range
:
(var car (λ (list) (range list 0 1)))
(var cdr (λ (list) (range list 1)))
Note that λ
is a synonym for lambda
- you can use either. The λ version is less verbose when passing function literals to functions like filter
(many editors have plugins to automatically convert certain identifiers to symbols, like Symbol Complete for vscode)
Logical and relational functions, or “I can do that already?”
Everyone who implements a Lisp soon reaches this magical point where not only can you start implementing the language in itself, you start preferring it! This is not to say the host language is bad, it’s just that if you can extend your Lisp using Lisp, that makes everything easier. It enables you to experiment in the REPL, and you don’t have to worry about low-level details like memory management.
As an example, here’s how or
, <
, and <=
are implemented in the standard library:
(var or (lambda (x y) (if x #t (if y #t #f)) ) )
(var < (lambda (x y) (if (= (order x y) (- 1)) #t #f)))
(var <= (lambda (x y) (if (or (< x y) (= x y) ) #t #f)))
So first we define or
as a function, then we use it to implement <
using the order
intrinsic, which finally allows us to implement <=
.
Functional composition is great!
Alas, time for a reality check! Implementing or
as a function is wrong here because both arguments are always evaluated. We’ll fix this as soon when we get to macros.
Scopes
Early Lisp dialects had dynamic scoping, and this is still the default in Emacs Lisp.
Like most languages these days, Bio is lexically scoped. What’s the difference anyway?
First of all, if a variable is not bound to a formal argument or a local definition, it must be a free variable defined elsewhere. How exactly should Bio “look elsewhere”? This is where environments and their parents come in.
Every lambda executes in its own environment, containing arguments and local variables. An environment is just a hashmap from the symbol literal to the value.
(var x 5)
is essentially the addition of a hashmap entry in the current environment. The same is true when passing arguments.
Every environment, except the initial root environment, also has a parent environment.
- If the parent environment is the callers’ environment, we have dynamic scoping
- If the parent environment is the environment that existed when the lambda was defined, we have lexical scoping
The last point tells us exactly what we need to do in the implementation:
- When we define a lambda, store the current environment in the lambda expression so we can look it up when evaluating the lambda.
- When we evaluate the lambda, make a brand new environment, add arguments to that environment (using the formal parameter names as key), and then make the parent environment be whatever the current environment was when the lambda was defined.
Macros
“One can even conjecture that Lisp owes its survival specifically to the fact that its programs are lists, which everyone, including me, has regarded as a disadvantage.” – John McCarthy
I mentioned earlier that we can’t implement or
using a regular function, because all arguments are evaluated before the function is evaluated. Logical operators are supposed to use shortcut evaluation.
Implementing or
as a macro will fix the problem:
(var or (macro (expr1 expr2)
`(if ,expr1
#t
(if ,expr2 #t #f)
)
))
While the use and implementation of macros are similar to lambdas, there are some differences. These differences are key to unleashing the power of Lisp meta-programming:
- The macro arguments are not evaluated when calling the macro. It’s up to the body to decide if and when to evaluate them.
- The parent environment is different. For lambdas, the parent environment is the environment that existed when it was defined. This gives us lexical scoping. But that’s not what we want for macros; we kind of want to “paste” the expanded body into the very place it’s being called. For this reason, the parent environment for a macro is simply the current one (it’s not quite “pasting” though, since the macro has its own local environment; this helps avoid symbol clashes.)
- The last expression of the macro body (the result of evaluating the macro) is evaluated again! The first evaluation typically causes a quasiquote to be expanded into code. The second evaluation then evaluates that code.
Macros are Bio functions that write Bio code, enabling you to mutate the language to fit your needs. In Lisp, code is lists which we can manipulate with the full power of the language, where other languages usually have to degrade to something like building strings.
Quasiquoting and unquoting
The or
macro above uses quasiquoting to help write code. It’s not strictly necessary, but it’s usually more readable than producing lists and quoting manually.
Quasiquoting is like quoting, except that it allows select expressions to be unquoted. In terms of implementation, it’s important to note that unquote
and unquote-splicing
are always executed inside a quasiquote
.
Example:
(var x 4)
(var mylist '(1 2 3))
`(,@mylist ,x)
(1 2 3 4)
If you call (verbose)
, any quasiquote expansions will be printed.
Garbage collection
As part of devising Lisp, John McCarthy also described a clever method for automatic memory management: garbage collection. Lisp programmers detest manual memory management, so I guess we’ll have to implement it.
To prove that it works, the interpreter allocates from Zig’s GeneralPurposeAllocator with leak detection turned on.
Here are the main steps:
- When we allocate an expression, we register it with the GC
- If we don’t register it with the GC, it’s a pinned expression, such as the predefined
nil
expression; this is static data we don’t need to deallocate. - Every 100K allocation, we run the garbage collector, but it can also be run manually with
(gc)
. This will first mark all reachable expressions. Then it runs through all registered expressions. If it’s not marked, we can deallocate it. By reachable, I mean that the expression is either a key or a value in some environment. If it’s a key, it’s always a symbol expression (denoting a variable name). We also need to sweep environments. The marking step starts at the root environment and callsmarkEnvironment
on it. This will loop through all entries and mark the environment associated with the expression. For lambdas, this may be something different than the root environment, because another lambda may have returned it. We also mark any parents of reachable environments. Finally, we destroy any unmarked environments.
There’s a lot of simple optimizations that can be done here, such as not using a container for marking, and only doing partial sweeps to reduce GC pauses.
Error handling
Bio implements a simple error handling mechanism through the try
and error
functions.
(try (math.safe-div (io.read-number) (io.read-number))
(print "The doubled result is: " (* 2 #value))
(print "Failed: " #!)
The try
function is very similar to if
: the first branch is evaluated if the function succeeds, otherwise the second (optional) branch is evaluated. The #value
symbol contains the result of the tried expression, while #!
contains any error expressions. Error expressions are produced with the (error <expr>)
function.
For more information, see the Reference section on GitHub.
Tail call elimination
Lisp programmers just love recursion. In fact, some of them love it so much that they make Lisp dialects without loop constructs! Bio macros such as while
simply expand to a tail-recursive function, and these can fortunately be made efficient. While Bio offers a loop intrinsic, it isn’t strictly necessary.
For Bio to support recursion well, we have to implement tail call optimization, or TCO. Without it, the evaluator would keep calling itself through indirection recursion, blowing up the stack in a jiffy.
The solution is conceptually simple: the body of eval
in Zig is a loop. When eval
evaluates, say, if
, then the selected branch will become the next expression to evaluate. In the case of lambdas, the last expression becomes the next expression in the eval loop, and in this case we also have to change the loop’s current environment to the lambda environment. The same is true for environment expressions, such as (pos (update 3 4))
I think the easiest way to see how this works is to fire up zig run bio.zig
in a debugger, then step through eval
after pasting a minimal recursive Bio function in the REPL.
VSCode configuration
- Install a Lisp syntax plugin
- Add a run task for the currently open lisp file in
tasks.json
:
{
"label": "bio run",
"type": "shell",
"command": "${workspaceFolder}/bio run ${file}",
"problemMatcher": [],
"group": {
"kind": "build",
"isDefault": true
}
},
Reading list
- http://jmc.stanford.edu/articles/lisp/lisp.pdf
- http://jmc.stanford.edu/articles/recursive/recursive.pdf
- A different take on implementing Lisp in Zig can be found in the Mal repository (not mine)*
- A comptime Lisp
- A correct quasiquoting algorithm in appendix A
Going forward (feel free to contribute or fork the project)
- Add arbitrary precision number support, ideally a Scheme-like numerical tower system
- A more advanced macro system, including reader macros
- GC optimizations
- Expand the standard library
- Improve error handling by recording more context
- Integrate libffi or similar so std lib can do anything
- A compiler targeting a vm or maybe Zig stage2 ir