Compiling a Lisp: Booleans, characters, nil

September 2, 2020

firstprevious

Welcome back to the “Compiling a Lisp” series. Last time, we compiled integer literals. In today’s relatively short post, we’ll add the rest of the immediate types. Our programs will look like this:

In addition, since we’re not adding too much exciting stuff today, I made writing tests a little bit easier by adding fixtures. Now, if we want, we can get a pre-made Buffer object passed into the test, and then have it destroyed afterward.

Encoding

Since we’re coming back to the pointer tagging scheme, I’ve reproduced the “pointer templates” (I don’t think that’s a real term) diagram from the last post below.

High                                                         Low
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX00  Integer
0000000000000000000000000000000000000000000000000XXXXXXX00001111  Character
00000000000000000000000000000000000000000000000000000000X0011111  Boolean
0000000000000000000000000000000000000000000000000000000000101111  Nil
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX001  Pair
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX010  Vector
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX011  String
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX101  Symbol
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX110  Closure

Notice that we have a pattern among the other immediates (character, boolean, and nil) – the lower four bits are all the same, and that sets them apart from other pointer types.

Also notice that among those immediates, they can be discriminated by the two bits just above those four:

High                                                             Low
0000000000000000000000000000000000000000000000000XXXXXXX00[00][1111]  Character
00000000000000000000000000000000000000000000000000000000X0[01][1111]  Boolean
0000000000000000000000000000000000000000000000000000000000[10][1111]  Nil

So a lower four bits of 0b1111 means immediate, and from there 0b00 means character, 0b01 means boolean, and 0b10 means nil. There’s even room to add another immediate tag pattern (0b11) if we like.

Let’s add some of the symbolic constants for bit manipulation.

const unsigned int kImmediateTagMask = 0x3f;

const unsigned int kCharTag = 0xf;   // 0b00001111
const unsigned int kCharMask = 0xff; // 0b11111111
const unsigned int kCharShift = 8;

const unsigned int kBoolTag = 0x1f;  // 0b0011111
const unsigned int kBoolMask = 0x80; // 0b10000000
const unsigned int kBoolShift = 7;

Notice that we don’t have any for nil. That’s because nil is a singleton and has no payload at all. It’s just a solitary 0x2f.

For the others, we need to put the payload alongside the tag, and that requires a shift and a bitwise or. The first operation, the shift, moves the payload left enough that there’s space for a tag, and the or adds the tag.

word Object_encode_char(char value) {
  return ((word)value << kCharShift) | kCharTag;
}

char Object_decode_char(word value) {
  return (value >> kCharShift) & kCharMask;
}

word Object_encode_bool(bool value) {
  return ((word)value << kBoolShift) | kBoolTag;
}

bool Object_decode_bool(word value) { return value & kBoolMask; }

word Object_true() { return Object_encode_bool(true); }

word Object_false() { return Object_encode_bool(false); }

word Object_nil() { return 0x2f; }

For bool, we’ve done a little trick. Since we only care if the value is true or false, instead of doing both a shift and mask to decode, we can turn off the tag bits. The resulting value will be either 0b00000000 for false or 0b10000000 for true. Since any non-zero value is truthy in C, we can “cast” that to a C bool by just returning it.

Note that the cast from char and bool to word is necessary because — as I learned the hard way, several months ago — shifting a type left more to the left than the size has bits is either undefined or implementation-defined behavior. I can’t remember which offhand but the situation went sideways and left me scratching my head.

I added Object_true and Object_false because I thought they might come in handy at some point, but we don’t have a use for them now. If you are strongly against including dead weight code, then feel free to omit them.

Now let’s add some more AST utility functions before we move on to compiling:

bool AST_is_char(ASTNode *node) {
  return ((word)node & kImmediateTagMask) == kCharTag;
}

char AST_get_char(ASTNode *node) { return Object_decode_char((word)node); }

ASTNode *AST_new_char(char value) {
  return (ASTNode *)Object_encode_char(value);
}

bool AST_is_bool(ASTNode *node) {
  return ((word)node & kImmediateTagMask) == kBoolTag;
}

bool AST_get_bool(ASTNode *node) { return Object_decode_bool((word)node); }

ASTNode *AST_new_bool(bool value) {
  return (ASTNode *)Object_encode_bool(value);
}

bool AST_is_nil(ASTNode *node) { return (word)node == Object_nil(); }

ASTNode *AST_nil() { return (ASTNode *)Object_nil(); }

Enough talk about object encoding. Let’s compile some immediates.

Compiling

The implementation is much the same as for integers. Check the type, pull out the payload, move to rax.

int Compile_expr(Buffer *buf, ASTNode *node) {
  if (AST_is_integer(node)) {
    word value = AST_get_integer(node);
    Emit_mov_reg_imm32(buf, kRax, Object_encode_integer(value));
    return 0;
  }
  if (AST_is_char(node)) {
    char value = AST_get_char(node);
    Emit_mov_reg_imm32(buf, kRax, Object_encode_char(value));
    return 0;
  }
  if (AST_is_bool(node)) {
    bool value = AST_get_bool(node);
    Emit_mov_reg_imm32(buf, kRax, Object_encode_bool(value));
    return 0;
  }
  if (AST_is_nil(node)) {
    Emit_mov_reg_imm32(buf, kRax, Object_nil());
    return 0;
  }
  assert(0 && "unexpected node type");
}

I suppose we could coalesce these by checking if the node is any sort of immediate and then writing the address immediately back with Emit_mov_reg_imm32… but that would be breaking abstractions or something.

Testing

The testing is also so much the same — so much so, that I’ll only include the test for compiling characters. The other code is available from assets/code/lisp if you would like a reference.

TEST compile_char(Buffer *buf) {
  char value = 'a';
  ASTNode *node = AST_new_char(value);
  int compile_result = Compile_function(buf, node);
  ASSERT_EQ(compile_result, 0);
  // mov eax, imm('a'); ret
  byte expected[] = {0x48, 0xc7, 0xc0, 0x0f, 0x61, 0x00, 0x00, 0xc3};
  EXPECT_EQUALS_BYTES(buf, expected);
  Buffer_make_executable(buf);
  word result = Testing_execute_expr(buf);
  ASSERT_EQ(result, Object_encode_char(value));
  PASS();
}

You’ll notice that instead of void, the function now takes Buffer*. This is part of the new testing fixtures setup that I mentioned earlier. The implementation is a macro that uses greatest.h’s “pass a parameter to your test” feature. Running a test looks much the same:

RUN_BUFFER_TEST(compile_char);

Anyway, that’s a wrap for today. Next time we’ll add some unary primitives for querying and manipulating the objects we have already.

Mini Table of Contents