Writing a C Compiler: Part 2, Parser and Code Generation

November 26, 2021

I’ve worked a bit more on the compiler, implementing the parsing and code generation steps. So, the program to compile is as follows:

int main() {
    return 42;
}

Disclaimer: If you are interested in doing this yourself, I would recommend reading Nora Sandler’s blog posts. I’m providing the source code that I ended up with, whereas the instructions found in the blog posts provides good context and instructions without providing any actual code or being prescriptive about the implementation.

You can see the code I wrote for the lexer here. I’ve learned that the lexer actually performs a relatively simple task (compared to the other steps) in that it simply creates a list of tokens. It doesn’t provide any structure or try to find syntax errors.

The next step is to build the Abstract Syntax Tree (AST) based on the list of tokens. This is where the program gets it’s hiarchical structure. In order to create an AST, each node can be described a combination of tokens or specific kinds of other nodes. For example, a “function” node would be a combination of keywords for return type, tokens for opening and closing brackets, expressions for arguments, statements for the function body, etc. Things like expressions and statements are also nodes, which have their own parsing rules, including tokens and further nodes. As the tokens are parsed, a tree is created as nodes become nested.

Again, each node is handled differently. For the 3-line example program, there are four types of nodes:

Program (the toplevel node)
Function (a single main function)
Statement (a single return statement)
Expression (a single constant value integer)

As far as code goes, here’s what I got. Still goin’ strong with the ruby.

# frozen_string_literal: true

class ParseError < StandardError; end

class Node
  attr_reader :tokens

  def initialize(tokens)
    @tokens = tokens
  end
end

class Program < Node
  attr_reader :function

  def parse!
    @function = Function.new(tokens).parse!

    raise ParseError, format('root function must be called main, found %s', @function.id) if @function.id != 'main'

    self
  end
end

class Statement < Node
  attr_reader :is_return, :expression

  def parse!
    token = tokens.shift

    if token.type == Token::KEYWORD && token.value == 'return'
      @is_return = true
    else
      tokens.unshift(token)
    end

    @expression = Expression.new(tokens).parse!

    self
  end
end

class Expression < Node
  attr_reader :constant_value

  def parse!
    token = tokens.shift
    raise ParseError, format('unexepected, got %<type>s : %<value>s', **token) unless token.type == Token::CONST

    @constant_value = token.value

    token = tokens.shift
    raise ParseError, format('expected terminator, got %<type>s : %<value>s', **token) unless token.type == Token::TERM

    self
  end
end

class Function < Node
  attr_reader :return_type, :id, :statements

  def parse!
    @statements = []

    token = tokens.shift

    unless token.type == Token::KEYWORD && return_types.include?(token.value)
      raise ParseError, format('invalid function return type %<type>s : %<value>s', **token)
    end

    @return_type = token.value

    token = tokens.shift
    raise ParseError, format('expected function name, got %<type>s : %<value>s', **token) unless token.type == Token::ID

    @id = token.value

    token = tokens.shift
    raise ParseError, format('expected args, got %<type>s : %<value>s', **token) unless token.type == Token::PAREN

    token = tokens.shift
    raise ParseError, format('expected args, got %<type>s : %<value>s', **token) unless token.type == Token::END_PAREN

    token = tokens.shift
    raise ParseError, format('expected args, got %<type>s : %<value>s', **token) unless token.type == Token::BLOCK

    loop do
      if token.type == Token::END_BLOCK
        raise ParseError, format('missing return') if @statements.empty?

        break
      end

      raise ParseError, 'unexpected end of function' if tokens.empty?

      @statements << Statement.new(tokens).parse!

      raise ParseError, 'unexpected end of function' if tokens.empty?

      token = tokens.shift
    end

    self
  end

  def return_types
    ['int']
  end
end

class Parse
  attr_reader :compiler
  attr_accessor :curnode

  def initialize(compiler)
    @compiler = compiler
    @ast = Program.new(compiler.tokens)
    @curnode = @ast
  end

  def parse!
    @curnode.parse!
  end
end

I’m going to throw in the code generator here too, since (for now) it’s actually relatively simple. I’m reopening classes here, sorry. It’s work a work in progress.

# frozen_string_literal: true

class Function
  def compile!
    <<~ASM
      .globl #{id}
      #{id}:
      #{statements.map(&:compile!).join("\n")}
    ASM
  end
end

class Statement
  def compile!
    <<~ASM
      #{expression.compile!}
      #{is_return ? 'ret' : ''}
    ASM
  end
end

class Expression
  def compile!
    <<~ASM
      movl    $#{constant_value}, %eax
    ASM
  end
end

class Generate
  attr_reader :compiler

  def initialize(compiler)
    @compiler = compiler
    @buf = ''
  end

  def compile!
    main = compiler.ast.function
    main.compile!
  end
end

I’m a bit concerned how this will scale. It seems like the parsing of each node is very specific to that node. I guess it would be, I’m not sure what I’m trying to say here. I just assume what I’m doing here will break down quickly and stop working. We’ll see!

I’m glad I made it this far, I’m curious to see how much more I can build on what I have here. Looking forward to Part 2.