Writing a simple parser using a finite state machine

2025-06-06 21:21

I wrote a literate programming tool called Crycco and at its core is a very simple parser which extracts comment blocks from source code and organizes them into a document.

If you want to see an example, Crycco's website is made out of its own code by processing it with itself.

The format is very simple: you write a comment block, then some code, then another comment block, and so on. The comment blocks are taken as markdown, and the code blocks are taken as source code.

The document is then split in sections: 1 chunk of doc, 1 chunk of code, and processed to make it look nice, with syntax highlighting and so on.

The parser was just a couple dozen lines of code, but when I wanted to add more features I ran into a wall, so I rewrote it using a finite state machine (FSM) approach.

Since, again, Crycco is literate code, you can see the parser by just looking at the source code of the parser itself, which is in the Document class

Of course, that may not be enough, so let's go into details.

The state machine has a few states:

    enum State
      CommentBlock
      EnclosingCommentBlock
      CodeBlock
    end

A comment block is something like this:

    # This is a comment
    # More comment

An enclosing comment block is a "multiline" comment, like this:

 /* This is a comment
 More comment
 */

Code blocks are lines that are not comments :-)

So, suppose you are in the CommentBlock state, and you see a line that starts with # you stay in the same state.

If you see a line that does not start with #, you switch to the CodeBlock state.

When you are in the CommentBlock state, the line you are in is a comment. If you are in the CodeBlock state, the line is code.

Here are the possible transitions in this machine:

    state_machine State, initial: State::CodeBlock do
      event :comment, from: [State::CodeBlock], to: State::CommentBlock
      event :enclosing_comment_start, from: [State::CodeBlock], to: State::EnclosingCommentBlock
      event :enclosing_comment_end, from: [State::EnclosingCommentBlock], to: State::CodeBlock
      event :code, from: [State::CommentBlock], to: State::CodeBlock
    end

Then, to parse the document, we go line by line, and call the appropriate event depending on the line we are reading. That event may change the state or not.

For example:

        if is_comment.match(line) && !NOT_COMMENT.match(line)
          self.comment {
            # These blocks only execute when transitions are successful.
            #
            # So, this block is executed when we are transitioning
            # to a comment block, which means we are starting
            # a new section
            @sections << Section.new(@language)
          }
          # Because the docs section is supposed to be markdown, we need
          # to remove the comment marker from the line.
          processed_line = processed_line.sub(@language.match, "") unless @literate

And that's it! We send the proper events to the machine, the machine changes state, we handle the line according to what state we are in, and we end up with a nicely parsed document.

Parsers are somewhat scary, but they don't have to be. A finite state machine is a very simple way to write a parser that is easy to understand and maintain, and often is enough.

Have fun parsing!

Ralsina.Me — Roberto Alsina's website

Posts about parser

Writing a simple parser using a finite state machine