Writing a simple parser using a finite state machine
I wrote a literate programming tool called Crycco and at its core is a very simple parser which extracts comment blocks from source code and organizes them into a document.
If you want to see an example, Crycco's website is made out of its own code by processing it with itself.
The format is very simple: you write a comment block, then some code, then another comment block, and so on. The comment blocks are taken as markdown, and the code blocks are taken as source code.
The document is then split in sections: 1 chunk of doc, 1 chunk of code, and processed to make it look nice, with syntax highlighting and so on.
The parser was just a couple dozen lines of code, but when I wanted to add more features I ran into a wall, so I rewrote it using a finite state machine (FSM) approach.
Since, again, Crycco is literate code, you can see the parser by just looking at the source code of the parser itself, which is in the Document
class
Of course, that may not be enough, so let's go into details.
The state machine has a few states:
A comment block is something like this:
An enclosing comment block is a "multiline" comment, like this:
Code blocks are lines that are not comments :-)
So, suppose you are in the CommentBlock
state, and you see a line that starts with #
you stay in the same state.
If you see a line that does not start with #
, you switch to the CodeBlock
state.
When you are in the CommentBlock
state, the line you are in is a comment. If you are in the CodeBlock
state, the line is code.
Here are the possible transitions in this machine:
Then, to parse the document, we go line by line, and call the appropriate event depending on the line we are reading. That event may change the state or not.
For example:
And that's it! We send the proper events to the machine, the machine changes state, we handle the line according to what state we are in, and we end up with a nicely parsed document.
Parsers are somewhat scary, but they don't have to be. A finite state machine is a very simple way to write a parser that is easy to understand and maintain, and often is enough.
Have fun parsing!