Creating Languages For Dummies

2017-04-12 02:11

Intro

I don't have the usual programmer's education. I studied maths, and then dropped out of that, and am mostly self-taught. So, there are some parts of programming I always saw wearily, thinking to myself that I really should go to school to learn them. One remarkable such area is parsing and implementing languages.

Well... sure, school is always a good idea, but this is not that hard. In this article I will explain how to go from nothing to a functioning, extensible language, using Python and PyParsing. If you are as scared of grammars, parsers and all that jazz as I used to be, come along, it's pretty simple,

[TOC]

Grammar (part I)

What is a grammar? It's what defines what is and what is not valid in our language, and it's the hardest part to get right, mostly because it's all about making decisions.

First decision: language name. I will call this language Least Optimized Language Ever, or LOLE

I will describe the language informally, then show you how to describe that grammar formally using PyParsing.

PyParsing is "constructive", you start creating small blocks, then combine those blocks into larger things.

Identifiers

Some things in LOLE will have names. For example, variables, or functions. So, we need to define what a name is. Most languages allow you to use identifiers that start with a letter, and then use letters or numbers. Let's go even simpler: identifiers are words. Just letters.

identifier = Word(alphas)

Word is a PyParsing object that means "these characters". alphas is a constant that means "letters". So: "some letters".

You can actually try this in an interactive interpreter:

>>> from pyparsing import *
>>> identifier = Word(alphas)
>>> identifier.parseString("foo")
(['foo'], {})

identifier.parseString takes a string, and tries to parse it as an identifier. So, "foo" is an identifier, and we get the match. Nice!

What happens if we try something that doesn't match?

>>> identifier.parseString("42")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "pyparsing.py", line 1632, in parseString
    raise exc
ParseException: Expected W:(ABCD...) (at char 0), (line:1, col:1)

Surprise! You get an error. It's a pretty awful error, but it basically says "I was expecting a thing made of letters (ABCD...) and I did not get it".

Because error messages are important let's make that better already. There's always time to develop bad habits later.

>>> identifier = Word(alphas).setName('identifier')
>>> identifier.parseString("42")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "pyparsing.py", line 1632, in parseString
    raise exc
ParseException: Expected identifier (at char 0), (line:1, col:1)

Now it says it was expecting an identifier, and it did not get one. So it's more clear. Let's try one last thing:

>>> identifier.parseString("foo bar")
(['foo'], {})

You may say "hey, there are two identifiers there, and it only found one!" and you are right. That's because we are parsing identifier not identifierS. We told it to understand "foo bar" as an identifier, and it did just that. By default PyParsing will ignore anything left in the input after it found what it's looking for. If you want to make sure the whole input parses correctly, then you have to say so:

>>> identifier.parseString("foo bar", parseAll=True)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "pyparsing.py", line 1632, in parseString
    raise exc
ParseException: Expected end of text (at char 4), (line:1, col:5)

And yes, that is a valid error. We asked to know if "foo bar" is an identifier. It's not, because to be one it would have to end after "foo" because identifiers can't have spaces, and it does not.

Let's move to something more challenging....

Literals: Strings and Numbers

LOLE has literals for two types: strings and numbers. They are the usual thing, except that strings are python style, either single or double quoted, and support escaping quotes inside them.

I could now explain how to define numbers in terms of literals and combinations of symbols and numbers and so on. And strings as a list of characters surrounded by quotes. But I am going to tell you something much more important: learn your library.

literal = quotedString ^ pyparsing_common.number

quotedString is provided by PyParsing to define "strings as usual". And number defines "numbers as usual" so we don't have to do it. Because most languages need this, so having to do it each time is dumb.

We are also using ^ there. That means "one or the other". So 'literal' will successfully parse things that are strings and also things that are numbers.

>>> literal.parseString("-43.21E12")
([-43210000000000.0], {})
>>> literal.parseString(r"'this is a string with a \' in it!'")
(["'this is a string with a \\' in it!'"], {})

And of course, if you parse something else, it will fail:

>>> literal.parseString(r"foo")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "pyparsing.py", line 1632, in parseString
    raise exc
ParseException: Expected {quotedString using single or double quotes ^ {real number with scientific notation | real number | signed integer}} (at char 0), (line:1, col:1)

Like the error says, it was expecting either a string quoted with single or double quotes, or one of a few kinds of numbers.

So, now that we have identifiers and values, we can make this more like a real language and implement ...

Assignments

This is pretty simple:

foo = 42

So, an assignment is (for now):

An identifier
An equal sign
A literal value

And this is how it looks in PyParsing:

>>> assignment = identifier + Literal('=') + literal
>>> assignment.parseString('foo = 42')
(['foo', '=', 42], {})

Program

So far, we only support programs that do one thing and then end. That is not very useful. So let's define what a program is: "A list of statements separated with semicolons". So far, the only kind of statement we have is assignment, so it will look like this:

program = delimitedList(assignment, delim=';')

As usual, PyParsing does the heavy lifting for us: see what delimitedList does.

And now we can parse slightly more interesting things:

>>> program.parseString('foo = 42; bar="bat"')
(['foo', '=', 42, 'bar', '=', '"bat"'], {})

Let's stop defining grammar now for a little bit (we will come back to it) and think about how to use this thing we are creating.

Execution (Part I)

So we now have the world's most useless language, that only supports syntax for assigning values to variables. How do we make it do that?

Well, we need to take the output of parseString and process it, doing what our "program" says we should do. That's an interpreter.

So, how do we "run" ['foo', '=', 42, 'bar', '=', '"bat"']? For starters, I don't really like it. There are two statements there but we have six elements and no clear boundary between one statement and another. To fix that, we can change assignment:

>>> assignment = Group(identifier + Literal('=') + literal)
>>> program = delimitedList(assignment, delim=';')
>>> print program.parseString('foo = 42; bar="bat"')
[['foo', '=', 42], ['bar', '=', '"bat"']]

Unsurprisingly, what Group does is group things, so we now have each statement as its own separate thing.

So, we could "run" this program like this:

>>> def run_program(code):
...    parsed = program.parseString(code)
...    vars = {}
...    for name, _, value in parsed:
...        vars[name] = value
...    return vars
>>> run_program('foo = 42; bar="bat"')
{'foo': 42, 'bar': '"bat"'}

And hey, that means this is an interpreter already. Not a very good one, tho!

Grammar (Part II)

Now, how about this LOLE program? foo = 42; bar = foo

As we have defined the language so far, that's illegal, because on the right side of the equal sign LOLE requires a literal, which is either a string or a number. What we really want on the right side is more general: an expression. An expression is something that eventually resolves to a value. So, variable names are expressions, function calls (when we have them) will be expressions, and things like 2 + 2 would be expressions if we implemented them.

So, let's introduce expressions, which are either literals, or identifiers (so far):

>>> expression = literal ^ identifier
>>> assignment = Group(identifier + Literal('=') + expression)
>>> program = delimitedList(assignment, delim=';')
>>> print program.parseString("foo = 42; bar = foo")
[['foo', '=', 42], ['bar', '=', 'foo']]

Execution (Part II)

So, a LOLE expression in an assignment can currently be three kinds of things:

A number: 42
An identifier: 'foo'
A string: '"foo"'

What happens if we run this through run_program?

>>> run_program("foo = 42; bar = foo")
{'foo': 42, 'bar': 'foo'}

That is not really right, is it? We would want bar to be set to 42. The problem is that we are not calculating the value of expressions, but just passing them along. We need eval_expression:

>>> vars = {}
>>> def eval_expression(expr):
...    if isinstance(expr, str):
...        if expr[0] in '"\'':  # string literal
...            return expr[1:-1] # remove quotes
...        else: # identifier
...            return vars.get(expr)
...    # Number, return as-is
...    return expr

This solves that problem of "resolving" variables, and also that of extraneous quotes in our string literals. And then we need to use it in run_program, too:

>>> def run_program(code):
...    parsed = program.parseString(code)
...    for name, _, value in parsed:
...        vars[name] = eval_expression(value)
...    return vars
   
>>> run_program("foo = 'quux'; bar = foo")
{'foo': 'quux', 'bar': 'quux'}

Grammar (Part III)

Let's make LOLE slightly more interesting by adding functions. Since creating grammars is mostly a matter of making decisions, let's make a couple.

A function call is identifier() or identifier(arg) or identifier(foo=expr, bar=expr). If a function takes more than one argument, it has to use keyword arguments.
Functions are not defined in LOLE, they are provided by Python.

While keyword arguments look a lot like assignments, they are not the same thing, so we define them separately.

>>> kwarg = Group(identifier + Literal('=') + expression)
>>> function_call = Group(identifier + Literal('(') + Optional(expression ^ delimitedList(kwarg)) + Literal(')'))
>>> print function_call.parseString('foo(bar=1, baz=2)')
[['foo', '(', ['bar', '=', 1], ['baz', '=', 2], ')']]
>>> print function_call.parseString('foo(bar=1)')
[['foo', '(', ['bar', '=', 1], ')']]
>>> print function_call.parseString('foo(42)')
[['foo', '(', 42, ')']]
>>> print function_call.parseString('foo()')
[['foo', '(', ')']]

This shows we have a number of problems in how we process our code.

How can we distinguish ['foo', '(', ')'] (a function call) from ['foo', '=', 4] (an assignment)? Clearly the output of parsing a function call is not distinctive enough. We can improve this by using a PyParsing feature called parseActions.

What parseActions do is process the result of parsing and allow you to use the knowledge of what you are parsing to do The Right Thing (TM).


class FunctionCall():
    pass

def function_call_action(tokens):
    f_call = FunctionCall()
    f_call.name = tokens[0][0]
    f_call.args = tokens[0][1:]
    return f_call
    
function_call = Group(identifier + Literal('(') + Optional(expression ^ delimitedList(kwarg)) + Literal(')'))
function_call = function_call.setParseAction(function_call_action)

>>> print function_call.parseString('foo()')[0].name
foo
>>> print function_call.parseString('foo()')[0].args
['(', ')']

Do we really want args to contain the parenthesis? I say no. While they are useful for defining the syntax, we really don't need them at all, so we can use suppress to make our parser's output cleaner:

function_call = Group(identifier + Literal('(').suppress() + Optional(expression ^ delimitedList(kwarg)) + Literal(')').suppress())
function_call = function_call.setParseAction(function_call_action)

>>> print function_call.parseString('foo()')[0].name
foo
>>> print function_call.parseString('foo()')[0].args
[]
>>> print function_call.parseString('foo(42)')[0].args
[42]
>>> print function_call.parseString('foo(a=5, b=7)')[0].args
[(['a', '=', 5], {}), (['b', '=', 7], {})]

And since calling functions produces a value, it's a type of expression. And since we can just want to run a function and not care about what it returns, it's also a type of statement. Let's put all our grammar in one place:

identifier = Word(alphas)
literal = quotedString ^ pyparsing_common.number
function_call = Forward()
expression = literal ^ function_call
kwarg = Group(identifier + Literal('=') + expression)
assignment = Group(identifier + Literal('=') + expression)
function_call << Group(identifier + Literal('(').suppress() + Optional(expression ^  delimitedList(kwarg)) + Literal(')').suppress()).setParseAction(function_call_action)

statement = assignment ^ function_call
program = delimitedList(statement, delim=';')

The only new things are related to function_call. Since its definition needs to know about expression, and expression is defined using function_call we have a bit of a chicken-egg problem. The solution is to use Forward, so we say "hey, there is a thing called function_call, I will explain later".

Then, when actually defining function_call it uses << so the definition is applied to the Forward placeholder.

Execution (Part III)

Since we added function calls ... how do we execute those functions? By improving our run_program of course. So far, we treated every statement as an assignment, but now it can be either an assignment or a function_call.

Also, our eval_expression knew how to evaluate literals and variables, but not function calls, so that will need some tweaking also.

Let's do eval_expression first:

# This dictionary exposes functions to the LOLE interpreter
funcs = {
    'sin': sin,
}

def eval_expression(expr):
    if isinstance(expr, str):
       if expr[0] in '"\'':  # string literal
           return expr[1:-1] # remove quotes
       else: # identifier
           return vars.get(expr)
    elif isinstance(expr, FunctionCall):
        a = []
        kw = {}
        # Process function call arguments
        for arg in expr.args:
            if isinstance(arg, list):
                # argument is a kwarg
                kw[arg[0]] = eval_expression(arg[2])
            else:
                # argument is a literal value
                a.append(eval_expression(arg))
        return funcs[expr.name](*a, **kw)
    # Number, return as-is
    return expr

Sure, it got a little more complicated, but basically, all it does is add a special case for when the expression is a FunctionCall and handle that case by calling the named function with the given arguments. Since arguments are expressions themselves, we need to recursively call eval_expression to get their value.

And now we need to do something similar with run_program:

def run_program(code):
    parsed = program.parseString(code)
    for statement in parsed:
        if isinstance(statement, FunctionCall):
            eval_expression(statement)
        else:  # assignment
            name, _, value = statement
            vars[name] = eval_expression(value)
    return vars

>>> print run_program('x = sin(0); sin(1)')
{'x': 0.0}

Again, while iterating over the statements, we check if the statement is a function call, in which case we evaluate it, or if it's an assignment, in which case we evaluate the right side and assign that value to the variable named on the left side.

This shows how extending the language works. If we add a new type of expression in the grammar, we add a new branch to eval_expression. If we add a new type of statement in the grammar, we add a branch to run_program. That is all there is to it. To prove it, let's add a more complex structure.

Grammar (Part IV)

What kind of language doesn't support an if statement? Well this one, until now. So let's add it. It could look like this:

if expression then { statements } [else { statements }]

The brackets around else mean that part is optional, since not all ifs need an else clause. The conditionally executed parts are a list of statements surrounded by braces, C-style.

Let's express this in PyParsing:

code_block = Forward().setName('code block')
if_stmt = Group(Keyword('if') + expression + Keyword('then') + code_block + Optional(Keyword('else') + code_block))

statement = if_stmt ^ assignment ^ function_call
code_block << Group(Literal('{').suppress() + delimitedList(statement, delim=';') + Literal('}').suppress()).setName('code block')

program = delimitedList(statement, delim=';')

>>> print program.parseString('if 0 then {x=2} else {x=3}')
['if', 0, 'then', [['x', '=', 2]], 'else', [['x', '=', 3]]]

Again, we have to use Forward, now for code_block since if_stmt contains code_block which uses statement but if_stmt is also a statement. Also, some calls to setName so errors are nice and tidy, and use the Keyword element we had not seen before.

Just like with function calls, we end up with just a bunch of things in a list, which means when we want to execute it it will be annoying. We can improve it in exactly the same way via setParseAction:

class IfStmt():
    pass

def if_stmt_action(tokens):
    if_stmt = IfStmt()
    if_stmt.condition = tokens[0][1]
    if_stmt.then_block = tokens[0][3]
    if len(tokens[0]) > 4:  # has an else clause
        if_stmt.else_block = tokens[0][5]
    else:
        if_stmt.else_block = None
    return if_stmt

if_stmt = Group(Keyword('if') + expression + Keyword('then') + code_block + Optional(Keyword('else') + code_block)).setParseAction(if_stmt_action)

>>> if_stmt = print program.parseString('if 0 then {x=2} else {x=3}')[0]
>>> print if_stmt.condition
0
>>> print if_stmt.then_block
['x', '=', 2]
>>> print if_stmt.else_block
['x', '=', 3]

Execution (Part IV)

And, since we added a new kind of statement, we need to handle it in run_program:

def run_program(code):
    if isinstance(code, str):
        parsed = program.parseString(code)
    else:
        parsed = code
    for statement in parsed:
        if isinstance(statement, FunctionCall):
            eval_expression(statement)
        elif isinstance(statement, IfStmt):
            if eval_expression(statement.condition):
                run_program(statement.then_block)
            elif statement.else_block is not None:
                run_program(statement.else_block)
        else:  # assignment
            name, _, value = statement
            vars[name] = eval_expression(value)
    return vars

>>> print run_program('if 0 then {x=2} else {x=3}')
{'x': 3}
>>> print run_program('if 1 then {x=2} else {x=3}') 
{'x': 2}

The new bits:

There is a branch to handle if statements. Since if statements contain code blocks, which are basically the same as a program, we end up calling run_program recursively.
To avoid complicating the example too much, run_program now needs to handle both unparsed programs (when we want to run a script) and parsed code, when it's called recursively. That is managed by the if at the beginning of the function.

Conclusion and Full Source Code

And there you go, we now have an extensible language that can call arbitrary Python functions! And here is the full listing:

from math import sin

from pyparsing import *
from pyparsing import pyparsing_common

class FunctionCall():
    pass

def function_call_action(tokens):
    f_call = FunctionCall()
    f_call.name = tokens[0][0]
    f_call.args = tokens[0][1:]
    return f_call

class IfStmt():
    pass

def if_stmt_action(tokens):
    if_stmt = IfStmt()
    if_stmt.condition = tokens[0][1]
    if_stmt.then_block = tokens[0][3]
    if len(tokens[0]) > 4:  # has an else clause
        if_stmt.else_block = tokens[0][5]
    else:
        if_stmt.else_block = None
    return if_stmt

identifier = Word(alphas)
literal = quotedString ^ pyparsing_common.number
function_call = Forward().setName('function call')
expression = (literal ^ function_call).setName('expression')
kwarg = Group(identifier + Literal('=') + expression)
assignment = Group(identifier + Literal('=') + expression).setName('assignment')
function_call << Group(identifier + Literal('(').suppress() + Optional(expression ^  delimitedList(kwarg)) + Literal(')').suppress()).setParseAction(function_call_action)

code_block = Forward().setName('code block')
if_stmt = Group(Keyword('if') + expression + Keyword('then') + code_block + Optional(Keyword('else') + code_block)).setParseAction(if_stmt_action)

statement = if_stmt ^ assignment ^ function_call
code_block << Group(Literal('{').suppress() + delimitedList(statement, delim=';') + Literal('}').suppress()).setName('code block')

program = delimitedList(statement, delim=';')

funcs = {
    'sin': sin,
}

vars = {}

def eval_expression(expr):
    if isinstance(expr, str):
       if expr[0] in '"\'':  # string literal
           return expr[1:-1] # remove quotes
       else: # identifier
           return vars.get(expr)
    elif isinstance(expr, FunctionCall):
        a = []
        kw = {}
        # Process function call arguments
        for arg in expr.args:
            if isinstance(arg, list):
                # argument is a kwarg
                kw[arg[0]] = eval_expression(arg[2])
            else:
                # argument is a literal value
                a.append(eval_expression(arg))
        return funcs[expr.name](*a, **kw)
    # Number, return as-is
    return expr

def run_program(code):
    if isinstance(code, str):
        parsed = program.parseString(code)
    else:
        parsed = code
    for statement in parsed:
        if isinstance(statement, FunctionCall):
            eval_expression(statement)
        elif isinstance(statement, IfStmt):
            if eval_expression(statement.condition):
                run_program(statement.then_block)
            elif statement.else_block is not None:
                run_program(statement.else_block)
        else:  # assignment
            name, _, value = statement
            vars[name] = eval_expression(value)
    return vars

Exercises for the Reader

To make sure you actually understood this, the best way is to continue this work yourself. LOLE will stay forever unfinished so you can make it do interesting things while you learn by doing. Here are a few questions for you to explore in no particular order:

How would you make a LOLE program exit?
How would that exiting program return a result?
The FunctionCall and IfStmt classes are very sketchy. How would you improve them?
Could the logic in run_program to handle different types of statements be moved out, into classes like FunctionCall?
How would you implement comparisons so if is more useful?
What happens if you create a variable called 'if'?
What happens if you have a variable and a function with the same name? Is that good?
How would you implement lists? How about dictionaries?
How would you implement iteration?
How would you support comments?

Paul McGuire / 2017-04-12 13:29:

Very nice article! Good use of pyparsing features, including the new `pyparsing_common` expressions. You should post this link on reddit.

Roberto Alsina / 2017-04-12 13:33:

It's usually better if someone else posts to reddit. Could you? :-)

Ron Barak / 2017-04-13 16:24:

Erratum:
So we now have the world's most usesess language -> So we now have the world's most useless language

Roberto Alsina / 2017-04-16 18:32:

Thx, fixed, rolling out soon.

A. Jesse Jiryu Davis / 2017-04-13 21:12:

Beautifully explained, great work. I read to the end and enjoyed it even though I have no plans to write a parser. I bet that if I did this introduction would be invaluable.