Creating Languages For Dummies
Intro
I don't have the usual programmer's education. I studied maths, and then dropped out of that, and am mostly self-taught. So, there are some parts of programming I always saw wearily, thinking to myself that I really should go to school to learn them. One remarkable such area is parsing and implementing languages.
Well... sure, school is always a good idea, but this is not that hard. In this article I will explain how to go from nothing to a functioning, extensible language, using Python and PyParsing. If you are as scared of grammars, parsers and all that jazz as I used to be, come along, it's pretty simple,
[TOC]
Grammar (part I)
What is a grammar? It's what defines what is and what is not valid in our language, and it's the hardest part to get right, mostly because it's all about making decisions.
First decision: language name. I will call this language Least Optimized Language Ever, or LOLE
I will describe the language informally, then show you how to describe that grammar formally using PyParsing.
PyParsing is "constructive", you start creating small blocks, then combine those blocks into larger things.
Identifiers
Some things in LOLE will have names. For example, variables, or functions. So, we need to define what a name is. Most languages allow you to use identifiers that start with a letter, and then use letters or numbers. Let's go even simpler: identifiers are words. Just letters.
identifier = Word(alphas)
Word
is a PyParsing object that means "these characters". alphas
is a constant that means "letters". So: "some letters".
You can actually try this in an interactive interpreter:
>>> from pyparsing import *
>>> identifier = Word(alphas)
>>> identifier.parseString("foo")
(['foo'], {})
identifier.parseString
takes a string, and tries to parse it as an identifier. So, "foo"
is an identifier, and we get the match. Nice!
What happens if we try something that doesn't match?
>>> identifier.parseString("42")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pyparsing.py", line 1632, in parseString
raise exc
ParseException: Expected W:(ABCD...) (at char 0), (line:1, col:1)
Surprise! You get an error. It's a pretty awful error, but it basically says "I was expecting a thing made of letters (ABCD...) and I did not get it".
Because error messages are important let's make that better already. There's always time to develop bad habits later.
>>> identifier = Word(alphas).setName('identifier')
>>> identifier.parseString("42")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pyparsing.py", line 1632, in parseString
raise exc
ParseException: Expected identifier (at char 0), (line:1, col:1)
Now it says it was expecting an identifier, and it did not get one. So it's more clear. Let's try one last thing:
>>> identifier.parseString("foo bar")
(['foo'], {})
You may say "hey, there are two identifiers there, and it only found one!" and you are right. That's because we are parsing identifier not identifierS. We told it to understand "foo bar"
as an identifier, and it did just that. By default PyParsing will ignore anything left in the input after it found what it's looking for. If you want to make sure the whole input parses correctly, then you have to say so:
>>> identifier.parseString("foo bar", parseAll=True)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pyparsing.py", line 1632, in parseString
raise exc
ParseException: Expected end of text (at char 4), (line:1, col:5)
And yes, that is a valid error. We asked to know if "foo bar"
is an identifier. It's not, because to be one it would have to end after "foo"
because identifiers can't have spaces, and it does not.
Let's move to something more challenging....
Literals: Strings and Numbers
LOLE has literals for two types: strings and numbers. They are the usual thing, except that strings are python style, either single or double quoted, and support escaping quotes inside them.
I could now explain how to define numbers in terms of literals and combinations of symbols and numbers and so on. And strings as a list of characters surrounded by quotes. But I am going to tell you something much more important: learn your library.
literal = quotedString ^ pyparsing_common.number
quotedString
is provided by PyParsing to define "strings as usual". And number
defines "numbers as usual" so we don't have to do it. Because most languages need this, so having to do it each time is dumb.
We are also using ^
there. That means "one or the other". So 'literal' will successfully parse things that are strings and also things that are numbers.
>>> literal.parseString("-43.21E12")
([-43210000000000.0], {})
>>> literal.parseString(r"'this is a string with a \' in it!'")
(["'this is a string with a \\' in it!'"], {})
And of course, if you parse something else, it will fail:
>>> literal.parseString(r"foo")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pyparsing.py", line 1632, in parseString
raise exc
ParseException: Expected {quotedString using single or double quotes ^ {real number with scientific notation | real number | signed integer}} (at char 0), (line:1, col:1)
Like the error says, it was expecting either a string quoted with single or double quotes, or one of a few kinds of numbers.
So, now that we have identifiers and values, we can make this more like a real language and implement ...
Assignments
This is pretty simple:
foo = 42
So, an assignment is (for now):
- An identifier
- An equal sign
- A literal value
And this is how it looks in PyParsing:
>>> assignment = identifier + Literal('=') + literal
>>> assignment.parseString('foo = 42')
(['foo', '=', 42], {})
Program
So far, we only support programs that do one thing and then end. That is not very useful. So let's define what a program is: "A list of statements separated with semicolons". So far, the only kind of statement we have is assignment, so it will look like this:
program = delimitedList(assignment, delim=';')
As usual, PyParsing does the heavy lifting for us: see what delimitedList
does.
And now we can parse slightly more interesting things:
>>> program.parseString('foo = 42; bar="bat"')
(['foo', '=', 42, 'bar', '=', '"bat"'], {})
Let's stop defining grammar now for a little bit (we will come back to it) and think about how to use this thing we are creating.
Execution (Part I)
So we now have the world's most useless language, that only supports syntax for assigning values to variables. How do we make it do that?
Well, we need to take the output of parseString
and process it, doing what our "program" says we should do. That's an interpreter.
So, how do we "run" ['foo', '=', 42, 'bar', '=', '"bat"']
?
For starters, I don't really like it. There are two statements there but we have six elements and no clear boundary between one statement and another. To fix that, we can change assignment:
>>> assignment = Group(identifier + Literal('=') + literal)
>>> program = delimitedList(assignment, delim=';')
>>> print program.parseString('foo = 42; bar="bat"')
[['foo', '=', 42], ['bar', '=', '"bat"']]
Unsurprisingly, what Group
does is group things, so we now have each statement as its own separate thing.
So, we could "run" this program like this:
>>> def run_program(code):
... parsed = program.parseString(code)
... vars = {}
... for name, _, value in parsed:
... vars[name] = value
... return vars
>>> run_program('foo = 42; bar="bat"')
{'foo': 42, 'bar': '"bat"'}
And hey, that means this is an interpreter already. Not a very good one, tho!
Grammar (Part II)
Now, how about this LOLE program? foo = 42; bar = foo
As we have defined the language so far, that's illegal, because on the right side of the equal sign LOLE requires a literal, which is either a string or a number. What we really want on the right side is more general: an expression. An expression is something that eventually resolves to a value. So, variable names are expressions, function calls (when we have them) will be expressions, and things like 2 + 2
would be expressions if we implemented them.
So, let's introduce expressions, which are either literals, or identifiers (so far):
>>> expression = literal ^ identifier
>>> assignment = Group(identifier + Literal('=') + expression)
>>> program = delimitedList(assignment, delim=';')
>>> print program.parseString("foo = 42; bar = foo")
[['foo', '=', 42], ['bar', '=', 'foo']]
Execution (Part II)
So, a LOLE expression in an assignment can currently be three kinds of things:
- A number:
42
- An identifier:
'foo'
- A string:
'"foo"'
What happens if we run this through run_program
?
>>> run_program("foo = 42; bar = foo")
{'foo': 42, 'bar': 'foo'}
That is not really right, is it? We would want bar
to be set to 42
. The problem is that we are not calculating the value of expressions, but just passing them along. We need eval_expression
:
>>> vars = {}
>>> def eval_expression(expr):
... if isinstance(expr, str):
... if expr[0] in '"\'': # string literal
... return expr[1:-1] # remove quotes
... else: # identifier
... return vars.get(expr)
... # Number, return as-is
... return expr
This solves that problem of "resolving" variables, and also that of extraneous quotes in our string literals. And then we need to use it in run_program, too:
>>> def run_program(code):
... parsed = program.parseString(code)
... for name, _, value in parsed:
... vars[name] = eval_expression(value)
... return vars
>>> run_program("foo = 'quux'; bar = foo")
{'foo': 'quux', 'bar': 'quux'}
Grammar (Part III)
Let's make LOLE slightly more interesting by adding functions. Since creating grammars is mostly a matter of making decisions, let's make a couple.
- A function call is
identifier()
oridentifier(arg)
oridentifier(foo=expr, bar=expr)
. If a function takes more than one argument, it has to use keyword arguments. - Functions are not defined in LOLE, they are provided by Python.
While keyword arguments look a lot like assignments, they are not the same thing, so we define them separately.
>>> kwarg = Group(identifier + Literal('=') + expression)
>>> function_call = Group(identifier + Literal('(') + Optional(expression ^ delimitedList(kwarg)) + Literal(')'))
>>> print function_call.parseString('foo(bar=1, baz=2)')
[['foo', '(', ['bar', '=', 1], ['baz', '=', 2], ')']]
>>> print function_call.parseString('foo(bar=1)')
[['foo', '(', ['bar', '=', 1], ')']]
>>> print function_call.parseString('foo(42)')
[['foo', '(', 42, ')']]
>>> print function_call.parseString('foo()')
[['foo', '(', ')']]
This shows we have a number of problems in how we process our code.
How can we distinguish ['foo', '(', ')']
(a function call) from ['foo', '=', 4]
(an assignment)? Clearly the output of parsing a function call is not distinctive enough. We can improve this by using a PyParsing feature called parseActions
.
What parseActions
do is process the result of parsing and allow you to use the knowledge of what you are parsing to do The Right Thing (TM).
class FunctionCall():
pass
def function_call_action(tokens):
f_call = FunctionCall()
f_call.name = tokens[0][0]
f_call.args = tokens[0][1:]
return f_call
function_call = Group(identifier + Literal('(') + Optional(expression ^ delimitedList(kwarg)) + Literal(')'))
function_call = function_call.setParseAction(function_call_action)
>>> print function_call.parseString('foo()')[0].name
foo
>>> print function_call.parseString('foo()')[0].args
['(', ')']
Do we really want args to contain the parenthesis? I say no. While they are useful for defining the syntax, we really don't need them at all, so we can use suppress
to make our parser's output cleaner:
function_call = Group(identifier + Literal('(').suppress() + Optional(expression ^ delimitedList(kwarg)) + Literal(')').suppress())
function_call = function_call.setParseAction(function_call_action)
>>> print function_call.parseString('foo()')[0].name
foo
>>> print function_call.parseString('foo()')[0].args
[]
>>> print function_call.parseString('foo(42)')[0].args
[42]
>>> print function_call.parseString('foo(a=5, b=7)')[0].args
[(['a', '=', 5], {}), (['b', '=', 7], {})]
And since calling functions produces a value, it's a type of expression. And since we can just want to run a function and not care about what it returns, it's also a type of statement. Let's put all our grammar in one place:
identifier = Word(alphas)
literal = quotedString ^ pyparsing_common.number
function_call = Forward()
expression = literal ^ function_call
kwarg = Group(identifier + Literal('=') + expression)
assignment = Group(identifier + Literal('=') + expression)
function_call << Group(identifier + Literal('(').suppress() + Optional(expression ^ delimitedList(kwarg)) + Literal(')').suppress()).setParseAction(function_call_action)
statement = assignment ^ function_call
program = delimitedList(statement, delim=';')
The only new things are related to function_call
. Since its definition needs to know about expression
, and expression
is defined using function_call
we have a bit of a chicken-egg problem. The solution is to use Forward
, so we say "hey, there is a thing called function_call
, I will explain later".
Then, when actually defining function_call
it uses <<
so the definition is applied to the Forward
placeholder.
Execution (Part III)
Since we added function calls ... how do we execute those functions? By improving our run_program
of course. So far, we treated every statement as an assignment, but now it can be either an assignment or a function_call
.
Also, our eval_expression
knew how to evaluate literals and variables, but not function calls, so that will need some tweaking also.
Let's do eval_expression
first:
# This dictionary exposes functions to the LOLE interpreter
funcs = {
'sin': sin,
}
def eval_expression(expr):
if isinstance(expr, str):
if expr[0] in '"\'': # string literal
return expr[1:-1] # remove quotes
else: # identifier
return vars.get(expr)
elif isinstance(expr, FunctionCall):
a = []
kw = {}
# Process function call arguments
for arg in expr.args:
if isinstance(arg, list):
# argument is a kwarg
kw[arg[0]] = eval_expression(arg[2])
else:
# argument is a literal value
a.append(eval_expression(arg))
return funcs[expr.name](*a, **kw)
# Number, return as-is
return expr
Sure, it got a little more complicated, but basically, all it does is add a special case for when the expression
is a FunctionCall
and handle that case by calling the named function with the given arguments.
Since arguments are expressions themselves, we need to recursively call eval_expression
to get their value.
And now we need to do something similar with run_program:
def run_program(code):
parsed = program.parseString(code)
for statement in parsed:
if isinstance(statement, FunctionCall):
eval_expression(statement)
else: # assignment
name, _, value = statement
vars[name] = eval_expression(value)
return vars
>>> print run_program('x = sin(0); sin(1)')
{'x': 0.0}
Again, while iterating over the statements, we check if the statement is a function call, in which case we evaluate it, or if it's an assignment, in which case we evaluate the right side and assign that value to the variable named on the left side.
This shows how extending the language works. If we add a new type of expression
in the grammar, we add a new branch to eval_expression
. If we add a new type of statement
in the grammar, we add a branch to run_program
. That is all there is to it. To prove it, let's add a more complex structure.
Grammar (Part IV)
What kind of language doesn't support an if
statement? Well this one, until now. So let's add it. It could look like this:
if expression then { statements } [else { statements }]
The brackets around else
mean that part is optional, since not all ifs need an else
clause. The conditionally executed parts are a list of statements surrounded by braces, C-style.
Let's express this in PyParsing:
code_block = Forward().setName('code block')
if_stmt = Group(Keyword('if') + expression + Keyword('then') + code_block + Optional(Keyword('else') + code_block))
statement = if_stmt ^ assignment ^ function_call
code_block << Group(Literal('{').suppress() + delimitedList(statement, delim=';') + Literal('}').suppress()).setName('code block')
program = delimitedList(statement, delim=';')
>>> print program.parseString('if 0 then {x=2} else {x=3}')
['if', 0, 'then', [['x', '=', 2]], 'else', [['x', '=', 3]]]
Again, we have to use Forward
, now for code_block
since if_stmt
contains code_block
which uses statement
but if_stmt
is also a statement
. Also, some calls to setName
so errors are nice and tidy, and use the Keyword
element we had not seen before.
Just like with function calls, we end up with just a bunch of things in a list, which means when we want to execute it it will be annoying. We can improve it in exactly the same way via setParseAction
:
class IfStmt():
pass
def if_stmt_action(tokens):
if_stmt = IfStmt()
if_stmt.condition = tokens[0][1]
if_stmt.then_block = tokens[0][3]
if len(tokens[0]) > 4: # has an else clause
if_stmt.else_block = tokens[0][5]
else:
if_stmt.else_block = None
return if_stmt
if_stmt = Group(Keyword('if') + expression + Keyword('then') + code_block + Optional(Keyword('else') + code_block)).setParseAction(if_stmt_action)
>>> if_stmt = print program.parseString('if 0 then {x=2} else {x=3}')[0]
>>> print if_stmt.condition
0
>>> print if_stmt.then_block
['x', '=', 2]
>>> print if_stmt.else_block
['x', '=', 3]
Execution (Part IV)
And, since we added a new kind of statement, we need to handle it in run_program
:
def run_program(code):
if isinstance(code, str):
parsed = program.parseString(code)
else:
parsed = code
for statement in parsed:
if isinstance(statement, FunctionCall):
eval_expression(statement)
elif isinstance(statement, IfStmt):
if eval_expression(statement.condition):
run_program(statement.then_block)
elif statement.else_block is not None:
run_program(statement.else_block)
else: # assignment
name, _, value = statement
vars[name] = eval_expression(value)
return vars
>>> print run_program('if 0 then {x=2} else {x=3}')
{'x': 3}
>>> print run_program('if 1 then {x=2} else {x=3}')
{'x': 2}
The new bits:
There is a branch to handle if statements. Since if statements contain code blocks, which are basically the same as a program, we end up calling
run_program
recursively.To avoid complicating the example too much,
run_program
now needs to handle both unparsed programs (when we want to run a script) and parsed code, when it's called recursively. That is managed by theif
at the beginning of the function.
Conclusion and Full Source Code
And there you go, we now have an extensible language that can call arbitrary Python functions! And here is the full listing:
from math import sin
from pyparsing import *
from pyparsing import pyparsing_common
class FunctionCall():
pass
def function_call_action(tokens):
f_call = FunctionCall()
f_call.name = tokens[0][0]
f_call.args = tokens[0][1:]
return f_call
class IfStmt():
pass
def if_stmt_action(tokens):
if_stmt = IfStmt()
if_stmt.condition = tokens[0][1]
if_stmt.then_block = tokens[0][3]
if len(tokens[0]) > 4: # has an else clause
if_stmt.else_block = tokens[0][5]
else:
if_stmt.else_block = None
return if_stmt
identifier = Word(alphas)
literal = quotedString ^ pyparsing_common.number
function_call = Forward().setName('function call')
expression = (literal ^ function_call).setName('expression')
kwarg = Group(identifier + Literal('=') + expression)
assignment = Group(identifier + Literal('=') + expression).setName('assignment')
function_call << Group(identifier + Literal('(').suppress() + Optional(expression ^ delimitedList(kwarg)) + Literal(')').suppress()).setParseAction(function_call_action)
code_block = Forward().setName('code block')
if_stmt = Group(Keyword('if') + expression + Keyword('then') + code_block + Optional(Keyword('else') + code_block)).setParseAction(if_stmt_action)
statement = if_stmt ^ assignment ^ function_call
code_block << Group(Literal('{').suppress() + delimitedList(statement, delim=';') + Literal('}').suppress()).setName('code block')
program = delimitedList(statement, delim=';')
funcs = {
'sin': sin,
}
vars = {}
def eval_expression(expr):
if isinstance(expr, str):
if expr[0] in '"\'': # string literal
return expr[1:-1] # remove quotes
else: # identifier
return vars.get(expr)
elif isinstance(expr, FunctionCall):
a = []
kw = {}
# Process function call arguments
for arg in expr.args:
if isinstance(arg, list):
# argument is a kwarg
kw[arg[0]] = eval_expression(arg[2])
else:
# argument is a literal value
a.append(eval_expression(arg))
return funcs[expr.name](*a, **kw)
# Number, return as-is
return expr
def run_program(code):
if isinstance(code, str):
parsed = program.parseString(code)
else:
parsed = code
for statement in parsed:
if isinstance(statement, FunctionCall):
eval_expression(statement)
elif isinstance(statement, IfStmt):
if eval_expression(statement.condition):
run_program(statement.then_block)
elif statement.else_block is not None:
run_program(statement.else_block)
else: # assignment
name, _, value = statement
vars[name] = eval_expression(value)
return vars
Exercises for the Reader
To make sure you actually understood this, the best way is to continue this work yourself. LOLE will stay forever unfinished so you can make it do interesting things while you learn by doing. Here are a few questions for you to explore in no particular order:
- How would you make a LOLE program exit?
- How would that exiting program return a result?
- The
FunctionCall
andIfStmt
classes are very sketchy. How would you improve them? - Could the logic in
run_program
to handle different types of statements be moved out, into classes likeFunctionCall
? - How would you implement comparisons so
if
is more useful? - What happens if you create a variable called 'if'?
- What happens if you have a variable and a function with the same name? Is that good?
- How would you implement lists? How about dictionaries?
- How would you implement iteration?
- How would you support comments?
Very nice article! Good use of pyparsing features, including the new `pyparsing_common` expressions. You should post this link on reddit.
It's usually better if someone else posts to reddit. Could you? :-)
Erratum:
So we now have the world's most usesess language -> So we now have the world's most useless language
Thx, fixed, rolling out soon.
Beautifully explained, great work. I read to the end and enjoyed it even though I have no plans to write a parser. I bet that if I did this introduction would be invaluable.
cool :-)
A great guide for those people who are still starting in creating a language that would best fit on their standards. At least, it would be a great factor for them to even make their performance even better and would help them broaden their mind when it comes to programming.