Lately, I started learning Haskell, and I’m really enjoying the insight this provides. I’ll probably expand on that another time. Today I’d like to share with you the first bigger program I wrote in Haskell.
Inspiration for this post, as well as some of the initial ideas behind this post and code are taken from the “Error handling” chapter of the “Real World Haskell” book by Bryan O’Sullivan, Don Stewart, and John Goerzen. I highly recommend this book for everybody wanting to learn Haskell.
The Brainfuck language
When learning Haskell I posed myself a goal — write an optimizing compiler for the Brainfuck language. If you’re not familiar with Brainfuck — it’s a super simple toy language. It operates on an array of memory cells, each initially set to zero. There is a pointer, initially pointing to the first memory cell. You can manipulate the pointer and cells using one of 8 operations:
Symbol | Meaning |
---|---|
> |
Move the pointer to the right |
< |
Move the pointer to the left |
+ |
Increment the memory cell under the pointer |
- |
Decrement the memory cell under the pointer |
. |
Output the character signified by the cell at the pointer |
, |
Input a character and store it in the cell at the pointer |
[ |
Jump past the matching ] if the cell under the pointer is 0 |
] |
Jump back to the matching [ if the cell under the pointer is nonzero |
All other characters are considered comments.
If you wander why such a name is used to describe this language — let me present the “Hello World” of Brainfuck. I think it makes the name painfully obvious:
++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>
---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.
Given the language is so simple, there’s a lot of place for compiler optimizations - a great opportunity to learn about them.
Basic parser
But let’s focus on the parser itself. I know there are already great parsers in Haskell — most notably Parsec and Attoparsec. But I wanted to make my own to learn a little bit more about monads, and using Haskell in general.
Let me first define two types: the AST
that will be our target and
ParseError
that we’ll use to differentiate erroneous results:
data ParseError = Unexpected Char
| UnexpectedEof
| Unknown
deriving (Eq, Show)
data AST = PtrMove Int
| MemMove Int
| Output
| Input
| Loop [AST]
deriving (Eq, Show)
For the monads I’ll be using the
mtl
package.
Our Parser
monad will be composed of the inner State
monad holding the
string that is being parsed and an ExceptT
monad transformer to handle parsing
errors. In order to be able to simply derive the required monad typeclasses
we need to turn on the GeneralizedNewtypeDeriving
language extension.
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
newtype Parser a = P { runP :: ExceptT ParseError (State String) a
} deriving ( Monad
, MonadError ParseError
, MonadState String
)
Let’s now define the function we’ll use to run our monad. It simply untangles the various layers and tidies the result:
runParser :: Parser a -> String -> Either ParseError (a, String)
runParser parser str =
case (runState . runExceptT . runP) parser str of
(Left err, _) -> Left err
(Right r, rest) -> Right (r, rest)
Now let’s define the basic building parser — satisfy
that will allow us to
parse a character if it satisfies a predicate:
satisfy :: (Char -> Bool) -> Parser Char
satisfy predicate = do
s <- get
case s of
x:xs | predicate x -> do
put xs
return x
x:_ -> throwError (Unexpected x)
[] -> throwError UnexpectedEof
We retrieve the parser state (the string we’re parsing) from the underlying
State
monad and we check if the first character in the string matches. If it
does we update the state, and return the matched Char
. When we run the parser
it will be wrapped in a nice Right Char
value. If the predicate does not match
we throw and Unexpected Char
error. We can throw errors thanks to out
ExceptT
monad transformer. It will make the parser return Left (Unexpected Char)
if this branch will be invoked. If there was no input to parse to begin
with we throw UnexpectedEof
error.
Parser combinators
With those basic blocks in place we can start to think of ways to put together multiple parsers in order to process bigger chunks of input.
We need a way to choose between two parsers. We want to be able to make our
parser to try running one parser and if it fails to run a different one. We’ll
define an option
function that will do exactly that. We can also thing of it
as a way to combine two parsers together into one.
option :: Parser a -> Parser a -> Parser a
option parser1 parser2 = do
s <- get
parser1 `catchError` \_ -> do
put s
parser2
Once again we retrieve the state. Then we try to parse with the first parser.
The catchError
function is provided thanks to the ExceptT
transformer. It will try the thing on the left and it it fails it will process
the function on the right passing the error as the argument. We actually don’t
care what the error was, so we happily continue by restoring the initial state
(because we want to try to parse the same input again) and running the
second parser.
With this we can easily define a function that will receive a list of parsers
and try to apply them in sequence, returning the first successful parse. Let’s
call it choice
, because it will choose one parser from many.
choice :: [Parser a] -> Parser a
choice = foldr option (throwError Unknown)
The only thing that is not straightforward in this function is the default
value. By default we decide that our parser will be failing with an Unknown
error. We then fold over the given list of parsers matching them one by one,
until one succeeds. Thanks to laziness we don’t have to care about the parsers
that come after the one that succeeded. When folding over the parsers we’ll
catch the initial Unknown
error in our option
function if we’ll have any
parsers to try. If you pass an empty list of parsers to try we’ll return this
Unknown
error — because we can’t know what failed if we had nothing to try
out.
Now, I think, we need a way to run a single parser multiple times. Let’s define
a many
function that will take a parser and try to run it as many times as
possible, returning a list of the successfully parsed values. It may be a short
function, but I think it’s the most complicated one in this blog post. I’ll try
to explain it thoroughly.
many :: Parser a -> Parser [a]
many parser = recurse `catchError` \_ -> return []
where recurse = do
result <- parser
rest <- many parser
return (result:rest)
The complexity comes from the fact, that it involves some fancy manual recursion.
What is happening here? First we try to recurse
(whatever that means — let’s
ignore this for now), if that fails we simply return an empty list, ignoring the
error using the familiar catchError
function. So what is happening when we’re
recursing? First, we apply the parser once, and extract the result. Next we try
to recursively apply the parser more times extracting the rest of the parsable
input. Finally we cons the result of single parser run onto the rest of what we
could parse.
How does this really work? Let’s look at a simple example, and walk through it
step by step. Let’s say we try to parse letter 'a'
from the string "aab"
.
We go into our many
function, and immediately go into the recurse
helper.
There we apply the parser once getting the letter 'a'
as the result.
If we wanted to look at what we’ll return it will be 'a':rest
, where rest
is
whatever recursively calling ourselves will return.
So we carry on, and recursively enter our function once again, but this time
around we only have "ab"
as the input. Once again we’re able to parse another
'a'
. Now we can say we’ve built something like 'a':'a':rest
as our result.
Yet again we recurse into our function, this time with only the string "b"
as
the input. With this, it’s obvious that trying to parse 'a'
will fail. Ah! We
enter the error handling code and simply return an empty list. We can now go
back through our recursive applications and compose the final result:
'a':'a':[]
which simply is ['a', 'a']
. We also have "b"
left as our input.
It works!
Pff, that was quite complex. Hopefully that will be all the combinators that we’ll need for now.
Let’s parse some brainfuck!
Now that we’ve build ourselves some basic blocks, let’s look at how we can parse
a brainfuck program.
We need a basic parser that can handle a single brainfuck command. Let’s call it
parseOne
:
parseOne :: Parser AST
parseOne = choice [ transform '>' (PtrMove 1)
, transform '<' (PtrMove (-1))
, transform '+' (MemMove 1)
, transform '-' (MemMove (-1))
, transform ',' Output
, transform '.' Input
, parseLoop
]
where transform char ast = expect char >> return ast
expect char = satisfy (== char)
We define two helpers: expect
that simply expects to find a certain character
using our satisfy
function, that we defined earlier, and transform
that will
try to consume given character and return the ast block if that succeeds.
Using those helpers we define parsers for all the basic brainfuck commands. Then
we try to apply all of them in sequence using the choice
combinator we defined
earlier, until one of them is able to parse some input.
We also used a parseLoop
parser that handles (guess what!) parsing loops. Let’s define it
now:
parseLoop :: Parser AST
parseLoop = do
consume '['
steps <- many parseOne
consume ']'
return (Loop steps)
where consume char = satisfy (== char) >> return ()
I think it’s pretty straightforward — we first consume the opening bracket,
then using our many
combinator we try to parse as many elements as possible
(using the parseOne
parser we defined earlier), then we expect to find the
closing bracket. Finally we return the ast for the loop. The consume
helper
function is also quite simple. It tries to parse the character we provide, and
if it succeeds it returns the unit, because we don’t care about the actual
result.
Notice that those two functions are mutually recursive: parseLoop
will call
parseOne
, and parseOne
will call parseLoop
. We need this in order to
handle nested loops.
We need yet another small function to be able to parse whole programs — a way
to express that we expect the parsing to finish. Let’s define an eof
function
for this:
eof :: Parser ()
eof = do
s <- get
case s of
[] -> return ()
_ -> throwError UnexpectedEof
Once again it’s quite simple. We inspect the current state of the parser. If
it’s an empty string we have indeed reached the end and return a unit, because we
don’t have anything meaningful to return. In case there’s still something left
to parse we throw the UnexpectedEof
error. You may find the choice of error
puzzling - why the UnexpectedEof
error when we still have some things to
parse? This choice will become clear when you thing about the situation we’ll
reach this part of the code. Let’s say we want to parse a malformed loop:
"[.+,-"
. What will happen when we try to parse it with our parseLoop
parser?
It will fail when trying to match the closing bracket, and leave unparsed input.
If we apply the eof
parser at this point expecting the parsing to finish,
it’s clear that we should raise UnexpectedEof
error.
Finally we can define a parser for brainfuck:
parseAll :: Parser [AST]
parseAll = do
exprs <- many parseOne
eof
return exprs
We’re parsing all the simple commands. Finally we expect that we’ve parsed everything there was to parse, and thus reached eof.
With that parser we can assemble a parse
function to parse a string
representing the brainfuck program, and return the parsed ast or the error:
parse :: String -> Either ParseError [AST]
parse = fmap fst . runParser parseAll . filter isMeaningful
where isMeaningful = (`elem` "><+-,.[]")
We first filter the incoming string to only keep the meaningful brainfuck commands (everything else is a comment), run our parser, and finally extract the result.
Conclusion
Haskell is known for excellent parsers, and we can see why. In under 100 lines
of code we were able to define a full featured parser with error handling,
that’s easy and intuitive to use. There are many places this code could be
optimized, or where it could use more category theory spicing (for example we could
use many
as defined by Alternative
class from Control.Applicative
by
making our parser member of those classes, or reduce the choice
function to a
simple asum
on our parser type). But I think this implementation is quite clear and
focuses on what’s important, instead of focusing on intricacies of Haskell’s
typeclasses — as much as it’s interesting it was not the point of this post.
You can also get all the code used in this post.