Simple Monadic Parser in Haskell

Lately, I started learning Haskell, and I’m really enjoying the insight this provides. I’ll probably expand on that another time. Today I’d like to share with you the first bigger program I wrote in Haskell.

Inspiration for this post, as well as some of the initial ideas behind this post and code are taken from the “Error handling” chapter of the “Real World Haskell” book by Bryan O’Sullivan, Don Stewart, and John Goerzen. I highly recommend this book for everybody wanting to learn Haskell.

The Brainfuck language

When learning Haskell I posed myself a goal — write an optimizing compiler for the Brainfuck language. If you’re not familiar with Brainfuck — it’s a super simple toy language. It operates on an array of memory cells, each initially set to zero. There is a pointer, initially pointing to the first memory cell. You can manipulate the pointer and cells using one of 8 operations:

Symbol Meaning
> Move the pointer to the right
< Move the pointer to the left
+ Increment the memory cell under the pointer
- Decrement the memory cell under the pointer
. Output the character signified by the cell at the pointer
, Input a character and store it in the cell at the pointer
[ Jump past the matching ] if the cell under the pointer is 0
] Jump back to the matching [ if the cell under the pointer is nonzero

All other characters are considered comments.

If you wander why such a name is used to describe this language — let me present the “Hello World” of Brainfuck. I think it makes the name painfully obvious:

++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>
---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.

Given the language is so simple, there’s a lot of place for compiler optimizations - a great opportunity to learn about them.

Basic parser

But let’s focus on the parser itself. I know there are already great parsers in Haskell — most notably Parsec and Attoparsec. But I wanted to make my own to learn a little bit more about monads, and using Haskell in general.

Let me first define two types: the AST that will be our target and ParseError that we’ll use to differentiate erroneous results:

data ParseError = Unexpected Char
                | UnexpectedEof
                | Unknown
                deriving (Eq, Show)

data AST = PtrMove Int
         | MemMove Int
         | Output
         | Input
         | Loop [AST]
         deriving (Eq, Show)

For the monads I’ll be using the mtl package. Our Parser monad will be composed of the inner State monad holding the string that is being parsed and an ExceptT monad transformer to handle parsing errors. In order to be able to simply derive the required monad typeclasses we need to turn on the GeneralizedNewtypeDeriving language extension.

{-# LANGUAGE GeneralizedNewtypeDeriving #-}

newtype Parser a = P { runP :: ExceptT ParseError (State String) a
                     } deriving ( Monad
                                , MonadError ParseError
                                , MonadState String
                                )

Let’s now define the function we’ll use to run our monad. It simply untangles the various layers and tidies the result:

runParser :: Parser a -> String -> Either ParseError (a, String)
runParser parser str =
  case (runState . runExceptT . runP) parser str of
    (Left err, _)   -> Left err
    (Right r, rest) -> Right (r, rest)

Now let’s define the basic building parser — satisfy that will allow us to parse a character if it satisfies a predicate:

satisfy :: (Char -> Bool) -> Parser Char
satisfy predicate = do
  s <- get
  case s of
    x:xs | predicate x -> do
             put xs
             return x
    x:_ -> throwError (Unexpected x)
    []  -> throwError UnexpectedEof

We retrieve the parser state (the string we’re parsing) from the underlying State monad and we check if the first character in the string matches. If it does we update the state, and return the matched Char. When we run the parser it will be wrapped in a nice Right Char value. If the predicate does not match we throw and Unexpected Char error. We can throw errors thanks to out ExceptT monad transformer. It will make the parser return Left (Unexpected Char) if this branch will be invoked. If there was no input to parse to begin with we throw UnexpectedEof error.

Parser combinators

With those basic blocks in place we can start to think of ways to put together multiple parsers in order to process bigger chunks of input.

We need a way to choose between two parsers. We want to be able to make our parser to try running one parser and if it fails to run a different one. We’ll define an option function that will do exactly that. We can also thing of it as a way to combine two parsers together into one.

option :: Parser a -> Parser a -> Parser a
option parser1 parser2 = do
  s <- get
  parser1 `catchError` \_ -> do
    put s
    parser2

Once again we retrieve the state. Then we try to parse with the first parser. The catchError function is provided thanks to the ExceptT transformer. It will try the thing on the left and it it fails it will process the function on the right passing the error as the argument. We actually don’t care what the error was, so we happily continue by restoring the initial state (because we want to try to parse the same input again) and running the second parser.

With this we can easily define a function that will receive a list of parsers and try to apply them in sequence, returning the first successful parse. Let’s call it choice, because it will choose one parser from many.

choice :: [Parser a] -> Parser a
choice = foldr option (throwError Unknown)

The only thing that is not straightforward in this function is the default value. By default we decide that our parser will be failing with an Unknown error. We then fold over the given list of parsers matching them one by one, until one succeeds. Thanks to laziness we don’t have to care about the parsers that come after the one that succeeded. When folding over the parsers we’ll catch the initial Unknown error in our option function if we’ll have any parsers to try. If you pass an empty list of parsers to try we’ll return this Unknown error — because we can’t know what failed if we had nothing to try out.

Now, I think, we need a way to run a single parser multiple times. Let’s define a many function that will take a parser and try to run it as many times as possible, returning a list of the successfully parsed values. It may be a short function, but I think it’s the most complicated one in this blog post. I’ll try to explain it thoroughly.

many :: Parser a -> Parser [a]
many parser = recurse `catchError` \_ -> return []
  where recurse = do
          result <- parser
          rest   <- many parser
          return (result:rest)

The complexity comes from the fact, that it involves some fancy manual recursion. What is happening here? First we try to recurse (whatever that means — let’s ignore this for now), if that fails we simply return an empty list, ignoring the error using the familiar catchError function. So what is happening when we’re recursing? First, we apply the parser once, and extract the result. Next we try to recursively apply the parser more times extracting the rest of the parsable input. Finally we cons the result of single parser run onto the rest of what we could parse.

How does this really work? Let’s look at a simple example, and walk through it step by step. Let’s say we try to parse letter 'a' from the string "aab". We go into our many function, and immediately go into the recurse helper. There we apply the parser once getting the letter 'a' as the result. If we wanted to look at what we’ll return it will be 'a':rest, where rest is whatever recursively calling ourselves will return. So we carry on, and recursively enter our function once again, but this time around we only have "ab" as the input. Once again we’re able to parse another 'a'. Now we can say we’ve built something like 'a':'a':rest as our result. Yet again we recurse into our function, this time with only the string "b" as the input. With this, it’s obvious that trying to parse 'a' will fail. Ah! We enter the error handling code and simply return an empty list. We can now go back through our recursive applications and compose the final result: 'a':'a':[] which simply is ['a', 'a']. We also have "b" left as our input. It works!

Pff, that was quite complex. Hopefully that will be all the combinators that we’ll need for now.

Let’s parse some brainfuck!

Now that we’ve build ourselves some basic blocks, let’s look at how we can parse a brainfuck program. We need a basic parser that can handle a single brainfuck command. Let’s call it parseOne:

parseOne :: Parser AST
parseOne = choice [ transform '>' (PtrMove 1)
                  , transform '<' (PtrMove (-1))
                  , transform '+' (MemMove 1)
                  , transform '-' (MemMove (-1))
                  , transform ',' Output
                  , transform '.' Input
                  , parseLoop
                  ]
  where transform char ast = expect char >> return ast
        expect char = satisfy (== char)

We define two helpers: expect that simply expects to find a certain character using our satisfy function, that we defined earlier, and transform that will try to consume given character and return the ast block if that succeeds. Using those helpers we define parsers for all the basic brainfuck commands. Then we try to apply all of them in sequence using the choice combinator we defined earlier, until one of them is able to parse some input.

We also used a parseLoop parser that handles (guess what!) parsing loops. Let’s define it now:

parseLoop :: Parser AST
parseLoop = do
  consume '['
  steps <- many parseOne
  consume ']'
  return (Loop steps)
  where consume char = satisfy (== char) >> return ()

I think it’s pretty straightforward — we first consume the opening bracket, then using our many combinator we try to parse as many elements as possible (using the parseOne parser we defined earlier), then we expect to find the closing bracket. Finally we return the ast for the loop. The consume helper function is also quite simple. It tries to parse the character we provide, and if it succeeds it returns the unit, because we don’t care about the actual result.

Notice that those two functions are mutually recursive: parseLoop will call parseOne, and parseOne will call parseLoop. We need this in order to handle nested loops.

We need yet another small function to be able to parse whole programs — a way to express that we expect the parsing to finish. Let’s define an eof function for this:

eof :: Parser ()
eof = do
  s <- get
  case s of
    [] -> return ()
    _  -> throwError UnexpectedEof

Once again it’s quite simple. We inspect the current state of the parser. If it’s an empty string we have indeed reached the end and return a unit, because we don’t have anything meaningful to return. In case there’s still something left to parse we throw the UnexpectedEof error. You may find the choice of error puzzling - why the UnexpectedEof error when we still have some things to parse? This choice will become clear when you thing about the situation we’ll reach this part of the code. Let’s say we want to parse a malformed loop: "[.+,-". What will happen when we try to parse it with our parseLoop parser? It will fail when trying to match the closing bracket, and leave unparsed input. If we apply the eof parser at this point expecting the parsing to finish, it’s clear that we should raise UnexpectedEof error.

Finally we can define a parser for brainfuck:

parseAll :: Parser [AST]
parseAll = do
  exprs <- many parseOne
  eof
  return exprs

We’re parsing all the simple commands. Finally we expect that we’ve parsed everything there was to parse, and thus reached eof.

With that parser we can assemble a parse function to parse a string representing the brainfuck program, and return the parsed ast or the error:

parse :: String -> Either ParseError [AST]
parse = fmap fst . runParser parseAll . filter isMeaningful
  where isMeaningful = (`elem` "><+-,.[]")

We first filter the incoming string to only keep the meaningful brainfuck commands (everything else is a comment), run our parser, and finally extract the result.

Conclusion

Haskell is known for excellent parsers, and we can see why. In under 100 lines of code we were able to define a full featured parser with error handling, that’s easy and intuitive to use. There are many places this code could be optimized, or where it could use more category theory spicing (for example we could use many as defined by Alternative class from Control.Applicative by making our parser member of those classes, or reduce the choice function to a simple asum on our parser type). But I think this implementation is quite clear and focuses on what’s important, instead of focusing on intricacies of Haskell’s typeclasses — as much as it’s interesting it was not the point of this post.

You can also get all the code used in this post.

Where is my comment box!?

I don't do traditional comments, but you're welcome to send me an email to michal at muskala dot eu and I'll publish it at the bottom of the article as a comment.