Parsing

There are a lot of parsing libraries in Haskell. The considerations for choosing among them depend on:

  1. What kind of format you’re parsing
  2. How the parser will be used

In this article we’re going to discuss three packages:

The code examples use these imports:

-- base
import Control.Applicative
import Data.Char
import Data.Functor
import Data.Void
import Numeric.Natural

-- text
import qualified Data.Text as T

-- parsec
import qualified Text.Parsec      as Parsec
import qualified Text.Parsec.Text as Parsec

-- megaparsec
import qualified Text.Megaparsec            as Mega
import qualified Text.Megaparsec.Char       as Mega
import qualified Text.Megaparsec.Char.Lexer as Mega

-- attoparsec
import qualified Data.Attoparsec.Text as Atto

type MegaParser = Mega.Parsec Void T.Text

Error messages

What’s more important to you: Giving informative error messages for invalid inputs, or speed?

Error messages are really important when you’re writing a parser for something like a programming language, where the file you’re parsing is something that’s written by a person. Programmers make syntax errors all the time with our fallible human fingers, and the parser’s responsibility to point out where the mistake is and explain why it’s a mistake is just as important as its job to read correctly-written code.

Speed is more important when you’re parsing messages for a network protocol. Because the string you’re reading is generated by software, For example, in the Web Servers course, we use attoparsec to read HTTP requests, which are typically generated by HTTP client software such as curl. most of the time it won’t contain mistakes – and even if you do generate great explanatory error messages, there’s usually nobody there to read them anyway. Some informative error output is nice to have for the sake of debugging (either for a client that generated a faulty request, or for an incorrectly-written parser), but not nearly as important as in the former case.

Producing informative error information requires keeping track of more information while parsing, which comes at some performance cost. For a programming language implementation that’s usually only compiling one file at a time, the usability benefit is probably worth the tradeoff. For servers, when we may be considering how many thousands of requests they can process per second, perhaps the overhead of detailed error output is not worthwhile.

Incremental parsing

Do you need to run the parser incrementally?

In the Text.Parsec, there is essentially one way to run a parser:

Parsec.parse :: Parser a -> SourceName -> Text
    -> Either ParseError a

The return type is Either ParseError a, which means there are two possible outcomes:

  1. Left, with an error message explaining why the parser cannot parse the input.
  2. Right, with the result of a successful parse. Note that a result of Right does not give any indication as to whether the entire input was consumed.

In Data.Attoparsec.Text, you’ll find two ways to run a parser:

Atto.parse :: Parser a -> Text
    -> Result a
Atto.parseOnly :: Parser a -> Text
    -> Either String a

What parsec calls ‘parse’, attoparsec calls ‘parseOnly’. attoparsec’s parse function provides something more. Result a is similar to the failure-or-success output that parsec’s parse function provides, but with some additional information and a third possible outcome. The constructors for Result are:

  1. Fail, indicating that the parser definitely could not parse the input.
  2. Done, which contains a successful result of type a, as well as the remaining unconsumed input Text.
  3. Partial, indicating that the parser couldn’t finish parsing given only the input so far, but might be able to keep going if fed more input.

To illustrate, let’s consider a simple parser of type Parser (Char, Char, Char) that accepts exactly three letters (letters only, no numbers or other weird characters). With either parsec or attoparsec imported, we can express the parser like this:

(,,) <$> letter <*> letter <*> letter

Below we summarize the difference between parsec’s parse function and attoparsec’s incremental parse function by comparing the results on a handful of example inputs.

Input Parsec Attoparsec
"abc" Right: ('a', 'b', 'c') Done: ('a', 'b', 'c'); remaining input: ""
"abcd" Right: ('a', 'b', 'c') Done: ('a', 'b', 'c'); remaining input: "d"
"a3c" Left: Unexpected "3", expecting letter Fail: "letter"; remaining input: "3c"
"ab" Left: Unexpected end of input Partial

Do you need incremental parsing?

  • If you’re reading a file: Probably no.
  • If you’re reading an unbounded stream of messages from a socket: Probably yes.

You can see attoparsec’s incremental parsing feature in use in Web Servers, Lesson 9.

Backtracking

Do you want automatic backtracking?

Backtracking behavior is an important, subtle aspect of a parser. To understand it, first we have to appreciate the concept of parsing alternatives.

The most direct use of alternatives comes about when you use (<|>) or asum. asum [a, b, c] is equivalent to a <|> b <|> c Alternatives usually also sneak in when you use combinators like optional, some, and many. Most of these functions are from, or at least constrained by, the Alternative typeclass.

For example, let’s say we’re parsing a word, which we define as either

  • a list of letters; or
  • a list of numbers.

So “hello”, “world”, “2”, and “9817624” are words, and “$@!&”, “hel3o”, and “” are not.

data Word = Letters T.Text | Digits Natural
    deriving Show

We can write a parser for a word in megaparsec like this:

word :: MegaParser Word
word = letters <|> digits
  where
    letters = Letters <$> Mega.takeWhile1P (Just "letter") isAlpha
    digits  = Digits  <$> Mega.decimal

The way this parser runs looks like this:

Notice that the first character of the input completely determines the course of the rest of the parsing. If it’s a letter, we take the first branch and keep looking for more letters. If it’s a digit, we take the second branch and keep looking for more digits. Unfortunately, not all alternatives are so simple.

For the next example, we’ll write a very simple date parser. Specifically, we’ll be parsing year-and-month strings.

data Month = Month
    { theYear  :: Natural
    , theMonth :: Natural
    }
    deriving Show

We want the parser to accept strings in either of two formats: YYYY-MM or MM-YYYY. That is, the year must be four digits, the month must be two digits, and they must be separated by a dash, but either the year or the month may come first.

We’ll start by writing parsers for the smallest components:

yearP :: MegaParser Natural
yearP = read <$> Mega.count 4 Mega.digitChar

monthP :: MegaParser Natural
monthP = read <$> Mega.count 2 Mega.digitChar

dash :: MegaParser ()
dash = void (Mega.char '-')

Then we can combine them in different orders to define parsers for the two formats:

yearFirstP :: MegaParser Month
yearFirstP =
  do
    y <- yearP
    _ <- dash
    m <- monthP
    return (Month y m)

monthFirstP :: MegaParser Month
monthFirstP =
  do
    m <- monthP
    _ <- dash
    y <- yearP
    return (Month y m)

To write a parser that accepts either format, we might think we could combine the two with (<|>) like we did before.

monthP :: MegaParser Month
monthP = yearFirstP <|> monthFirstP

Why doesn’t this work? Let’s draw the flowchart again.

The first character of the input determines the course of the rest of the parsing. If the first character is a digit, the parser takes the first branch. There is no way to reach the second branch. To fix this, we need to backtrack – If, somewhere during the progress of the first branch, we find that the YYYY-MM parser fails, we need to rewind the input back to the start and try again with the MM-YYYY parser.

This is what the try function in parsec and megaparsec does: It turns a non-backtracking parser into a backtracking parser. Let’s modify monthP by applying try to the first alternative:

monthP :: MegaParser Month
monthP = Mega.try yearFirstP <|> monthFirstP

Now it can fall back appropriately to the second format.

Backtracking is something we have to do all the time when we’re reading prose, though you probably only notice it when you’re trying to read something that’s poorly written. Consider the following awkward sentence:

He told his friend that he went to the game with a lie.

You read left to right, recognizing

  1. “He” is the subject of this sentence.
  2. “told” is the verb.
  3. “his friend” is an indirect object; we are now anticipating a direct object, which will be what he told to his friend.

At this point, we have to guess what grammatical role the word “that” serves. Most readers will probably guess that the sentence has the form

He told his friend that […]

and that the remainder of the sentence is the object. But then we finish reading: “he went to the game with a lie.” How do you go to a game with a lie? That doesn’t make sense. If you read the sentence this way, you then have to back up and look for a different grammatical interpretation until you find one that works. In this case, what the author probably meant was

He told his friend (that he went to the game with) a lie.

Skillful writing avoids forcing the reader to do much backtracking, much like well-designed data formats avoid the need to write parsers that backtrack significantly.

Most parsing libraries provide support for backtracking parsers; what you need to know is whether they backtrack by default.

  • attoparsec backtracks automatically, so you never have to use the try function.
  • parsec and megaparsec do not, so you need to use try when you want backtracking.

This is an important distinction to understand, because very similar-looking parser expressions written using different libraries can give you different results.

Join Type Classes for courses and projects to get you started and make you an expert in FP with Haskell.