Parsing

There are a lot of parsing libraries in Haskell. The considerations for choosing among them depend on:

  1. What kind of format you’re parsing
  2. How the parser will be used

In this article we’re going to discuss three packages:

The code examples use these imports:

Error messages

What’s more important to you: Giving informative error messages for invalid inputs, or speed?

Error messages are really important when you’re writing a parser for something like a programming language, where the file you’re parsing is something that’s written by a person. Programmers make syntax errors all the time with our fallible human fingers, and the parser’s responsibility to point out where the mistake is and explain why it’s a mistake is just as important as its job to read correctly-written code.

Speed is more important when you’re parsing messages for a network protocol. Because the string you’re reading is generated by software,For example, in the Web Servers course, we use attoparsec to read HTTP requests, which are typically generated by HTTP client software such as curl. most of the time it won’t contain mistakes – and even if you do generate great explanatory error messages, there’s usually nobody there to read them anyway. Some informative error output is nice to have for the sake of debugging (either for a client that generated a faulty request, or for an incorrectly-written parser), but not nearly as important as in the former case.

Producing informative error information requires keeping track of more information while parsing, which comes at some performance cost. For a programming language implementation that’s usually only compiling one file at a time, the usability benefit is probably worth the tradeoff. For servers, when we may be considering how many thousands of requests they can process per second, perhaps the overhead of detailed error output is not worthwhile.

Incremental parsing

Do you need to run the parser incrementally?

In the Text.Parsec, there is essentially one way to run a parser:

The return type is Either ParseError a, which means there are two possible outcomes:

  1. Left, with giving an error message explaining why the parser cannot parse the input.
  2. Right, with the result of a successful parse.Note that a result of Right does not give any indication as to whether the entire input was consumed.

In Data.Attoparsec.Text, you’ll find two ways to run a parser:

What parsec calls ‘parse’, attoparsec calls ‘parseOnly’. Attoparsec’s parse function provides something more. Result a is similar to the failure-or-success output that parsec’s parse function provides, but with some additional information and a third possible outcome. The constructors for Result are:

  1. Fail, indicating that the parser definitely could not parse the input.
  2. Done, which contains a successful result of type a, as well as the remaining unconsumed input Text.
  3. Partial, indicating that the parser couldn’t finish parsing given only the input so far, but might be able to keep going if fed more input.

To illustrate, let’s consider a simple parser of type Parser (Char, Char, Char) that accepts exactly three letters (letters only, no numbers or other weird characters). With either Parsec or Attoparsec imported, we can express the parser like this:

Below we summarize the difference between Parsec’s parse function and Attoparsec’s incremental parse function by comparing the results on a handful of example inputs.

Input Parsec Attoparsec
"abc" Right: ('a', 'b', 'c') Done: ('a', 'b', 'c'); remaining input: ""
"abcd" Right: ('a', 'b', 'c') Done: ('a', 'b', 'c'); remaining input: "d"
"a3c" Left: Unexpected "3", expecting letter Fail: "letter"; remaining input: "3c"
"ab" Left: Unexpected end of input Partial

Do you need incremental parsing?

  • If you’re reading a file: Probably no.
  • If you’re reading an unbounded stream of messages from a socket: Probably yes.

Backtracking

Do you want automatic backtracking?

Backtracking behavior is an important subtle aspect of a parser. To understand it, first we have to appreciate the concept of parsing alternatives.

The most direct use of alternatives comes about when you use (<|>) or asum.asum [a, b, c] is equivalent to a <|> b <|> c Alternatives usually also sneak in when you use combinators like optional, some, and many.

For example, let’s say we’re parsing a word, which we define as either

  • a list of letters; or
  • a list of numbers

So “hello”, “world”, “2”, and “9817624” are words, and “$@!&”, “hel3o”, and "" are not.

We can write a parser for a word in Megaparsec like this:

The way this parser runs looks like this:

Notice that the first character of the input completely determines the course of the rest of the parsing. If it’s a letter, we take the first branch and keep looking for more letters. If it’s a digit, we take the second branch and keep looking for more digits. Unfortunately, not all alternatives are so simple.

For the next example, we’ll write a very simple date parser. Specifically, we’ll be parsing year-and-month strings.

We want the parser to accept strings in either of two formats: YYYY-MM or MM-YYYY. That is, the year must be four digits, the month must be two digits, and they must be separated by a dash, but either the year or the month may come first.

We’ll start by writing parsers for the smallest components:

Then we can combine them in different orders to define parsers for the two formats:

To write a parser that accepts either format, we might think we could combine the two with (<|>) like we did before.

Why doesn’t this work? Let’s draw the flowchart again.

The first character of the input determines the course of the rest of the parsing. If the first character is a digit, the parser takes the first branch. There is no way to reach the second branch. To fix this, we need to backtrack – If, somewhere during the progress of the first branch, we find that the YYYY-MM parser fails, we need to rewind the input back to the start and try again with the MM-YYYY parser.

This is what the try function in Parsec and Megaparsec does: It turns a non-backtracking parser into a backtracking parser. Let’s modify monthP by applying try to the first alternative:

Now it can fall back appropriately to the second format.

Backtracking is something we have to do all the time when we’re reading prose, though you probably only notice it when you’re trying to read something that’s poorly written. Consider the following awkward sentence:

He told his friend that he went to the game with a lie.

You read left to right, recognizing

  1. “He” is the subject of this sentence.
  2. “told” is the verb.
  3. “his friend” is an indirect object; we are now anticipating a direct object, which will be what he told to his friend.

At this point, we have to guess what grammatical role the word “that” serves. Most readers will probably guess that the sentence has the form

He told his friend that […]

and that the remainder of the sentence is the object. But then we finish reading: “he went to the game with a lie.” How do you go to a game with a lie? That doesn’t make sense. If you read the sentence this way, you then have to back up and look for a different grammatical interpretation until you find one that works. In this case, what the author probably meant was

He told his friend (that he went to the game with) a lie.

Skillful writing avoids forcing the reader to do much backtracking, much like well-designed data formats avoid the need to write parsers that backtrack significantly.

Most parsing libraries provide support for backtracking parsers; what you need to know is whether they backtrack by default.

  • Attoparsec backtracks automatically, so you never have to use the try function.
  • Parsec and Megaparsec do not, so you need to use try when you want backtracking.

This is an important distinction to understand, because very similar-looking parser expressions written using different libraries can give you different results.

Type Classes offers courses and projects to get you started and make you an expert in FP with Haskell. For $29/month, you get access to the complete archive and all the latest content.