There are a lot of parsing libraries in Haskell. The considerations for choosing among them depend on:
- What kind of format you’re parsing
- How the parser will be used
In this article we’re going to discuss three packages:
The code examples use these imports:
-- base import Control.Applicative import Data.Char import Data.Functor import Data.Void import Numeric.Natural -- text import qualified Data.Text as T -- parsec import qualified Text.Parsec as Parsec import qualified Text.Parsec.Text as Parsec -- megaparsec import qualified Text.Megaparsec as Mega import qualified Text.Megaparsec.Char as Mega import qualified Text.Megaparsec.Char.Lexer as Mega -- attoparsec import qualified Data.Attoparsec.Text as Atto type MegaParser = Mega.Parsec Void T.Text
What’s more important to you: Giving informative error messages for invalid inputs, or speed?
Error messages are really important when you’re writing a parser for something like a programming language, where the file you’re parsing is something that’s written by a person. Programmers make syntax errors all the time with our fallible human fingers, and the parser’s responsibility to point out where the mistake is and explain why it’s a mistake is just as important as its job to read correctly-written code.
Speed is more important when you’re parsing messages for a network protocol. Because the string you’re reading is generated by software,For example, in the Web Servers course, we use
attoparsec to read HTTP requests, which are typically generated by HTTP client software such as
curl. most of the time it won’t contain mistakes – and even if you do generate great explanatory error messages, there’s usually nobody there to read them anyway. Some informative error output is nice to have for the sake of debugging (either for a client that generated a faulty request, or for an incorrectly-written parser), but not nearly as important as in the former case.
Producing informative error information requires keeping track of more information while parsing, which comes at some performance cost. For a programming language implementation that’s usually only compiling one file at a time, the usability benefit is probably worth the tradeoff. For servers, when we may be considering how many thousands of requests they can process per second, perhaps the overhead of detailed error output is not worthwhile.
Do you need to run the parser incrementally?
Text.Parsec, there is essentially one way to run a parser:
The return type is
Either ParseError a, which means there are two possible outcomes:
Left, with giving an error message explaining why the parser cannot parse the input.
Right, with the result of a successful parse.Note that a result of
Rightdoes not give any indication as to whether the entire input was consumed.
Data.Attoparsec.Text, you’ll find two ways to run a parser:
What parsec calls ‘parse’, attoparsec calls ‘parseOnly’. Attoparsec’s
parse function provides something more.
Result a is similar to the failure-or-success output that parsec’s
parse function provides, but with some additional information and a third possible outcome. The constructors for
Fail, indicating that the parser definitely could not parse the input.
Done, which contains a successful result of type
a, as well as the remaining unconsumed input
Partial, indicating that the parser couldn’t finish parsing given only the input so far, but might be able to keep going if fed more input.
To illustrate, let’s consider a simple parser of type
Parser (Char, Char, Char) that accepts exactly three letters (letters only, no numbers or other weird characters). With either Parsec or Attoparsec imported, we can express the parser like this:
Below we summarize the difference between Parsec’s
parse function and Attoparsec’s incremental
parse function by comparing the results on a handful of example inputs.
|"abc"||Right: ('a', 'b', 'c')||Done: ('a', 'b', 'c'); remaining input: ""|
|"abcd"||Right: ('a', 'b', 'c')||Done: ('a', 'b', 'c'); remaining input: "d"|
|"a3c"||Left: Unexpected "3", expecting letter||Fail: "letter"; remaining input: "3c"|
|"ab"||Left: Unexpected end of input||Partial|
Do you need incremental parsing?
- If you’re reading a file: Probably no.
- If you’re reading an unbounded stream of messages from a socket: Probably yes.
Do you want automatic backtracking?
Backtracking behavior is an important subtle aspect of a parser. To understand it, first we have to appreciate the concept of parsing alternatives.
The most direct use of alternatives comes about when you use
asum [a, b, c] is equivalent to
a <|> b <|> c Alternatives usually also sneak in when you use combinators like
For example, let’s say we’re parsing a word, which we define as either
- a list of letters; or
- a list of numbers
So “hello”, “world”, “2”, and “9817624” are words, and “$@!&”, “hel3o”, and "" are not.
We can write a parser for a word in Megaparsec like this:
The way this parser runs looks like this:
Notice that the first character of the input completely determines the course of the rest of the parsing. If it’s a letter, we take the first branch and keep looking for more letters. If it’s a digit, we take the second branch and keep looking for more digits. Unfortunately, not all alternatives are so simple.
For the next example, we’ll write a very simple date parser. Specifically, we’ll be parsing year-and-month strings.
We want the parser to accept strings in either of two formats:
MM-YYYY. That is, the year must be four digits, the month must be two digits, and they must be separated by a dash, but either the year or the month may come first.
We’ll start by writing parsers for the smallest components:
Then we can combine them in different orders to define parsers for the two formats:
To write a parser that accepts either format, we might think we could combine the two with
(<|>) like we did before.
Why doesn’t this work? Let’s draw the flowchart again.
The first character of the input determines the course of the rest of the parsing. If the first character is a digit, the parser takes the first branch. There is no way to reach the second branch. To fix this, we need to backtrack – If, somewhere during the progress of the first branch, we find that the
YYYY-MM parser fails, we need to rewind the input back to the start and try again with the
This is what the
try function in Parsec and Megaparsec does: It turns a non-backtracking parser into a backtracking parser. Let’s modify
monthP by applying
try to the first alternative:
Now it can fall back appropriately to the second format.
Backtracking is something we have to do all the time when we’re reading prose, though you probably only notice it when you’re trying to read something that’s poorly written. Consider the following awkward sentence:
He told his friend that he went to the game with a lie.
You read left to right, recognizing
- “He” is the subject of this sentence.
- “told” is the verb.
- “his friend” is an indirect object; we are now anticipating a direct object, which will be what he told to his friend.
At this point, we have to guess what grammatical role the word “that” serves. Most readers will probably guess that the sentence has the form
He told his friend that […]
and that the remainder of the sentence is the object. But then we finish reading: “he went to the game with a lie.” How do you go to a game with a lie? That doesn’t make sense. If you read the sentence this way, you then have to back up and look for a different grammatical interpretation until you find one that works. In this case, what the author probably meant was
He told his friend (that he went to the game with) a lie.
Skillful writing avoids forcing the reader to do much backtracking, much like well-designed data formats avoid the need to write parsers that backtrack significantly.
Most parsing libraries provide support for backtracking parsers; what you need to know is whether they backtrack by default.
- Attoparsec backtracks automatically, so you never have to use the
- Parsec and Megaparsec do not, so you need to use
trywhen you want backtracking.
This is an important distinction to understand, because very similar-looking parser expressions written using different libraries can give you different results.