Parsing
There are a lot of parsing libraries in Haskell. The considerations for choosing among them depend on:
- What kind of format you’re parsing
- How the parser will be used
In this article we’re going to discuss three packages:
The code examples use these imports:
-- base
import Control.Applicative
import Data.Char
import Data.Functor
import Data.Void
import Numeric.Natural
-- text
import qualified Data.Text as T
-- parsec
import qualified Text.Parsec as Parsec
import qualified Text.Parsec.Text as Parsec
-- megaparsec
import qualified Text.Megaparsec as Mega
import qualified Text.Megaparsec.Char as Mega
import qualified Text.Megaparsec.Char.Lexer as Mega
-- attoparsec
import qualified Data.Attoparsec.Text as Atto
type MegaParser = Mega.Parsec Void T.Text
Error messages
What’s more important to you: Giving informative error messages for invalid inputs, or speed?
Error messages are really important when you’re writing a parser for something like a programming language, where the file you’re parsing is something that’s written by a person. Programmers make syntax errors all the time with our fallible human fingers, and the parser’s responsibility to point out where the mistake is and explain why it’s a mistake is just as important as its job to read correctly-written code.
Speed is more important when you’re parsing messages for a network protocol. Because the string you’re reading is generated by software, For example, in the Web Servers course, we use attoparsec
to read HTTP requests, which are typically generated by HTTP client software such as curl
. most of the time it won’t contain mistakes – and even if you do generate great explanatory error messages, there’s usually nobody there to read them anyway. Some informative error output is nice to have for the sake of debugging (either for a client that generated a faulty request, or for an incorrectly-written parser), but not nearly as important as in the former case.
Producing informative error information requires keeping track of more information while parsing, which comes at some performance cost. For a programming language implementation that’s usually only compiling one file at a time, the usability benefit is probably worth the tradeoff. For servers, when we may be considering how many thousands of requests they can process per second, perhaps the overhead of detailed error output is not worthwhile.
Incremental parsing
Do you need to run the parser incrementally?
In the Text.Parsec
, there is essentially one way to run a parser:
:: Parser a -> SourceName -> Text
Parsec.parse-> Either ParseError a
The return type is Either ParseError a
, which means there are two possible outcomes:
Left
, with an error message explaining why the parser cannot parse the input.Right
, with the result of a successful parse. Note that a result ofRight
does not give any indication as to whether the entire input was consumed.
In Data.Attoparsec.Text
, you’ll find two ways to run a parser:
:: Parser a -> Text
Atto.parse-> Result a
:: Parser a -> Text
Atto.parseOnly-> Either String a
What parsec
calls ‘parse’, attoparsec
calls ‘parseOnly’. attoparsec
’s parse
function provides something more. Result a
is similar to the failure-or-success output that parsec
’s parse
function provides, but with some additional information and a third possible outcome. The constructors for Result
are:
Fail
, indicating that the parser definitely could not parse the input.Done
, which contains a successful result of typea
, as well as the remaining unconsumed inputText
.Partial
, indicating that the parser couldn’t finish parsing given only the input so far, but might be able to keep going if fed more input.
To illustrate, let’s consider a simple parser of type Parser (Char, Char, Char)
that accepts exactly three letters (letters only, no numbers or other weird characters). With either parsec
or attoparsec
imported, we can express the parser like this:
<$> letter <*> letter <*> letter (,,)
Below we summarize the difference between parsec
’s parse
function and attoparsec
’s incremental parse
function by comparing the results on a handful of example inputs.
Input | Parsec | Attoparsec |
---|---|---|
"abc" | Right: ('a', 'b', 'c') | Done: ('a', 'b', 'c'); remaining input: "" |
"abcd" | Right: ('a', 'b', 'c') | Done: ('a', 'b', 'c'); remaining input: "d" |
"a3c" | Left: Unexpected "3", expecting letter | Fail: "letter"; remaining input: "3c" |
"ab" | Left: Unexpected end of input | Partial |
Do you need incremental parsing?
- If you’re reading a file: Probably no.
- If you’re reading an unbounded stream of messages from a socket: Probably yes.
You can see attoparsec
’s incremental parsing feature in use in Web Servers, Lesson 9.
Backtracking
Do you want automatic backtracking?
Backtracking behavior is an important, subtle aspect of a parser. To understand it, first we have to appreciate the concept of parsing alternatives.
The most direct use of alternatives comes about when you use (<|>)
or asum
. asum [a, b, c]
is equivalent to a <|> b <|> c
Alternatives usually also sneak in when you use combinators like optional
, some
, and many
. Most of these functions are from, or at least constrained by, the Alternative
typeclass.
For example, let’s say we’re parsing a word, which we define as either
- a list of letters; or
- a list of numbers.
So “hello”, “world”, “2”, and “9817624” are words, and “$@!&”, “hel3o”, and “” are not.
data Word = Letters T.Text | Digits Natural
deriving Show
We can write a parser for a word in megaparsec
like this:
word :: MegaParser Word
= letters <|> digits
word where
= Letters <$> Mega.takeWhile1P (Just "letter") isAlpha
letters = Digits <$> Mega.decimal digits
The way this parser runs looks like this:
Notice that the first character of the input completely determines the course of the rest of the parsing. If it’s a letter, we take the first branch and keep looking for more letters. If it’s a digit, we take the second branch and keep looking for more digits. Unfortunately, not all alternatives are so simple.
For the next example, we’ll write a very simple date parser. Specifically, we’ll be parsing year-and-month strings.
We want the parser to accept strings in either of two formats: YYYY-MM
or MM-YYYY
. That is, the year must be four digits, the month must be two digits, and they must be separated by a dash, but either the year or the month may come first.
We’ll start by writing parsers for the smallest components:
Then we can combine them in different orders to define parsers for the two formats:
To write a parser that accepts either format, we might think we could combine the two with (<|>)
like we did before.
monthP :: MegaParser Month
= yearFirstP <|> monthFirstP monthP
Why doesn’t this work? Let’s draw the flowchart again.
The first character of the input determines the course of the rest of the parsing. If the first character is a digit, the parser takes the first branch. There is no way to reach the second branch. To fix this, we need to backtrack – If, somewhere during the progress of the first branch, we find that the YYYY-MM
parser fails, we need to rewind the input back to the start and try again with the MM-YYYY
parser.
This is what the try
function in parsec
and megaparsec
does: It turns a non-backtracking parser into a backtracking parser. Let’s modify monthP
by applying try
to the first alternative:
monthP :: MegaParser Month
= Mega.try yearFirstP <|> monthFirstP monthP
Now it can fall back appropriately to the second format.
Backtracking is something we have to do all the time when we’re reading prose, though you probably only notice it when you’re trying to read something that’s poorly written. Consider the following awkward sentence:
He told his friend that he went to the game with a lie.
You read left to right, recognizing
- “He” is the subject of this sentence.
- “told” is the verb.
- “his friend” is an indirect object; we are now anticipating a direct object, which will be what he told to his friend.
At this point, we have to guess what grammatical role the word “that” serves. Most readers will probably guess that the sentence has the form
He told his friend that […]
and that the remainder of the sentence is the object. But then we finish reading: “he went to the game with a lie.” How do you go to a game with a lie? That doesn’t make sense. If you read the sentence this way, you then have to back up and look for a different grammatical interpretation until you find one that works. In this case, what the author probably meant was
He told his friend (that he went to the game with) a lie.
Skillful writing avoids forcing the reader to do much backtracking, much like well-designed data formats avoid the need to write parsers that backtrack significantly.
Most parsing libraries provide support for backtracking parsers; what you need to know is whether they backtrack by default.
attoparsec
backtracks automatically, so you never have to use thetry
function.parsec
andmegaparsec
do not, so you need to usetry
when you want backtracking.
This is an important distinction to understand, because very similar-looking parser expressions written using different libraries can give you different results.