Parser combinators are a functional approach to parsing where complex parsers are built by combining simpler ones. They’re a powerful application of monads in Haskell (Monads Basics, Common Monads, Monad Laws).
What are Parser Combinators?
A parser is a function that takes some input (typically a string) and produces a structured result. Parser combinators are higher-order functions (Higher-Order Functions) that take parsers as input and return new parsers.
The key insight is that we can:
- Create primitive parsers for basic elements
- Combine them to create more complex parsers
- Use the monadic structure to sequence parsing operations (Monads Basics)
The Parsec Library
Parsec is the most common parser combinator library in Haskell. Let’s look at its core concepts.
Installation
cabal install --lib parsecBasic Types
In Parsec, the primary type is:
type Parser a = Parsec String () awhich means a parser that consumes a String input, uses () as a custom state, and produces a value of type a.
Simple Parsers
Parsec provides many primitive parsers:
import Text.Parsec
import Text.Parsec.String (Parser)
-- Parse a single character
charParser :: Parser Char
charParser = char 'c'
-- Parse any digit
digitParser :: Parser Char
digitParser = digit
-- Parse a specific string
helloParser :: Parser String
helloParser = string "hello"Running Parsers
To run a parser on an input string:
parse :: Parsec s u a -> SourceName -> s -> Either ParseError aExamples:
parse charParser "input" "cat" -- Right 'c'
parse charParser "input" "dog" -- Left "unexpected 'd'"
parseTest :: Show a => Parsec s u a -> s -> IO ()
parseTest charParser "cat" -- 'c'Combining Parsers
The power of parser combinators comes from combining simpler parsers into more complex ones. This is a direct application of higher-order functions (Higher-Order Functions) and monadic sequencing (Monads Basics).
Sequencing Parsers
To parse one thing followed by another, we use monadic do notation (Monads Basics):
-- Parse "hello" followed by "world"
helloWorldParser :: Parser (String, String)
helloWorldParser = do
hello <- string "hello"
space
world <- string "world"
return (hello, world)Alternatives
To try one parser, and if it fails, try another, we use the <|> operator from the Alternative typeclass (Functors and Applicatives):
-- Parse either "cat" or "dog"
animalParser :: Parser String
animalParser = try (string "cat") <|> string "dog"The <|> operator comes from the Alternative typeclass and represents choice.
Repetition
To parse something multiple times:
-- Parse zero or more digits
digitsParser :: Parser String
digitsParser = many digit
-- Parse one or more digits
digitsParser1 :: Parser String
digitsParser1 = many1 digitBuilding Complex Parsers
Let’s build a more complex parser for a simple arithmetic expression. This example uses recursive data types (Algebraic Data Types) and pattern matching (Pattern Matching and Recursion).
import Text.Parsec
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Expr as E
import Text.Parsec.Token (GenLanguageDef(..), makeTokenParser)
import qualified Text.Parsec.Token as Token
-- Define the language
def :: GenLanguageDef String () Identity
def = LanguageDef
{ commentStart = "/*"
, commentEnd = "*/"
, commentLine = "//"
, nestedComments = True
, identStart = letter
, identLetter = alphaNum <|> char '_'
, opStart = oneOf "+-*/="
, opLetter = oneOf "+-*/="
, reservedNames = ["if", "then", "else", "while", "do", "end"]
, reservedOpNames = ["+", "-", "*", "/", "="]
, caseSensitive = True
}
-- Create a token parser
tokenParser = makeTokenParser def
-- Extract useful parsers
identifier = Token.identifier tokenParser
reserved = Token.reserved tokenParser
reservedOp = Token.reservedOp tokenParser
parens = Token.parens tokenParser
integer = Token.integer tokenParser
whiteSpace = Token.whiteSpace tokenParser
-- Expression parser
expr :: Parser Integer
expr = E.buildExpressionParser table term
where
term = parens expr <|> integer
table = [ [binary "*" (*) E.AssocLeft, binary "/" div E.AssocLeft]
, [binary "+" (+) E.AssocLeft, binary "-" (-) E.AssocLeft]
]
binary name fun assoc = E.Infix (reservedOp name >> return fun) assoc
-- Parse and evaluate an expression
parseExpr :: String -> Either ParseError Integer
parseExpr input = parse (whiteSpace >> expr) "" inputBuilding a Parse Tree
Often, we want to build a structured representation of the input rather than evaluating it directly. This is a classic use of algebraic data types (Algebraic Data Types) and recursion (Pattern Matching and Recursion).
-- Data type for arithmetic expressions
data Expr = Lit Integer
| Add Expr Expr
| Sub Expr Expr
| Mul Expr Expr
| Div Expr Expr
deriving (Show)
-- Parser for a parse tree
exprTree :: Parser Expr
exprTree = E.buildExpressionParser table term
where
term = parens exprTree <|> (Lit <$> integer)
table = [ [binary "*" Mul E.AssocLeft, binary "/" Div E.AssocLeft]
, [binary "+" Add E.AssocLeft, binary "-" Sub E.AssocLeft]
]
binary name constructor assoc =
E.Infix (reservedOp name >> return constructor) assoc
-- After parsing, we can evaluate or transform the parse tree
eval :: Expr -> Integer
eval (Lit n) = n
eval (Add e1 e2) = eval e1 + eval e2
eval (Sub e1 e2) = eval e1 - eval e2
eval (Mul e1 e2) = eval e1 * eval e2
eval (Div e1 e2) = eval e1 `div` eval e2Common Parser Combinators
Basic Parsers
char :: Char -> Parser Char -- Parse a specific character
string :: String -> Parser String -- Parse a specific string
anyChar :: Parser Char -- Parse any character
letter :: Parser Char -- Parse a letter
digit :: Parser Char -- Parse a digit
alphaNum :: Parser Char -- Parse a letter or digit
space :: Parser Char -- Parse a space character
spaces :: Parser () -- Parse zero or more spacesCombinators
(<|>) :: Parser a -> Parser a -> Parser a -- Choice between parsers
try :: Parser a -> Parser a -- Backtracking
many :: Parser a -> Parser [a] -- Zero or more occurrences
many1 :: Parser a -> Parser [a] -- One or more occurrences
option :: a -> Parser a -> Parser a -- Optional parser with default
optional :: Parser a -> Parser () -- Optional parser, discarding result
between :: Parser open -> Parser close -> Parser a -> Parser a -- Brackets
sepBy :: Parser a -> Parser sep -> Parser [a] -- Separated list
sepBy1 :: Parser a -> Parser sep -> Parser [a] -- Non-empty separated list
endBy :: Parser a -> Parser sep -> Parser [a] -- List ending with separator
chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a -- Left-associative chainHandling Parse Errors
Parsec provides detailed error messages when parsing fails:
parse expr "" "1 + (2 * 3"
-- Left (line 1, column 10):
-- unexpected end of input
-- expecting ")" or digitYou can customize error messages using the <??> operator:
expr' = expr <?> "expression"Common Patterns
Lexing and Parsing
For complex languages, it’s common to separate lexical analysis (tokenization) from parsing:
-- Lexer (converts string to tokens)
lexer :: Parser [Token]
lexer = many (space *> token <* spaces)
-- Parser (converts tokens to AST)
parser :: [Token] -> Either ParseError ASTRecursive Parsers
For recursive structures like expressions, we need to handle precedence and associativity:
-- Manual approach (without buildExpressionParser)
expr = term `chainl1` addOp
term = factor `chainl1` mulOp
factor = parens expr <|> integer
addOp = (reservedOp "+" >> return Add) <|> (reservedOp "-" >> return Sub)
mulOp = (reservedOp "*" >> return Mul) <|> (reservedOp "/" >> return Div)Consumed Input
When dealing with ambiguous grammars, it’s important to use try to backtrack when a parser fails after consuming input:
-- Without try, this would fail if "let" is found but not followed by a valid definition
letExpr = try (do
reserved "let"
name <- identifier
reservedOp "="
value <- expr
reserved "in"
body <- expr
return (Let name value body)
) <|> otherExprPractical Example: A JSON Parser
Here’s a simple JSON parser using Parsec:
import Text.Parsec
import Text.Parsec.String (Parser)
import Control.Applicative ((<$>), (<*>), (*>), (<*))
import Data.Char (digitToInt)
data JSON = JNull
| JBool Bool
| JNum Double
| JStr String
| JArr [JSON]
| JObj [(String, JSON)]
deriving (Show, Eq)
-- Whitespace
spaces :: Parser ()
spaces = skipMany (oneOf " \t\n\r")
-- JSON null
jNull :: Parser JSON
jNull = string "null" *> return JNull
-- JSON boolean
jBool :: Parser JSON
jBool = JBool <$> (true <|> false)
where true = string "true" *> return True
false = string "false" *> return False
-- JSON number
jNum :: Parser JSON
jNum = JNum <$> (do
s <- option 1 (char '-' *> return (-1))
n <- number
return (s * n))
where
number = do
i <- integer
f <- option 0 fraction
e <- option 0 exponent
return ((fromIntegral i + f) * (10 ** e))
integer = do
first <- digit
rest <- many digit
return (read (first:rest))
fraction = do
char '.'
digits <- many1 digit
let f = read digits :: Integer
l = fromIntegral (length digits)
return (fromIntegral f / (10 ^ l))
exponent = do
e <- oneOf "eE"
sign <- option 1 (char '+' *> return 1 <|> char '-' *> return (-1))
digits <- many1 digit
return (sign * (read digits))
-- JSON string
jStr :: Parser JSON
jStr = JStr <$> (char '"' *> many charParser <* char '"')
where
charParser = escapedChar <|> normalChar
normalChar = noneOf "\\\"\n\r\t"
escapedChar = char '\\' *> (
char '"' <|>
char '\\' <|>
char '/' <|>
(char 'b' *> return '\b') <|>
(char 'f' *> return '\f') <|>
(char 'n' *> return '\n') <|>
(char 'r' *> return '\r') <|>
(char 't' *> return '\t'))
-- JSON array
jArr :: Parser JSON
jArr = JArr <$> (char '[' *> spaces *> elements <* spaces <* char ']')
where
elements = sepBy (spaces *> jsonParser <* spaces) (char ',')
-- JSON object
jObj :: Parser JSON
jObj = JObj <$> (char '{' *> spaces *> pairs <* spaces <* char '}')
where
pairs = sepBy pair (spaces *> char ',' <* spaces)
pair = do
key <- jStr
spaces *> char ':' *> spaces
value <- jsonParser
case key of
JStr k -> return (k, value)
_ -> fail "Expected string key in object"
-- Main JSON parser
jsonParser :: Parser JSON
jsonParser = jNull <|> jBool <|> jNum <|> jStr <|> jArr <|> jObj
-- Parse JSON from a string
parseJSON :: String -> Either ParseError JSON
parseJSON = parse (spaces *> jsonParser <* spaces <* eof) ""Key Points to Remember
- Parser combinators allow you to build complex parsers by combining simpler ones
- Parsec is a popular parser combinator library in Haskell
- Parsers are monadic, allowing for clean sequencing of parsing operations
- The
<|>operator provides alternatives when parsing - Common combinators include
many,many1,sepBy, andbetween - Recursive parsers can handle nested structures like expressions
- Parser combinators provide a declarative, composable approach to parsing