Parser combinators are a functional approach to parsing where complex parsers are built by combining simpler ones. They’re a powerful application of monads in Haskell (Monads Basics, Common Monads, Monad Laws).

What are Parser Combinators?

A parser is a function that takes some input (typically a string) and produces a structured result. Parser combinators are higher-order functions (Higher-Order Functions) that take parsers as input and return new parsers.

The key insight is that we can:

  1. Create primitive parsers for basic elements
  2. Combine them to create more complex parsers
  3. Use the monadic structure to sequence parsing operations (Monads Basics)

The Parsec Library

Parsec is the most common parser combinator library in Haskell. Let’s look at its core concepts.

Installation

cabal install --lib parsec

Basic Types

In Parsec, the primary type is:

type Parser a = Parsec String () a

which means a parser that consumes a String input, uses () as a custom state, and produces a value of type a.

Simple Parsers

Parsec provides many primitive parsers:

import Text.Parsec
import Text.Parsec.String (Parser)
 
-- Parse a single character
charParser :: Parser Char
charParser = char 'c'
 
-- Parse any digit
digitParser :: Parser Char
digitParser = digit
 
-- Parse a specific string
helloParser :: Parser String
helloParser = string "hello"

Running Parsers

To run a parser on an input string:

parse :: Parsec s u a -> SourceName -> s -> Either ParseError a

Examples:

parse charParser "input" "cat"  -- Right 'c'
parse charParser "input" "dog"  -- Left "unexpected 'd'"
 
parseTest :: Show a => Parsec s u a -> s -> IO ()
parseTest charParser "cat"  -- 'c'

Combining Parsers

The power of parser combinators comes from combining simpler parsers into more complex ones. This is a direct application of higher-order functions (Higher-Order Functions) and monadic sequencing (Monads Basics).

Sequencing Parsers

To parse one thing followed by another, we use monadic do notation (Monads Basics):

-- Parse "hello" followed by "world"
helloWorldParser :: Parser (String, String)
helloWorldParser = do
  hello <- string "hello"
  space
  world <- string "world"
  return (hello, world)

Alternatives

To try one parser, and if it fails, try another, we use the <|> operator from the Alternative typeclass (Functors and Applicatives):

-- Parse either "cat" or "dog"
animalParser :: Parser String
animalParser = try (string "cat") <|> string "dog"

The <|> operator comes from the Alternative typeclass and represents choice.

Repetition

To parse something multiple times:

-- Parse zero or more digits
digitsParser :: Parser String
digitsParser = many digit
 
-- Parse one or more digits
digitsParser1 :: Parser String
digitsParser1 = many1 digit

Building Complex Parsers

Let’s build a more complex parser for a simple arithmetic expression. This example uses recursive data types (Algebraic Data Types) and pattern matching (Pattern Matching and Recursion).

import Text.Parsec
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Expr as E
import Text.Parsec.Token (GenLanguageDef(..), makeTokenParser)
import qualified Text.Parsec.Token as Token
 
-- Define the language
def :: GenLanguageDef String () Identity
def = LanguageDef
  { commentStart = "/*"
  , commentEnd = "*/"
  , commentLine = "//"
  , nestedComments = True
  , identStart = letter
  , identLetter = alphaNum <|> char '_'
  , opStart = oneOf "+-*/="
  , opLetter = oneOf "+-*/="
  , reservedNames = ["if", "then", "else", "while", "do", "end"]
  , reservedOpNames = ["+", "-", "*", "/", "="]
  , caseSensitive = True
  }
 
-- Create a token parser
tokenParser = makeTokenParser def
 
-- Extract useful parsers
identifier = Token.identifier tokenParser
reserved = Token.reserved tokenParser
reservedOp = Token.reservedOp tokenParser
parens = Token.parens tokenParser
integer = Token.integer tokenParser
whiteSpace = Token.whiteSpace tokenParser
 
-- Expression parser
expr :: Parser Integer
expr = E.buildExpressionParser table term
  where
    term = parens expr <|> integer
    table = [ [binary "*" (*) E.AssocLeft, binary "/" div E.AssocLeft]
            , [binary "+" (+) E.AssocLeft, binary "-" (-) E.AssocLeft]
            ]
    binary name fun assoc = E.Infix (reservedOp name >> return fun) assoc
 
-- Parse and evaluate an expression
parseExpr :: String -> Either ParseError Integer
parseExpr input = parse (whiteSpace >> expr) "" input

Building a Parse Tree

Often, we want to build a structured representation of the input rather than evaluating it directly. This is a classic use of algebraic data types (Algebraic Data Types) and recursion (Pattern Matching and Recursion).

-- Data type for arithmetic expressions
data Expr = Lit Integer
          | Add Expr Expr
          | Sub Expr Expr
          | Mul Expr Expr
          | Div Expr Expr
          deriving (Show)
 
-- Parser for a parse tree
exprTree :: Parser Expr
exprTree = E.buildExpressionParser table term
  where
    term = parens exprTree <|> (Lit <$> integer)
    table = [ [binary "*" Mul E.AssocLeft, binary "/" Div E.AssocLeft]
            , [binary "+" Add E.AssocLeft, binary "-" Sub E.AssocLeft]
            ]
    binary name constructor assoc = 
      E.Infix (reservedOp name >> return constructor) assoc
 
-- After parsing, we can evaluate or transform the parse tree
eval :: Expr -> Integer
eval (Lit n) = n
eval (Add e1 e2) = eval e1 + eval e2
eval (Sub e1 e2) = eval e1 - eval e2
eval (Mul e1 e2) = eval e1 * eval e2
eval (Div e1 e2) = eval e1 `div` eval e2

Common Parser Combinators

Basic Parsers

char :: Char -> Parser Char           -- Parse a specific character
string :: String -> Parser String     -- Parse a specific string
anyChar :: Parser Char                -- Parse any character
letter :: Parser Char                 -- Parse a letter
digit :: Parser Char                  -- Parse a digit
alphaNum :: Parser Char              -- Parse a letter or digit
space :: Parser Char                 -- Parse a space character
spaces :: Parser ()                  -- Parse zero or more spaces

Combinators

(<|>) :: Parser a -> Parser a -> Parser a  -- Choice between parsers
try :: Parser a -> Parser a                -- Backtracking
many :: Parser a -> Parser [a]             -- Zero or more occurrences
many1 :: Parser a -> Parser [a]            -- One or more occurrences
option :: a -> Parser a -> Parser a        -- Optional parser with default
optional :: Parser a -> Parser ()          -- Optional parser, discarding result
between :: Parser open -> Parser close -> Parser a -> Parser a  -- Brackets
sepBy :: Parser a -> Parser sep -> Parser [a]      -- Separated list
sepBy1 :: Parser a -> Parser sep -> Parser [a]     -- Non-empty separated list
endBy :: Parser a -> Parser sep -> Parser [a]      -- List ending with separator
chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a  -- Left-associative chain

Handling Parse Errors

Parsec provides detailed error messages when parsing fails:

parse expr "" "1 + (2 * 3"
-- Left (line 1, column 10):
-- unexpected end of input
-- expecting ")" or digit

You can customize error messages using the <??> operator:

expr' = expr <?> "expression"

Common Patterns

Lexing and Parsing

For complex languages, it’s common to separate lexical analysis (tokenization) from parsing:

-- Lexer (converts string to tokens)
lexer :: Parser [Token]
lexer = many (space *> token <* spaces)
 
-- Parser (converts tokens to AST)
parser :: [Token] -> Either ParseError AST

Recursive Parsers

For recursive structures like expressions, we need to handle precedence and associativity:

-- Manual approach (without buildExpressionParser)
expr = term `chainl1` addOp
term = factor `chainl1` mulOp
factor = parens expr <|> integer
addOp = (reservedOp "+" >> return Add) <|> (reservedOp "-" >> return Sub)
mulOp = (reservedOp "*" >> return Mul) <|> (reservedOp "/" >> return Div)

Consumed Input

When dealing with ambiguous grammars, it’s important to use try to backtrack when a parser fails after consuming input:

-- Without try, this would fail if "let" is found but not followed by a valid definition
letExpr = try (do
  reserved "let"
  name <- identifier
  reservedOp "="
  value <- expr
  reserved "in"
  body <- expr
  return (Let name value body)
) <|> otherExpr

Practical Example: A JSON Parser

Here’s a simple JSON parser using Parsec:

import Text.Parsec
import Text.Parsec.String (Parser)
import Control.Applicative ((<$>), (<*>), (*>), (<*))
import Data.Char (digitToInt)
 
data JSON = JNull
          | JBool Bool
          | JNum Double
          | JStr String
          | JArr [JSON]
          | JObj [(String, JSON)]
          deriving (Show, Eq)
 
-- Whitespace
spaces :: Parser ()
spaces = skipMany (oneOf " \t\n\r")
 
-- JSON null
jNull :: Parser JSON
jNull = string "null" *> return JNull
 
-- JSON boolean
jBool :: Parser JSON
jBool = JBool <$> (true <|> false)
  where true = string "true" *> return True
        false = string "false" *> return False
 
-- JSON number
jNum :: Parser JSON
jNum = JNum <$> (do
  s <- option 1 (char '-' *> return (-1))
  n <- number
  return (s * n))
  where
    number = do
      i <- integer
      f <- option 0 fraction
      e <- option 0 exponent
      return ((fromIntegral i + f) * (10 ** e))
    integer = do
      first <- digit
      rest <- many digit
      return (read (first:rest))
    fraction = do
      char '.'
      digits <- many1 digit
      let f = read digits :: Integer
          l = fromIntegral (length digits)
      return (fromIntegral f / (10 ^ l))
    exponent = do
      e <- oneOf "eE"
      sign <- option 1 (char '+' *> return 1 <|> char '-' *> return (-1))
      digits <- many1 digit
      return (sign * (read digits))
 
-- JSON string
jStr :: Parser JSON
jStr = JStr <$> (char '"' *> many charParser <* char '"')
  where
    charParser = escapedChar <|> normalChar
    normalChar = noneOf "\\\"\n\r\t"
    escapedChar = char '\\' *> (
      char '"' <|> 
      char '\\' <|> 
      char '/' <|> 
      (char 'b' *> return '\b') <|>
      (char 'f' *> return '\f') <|>
      (char 'n' *> return '\n') <|>
      (char 'r' *> return '\r') <|>
      (char 't' *> return '\t'))
 
-- JSON array
jArr :: Parser JSON
jArr = JArr <$> (char '[' *> spaces *> elements <* spaces <* char ']')
  where
    elements = sepBy (spaces *> jsonParser <* spaces) (char ',')
 
-- JSON object
jObj :: Parser JSON
jObj = JObj <$> (char '{' *> spaces *> pairs <* spaces <* char '}')
  where
    pairs = sepBy pair (spaces *> char ',' <* spaces)
    pair = do
      key <- jStr
      spaces *> char ':' *> spaces
      value <- jsonParser
      case key of
        JStr k -> return (k, value)
        _ -> fail "Expected string key in object"
 
-- Main JSON parser
jsonParser :: Parser JSON
jsonParser = jNull <|> jBool <|> jNum <|> jStr <|> jArr <|> jObj
 
-- Parse JSON from a string
parseJSON :: String -> Either ParseError JSON
parseJSON = parse (spaces *> jsonParser <* spaces <* eof) ""

Key Points to Remember

  1. Parser combinators allow you to build complex parsers by combining simpler ones
  2. Parsec is a popular parser combinator library in Haskell
  3. Parsers are monadic, allowing for clean sequencing of parsing operations
  4. The <|> operator provides alternatives when parsing
  5. Common combinators include many, many1, sepBy, and between
  6. Recursive parsers can handle nested structures like expressions
  7. Parser combinators provide a declarative, composable approach to parsing