Parser API

pest provides several ways of accessing the results of a successful parse. The examples below use the following grammar:

number = { ASCII_DIGIT+ }                // one or more decimal digits
enclosed = { "(.." ~ number ~ "..)" }    // for instance, "(..6472..)"
sum = { number ~ " + " ~ number }        // for instance, "1362 + 12"

Tokens

pest represents successful parses using tokens. Whenever a rule matches, two tokens are produced: one at the start of the text that the rule matched, and one at the end. For example, the rule number applied to the string "3130 abc" would match and produce this pair of tokens:

"3130 abc"
 |   ^ end(number)
 ^ start(number)

Note that the rule doesn't match the entire input text. It only matches as much text as possible, then stops if successful.

A token is like a cursor in the input string. It has a character position in the string, as well as a reference to the rule that created it.

Nested rules

If a named rule contains another named rule, tokens will be produced for both rules. For instance, the rule enclosed applied to the string "(..6472..)" would match and produce these four tokens:

"(..6472..)"
 |  |   |  ^ end(enclosed)
 |  |   ^ end(number)
 |  ^ start(number)
 ^ start(enclosed)

Sometimes, tokens might not occur at distinct character positions. For example, when parsing the rule sum, the inner number rules share some start and end positions:

"1773 + 1362"
 |   |  |   ^ end(sum)
 |   |  |   ^ end(number)
 |   |  ^ start(number)
 |   ^ end(number)
 ^ start(number)
 ^ start(sum)

In fact, for a rule that matches empty input, the start and end tokens will be at the same position!

Interface

Tokens are exposed as the Token enum, which has Start and End variants. You can get an iterator of Tokens by calling tokens on a parse result:

let parse_result = Parser::parse(Rule::sum, "1773 + 1362").unwrap();
let tokens = parse_result.tokens();

for token in tokens {
    println!("{:?}", token);
}

After a successful parse, tokens will occur as nested pairs of matching Start and End. Both kinds of tokens have two fields:

  • rule, which explains which rule generated them; and
  • pos, which indicates their positions.

A start token's position is the first character that the rule matched. An end token's position is the first character that the rule did not match — that is, an end token refers to a position after the match. If a rule matched the entire input string, the end token points to an imaginary position after the string.

Pairs

Tokens are not the most convenient interface, however. Usually you will want to explore the parse tree by considering matching pairs of tokens. For this purpose, pest provides the Pair type.

A Pair represents a matching pair of tokens, or, equivalently, the spanned text that a named rule successfully matched. It is commonly used in several ways:

  • Determining which rule produced the Pair
  • Using the Pair as a raw &str
  • Inspecting the inner named sub-rules that produced the Pair
let pair = Parser::parse(Rule::enclosed, "(..6472..) and more text")
    .unwrap().next().unwrap();

assert_eq!(pair.as_rule(), Rule::enclosed);
assert_eq!(pair.as_str(), "(..6472..)");

let inner_rules = pair.into_inner();
println!("{}", inner_rules); // --> [number(3, 7)]

In general, a Pair might have any number of inner rules: zero, one, or more. For maximum flexibility, Pair::into_inner() returns Pairs, which is an iterator over each pair.

This means that you can use for loops on parse results, as well as iterator methods such as map, filter, and collect.

let pairs = Parser::parse(Rule::sum, "1773 + 1362")
    .unwrap().next().unwrap()
    .into_inner();

let numbers = pairs
    .clone()
    .map(|pair| str::parse(pair.as_str()).unwrap())
    .collect::<Vec<i32>>();
assert_eq!(vec![1773, 1362], numbers);

for (found, expected) in pairs.zip(vec!["1773", "1362"]) {
    assert_eq!(Rule::number, found.as_rule());
    assert_eq!(expected, found.as_str());
}

Pairs iterators are also commonly used via the next method directly. If a rule consists of a known number of sub-rules (for instance, the rule sum has exactly two sub-rules), the sub-matches can be extracted with next and unwrap:

let parse_result = Parser::parse(Rule::sum, "1773 + 1362")
    .unwrap().next().unwrap();
let mut inner_rules = parse_result.into_inner();

let match1 = inner_rules.next().unwrap();
let match2 = inner_rules.next().unwrap();

assert_eq!(match1.as_str(), "1773");
assert_eq!(match2.as_str(), "1362");

Sometimes rules will not have a known number of sub-rules, such as when a sub-rule is repeated with an asterisk *:

list = { number* }

In cases like these it is not possible to call .next().unwrap(), because the number of sub-rules depends on the input string — it cannot be known at compile time.

The parse method

A pest-derived Parser has a single method parse which returns a Result< Pairs, Error >. To access the underlying parse tree, it is necessary to match on or unwrap the result:

// check whether parse was successful
match Parser::parse(Rule::enclosed, "(..6472..)") {
    Ok(mut pairs) => {
        let enclosed = pairs.next().unwrap();
        // ...
    }
    Err(error) => {
        // ...
    }
}

Our examples so far have included the calls Parser::parse(...).unwrap().next().unwrap(). The first unwrap turns the result into a Pairs. If parsing had failed, the program would panic! We only use unwrap in these examples because we already know that they will parse successfully.

In the example above, in order to get to the enclosed rule inside of the Pairs, we use the iterator interface. The next() call returns an Option<Pair>, which we finally unwrap to get the Pair for the enclosed rule.

Using Pair and Pairs with a grammar

While the Result from Parser::parse(...) might very well be an error on invalid input, Pair and Pairs often have more subtle behavior. For instance, with this grammar:

number = { ASCII_DIGIT+ }
sum = { number ~ " + " ~ number }

this function will never panic:

fn process(pair: Pair<Rule>) -> f64 {
    match pair.as_rule() {
        Rule::number => str::parse(pair.as_str()).unwrap(),
        Rule::sum => {
            let mut pairs = pair.into_inner();

            let num1 = pairs.next().unwrap();
            let num2 = pairs.next().unwrap();

            process(num1) + process(num2)
        }
    }
}

str::parse(...).unwrap() is safe because the number rule only ever matches digits, which str::parse(...) can handle. And pairs.next().unwrap() is safe to call twice because a sum match always has two sub-matches, which is guaranteed by the grammar.

Since these sorts of guarantees are awkward to express with Rust types, pest only provides a few high-level types to represent parse trees. Nevertheless, you should rely on the meaning of your grammar for properties such as "contains n sub-rules", "is safe to parse to f32", and "never fails to match". Idiomatic pest code uses unwrap and unreachable!.

Spans and positions

Occasionally, you will want to refer to a matching rule in the context of the raw source text, rather than the interior text alone. For example, you might want to print the entire line that contained the match. For this you can use Span and Position.

A Span is returned from Pair::as_span. Spans have a start position and an end position (which correspond to the start and end tokens of the rule that made the pair).

Spans can be decomposed into their start and end Positions, which provide useful methods for examining the string around that position. For example, Position::line_col() finds out the line and column number of a position.

Essentially, a Position is a Token without a rule. In fact, you can use pattern matching to turn a Token into its component Rule and Position.