9.2 Extendable grammars

9.2 Extendable grammars

Grammar rules can be added with the Grammar command. This command is just an interface towards Camlp4, providing the semantic actions so that they build the expected AST. A simple grammar command has the following syntax:

Grammar entry nonterminal := rule-name LMP -> action .


The components have the following meaning:


grammar ::= Grammar entry gram-entry with ... with gram-entry
entry ::= ident
gram-entry ::= rule-name [: entry-type] := [production | ... | production]
rule-name ::= ident
entry-type ::= Ast | List
production ::= rule-name [ [prod-item ... prod-item] ] ->  action
rule-name ::= ident
prod-item ::= string
  | [entry :] entry-name  [( meta )]
action ::= [ [ast ... ast] ]
  | let pattern = action  in action
  | case action [: entry-type] of  [case | ... | case] esac
case ::= [pattern ... pattern] -> action
pattern ::= ast


Figure 9.2:Syntax of the grammar extension command

The exact syntax of the Grammar command is defined in Fig. 9.2. It is possible to extend a grammar with several rules at once.
Grammar  entry  nonterminal := production1
  |
×
×
×
  | productionn .
Productions are entered in reverse order (i.e. productionn before production1), so that the first rules have priority over the last ones. The set of rules can be read as an usual pattern matching.

Also, we can extend several grammars of a given universe at the same time. The order of non-terminals does not matter since they extend different grammars.
Grammar entry nonterminal1 := production11
      |
×
×
×
      | productionn11
  with
×
×
×
   
  with nonterminalp := production1p
      |
×
×
×
      | productionnpp .

9.2.1 Grammar entries

Let us describe the four predefined entries. Each of them (except prim) possesses an initial grammar for starting the parsing process.



The user can define new entries and new non-terminals, using the grammar extension command. A grammar does not have to be explicitly defined. But the grammars in the left member of rules must all be defined, possibly by the current grammar command. It may be convenient to define an empty grammar, just so that it may be called by other grammars, and extend this empty grammar later. Assume that the command command13 does not exist. The next command defines it with zero productions.

Coq < Grammar command command13 := .


The grammars of new entries do not have an initial grammar. To use them, they must be called (directly or indirectly) by grammars of predefined entries. We give an example of a (direct) call of the grammar newentry nonterm by command command. This following rule allows to use the syntax a&b for the conjunction a/\b.

Coq < Grammar newentry nonterm :=
Coq < ampersand [ "&" command:command($c) ] -> [$c].

Coq < Grammar command command :=
Coq < new_and [ command8($a) newentry:nonterm($b) ] -> [<<$a/\$b>>].

9.2.2 Left member of productions (LMP)

A LMP is composed of a combination of terminals (enclosed between double quotes) and grammar calls specifying the entry. It is enclosed between ``['' and ``]''. The empty LMP, represented by [ ], corresponds to e in formal language theory.

A grammar call is done by entry:nonterminal($id) where:

The elements entry and $id are optional. The grammar entry can be omitted if it is the same as the entry of the non-terminal we are extending. Also, $id is omitted if we do not want to get back the AST result. Thus a grammar call can be reduced to a non-terminal.

Each terminal must contain exactly one token. This token does not need to be already defined. If not, it will be automatically added. Nethertheless, any string cannot be a token (e.g. blanks should not appear in tokens since parsing would then depend on indentation). We introduce the notion of valid token, as a sequence, without blanks, of characters taken from the following list:
< > / \ - + = ; , | ! @ # % ^ & * ( ) ? : ~ $ _ ` ' a..z A..Z 0..9
that do not start with a character from
$ _ a..z A..Z ' 0..9


When an LMP is used in the parsing process of an expression, it is analyzed from left to right. Every token met in the LMP should correspond to the current token of the expression. As for the grammars calls, they are performed in order to recognize parts of the initial expression.


Warning: Unlike destructive expressions, if a variable appears several times in the LMP, the last binding hides the previous ones. Comparison can be performed only in the actions.


Example 1: Defining a syntax for inequality

The rule below allows us to use the syntax t1#t2 for the term ~t1=t2.

Coq < Grammar command command1 :=
Coq < not_eq [ command0($a) "#" command0($b) ] -> [<<~($a=$b)>>].


The level 1 of the grammar of terms is extended with one rule named not_eq. When this rule is selected, its LMP calls the grammar command command0. This grammar recognizes a term that it binds to the metavariable $a. Then it meets the token ``#'' and finally it calls the grammar command command0. This grammar returns the recognized term in $b. The action constructs the term ~($a=$b). Note that we use the command quotation on the right-hand side.


Warning: Metavariables are identifiers preceded by the ``$'' symbol. They cannot be replaced by identifiers. For instance, if we enter a rule with identifiers and not metavariables, an error occurs:

Coq < Grammar command command1 :=
Coq < not_eq [ command0(a) "#" command0(b) ] -> [<<~(a=b)>>].
Warning: Could not globalize a
Warning: Could not globalize b
Toplevel input, characters 49-50
>   not_eq [ command0(a) "#" command0(b) ] -> [<<~(a=b)>>].
>                     ^
Error: This ident is not a metavariable.


For instance, let us give the statement of the symmetry of #:

Coq < Goal (A:Set)(a,b:A) a#b -> b#a.
1 subgoal
  
  ============================
   (A:Set)(a,b:A)~a=b->~b=a


Conversely, one can force ~a=b to be printed a#b by giving pretty-printing rules. This is explained in section 9.3.


Example 2: Redefining vernac commands

Thanks to the following rule, ``|- term.'' will have the same effect as ``Goal term.''.

Coq < Grammar vernac vernac :=
Coq < thesis [ "|" "-" command:command($term) "." ]
Coq < -> [<:vernac:<Goal $term.>>].


This rule allows putting blanks between the bar and the dash, as in

Coq < | - (A:Prop)A->A.
1 subgoal
  
  ============================
   (A:Prop)A->A


Assuming the previous rule has not been entered, we can forbid blanks with a rule that declares ``|-'' as a single token:

Coq < Grammar vernac vernac :=
Coq < thesis [ "|-" command:command($term) "." ]
Coq < -> [<:vernac:<Goal $term.>>].

Coq < | - (A:Prop)A->A.
Toplevel input, characters 0-1
> | - (A:Prop)A->A.
> ^
Syntax error: illegal begin of vernac


If both rules were entered, we would have three tokens |, - and |-. The lexical ambiguity on the string |- is solved according to the longest match rule (see lexical conventions page X), i.e. |- would be one single token. To enforce the use of the first rule, a blank must be inserted between the bar and the dash.


Remark: The vernac commands should always be terminated by a period. When a syntax error is detected, the toplevel discards its input until it reaches a period token, and then resumes parsing.


Example 3: Redefining tactics

We can give names to repetitive tactic sequences. Thus in this example ``IntSp'' will correspond to the tactic Intros followed by Split.

Coq < Grammar tactic simple_tactic :=
Coq < intros_split [ "IntSp" ] -> [<:tactic:<Intros; Split>>].


Let us check that this works.

Coq < Goal (A,B:Prop)A/\B -> B/\A.
1 subgoal
  
  ============================
   (A,B:Prop)A/\B->B/\A

Coq < IntSp.
2 subgoals
  
  A : Prop
  B : Prop
  H : A/\B
  ============================
   B
subgoal 2 is:
 A


Note that the same result can be obtained in a simpler way with Tactic Definition (see page X). However, this macro can only define tactics which arguments are terms.


Example 4: Priority, left and right associativity of operators

The disjunction has a higher priority than conjunction. Thus A/\B\/C will be parsed as (A/\B)\/C and not as A/\(B\/C). The priority is done by putting the rule for the disjonction in a higher level than that of conjonction: conjunction is defined in the non-terminal command6 and disjunction in command7 (see file Logic.v in the library). Notice that the character ``\'' must be doubled (see lexical conventions for quoted strings on page X).

Coq < Grammar command command6 :=
Coq < and [ command5($c1) "/\\" command6($c2) ] -> [<<(and $c1 $c2)>>].

Coq < Grammar command command7 :=
Coq < or [ command6($c1) "\\/" command7($c2) ] -> [<<(or $c1 $c2)>>].


Thus conjunction and disjunction associate to the right since in both cases the priority of the right term (resp. command6 and command7) is higher than the priority of the left term (resp. command5 and command6). The left member of a conjunction cannot be itself a conjunction, unless you enclose it inside parenthesis.

The left associativity is done by calling recursively the non-terminal. Camlp4 deals with this recursion by first trying the non-left-recursive rules. Here is an example taken from the standard library, defining a syntax for the addition on integers:

Coq < Grammar znatural expr :=
Coq < expr_plus [ expr($p) "+" expr($c) ] -> [<<(Zplus $p $c)>>].

9.2.3 Actions

Every rule should generate an AST corresponding to the syntactic construction that it recognizes. This generation is done by an action. Thus every rule is associated to an action. The syntax has been defined in Fig. 9.2. We give some examples.

Simple actions

A simple action is an AST enclosed between ``['' and ``]''. It simply builds the AST by interpreting it as a constructive expression in the environment defined by the LMP. This case has been illustrated in all the previous examples. We will later see that grammars can also return AST lists.

Local definitions

When an action should generate a big term, we can use let pattern = action1 in action2expressions to construct it progressively. The action action1 is first computed, then it is matched against pattern which may bind metavariables, and the result is the evaluation of action2 in this new context.


Example 5:

From the syntax t1*+t2, we generate the term (plus (plus t1 t2) (mult t1 t2)).

Coq < Grammar command command1 :=
Coq < mult_plus [ command0($a) "*" "+" command0($b) ]
Coq < -> let $p1=[<<(plus $a $b)>>] in
Coq < let $p2=[<<(mult $a $b)>>] in
Coq < [<<(plus $p1 $p2)>>].


Let us give an example with this syntax:

Coq < Goal (O*+O)=O.
1 subgoal
  
  ============================
   (plus (plus O O) (mult O O))=O

Conditional actions

We recall the syntax of conditional actions:

case action of pattern1 -> action1 | ... | patternn -> actionn esac


The action to execute is chosen according to the value of action. The matching is performed from left to right. The selected action is the one associated to the first pattern that matches the value of action. This matching operation will bind the metavariables appearing in the selected pattern. The pattern matching does need being exhaustive, and no warning is emited. When the pattern matching fails a message reports in which grammar rule the failure happened.


Example 6: Overloading the ``+'' operator

The internal representaion of an expression such as A+B depends on the shape of A and B: The trick is to build a temporary AST: {A} generates the node (SQUASH A). When we parse A+B, we remove the SQUASH in A and B:

Coq < Grammar command command1 :=
Coq < squash [ "{" lcommand($lc) "}" ] -> [(SQUASH $lc)].

Coq < Grammar command lassoc_command4 :=
Coq < squash_sum
Coq < [ lassoc_command4($c1) "+" lassoc_command4($c2) ] ->
Coq < case [$c2] of
Coq < (SQUASH $T2) ->
Coq < case [$c1] of
Coq < (SQUASH $T1) -> [<<(sumbool $T1 $T2)>>]
Coq < | $_ -> [<<(sumor $c1 $T2)>>]
Coq < esac
Coq < | $_ -> [<<(sum $c1 $c2)>>]
Coq < esac.


The problem is that sometimes, the intermediate SQUASH node cannot re-shaped, then we have a very specific error:

Coq < Check {True}.
Toplevel input, characters 6-12
> Check {True}.
>       ^^^^^^
Error: Unrecognizable braces expression.



Example 7: Comparisons and non-linear patterns

The patterns may be non-linear: when an already bound metavariable appears in a pattern, the value yielded by the pattern matching must be equal, up to renaming of bound variables, to the current value. Note that this does not apply to the wildcard $_. For example, we can compare two arguments:

Coq < Grammar command command10 :=
Coq < refl_equals [ command9($c1) "||" command9($c2) ] ->
Coq < case [$c1] of $c2 -> [<<(refl_equal ? $c2)>>] esac.

Coq < Check ([x:nat]x || [y:nat]y).
(refl_equal nat->nat [y:nat]y)
     : ([y:nat]y)=([y:nat]y)


The metavariable $c1 is bound to [x:nat]x and $c2 to [y:nat]y. Since these two values are equal, the pattern matching succeeds. It fails when the two terms are not equal:

Coq < Check ([x:nat]x || [z:bool]z).
Toplevel input, characters 7-28
> Check ([x:nat]x || [z:bool]z).
>        ^^^^^^^^^^^^^^^^^^^^^
Error: during interpretation of grammar rule refl_equals,
 Grammar case failure. The ast (LAMBDA nat [x]x)
 does not match any of the patterns : $c2
 with constraints : $c1 = (LAMBDA nat [x]x)
                    $c2 = (LAMBDA bool [z]z)

9.2.4 Grammars of type List

Assume we want to define an non-terminal ne_identarg_list that parses an non-empty list of identifiers. If the grammars could only return AST's, we would have to define it this way:

Coq < Grammar tactic my_ne_ident_list :=
Coq < ident_list_cons [ identarg($id) my_ne_ident_list($l) ] ->
Coq < case [$l] of
Coq < (IDENTS ($LIST $idl)) -> [(IDENTS $id ($LIST $idl))]
Coq < esac
Coq < | ident_list_single [ identarg($id) ] -> [(IDENTS $id)].


But it would be inefficient: every time an identifier is read, we remove the ``boxing'' operator IDENTS, and put it back once the identifier is inserted in the list.

To avoid these awkward trick, we allow grammars to return AST lists. Hence grammars have a type (Ast or List), just like AST's do. Type-checking can be done statically.

The simple actions can produce lists by putting a list of constructive expressions one beside the other. As usual, the $LIST operator allows to inject AST list variables.

Coq < Grammar tactic ne_identarg_list : List :=
Coq < ne_idl_cons [ identarg($id) ne_identarg_list($idl) ]
Coq < -> [ $id ($LIST $idl) ]
Coq < | ne_idl_single [ identarg($id) ] -> [ $id ].


Note that the grammar type must be recalled in every extension command, or else the system could not discriminate between a single AST and an AST list with only one item. If omitted, the default type is Ast. The following command fails because ne_identarg_list is already defined with type List but the Grammar command header assumes its type is Ast.

Coq < Grammar tactic ne_identarg_list :=
Coq < list_two [ identarg($id1) identarg($id2) ] -> [ $id1 $id2 ].
Toplevel input, characters 15-31
> Grammar tactic ne_identarg_list :=
>                ^^^^^^^^^^^^^^^^
Error: Entry tactic:ne_identarg_list already exists with another type


All rules of a same grammar must have the same type. For instance, the following rule is refused because the command command1 grammar has been already defined with type Ast, and cannot be extended with a rule returning AST lists.

Coq < Grammar command command1 :=
Coq < carret_list [ command0($c1) "^" command0($c2)] -> [ $c1 $c2 ].
Toplevel input, characters 146-153
>   carret_list [ command0($c1) "^" command0($c2)] -> [ $c1 $c2 ].
>                                                       ^^^^^^^
Error: entry type is Ast, but the right hand side is a list

9.2.5 Limitations

The extendable grammar mechanism have four serious limitations. The first two are inherited from Camlp4.

The command Print Grammar prints the rules of a grammar. It is displayed by Camlp4. So, the actions are not printed, and the recursive calls are printed SELF. It is sometimes useful if the user wants to understand why parsing fails, or why a factorization was not done as expected.

Coq < Print Grammar command command8.
[ LEFTA
  [ Command.command7; "<->"; SELF
  | Command.command7; "->"; SELF
  | Command.command7 ] ]

Getting round the lack of factorization

The first limitation may require a non-trivial work, and may lead to ugly grammars, hardly extendable. Sometimes, we can use a trick to avoid these troubles. The problem arises in the Gallina syntax, to make Camlp4 factorize the rules for application and product. The natural grammar would be:

Coq < Grammar command command0 :=
Coq < parenthesis [ "(" command10($c) ")" ] -> [$c]
Coq < | product [ "(" prim:var($id) ":" command($c1) ")" command0($c2) ] ->
Coq < [(PROD $c1 [$id]$c2)]
Coq < with command10 :=
Coq < application [ command91($c1) ne_command91_list($lc) ] ->
Coq < [(APPLIST $c1 ($LIST $lc))]
Coq < | inject_com91 [ command91($c) ] -> [$c].

Coq < Check (x:nat)nat.
Toplevel input, characters 8-9
> Check (x:nat)nat.
>         ^
Syntax error: ')' expected after [Command.command10] (in [Command.command0])


But the factorization does not work, thus the product rule is never selected since identifiers match the command10 grammar. The trick is to parse the ident as a command10 and check a posteriori that the command is indeed an identifier:

Coq < Grammar command command0 :=
Coq < product [ "(" command10($c) ":" command($c1) ")" command0($c2) ] ->
Coq < [(PROD $c1 [$c]$c2)].

Coq < Check (x:nat)nat.
nat->nat
     : Set


We could have checked it explicitely with a case in the right-hand side of the rule, but the error message in the following example would not be as relevant:

Coq < Check (S O:nat)nat.
Toplevel input, characters 7-10
> Check (S O:nat)nat.
>        ^^^
Error: during interpretation of grammar rule product,
 the variable $c was bound to (APPLIST S O)
 instead of a (list of) variable(s).


This trick is not similar to the SQUASH node in which we could not detect the error while parsing. Here, the error pops out when trying to build an abstraction of $c2 over the value of $c. Since it is not bound to a variable, the right-hand side of the product grammar rule fails.


Retrieved by Memoweb from http://pauillac.inria.fr/coq/doc/node.2.0.1.html at 8/10/98