1 Parser reference manual
3 1. Introduction
5 The parse_string() utility for DGD parses strings as follows: first, the
6 string is broken up into two types of smaller strings, called "tokens" and
7 "whitespace". Next, the resulting array of tokens (whitespace is ignored) is
8 parsed as a sentence in the language defined by a formal grammar. For any
9 valid parsings, LPC functions are called as appropriate, which may transform
10 the array of tokens, or indicate that a particular parsing of the string
11 is yet to be considered invalid. Finally, the resultant array of tokens, if
12 any, is returned by the function.
14 Regular expressions are used to define tokens and whitespace, and the language
15 of those tokens is defined by a context-free grammar, which can be enhanced
16 with LPC function calls. Token/whitespace definitions and language definition
17 together are called "the grammar".
19 Internally, DGD generates two partial automatons for each grammar, one to
20 match tokens and one to parse a sentence of tokens. Only those parts of the
21 automatons are generated that are required to parse the sentences seen so far;
22 this effect is cumulative for successive calls in each object, ensuring that
23 the automatons are not needlessly generated more than once. Each object can
24 handle only one grammar at a time. If a grammar differs from the previous
25 one used by the same object, any existing generated automatons are wiped out
26 and constructed anew.
29 2. Grammar
31 A grammar consists of token rules and production rules, which may be given in
32 any order. Each grammar must contain at least one token rule and one
33 production rule.
36 2.1 Token rules
38 A token rule has the following format:
40 token = /regexp/
42 where "token" is an identifier name as in LPC, and "regexp" a regular
43 expression, with the following syntax:
45 c a single character except `.', `[', `\', `*', `+', `(', `)', `/'
46 . any single character
47 \c a single character c, including `.', `[', `\', `*', `+', `(', `)',
49 [set] a single character in the given set, which is constructed of
50 single characters such as `a' and/or character ranges such as
51 `a-z'. `\' may be used to escape the characters `]', `^', `-', `\'
52 [^set] a single character not in the given set
54 and, with regular expressions "a" and "b":
56 a* zero or more occurrences of a (highest precedence)
57 a? zero or one occurrences of a
58 a+ one or more occurrences of a
59 ab the concatenation of a and b
60 a|b a or b
61 (a) a (lowest precedence)
63 For any regular expression, the longest possible token will be matched. The
64 name "whitespace" is reserved for defining a special token, which is simply
65 skipped. More than one rule may be specified for each token, including
68 If a string matches more than token, the token for which the rule appears first
69 in the grammar is selected. If a string does not match any token, it is
70 rejected and parsing fails.
73 2.2 Production rules
75 A production rule has the following format:
77 symbol : rule
79 where "symbol" is an identifier name as in LPC for the production rule, and
80 "rule" consists of zero or more token symbols, production rule symbols, or
81 string constants delimited by single quotes.
83 A string constant in a production rule matches the given sequence of characters
84 directly, without reference to a token rule. Since a token rule is implicitly
85 generated for such string constants, the string constant is also said to be a
86 token. String constants take precedence over regular expressions, and any
87 token that matches both a string constant and a regular expression will always
88 match the string constant, only. `\' may be used in a string constant to
89 escape the single quote.
91 More than one production rule may be given for the same symbol. All
92 production rules taken together form a context-free grammar, for which the
93 start symbol is defined in the first rule. Production rules that are
94 unreachable from the start symbol are ignored. Any symbol used in a
95 production rule reachable from the start symbol must itself define a token
96 rule or a production rule.
98 Optionally, a rule can be followed by an LPC function name:
100 symbol : rule ? function
102 where "function" is the name of an LPC function to be called in the current
103 object if the specified rule has been matched.
106 3. Parsing
108 Normally, parse_string() returns a flat array of token strings (if the input
109 string could be parsed as a sentence in the language defined by the grammar)
110 or nil (if the input string could not be parsed). However, the return value
111 can be modified to an array containing arbitrary other values, including
112 sub-arrays, by specifying LPC functions in the grammar that will be called
113 while parsing. Moreover, if the grammar is ambiguous, sub-arrays for the
114 alternative may be automatically included in the result.
117 3.1 LPC functions
119 An LPC function called for a production rule will have a single argument,
120 an array with the parse tree that matches the production rule. LPC functions
121 are called in bottom-up order, so any function calls for sub-parse trees in
122 the current parse tree will already have been made.
124 The LPC function can return its argument unchanged, or modify its argument
125 before returning it, or return a completely different value. Whatever the
126 return value is, it is taken to be an array matching the current rule; if
127 it is not an array at all, the current parse tree (and all parse trees that
128 depend on it) is discarded altogether.
130 LPC functions are typically used to replace tokens in the parse tree by
131 their semantic value, or to add structure to a parse tree, which otherwise
132 would be represented as a flat array.
135 3.2 Ambiguous grammars
137 A grammar that allows more than one parse tree for the same input string is
138 said to be "ambiguous". The string parser will sort alternative parse trees
139 by the order of the topmost different rule in the grammar, and by default
140 ignore all of them but the first one.
142 An optional third argument to parse_string() can be used to retain an
143 additional number of alternative parse trees, which will be merged with the
144 first one; the argument specifies the number of extra alternatives to keep at
145 each point where sub-parse trees are handled by different rules in the grammar.
146 Instead of incorporating the first alternative immediately in the array
147 returned, a sub-array containing all of the alternatives as separate elements
148 is included, sorted by the order of their topmost rule in the grammar.
150 LPC functions called for rules can be used to invalidate alternative parse
151 trees. Invalidated alternatives will be removed from the overall result.
154 3.3 Writing grammars
156 Grammars can be parsed most efficiently if they are not ambiguous. Generally,
157 ambiguous grammars can be rewritten to be non-ambiguous. For example,
158 consider the following (partial) grammar, which is ambiguous:
160 Expr: number
161 Expr: Expr '+' Expr
163 It can be rewritten to be either left-recursive,
165 Expr: number
166 Expr: Expr '+' number
168 or right-recursive:
170 Expr: number
171 Expr: number '+' Expr
173 Neither rewritten grammar is ambiguous, and both explicitly specify the order
174 in which expressions are to be evaluated: from left to right, or from right to
175 left, respectively. If the order doesn't matter, left recursion should be
176 preferred, as this is parsed more efficiently.
This page was automatically generated by the
Visit the LXR main site for more