ArchWizard

DGD/

source navigation ]
diff markup ]
identifier search ]
file search ]
Version: [ 1.0.a0 ] [ 1.1 ] [ 1.2 ] [ 1.2p1 ] [ 1.2p2 ] [ 1.2p3 ] [ 1.2p4 ] [ 1.2.151 ]

  1                         Parser reference manual
  2 
  3 1. Introduction
  4 
  5 The parse_string() utility for DGD parses strings as follows: first, the
  6 string is broken up into two types of smaller strings, called "tokens" and
  7 "whitespace".  Next, the resulting array of tokens (whitespace is ignored) is
  8 parsed as a sentence in the language defined by a formal grammar.  For any
  9 valid parsings, LPC functions are called as appropriate, which may transform
 10 the array of tokens, or indicate that a particular parsing of the string
 11 is yet to be considered invalid.  Finally, the resultant array of tokens, if
 12 any, is returned by the function.
 13 
 14 Regular expressions are used to define tokens and whitespace, and the language
 15 of those tokens is defined by a context-free grammar, which can be enhanced
 16 with LPC function calls.  Token/whitespace definitions and language definition
 17 together are called "the grammar".
 18 
 19 Internally, DGD generates two partial automatons for each grammar, one to
 20 match tokens and one to parse a sentence of tokens.  Only those parts of the
 21 automatons are generated that are required to parse the sentences seen so far;
 22 this effect is cumulative for successive calls in each object, ensuring that
 23 the automatons are not needlessly generated more than once.  Each object can
 24 handle only one grammar at a time.  If a grammar differs from the previous
 25 one used by the same object, any existing generated automatons are wiped out
 26 and constructed anew.
 27 
 28 
 29 2. Grammar
 30 
 31 A grammar consists of token rules and production rules, which may be given in
 32 any order.  Each grammar must contain at least one token rule and one
 33 production rule.
 34 
 35 
 36 2.1 Token rules
 37 
 38 A token rule has the following format:
 39 
 40     token = /regexp/
 41 
 42 where "token" is an identifier name as in LPC, and "regexp" a regular
 43 expression, with the following syntax:
 44 
 45     c       a single character except `.', `[', `\', `*', `+', `(', `)', `/'
 46     .       any single character
 47     \c      a single character c, including `.', `[', `\', `*', `+', `(', `)',
 48             `/'
 49     [set]   a single character in the given set, which is constructed of
 50             single characters such as `a' and/or character ranges such as
 51             `a-z'.  `\' may be used to escape the characters `]', `^', `-', `\'
 52     [^set]  a single character not in the given set
 53 
 54 and, with regular expressions "a" and "b":
 55 
 56     a*      zero or more occurrences of a       (highest precedence)
 57     a?      zero or one occurrences of a
 58     a+      one or more occurrences of a
 59     ab      the concatenation of a and b
 60     a|b     a or b
 61     (a)     a                                   (lowest precedence)
 62 
 63 For any regular expression, the longest possible token will be matched.  The
 64 name "whitespace" is reserved for defining a special token, which is simply
 65 skipped.  More than one rule may be specified for each token, including
 66 whitespace.
 67 
 68 If a string matches more than token, the token for which the rule appears first
 69 in the grammar is selected.  If a string does not match any token, it is
 70 rejected and parsing fails.
 71 
 72 
 73 2.2 Production rules
 74 
 75 A production rule has the following format:
 76 
 77     symbol : rule
 78 
 79 where "symbol" is an identifier name as in LPC for the production rule, and
 80 "rule" consists of zero or more token symbols, production rule symbols, or
 81 string constants delimited by single quotes.
 82 
 83 A string constant in a production rule matches the given sequence of characters
 84 directly, without reference to a token rule.  Since a token rule is implicitly
 85 generated for such string constants, the string constant is also said to be a
 86 token.  String constants take precedence over regular expressions, and any
 87 token that matches both a string constant and a regular expression will always
 88 match the string constant, only.  `\' may be used in a string constant to
 89 escape the single quote.
 90 
 91 More than one production rule may be given for the same symbol.  All
 92 production rules taken together form a context-free grammar, for which the
 93 start symbol is defined in the first rule.  Production rules that are
 94 unreachable from the start symbol are ignored.  Any symbol used in a
 95 production rule reachable from the start symbol must itself define a token
 96 rule or a production rule.
 97 
 98 Optionally, a rule can be followed by an LPC function name:
 99 
100     symbol : rule ? function
101 
102 where "function" is the name of an LPC function to be called in the current
103 object if the specified rule has been matched.
104 
105 
106 3. Parsing
107 
108 Normally, parse_string() returns a flat array of token strings (if the input
109 string could be parsed as a sentence in the language defined by the grammar)
110 or nil (if the input string could not be parsed).  However, the return value
111 can be modified to an array containing arbitrary other values, including
112 sub-arrays, by specifying LPC functions in the grammar that will be called
113 while parsing.  Moreover, if the grammar is ambiguous, sub-arrays for the
114 alternative may be automatically included in the result.
115 
116 
117 3.1 LPC functions
118 
119 An LPC function called for a production rule will have a single argument,
120 an array with the parse tree that matches the production rule.  LPC functions
121 are called in bottom-up order, so any function calls for sub-parse trees in
122 the current parse tree will already have been made.
123 
124 The LPC function can return its argument unchanged, or modify its argument
125 before returning it, or return a completely different value.  Whatever the
126 return value is, it is taken to be an array matching the current rule; if
127 it is not an array at all, the current parse tree (and all parse trees that
128 depend on it) is discarded altogether.
129 
130 LPC functions are typically used to replace tokens in the parse tree by
131 their semantic value, or to add structure to a parse tree, which otherwise
132 would be represented as a flat array.
133 
134 
135 3.2 Ambiguous grammars
136 
137 A grammar that allows more than one parse tree for the same input string is
138 said to be "ambiguous".  The string parser will sort alternative parse trees
139 by the order of the topmost different rule in the grammar, and by default
140 ignore all of them but the first one.
141 
142 An optional third argument to parse_string() can be used to retain an
143 additional number of alternative parse trees, which will be merged with the
144 first one; the argument specifies the number of extra alternatives to keep at
145 each point where sub-parse trees are handled by different rules in the grammar.
146 Instead of incorporating the first alternative immediately in the array
147 returned, a sub-array containing all of the alternatives as separate elements
148 is included, sorted by the order of their topmost rule in the grammar.
149 
150 LPC functions called for rules can be used to invalidate alternative parse
151 trees.  Invalidated alternatives will be removed from the overall result.
152 
153 
154 3.3 Writing grammars
155 
156 Grammars can be parsed most efficiently if they are not ambiguous.  Generally,
157 ambiguous grammars can be rewritten to be non-ambiguous.  For example,
158 consider the following (partial) grammar, which is ambiguous:
159 
160     Expr: number
161     Expr: Expr '+' Expr
162 
163 It can be rewritten to be either left-recursive,
164 
165     Expr: number
166     Expr: Expr '+' number
167 
168 or right-recursive:
169 
170     Expr: number
171     Expr: number '+' Expr
172 
173 Neither rewritten grammar is ambiguous, and both explicitly specify the order
174 in which expressions are to be evaluated: from left to right, or from right to
175 left, respectively.  If the order doesn't matter, left recursion should be
176 preferred, as this is parsed more efficiently.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~ [ file search ] ~

This page was automatically generated by the LXR engine.
Visit the LXR main site for more information.