? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Universität Stuttgart Fakultät Informatik A New Parsing Strategy for Context-Free Grammars Thomas Schöbel Computer Science Report No. 7/91 May 1991 1st printing May 1991 All rigths reserved. Dekanat der Fakultät Informatik Universität Stuttgart Breitwiesenstr. 20-22, D-7000 Stuttgart 80, Germany A New Parsing Strategy for Context-Free Grammars Extended Abstract from a currently prepared paper Thomas Schöbel 2 May 1991 Abstract We present a family of new top-down parsing algorithms for arbitrary context-free grammars based on the notion of a "tree of possible left-derivations". When coded in a special way, this tree can be compressed to a graph. While the time complexity on uniform RAMs is O(n3) in worst case, we need O(n2) space. We show that the asymptotic time complexities are equal to Earley's algorithms, and give some arguments to show our algorithm will be better in almost all cases. 1 Introduction Since Cocke, Younger and Kasami (CYK) presented the first polynomial algorithms for recognition of context-free languages, a lot of investigation has been done in the field of recognition and parsing. Amoung the various parsers, Earley's famous algorithm [Ea70] is the most commonly used. Graham, Harrison and Ruzzo [GHR80] have shown that the CYK and Earley methods are "almost" the same, based on an upper triangle matrix, principally containing triples as elements. Another approach is Tomita's method [To86], which uses a graph for compressing multiple stacks to a smaller structure. In contrast to our proposed algorithm, he takes the various bottom-up algorithms (SLR, LR(0), LR(k), LALR : : :) as base for compressing the exponentially growing computation tree. Our method is founded on a theoretical basis, from which we may get multiple algorithms. So we think the primary contribution of this paper to the field will be another view of parsing. At current state, however, we feel also that there are gains when compared to Earley's algorithm and many improvements which could be found in the literature (e. g. [BPS75],[GHR80],[KA89],[Lei90]). We are currently working on a detailed discussion of the related topics. 2 Basic Definitions, TPL and CTPL We will use a similar notation as in [HU79]. Kleene-Closure will be denoted by an asterisk ?. A context-free grammar G (or simply Grammar) is defined as G = (?; N;P; S), where ? is the terminal alphabet, N the nonterminal alphabet, ? \ N = ;, V := ? [ N the alphabet, P ? N ? V ? the set of productions, and S 2 N the start symbol. Terminals will be denoted as a, b, c, : : :, nonterminals as A, B, C, : : :, strings from V ? as ff, fi, fl, : : :, the empty string is ffl, and productions will be denoted as X ! fl. We introduce the notion of a left-derivation by terms of relations: Let G be a Grammar, then the relation )L on V ? is defined by ff )L fi if and only if ff = ff1Xff2; fi = ff1flff2; X ! fl 2 P; ff1 2 ?? ; ff2 2 V ?. A left{derivation DL is a (finite or infinite) sequence of strings DL(ff) := (ff0; ff1; ff2; : : :) where ffi )L ffi+1, ff0 = ff. This definition directly leads to the notion of a "Tree of Possible Left- Derivations": We just write the left-derivation relation as a tree. Definition 2.1 Let G be a Grammar. The Tree of Possible Left-Derivations (TPL) of a string ff 2 V ? is a (possibly infinite) tree _G = ( _V; _E) with root _r( _G) := _s 2 _V, where _V is the set of nodes, labeled with strings ? : _V 7! V ?, ?(_s) = S, _E ? _V ? _V is the set of edges, and 8 _v2 _V 8ff2V ? ; ?( _v))Lff : 9 _w2 _V ; ( _v; _w)2 _E : ?( _w) = ff 8( _v; _w1) ; ( _v; _w2)2 _E : _w1 6= _w2 =) ?( _w1) 6= ?( _w2) 2 Note that we always write objects belonging to a tree or directed graph with a dot on it, for distinction of _V (the set of vertices) from V (the alphabet). Also, we regard a tree as a specialized directed graph being acyclic, fully connected, and every node having at most one predecessor. The root of a tree has no predecessors. Intuitively, the TPL shows all possible left-derivations, starting from the root _s labeled with S. Siblings always mean different alternatives of derivations. In general, the TPL will be infinite in depth and growing exponentially in breadth. It is clear that the concept of the TPL will not directly serve for efficient recognition or parsing, because of its excessive consuming of resources. Therefore, we introduce a more efficient coding of the TPL, called CTPL (Compressed Tree of Possible Left-Derivations). Every label ff of the TPL is partitioned as follows: ff = xXffi where x 2 ?? is called the (terminal) prefix, X 2 N [ ffflg the (nonterminal) infix and ffi 2 V ? the suffix. Additionally, when X = - =) ffi = ffl, this partitioning is unique. Let us look at a derivation step in the TPL starting at _v 2 _V with label ?( _v) = ff = xXffi: Let X ! fl be a production, then xXffi )L xflffi where xflffi labels a son of _v. Derivation is taking place only at the X ; the other components (x and ffi) remain unchanged in a derivation step (a property of context-freeness). The basic idea of the CTPL is to register only the modifications along a path in the TPL. This is accomplished by repeatedly partitioning of fl into components x0X 0ffi0 by use of three different labeling functions. The same is also done with the direct derivation of X 0 and so on. As far as X 6= ffl, this will result in a sequence of derivation steps, where only the labeling differences from the predecessor need to be registered. For the CTPL, we need a prefix labeling function ß assigned to the edges of the tree to register the local differences of the terminal prefix x. The infix labeling function ? contains the nonterminal X , the suffix labeling function ö contains the rest ffi. Both are assigned to the nodes of the tree. As we shall see, every node _v in the TPL has a corresponding node _v0 in the CTPL. To get the same full prefix of _v, we have to concatenate all prefixes in the path in the CTPL from the root _s0 to _v0. Similarly, the full suffix can be achieved by concatenation of all suffixes along the path upwards to the root. So we find that the CTPL will contain the same information as the TPL. However, there are difficulties: When deriving a sentential form, a fl may be empty or may consist only of terminals. In other words, the existence of a nonterminal X and therefore of a non-empty infix is not guaranteed, and further derivations would be impossible. In this case, we have to do some transformations on the tree, which will produce an equivalent subtree for the "dead end". Normally, the suffixes ffi of other fl's contain nonterminals, which have to be expanded. So it is clear that our tree transformation has to activate these nonterminals. Definition 2.2 Let G be a grammar. A Compressed Tree of Possible Left-Derivations (CTPL) is a tree _G = ( _V; _E) with root _r( _G) := _s and labeling functions ß : _E 7! ?? (prefix labeling) ? : _V 7! N [ ffflg (infix labeling) ö : _V 7! V ? (suffix labeling) satisfying the following conditions: (1) ?(_s) = S; ö(_s) = - (2) 8 _v2 _V 8?( _v)!fl2P 9( _v; _w)2 _E : fl = ß(( _v; _w))?( _w)ö( _w) (3) ?( _v) = - =) ö( _v) = - (4) ( _u; _v) ; ( _v; _w) 2 _E ^ ?( _w)ö( _w) = - if and only if 9( _u; _b)2 _E : ß(( _u; _b)) = ß(( _u; _v))ß(( _v; _w))z; ?( _b) = X 0; ö( _b) = ffi0; where ö( _v) = zX 0ffi0; z 2 ??; X 0 2 N [ ffflg; ffi0 2 V ? (5) _G contains nothing else. Conditions (1) and (2) give an inductive definition of directly derivable nodes in the CTPL. Condition (3) ensures that existing nonterminals will always appear in ? rather than in ffi. Condition (4) is the tree transformation, when some ?( _w) is ffl. In this case, we generate a brother _b to the father _v, which receives the suffix ö( _v) for partitioning into three components. As we will see, _w and _b have the same meaning for denoting a derivation. Furthermore, note that _b may have an empty ?( _b) also; then condition (4) generates another "uncle" denoting the same derivation. 3 To show the equivalence of TPL and CTPL, we introduce the notions of total prefixes and total suffixes. Definition 2.3 Let _G be a CTPL with root _r( _G) := _s over a grammar G. Let ^a be a path (_s; _v1; _v2; : : : ; _vk; _v) from the root _s to a node _v. We define the following functions: 1. The total prefix ^ß(^a) of path ^a is ^ß(^a) := ß((_s; _v1)) ß(( _v1; _v2)) : : : ß(( _vk?1; _vk)) ß(( _vk ; _v)) 2. The total suffix ^ö(^a) of path ^a is ^ö(^a) := ö( _v) ö( _vk ) : : : ö( _v1) ö(_s) 3. The sentential form ^Ö(^a) of path ^a is ^Ö(^a) := ^ß(^a) ?( _v) ^ö(^a) When dealing with trees (rather than later with less restricted graphs), every node _v has a unique path ^a from the root _s to _v. Therefore, we can write ^Ö( _v) as a short form instead of explicitly giving the path ^a as argument for ^Ö(^a). Lemma 2.4 Let _G be a CTPL with root _r( _G) := _s. Let _u; _v; _w; _b 2 _V satisfying condition (4) of definition 2.2. Then ^Ö( _w) = ^Ö( _b) Proofs are omitted here. Lemma 2.5 Let _G be a CTPL. The relation ?= on _V, defined by _v ?= _w if and only if ^Ö( _v) = ^Ö( _w) is an equivalence relation. The proof is trivial, for ?= is directly defined by the equivalence relation =. Now we are ready to show the equivalence of TPL and CTPL: Theorem 2.6 Let G be a grammar, _G = ( _V; _E) the CTPL and _G0 = ( _V0; _E0) the TPL of G. Then (1) 8 _v2 _V : 9 _v02 _V0 : ^Ö( _v) = ?( _v0) (2) 8 _v02 _V0 : 9 _v2 _V : ^Ö( _v) = ?( _v0) (3) 8( _v0; _w0)2 _E0 : 9( _v; _w)2 _E : ^Ö( _v) = ?( _v0); ^Ö( _w) = ?( _w0) Informally, we can see that TPL and CTPL are equivalent in some sense. This means that the CTPL has more nodes becauses of its internal equivalence relation (lemma 2.5), so only the equivalence classes of the CTPL are equivalent to the TPL. 3 The CGPL Until now we were dealing with structures which, in the worst case, may grow exponentially in breadth. Furthermore, there may be infinite recursions in the CTPL when the grammar is left-recursive. Lastly, both structures infinitely generate all sentences of the language L(G). The last problem can be handled easily by a parsing algorithm: We just have to prune all branches whose total prefixes do not match the given input string x 2 L(G). The former problems can be solved when turning the CTPL to a true directed graph called CGPL, allowing to melt nodes with common subtrees. Some difficulties arises from dealing with labeled infinite trees. Hence, we need a definition of equivalence between infinite trees. Additionally, we want to compress trees to graphs, and therefore we need all the definitions also for graphs, regarding trees as a special case. For convenience, we will call the source (a single particular node) of a directed graph _G also the root by _r( _G).1 1Note that the root of a tree has no predecessors; in contrast, the root of a graph can have one. In our algorithm, the root of the main graph will never have predecessors, in contrast to subgraphs, where cycles may cause this. 4 Definition 3.1 Let _G = ( _V; _E) be an arbitrary graph with root _r( _G) := _s. The set of successors _R( _v) of a node _v 2 _V is defined as _R( _v) := _R1( _v) := f _wj( _v; _w) 2 _Eg The extension of _R to sets of nodes _A is defined by _R( _A) := _R1( _A) := [ _v2 _A _R( _v) The set of i-reachable nodes _Ri( _v) resp. _Ri( _A) is defined by _Ri( _v) := _R( _Ri?1( _v)) ; _Ri( _A) := [ _v2 _A _Ri( _v) The set of sucessors is the same as the set of 1-reachable nodes. Note also that if the graph has cycles, a node _w may infinitely often occur in the sequence f _Ri( _v)g. Membership _w 2 _Ri( _v) means that there exists a path of exactly length i from _v to _w. We close the definition for i = 0 by defining _R0( _v) := f _vg, and also for negative i denoting predecessors by _R?1( _w) := f _vj( _v; _w) 2 _Eg and analogously for i 2 f?2; ?3; : : :g. Definition 3.2 Let _G = ( _V; _E) be a (possibly infinite) graph (tree). A subgraph (subtree) _S( _v) of a node _v 2 _V is a graph (tree) _S( _v) := ( _V0; _E0), where _V0 ? _V, _E0 ? _E, and both subsets are defined by _V0 := [ i2N _Ri( _v); _E0 := f( _u; _w)j _u 2 _V0g; _r(( _V0; _E0)) := _v Intuitively, this definition for a subgraph encompasses all reachable nodes. Note that in _S( _v), _v is the new root of the subgraph. When the subgraph has no cycles, the root has (like a tree) no predecessors. In cases where _v is also member of some _Ri( _v), the root also has predecessors. Now we define the equivalence of arbitrary labeled graphs: Definition 3.3 Let _G = ( _V; _E) and _G0 = ( _V0; _E0) be (possibly infinite) graphs with arbitrary labeling functions ?1; : : : ; ?k resp. ?01; : : : ; ?0k mapping _V resp. _V0 to an arbitrary free monoid ??, and ?1; : : : ; ?l resp. ?01; : : : ; ?0l arbitrary labeling functions from _E resp. _E0 to ??. _G ? _G0 if and only if 8h2N : _G ?h _G0 where ?h is the finite-case equivalence defined by (1) _G ?0 _G0 if and only if ?i(_r( _G)) = ?0i(_r( _G0)) (2) _G ?h _G0 (h > 0) if and only if _G ?0 _G0; 8_e=(_r( _G); _v)2 _E 9_e0=(_r( _G0); _v0)2 _E0 : ?j( _e) = ?0j ( _e0 ) ; _S( _v) ?h?1 _S( _v0) 8_e0=(_r( _G0); _v0)2 _E0 9_e=(_r( _G); _v)2 _E : ?0j ( _e0 ) = ?j( _e) ; _S( _v0) ?h?1 _S( _v) Note that this is no structural equivalence; all we demand is that all edges of _G have an equivalent edge in _G0 leading to a node with the same labels and an equivalent subtree, and vice versa. As we shall see later, there are cases in which the nodes have no 1:1 correspondence but the graphs are equivalent. Definition 3.4 Let _G be a graph labeled as in definition 3.3. We extend the definition of all labeling functions ?i and ?j from nodes resp. edges to paths as follows, yielding sequences of strings: Let ^a = ( _v1; : : : ; _vk) be an arbitrary path in _G. Then ?i(^a) := (?i( _v1); : : : ; ?i( _vk)); ?j(^a) := (?j(( _v1; _v2)); : : : ; ?j(( _vk?1; _vk))) Now we present a simple lemma which justifies the intuitive notion of the equivalence ? by the observation, that every labeled path in _G has a corresponding labeled path in _G0 and vice versa (though the underlying paths may be different): Lemma 3.5 Let _G and _G0 be labeled graphs with _G ? _G0. For every _v 2 _V exists a _v0 2 _V0 such that ?i((_r( _G); : : : ; _v)) = ?0i((_r( _G0); : : : ; _v0)); ?j((_r( _G); : : : ; _v)) = ?0j ((_r( _G0); : : : ; _v0)) 5 Now we get the main lemma of this paper, showing equivalences on the CTPL: Lemma 3.6 Let G be a grammar and _G = ( _V; _E) be the corresponding CTPL. 8 _v; _w2 _V ; ?( _v)=?( _w) ; ö( _v)=ö( _w) : _S( _v) ? _S( _w) Now we have the tools for compressing the CTPL to a graph. However, when applying this lemma to a CTPL, we could compress it to a finite graph reflecting all possible left-derivations as a static structure, which is not what we want. When concerned with parsing, we prune all branches of the CTPL not matching the input, so we cannot apply this lemma to parts of the pruned CTPL, because different subtrees beginning with the same labeling may derive to different parts of the input and therefore may be pruned in different ways. First, we have to state a model how pruning of the CTPL is done. We will only consider the case of online parsing from left to right in this paper. In this model, we have sequential access to the input symbols ai for i from 1 to n. After accessing each symbol, we update our CGPL in such a way that it reflects the known parts of the CTPL. The process of updating the CGPL can be described as a sequence of graphs _Gi = ( _Vi; _Ei) where 8 _v2 _Vi : ^ß( _v) = a0 : : : aj where j <= i. Those nodes where j = i are called cursor nodes, denoted by _Ci. This means to keep only those parts of the CTPL in the graph where each node has a total prefix matching a substring a0 : : : ai of the current input. In other words, the cursors at step i are all the nodes having exactly total prefix a0 : : : ai. What could be currently derived from them are unpruned subtrees of the CTPL, because we needed additional input symbols to prune these subtrees also. So it is sufficient not to store these subtrees as a graph by leaving them simply out, for they can be completed when reading the next input symbol, so only the matching parts of these subtrees will be added. At first glance, this situation seems strange: For the known parts of our graph, lemma 3.6 cannot be applied in general, in contrast to the unknown parts. However, just this is the key for compressing the pruned CTPL to a graph with polynomial space complexity: All cursors having the same ? and ö label are equivalent by lemma 3.6, so they can be melted into one single node. In subsequent input steps, the pruning of the different original subtrees would be done in parallel; this means that after pruning has been consistently done, the different subtrees remain equivalent. So we are entitled to melt them. Now we are ready to state the main theorem of this paper, showing that all identically labeled cursor nodes produced in one input step can be melted. Theorem 3.7 Let G be a grammar, _GT the corresponding CTPL, and let x = a1 : : : an be an input word. Let f _Gig be an arbitrary sequence of graphs for i = 0; : : : ; n with the property _Gi ? ( _VTi ; _ETi ), where ( _VTi ; _ETi ) is the subtree of _GT containing exactly all nodes with 8 _v2 _VTi : ^ß( _v) = a1 : : : aj ; 0 <= j <= i 2. Then for every _Gi exists an equivalent graph _G0i ? _Gi such that 8 _v; _w2 _Ci( _G0i ); _v6= _w : ?( _v) 6= ?( _w) _ ö( _v) 6= ö( _w) In other words, the graph _G0i is minimal with respect to the number of cursor nodes. From this theorem, we can directly construct a parser for L(G). In step 0, we start with _G00 containing only the root with infix label S. In step i we have a graph _G0i, and now step i +1 begins with reading input symbol ai+1. We produce _Gi+1 from _G0i by appending the additional cursor nodes _Ci+1 using definiton 2.2 without rule (3), and lastly produce _G0i+1 by melting together all cursor nodes having the same labels using lemma 3.6. We can see at once, that in every input step i the graph _G0i has at most k cursor nodes, where k is a constant depending on the grammar G. The difference between _G0i and _G0i+1 lies only in the added cursor nodes, so we can use a single growing graph for our algorithm, and the total number of nodes can be bounded by O(n). Since _E ? _V ? _V, each node is connected with at most O(n) other nodes, and therefore the total space complexity is only O(n2). By construction, this growing graph is equivalent to (the matching parts of) the CTPL and therefore contains the same information; at least it contains the information of all successful parses when the algorithm terminates after reading the last input symbol. 4 A Recognizer Algorithm We present a recognizer based on theorem 3.7 which is as simple as possible. The simplifications are: - The prefix labels contain redundant information, for the prefix always matches the input symbols already processed. So we leave the prefix labels out. 2For j = 0, we define a1 : : : aj := - as straightforward extension 6 - The labeling functions ? and ö are concatenated into one labeling function & with &( _v) := ?( _v)ö( _v). We relax rule (3) of definition 2.2. When &( _v) begins with a nonterminal, then this denotes the infix; in all other cases the infix is empty. - We introduce a new terminal symbol a as end-of-file marker and therefore augment the grammar by a new production S0 ! S a without changing the generated language, using S0 as new start symbol. This is to prevent the root from appearing in a cycle. The set of all cursor nodes at step i is denoted by _Ci. We indicate the structure of the algorithm by indentation. Algorithm 4.1 Let G = (?; N; P; S) be a grammar, x = a1:::an a an input word. Let G0 := (? [ fag; N [ fS0g; P 0; S0) a modified grammar with new start symbol S0 and P 0 = P [ fS0 ! S ag. The CGPL _G is initialized with root _s labeled &(_s) := S0 and an edge (_s; _t) with &(_t) := S a. Initialize _C0 := f_tg. Program MAIN FOR i := 1; : : : ; n + 1 DO: 1) 8 cursors _w 2 _Ci?1 which have not yet been processed: a) IF &( _w) = Xffi, where X 2 N : remove _w from _Ci?1 8 productions X ! fl 2 P : IF fl = ffl: MAKEBROTHER ( _w) ELSE GENERATE ( _w, fl) b) IF &( _w) = ffl: 8 fathers _v of _w: MAKEBROTHER ( _v) 2) 8 cursors _w 2 _Ci?1, &( _w) = affi: IF a = ai: /* input matches */ replace &( _w) := ffi add _w to _Ci /* Let only matching nodes survive */ 3) IF _Ci = ;: reject the input word and stop. 4) IF ai =a: accept the input word and stop. Procedure MAKEBROTHER ( _w): /* Apply rule (5) of definiton 2.2 for those nodes where this has not been done yet. */ IF _w has not been considered in step i for this test: 8 fathers _v of _w: IF ö( _w) = ffl: MAKEBROTHER ( _v) /* Don't generate ffl-brothers */ ELSE GENERATE ( _v, ö( _w)) /* Generate brother of _w */ Procedure GENERATE ( _v, ff): /* Generate a new node with label ff and connect it to father _v */ /* If such a node was already generated in step i, use it instead. */ IF 6 9 _w generated in step i with &( _w) = ff: generate a new node _w. &( _w) := ff Add _w to _Ci?1 ELSIF MAKEBROTHER ( _w) had been completed previously in step i: /* Make up leeway */ let ff = Xffi for the following section: IF ffi = ffl: MAKEBROTHER ( _v) /* Don't generate ffl-brothers */ ELSE GENERATE ( _v, ffi) /* Generate/Connect also to brother of _w */ add a new edge ( _v; _w) to _E 7 Explanations: - The two steps of theorem 3.7, generating _Gi+1 and therefrom _G0i+1, are combined. Procedure GENERATE automatically unifies all nodes which would be generated in _Gi+1 in one input step, so we directly produce _G0i+1. - GENERATE will never produce an ffl-node, because it is always called with a non-empty argument for ff. Instead, rule (5) of definition 2.2 is applied at once whenever a call with - would be possible. - The quantifier 8 denotes a loop with nondeterministic choice of the quantified elements. In case 1) of MAIN, we can select any element from the cursor set until all elements have been processed. Note that while processing an element, new members may be added to the cursor set by GENERATE. - MAKEBROTHER will "generate" a brother of a node or call itself recursively in case of ffl-brothers only if this has not been done in the same input step before. In this way, cycles in the graph and/or multiple copies caused by different paths to the same node will not matter3. Whenever a brother of some _w has to be "generated", it has to be connected to all of _w's fathers, which is done automatically by the unification of GENERATE. This results in (at most) one brother connected to all fathers. However, in some cases not all fathers are known to MAKEBROTHER, because new fathers may be added later. If new fathers are added later to such a node _w via GENERATE, we have to make up this leeway by doing the same things as had been done previously in MAKEBROTHER. - All really new generated nodes are added to the set of cursor nodes _Ci?1 without regard, whether their total prefix really matches all read input symbols. When the generated fl's begin with another nonterminal, they are treated again and will therefore be removed from _Ci?1. At the end of step 1) of MAIN, all nodes in _Ci?1 begin with a terminal. The subsequent step 2) will take over to _Ci only the matching nodes. 5 Complexity Considerations By construction, algorithm 4.1 accepts any context-free grammar4 G. First we estimate the space and time complexities in the worst case, when the grammar is unrestricted: In worst case, procedure GENERATE can produce at most c different nodes in any input step, where c is a constant bounded by the grammar size. So the total number of nodes is O(n), and the total space complexity is O(n2) for the edges, assuming that nodes may have O(n) predecessors. (a) To see the general case time complexity at once is a little more difficult. First, we leave out the ELSIF part of GENERATE which complicates things. When we have shown the complexity without it, we will find that its presence does not change anything. We can state the following: - The inner complexity of GENERATE (assuming to leave out the ELSIF part) is bounded by a constant, for it has to compare ff with at most c nodes generated in the same input step. In a smart implementation, this comparison may be achieved by a simple table lookup. - The sets _Ci?1 never contain more than 2c elements, because c new elements may be added by GENERATE to at most c old members. - Step 1) of MAIN may execute the subcases a) and b) at most 2c times in every pass of the i-loop. The inner complexity of case b) is O(n), yielding an overall time complexity of O(n2) caused by step 1). - Regardless of the number of times MAKEBROTHER will be called, it performs its inner body only for at most O(n) different nodes in one input step. When the node has been tested before, the inner complexity is bounded by a constant. Otherwise, when the inner body is executed, its inner complexity is O(n) because _w has at most O(n) fathers. So the complexity caused by MAKEBROTHER in one input step is O(n2), in spite of the recursivity. The trick is to prevent cycles and multiple visits by the test. - Thus follows the overall time complexity of O(n3). 3In fact, this is basically a transitive closure algorithm. 4There are no restrictions on the grammar; it need not be reduced, may have ffl-productions and may contain useless and unit productions, or cycles X +) X 8 Now we have to release our assumption that the ELSIF part of GENERATE is left out. The only purpose of the ELSIF part is to do the same things which would have been done by MAKEBROTHER if all fathers had been present from the beginning. So our argumentation will hold also for taking the ELSIF part into account. (b) Now we are interested in the time complexities when the grammar is restricted in some way. First we investigate the case of bounded ambiguous grammars. A grammar is called to be of bounded ambiguity k when all possible derivations lead to at most k different derivation trees. The following arguments show that the overall complexity is O(n2) in this case: By definition of the bounded ambiguity, all nodes _v in the CTPL have at most k successors which will "survive" until rule (5) of definiton 2.2 is applied the last time to _v. Note that as long as the subgraph _S( _v) is not completed, there may be more successors, but they are guaranteed to produce not more than k brothers of _v in at most k different input steps, because the sublanguage L(?( _v)) is also of bounded ambiguity k. Therefore the maximum number of brothers _v can get by rule (5) is k. As a consequence, the total number of successors of each node is bounded by some constant k0. This is because successors are generated either in step 1)a) of MAIN (at most jP j) or by MAKEBROTHER, which will be called at most k times for each differently labeled brother5, so the maximum number of successors depends only on the grammar. Thus follows that the total number of edges in our graph is now O(n). Each time MAKEBROTHER examines all fathers of _w, it goes through the incoming edges of _w. In one input step any edge can be touched only once by MAKEBROTHER (or substitutively by GENERATE), so it has complexity O(n) in this input step, resulting in an overall complexity of the algorithm of O(n2). (c) Finally we investigate the case where the grammar is LR(k). By definiton of the LR(k) property, multiple branches occuring in the CTPL will lead to a dead end after at most k input steps. Now we argue in a simple way: Let _v be in _Ci?1, then for i < j < i + k there may occur multiple descendants of _v in _Cj?1, but in _Ci+k?1 there will remain at most one descendant of _v. So the number of brothers produced by MAKEBROTHER in each input step is bounded by some constant k0, for each brother generated at a certain input step j must be a descendant of a bounded number of cursors of previous cursor sets. As can be seen easily, MAKEBROTHER now has a constant bounded complexity in each input step; the same is true for step 1) of MAIN and for all other parts of the algorithm. So we have an overall time complexity of O(n). As a result of this section, we find that the worst case asymptotic time complexity of our algorithm is as good as Earley's. We are currently investigating the practical aspects of our algorithm, to see the behaviour of constant factors. Also, the nondeterministic choices in step 1) of MAIN give the opportunity to use several heuristic selection strategies. We recommend a queue structure for the _Ci?1, where elements are removed at the front and added at the end. However, we have currently no idea whether other strategies (like sorting by number of successors etc.) will give some gains at least with certain grammars. This may be a wide research area. 6 Possible Modifications, Conclusion Our Algorithm could be classified as an efficient general directional top-down method with maximally restricted breadth-first search. There are several ways to implement our it; one has been carried out by the author. However, there are more interesting points regarding possible modifications of the algorithm itself. Several techniques known from improvements of Earley's algorithm (see, e. g., [GHR80], [GJ90], [Lei90]) could be applied to our algorithm in a modified form. For example, we could introduce lookahead in several ways. A simple way would be for example to compute FIRST and FOLLOW sets to restrict case 1)a) of MAIN to those nodes matching the lookahead. Similarly, the reduction in case 2) could be restricted by lookahead techniques. Other types of precomputing could be useful, too. On the lowest level, it would be interesting to compare the running time of our algorithm to the other known general context-free parsers. This seems to be very difficult, for there are no commonly defined micro operations allowing a hardware independent comparison. Earley's algorithm has been adapted for many grammar classes used in the field of natural language processing. So it is interesting whether adapting our algorithm to, say, ID/LP grammars or various 5However, each generated brother may produce new brothers in turn, but they have shorter labels. This process will terminate in at most c0 generations of brothers, where c0 is the maximum number of nonterminals at the righthand side of the productions 9 unification methods would give any gains. Also, it can be seen easily that there are efficient bit-vector implementations: Just view the graph as adjacency matrix, then operate on sets of edges by representing them as bit vectors. We expect this will increase speed on real machines and on real problems considerably. The interrelations to Earley's algorithm and varieties will be explained in a separate paper. To our current knowledge, there is an argument that gives evidence our algorithm is at least as good as Earley's: Just partition Earley's state sets Si into groups where all group members have the same dotted rule X ! ff:fi (but different backward pointers). Then each group roughly corresponds to a node in our algorithm, labeled with fi. The main difference is, that Earley's pointers point to state sets, wheras our edges point to "states" respective such "groups". At first glance, Earley's structure seems to be less expensive than our's, because one particular Earley pointer pointing to a state set may address multiple states, while we need an extra edge for each of them. However, Earley's completer has to do exhaustive search to find all of these nodes within the addressed state set, while our algorithm simply follows an edge backward. Another difference to Earley's algorithm is that it only can unify dotted rules resulting from the same production, while our algorithm takes into account only the remainder behind the dot for melting nodes. Conversely, we have therefore discovered an improvement of Earley's algorithm: Just view states < X ! ff:fi; j > and < X ! ff0:fi; j > as equal even if ff 6= ff0. However, our algorithm would allow to unify < X ! ff:fi; j > and < Y ! ff0:fi; j > where X 6= Y , speaking analogously in terms of Earley. The interrelations to Tomita's algorithms are not clear at present. When applying graph structured stacks [To88] to a top-down algorithm (e. g. apply graph structuring to the LL(1) method), we will get an algorithm which seems similar to ours. The main difference will be that labels are symbols rather than strings. A similar behaviour could be seen in our algorithm when the grammar is transformed to a special left-recursion free chomsky normal form where some redundant information is thrown out by sharing common suffixes of the original grammar by the same chains of rules. But we feel our algorithm will be more practical than this. References [BPS75] Bouckaert, M., Pirotte, A., Snelling, M.: Efficient Parsing Algorithms for General Context-free Parsers. Inf. Sciences, vol. 8, no. 1, p. 1-26 (Jan 1975) [Ea70] Earley, J.: An Efficient Context-Free Parsing Algorithm. Comm. ACM 13, 2 (Feb. 1970), 94-102 [GJ90] Grune, D., Jacobs, C. J. H.: Parsing Techniques. a practical guide. Ellis Horwood 1990, ISBN 0-13-651431-6 [GHR80] Graham, S. L., Harrison, M. L., Ruzzo, W. L.: An Improved Context-Free Recognizer. ACM Trans. Prog. Lang. Syst. 2, 3 (1980), 415-462 [HU79] Hopcroft, J. E., Ullman, J. D.: Introduction to Automata Theory, Languages, and Computation. Addison Wesley 1979, ISBN 0-201-02988-X [KA89] Kruseman Aretz, F. E. J.: A new approach to Earley's parsing algorithm. Sci. Comput. Program (Netherlands), vol. 12, no. 2, p. 105-121 (July 1989) [Lei90] Leiss, H.: On Kilbury's Modification of Earley's Algorithm. ACM Trans. Programming Languages and Systems, Vol. 12, No. 4, p. 610-640 (Oct. 1990) [To86] Tomita, M.: Efficient parsing for natural languages. Kluwer Academic Publishers, Boston, 1986, p. 201 [To88] Tomita, M.: Graph-structured stack and natural language parsing. 26th Annual Meeting of the Association for Computational Linguistics. Proceedings of the Conference, Buffalo, NY, 7-10 June 1988 (Morristown, NJ: Assoc. Comput. Linguistics 1988) p. 249-257 10