? ?
? ?
? ?
? ?
?
?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
? ?
?

?

?

?
?

?

?

?
?

?

?

?
? ?
?

?
? ?
?

?
? ?
?

?
? ?
?

?
?
?
?

?
?
?
?

?
?
?
?

?
?
?
?

?
?
?
?

?
?
?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?

?
?
Universität Stuttgart

Fakultät Informatik

A New Parsing Strategy

for Context-Free

Grammars

Thomas Schöbel

Computer Science Report
No. 7/91 May 1991

1st printing May 1991

All rigths reserved.

Dekanat der Fakultät Informatik
Universität Stuttgart
Breitwiesenstr. 20-22, D-7000 Stuttgart 80, Germany


A New Parsing Strategy for Context-Free Grammars

Extended Abstract from a currently prepared paper

Thomas Schöbel

2 May 1991

Abstract

We present a family of new top-down parsing algorithms for arbitrary 
context-free grammars based on the notion of a "tree of possible 
left-derivations". When coded in a special way, this tree can be 
compressed to a graph. While the time complexity on uniform RAMs is 
O(n3) in worst case, we need O(n2) space. We show that the asymptotic 
time complexities are equal to Earley's algorithms, and give some 
arguments to show our algorithm will be better in almost all cases.

1 Introduction

Since Cocke, Younger and Kasami (CYK) presented the first polynomial 
algorithms for recognition of context-free languages, a lot of 
investigation has been done in the field of recognition and parsing. 
Amoung the various parsers, Earley's famous algorithm [Ea70] is the most 
commonly used. Graham, Harrison and Ruzzo [GHR80] have shown that the 
CYK and Earley methods are "almost" the same, based on an upper triangle 
matrix, principally containing triples as elements.

Another approach is Tomita's method [To86], which uses a graph for 
compressing multiple stacks to a smaller structure. In contrast to our 
proposed algorithm, he takes the various bottom-up algorithms (SLR, 
LR(0), LR(k), LALR : : :) as base for compressing the exponentially 
growing computation tree.

Our method is founded on a theoretical basis, from which we may get 
multiple algorithms. So we think the primary contribution of this paper 
to the field will be another view of parsing. At current state, however, 
we feel also that there are gains when compared to Earley's algorithm 
and many improvements which could be found in the literature (e. g. 
[BPS75],[GHR80],[KA89],[Lei90]). We are currently working on a detailed 
discussion of the related topics.

2 Basic Definitions, TPL and CTPL

We will use a similar notation as in [HU79]. Kleene-Closure will be 
denoted by an asterisk ?. A context-free grammar G (or simply Grammar) 
is defined as G = (?; N;P; S), where ? is the terminal alphabet, N the 
nonterminal alphabet, ? \ N = ;, V := ? [ N the alphabet, P ? N ? V ? 
the set of productions, and S 2 N the start symbol. Terminals will be 
denoted as a, b, c, : : :, nonterminals as A, B, C, : : :, strings from 
V ? as ff, fi, fl, : : :, the empty string is ffl, and productions will 
be denoted as X ! fl.

We introduce the notion of a left-derivation by terms of relations: Let 
G be a Grammar, then the relation )L on V ? is defined by ff )L fi if 
and only if ff = ff1Xff2; fi = ff1flff2; X ! fl 2 P; ff1 2 ?? ; ff2 2 V 
?. A left{derivation DL is a (finite or infinite) sequence of strings 
DL(ff) := (ff0; ff1; ff2; : : :) where ffi )L ffi+1, ff0 = ff. This 
definition directly leads to the notion of a "Tree of Possible Left- 
Derivations": We just write the left-derivation relation as a tree.

Definition 2.1 Let G be a Grammar.
The Tree of Possible Left-Derivations (TPL) of a string ff 2 V ? is a 
(possibly infinite) tree _G = ( _V; _E)

with root _r( _G) := _s 2 _V, where _V is the set of nodes, labeled with 
strings ? : _V 7! V ?, ?(_s) = S, _E ? _V ? _V is the set of edges, and

8 _v2 _V 8ff2V ? ; ?( _v))Lff : 9 _w2 _V ; ( _v; _w)2 _E : ?( _w) = ff

8( _v; _w1) ; ( _v; _w2)2 _E : _w1 6= _w2 =) ?( _w1) 6= ?( _w2)

2


Note that we always write objects belonging to a tree or directed graph 
with a dot on it, for distinction of _V (the set of vertices) from V 
(the alphabet). Also, we regard a tree as a specialized directed graph 
being acyclic, fully connected, and every node having at most one 
predecessor. The root of a tree has no predecessors.

Intuitively, the TPL shows all possible left-derivations, starting from 
the root _s labeled with S. Siblings always mean different alternatives 
of derivations. In general, the TPL will be infinite in depth and 
growing exponentially in breadth.

It is clear that the concept of the TPL will not directly serve for 
efficient recognition or parsing, because of its excessive consuming of 
resources. Therefore, we introduce a more efficient coding of the TPL, 
called CTPL (Compressed Tree of Possible Left-Derivations). Every label 
ff of the TPL is partitioned as follows:

ff = xXffi

where x 2 ?? is called the (terminal) prefix, X 2 N [ ffflg the 
(nonterminal) infix and ffi 2 V ? the suffix. Additionally, when X = - 
=) ffi = ffl, this partitioning is unique.

Let us look at a derivation step in the TPL starting at _v 2 _V with 
label ?( _v) = ff = xXffi: Let X ! fl be a production, then xXffi )L 
xflffi where xflffi labels a son of _v. Derivation is taking place only 
at the X ; the other components (x and ffi) remain unchanged in a 
derivation step (a property of context-freeness).

The basic idea of the CTPL is to register only the modifications along a 
path in the TPL. This is accomplished by repeatedly partitioning of fl 
into components x0X 0ffi0 by use of three different labeling functions. 
The same is also done with the direct derivation of X 0 and so on. As 
far as X 6= ffl, this will result in a sequence of derivation steps, 
where only the labeling differences from the predecessor need to be 
registered.

For the CTPL, we need a prefix labeling function ß assigned to the edges 
of the tree to register the local differences of the terminal prefix x. 
The infix labeling function ? contains the nonterminal X , the suffix 
labeling function ö contains the rest ffi. Both are assigned to the 
nodes of the tree.

As we shall see, every node _v in the TPL has a corresponding node _v0 
in the CTPL. To get the same full prefix of _v, we have to concatenate 
all prefixes in the path in the CTPL from the root _s0 to _v0. 
Similarly, the full suffix can be achieved by concatenation of all 
suffixes along the path upwards to the root. So we find that the CTPL 
will contain the same information as the TPL.

However, there are difficulties: When deriving a sentential form, a fl 
may be empty or may consist only of terminals. In other words, the 
existence of a nonterminal X and therefore of a non-empty infix is not 
guaranteed, and further derivations would be impossible. In this case, 
we have to do some transformations on the tree, which will produce an 
equivalent subtree for the "dead end". Normally, the suffixes ffi of 
other fl's contain nonterminals, which have to be expanded. So it is 
clear that our tree transformation has to activate these nonterminals.

Definition 2.2 Let G be a grammar.
A Compressed Tree of Possible Left-Derivations (CTPL) is a tree _G = ( 
_V; _E) with root _r( _G) := _s and labeling functions
ß : _E 7! ?? (prefix labeling)
? : _V 7! N [ ffflg (infix labeling)
ö : _V 7! V ? (suffix labeling)

satisfying the following conditions:

(1) ?(_s) = S; ö(_s) = -

(2) 8 _v2 _V 8?( _v)!fl2P 9( _v; _w)2 _E : fl = ß(( _v; _w))?( _w)ö( _w)

(3) ?( _v) = - =) ö( _v) = -

(4) ( _u; _v) ; ( _v; _w) 2 _E ^ ?( _w)ö( _w) = - if and only if
9( _u; _b)2 _E : ß(( _u; _b)) = ß(( _u; _v))ß(( _v; _w))z; ?( _b) = X 0; 
ö( _b) = ffi0;
where ö( _v) = zX 0ffi0; z 2 ??; X 0 2 N [ ffflg; ffi0 2 V ?

(5) _G contains nothing else.

Conditions (1) and (2) give an inductive definition of directly 
derivable nodes in the CTPL. Condition (3) ensures that existing 
nonterminals will always appear in ? rather than in ffi. Condition (4) 
is the tree transformation, when some ?( _w) is ffl. In this case, we 
generate a brother _b to the father _v, which receives the suffix ö( _v) 
for partitioning into three components. As we will see, _w and _b have 
the same meaning for denoting a derivation. Furthermore, note that _b 
may have an empty ?( _b) also; then condition (4) generates another 
"uncle" denoting the same derivation.
3


To show the equivalence of TPL and CTPL, we introduce the notions of 
total prefixes and total suffixes.

Definition 2.3 Let _G be a CTPL with root _r( _G) := _s over a grammar 
G.
Let ^a be a path (_s; _v1; _v2; : : : ; _vk; _v) from the root _s to a 
node _v.
We define the following functions:

1. The total prefix ^ß(^a) of path ^a is

^ß(^a) := ß((_s; _v1)) ß(( _v1; _v2)) : : : ß(( _vk?1; _vk)) ß(( _vk ; 
_v))

2. The total suffix ^ö(^a) of path ^a is

^ö(^a) := ö( _v) ö( _vk ) : : : ö( _v1) ö(_s)

3. The sentential form ^Ö(^a) of path ^a is

^Ö(^a) := ^ß(^a) ?( _v) ^ö(^a)

When dealing with trees (rather than later with less restricted graphs), 
every node _v has a unique path ^a from the root _s to _v. Therefore, we 
can write ^Ö( _v) as a short form instead of explicitly giving the path 
^a as argument for ^Ö(^a).

Lemma 2.4 Let _G be a CTPL with root _r( _G) := _s. Let _u; _v; _w; _b 2 
_V satisfying condition (4) of definition 2.2. Then
^Ö( _w) = ^Ö( _b)

Proofs are omitted here.

Lemma 2.5 Let _G be a CTPL. The relation ?= on _V, defined by

_v ?= _w if and only if ^Ö( _v) = ^Ö( _w)

is an equivalence relation.

The proof is trivial, for ?= is directly defined by the equivalence 
relation =.
Now we are ready to show the equivalence of TPL and CTPL:

Theorem 2.6 Let G be a grammar, _G = ( _V; _E) the CTPL and _G0 = ( _V0; 
_E0) the TPL of G. Then

(1) 8 _v2 _V : 9 _v02 _V0 : ^Ö( _v) = ?( _v0)

(2) 8 _v02 _V0 : 9 _v2 _V : ^Ö( _v) = ?( _v0)

(3) 8( _v0; _w0)2 _E0 : 9( _v; _w)2 _E : ^Ö( _v) = ?( _v0); ^Ö( _w) = ?( 
_w0)

Informally, we can see that TPL and CTPL are equivalent in some sense. 
This means that the CTPL has more nodes becauses of its internal 
equivalence relation (lemma 2.5), so only the equivalence classes of the 
CTPL are equivalent to the TPL.

3 The CGPL

Until now we were dealing with structures which, in the worst case, may 
grow exponentially in breadth. Furthermore, there may be infinite 
recursions in the CTPL when the grammar is left-recursive. Lastly, both 
structures infinitely generate all sentences of the language L(G).

The last problem can be handled easily by a parsing algorithm: We just 
have to prune all branches whose total prefixes do not match the given 
input string x 2 L(G). The former problems can be solved when turning 
the CTPL to a true directed graph called CGPL, allowing to melt nodes 
with common subtrees.

Some difficulties arises from dealing with labeled infinite trees. 
Hence, we need a definition of equivalence between infinite trees. 
Additionally, we want to compress trees to graphs, and therefore we need 
all the definitions also for graphs, regarding trees as a special case. 
For convenience, we will call the source (a single particular node) of a 
directed graph _G also the root by _r( _G).1

1Note that the root of a tree has no predecessors; in contrast, the root 
of a graph can have one. In our algorithm, the root of the main graph 
will never have predecessors, in contrast to subgraphs, where cycles may 
cause this. 4


Definition 3.1 Let _G = ( _V; _E) be an arbitrary graph with root _r( 
_G) := _s.
The set of successors _R( _v) of a node _v 2 _V is defined as

_R( _v) := _R1( _v) := f _wj( _v; _w) 2 _Eg

The extension of _R to sets of nodes _A is defined by

_R( _A) := _R1( _A) :=
[

_v2 _A

_R( _v)

The set of i-reachable nodes _Ri( _v) resp. _Ri( _A) is defined by

_Ri( _v) := _R( _Ri?1( _v)) ; _Ri( _A) :=
[

_v2 _A

_Ri( _v)

The set of sucessors is the same as the set of 1-reachable nodes. Note 
also that if the graph has cycles, a node _w may infinitely often occur 
in the sequence f _Ri( _v)g. Membership _w 2 _Ri( _v) means that there 
exists a path of exactly length i from _v to _w. We close the definition 
for i = 0 by defining _R0( _v) := f _vg, and also for negative i 
denoting predecessors by _R?1( _w) := f _vj( _v; _w) 2 _Eg and 
analogously for i 2 f?2; ?3; : : :g.

Definition 3.2 Let _G = ( _V; _E) be a (possibly infinite) graph (tree).
A subgraph (subtree) _S( _v) of a node _v 2 _V is a graph (tree) _S( _v) 
:= ( _V0; _E0), where _V0 ? _V, _E0 ? _E, and both subsets are defined 
by

_V0 :=
[

i2N
_Ri( _v); _E0 := f( _u; _w)j _u 2 _V0g; _r(( _V0; _E0)) := _v

Intuitively, this definition for a subgraph encompasses all reachable 
nodes. Note that in _S( _v), _v is the new root of the subgraph. When 
the subgraph has no cycles, the root has (like a tree) no predecessors. 
In cases where _v is also member of some _Ri( _v), the root also has 
predecessors.
Now we define the equivalence of arbitrary labeled graphs:

Definition 3.3 Let _G = ( _V; _E) and _G0 = ( _V0; _E0) be (possibly 
infinite) graphs with arbitrary labeling functions ?1; : : : ; ?k resp. 
?01; : : : ; ?0k mapping _V resp. _V0 to an arbitrary free monoid ??, 
and ?1; : : : ; ?l resp. ?01; : : : ; ?0l arbitrary labeling functions 
from _E resp. _E0 to ??.

_G ? _G0 if and only if 8h2N : _G ?h _G0

where ?h is the finite-case equivalence defined by

(1) _G ?0 _G0 if and only if ?i(_r( _G)) = ?0i(_r( _G0))

(2) _G ?h _G0 (h > 0) if and only if _G ?0 _G0;

8_e=(_r( _G); _v)2 _E 9_e0=(_r( _G0); _v0)2 _E0 : ?j( _e) = ?0j ( _e0 ) 
; _S( _v) ?h?1 _S( _v0)

8_e0=(_r( _G0); _v0)2 _E0 9_e=(_r( _G); _v)2 _E : ?0j ( _e0 ) = ?j( _e) 
; _S( _v0) ?h?1 _S( _v)

Note that this is no structural equivalence; all we demand is that all 
edges of _G have an equivalent edge in _G0 leading to a node with the 
same labels and an equivalent subtree, and vice versa. As we shall see 
later, there are cases in which the nodes have no 1:1 correspondence but 
the graphs are equivalent.

Definition 3.4 Let _G be a graph labeled as in definition 3.3. We extend 
the definition of all labeling functions ?i and ?j from nodes resp. 
edges to paths as follows, yielding sequences of strings: Let ^a = ( 
_v1; : : : ; _vk) be an arbitrary path in _G. Then

?i(^a) := (?i( _v1); : : : ; ?i( _vk)); ?j(^a) := (?j(( _v1; _v2)); : : 
: ; ?j(( _vk?1; _vk)))

Now we present a simple lemma which justifies the intuitive notion of 
the equivalence ? by the observation, that every labeled path in _G has 
a corresponding labeled path in _G0 and vice versa (though the 
underlying paths may be different):

Lemma 3.5 Let _G and _G0 be labeled graphs with _G ? _G0. For every _v 2 
_V exists a _v0 2 _V0 such that

?i((_r( _G); : : : ; _v)) = ?0i((_r( _G0); : : : ; _v0)); ?j((_r( _G); : 
: : ; _v)) = ?0j ((_r( _G0); : : : ; _v0))

5


Now we get the main lemma of this paper, showing equivalences on the 
CTPL:

Lemma 3.6 Let G be a grammar and _G = ( _V; _E) be the corresponding 
CTPL.

8 _v; _w2 _V ; ?( _v)=?( _w) ; ö( _v)=ö( _w) : _S( _v) ? _S( _w)

Now we have the tools for compressing the CTPL to a graph. However, when 
applying this lemma to a CTPL, we could compress it to a finite graph 
reflecting all possible left-derivations as a static structure, which is 
not what we want. When concerned with parsing, we prune all branches of 
the CTPL not matching the input, so we cannot apply this lemma to parts 
of the pruned CTPL, because different subtrees beginning with the same 
labeling may derive to different parts of the input and therefore may be 
pruned in different ways.

First, we have to state a model how pruning of the CTPL is done. We will 
only consider the case of online parsing from left to right in this 
paper. In this model, we have sequential access to the input symbols ai 
for i from 1 to n. After accessing each symbol, we update our CGPL in 
such a way that it reflects the known parts of the CTPL.

The process of updating the CGPL can be described as a sequence of 
graphs _Gi = ( _Vi; _Ei) where 8 _v2 _Vi : ^ß( _v) = a0 : : : aj where j 
<= i. Those nodes where j = i are called cursor nodes, denoted by _Ci.

This means to keep only those parts of the CTPL in the graph where each 
node has a total prefix matching a substring a0 : : : ai of the current 
input. In other words, the cursors at step i are all the nodes having 
exactly total prefix a0 : : : ai. What could be currently derived from 
them are unpruned subtrees of the CTPL, because we needed additional 
input symbols to prune these subtrees also. So it is sufficient not to 
store these subtrees as a graph by leaving them simply out, for they can 
be completed when reading the next input symbol, so only the matching 
parts of these subtrees will be added.

At first glance, this situation seems strange: For the known parts of 
our graph, lemma 3.6 cannot be applied in general, in contrast to the 
unknown parts. However, just this is the key for compressing the pruned 
CTPL to a graph with polynomial space complexity: All cursors having the 
same ? and ö label are equivalent by lemma 3.6, so they can be melted 
into one single node. In subsequent input steps, the pruning of the 
different original subtrees would be done in parallel; this means that 
after pruning has been consistently done, the different subtrees remain 
equivalent. So we are entitled to melt them. Now we are ready to state 
the main theorem of this paper, showing that all identically labeled 
cursor nodes produced in one input step can be melted.

Theorem 3.7 Let G be a grammar, _GT the corresponding CTPL, and let x = 
a1 : : : an be an input word. Let f _Gig be an arbitrary sequence of 
graphs for i = 0; : : : ; n with the property _Gi ? ( _VTi ; _ETi ), 
where ( _VTi ; _ETi ) is the subtree of _GT containing exactly all nodes 
with 8 _v2 _VTi : ^ß( _v) = a1 : : : aj ; 0 <= j <= i

2. Then for every _Gi exists an equivalent graph _G0i ? _Gi such that

8 _v; _w2 _Ci( _G0i ); _v6= _w : ?( _v) 6= ?( _w) _ ö( _v) 6= ö( _w)

In other words, the graph _G0i is minimal with respect to the number of 
cursor nodes. From this theorem, we can directly construct a parser for 
L(G). In step 0, we start with _G00 containing only the root with infix 
label S. In step i we have a graph _G0i, and now step i +1 begins with 
reading input symbol ai+1. We produce _Gi+1 from _G0i by appending the 
additional cursor nodes _Ci+1 using definiton 2.2 without rule (3), and 
lastly produce _G0i+1 by melting together all cursor nodes having the 
same labels using lemma

3.6. We can see at once, that in every input step i the graph _G0i has 
at most k cursor nodes, where k is a constant depending on the grammar 
G. The difference between _G0i and _G0i+1 lies only in the added cursor 
nodes, so we can use a single growing graph for our algorithm, and the 
total number of nodes can be bounded by O(n). Since _E ? _V ? _V, each 
node is connected with at most O(n) other nodes, and therefore the total 
space complexity is only O(n2). By construction, this growing graph is 
equivalent to (the matching parts of) the CTPL and therefore contains 
the same information; at least it contains the information of all 
successful parses when the algorithm terminates after reading the last 
input symbol.

4 A Recognizer Algorithm

We present a recognizer based on theorem 3.7 which is as simple as 
possible. The simplifications are:

- The prefix labels contain redundant information, for the prefix always 
matches the input symbols already processed. So we leave the prefix 
labels out.

2For j = 0, we define a1 : : : aj := - as straightforward extension
6


- The labeling functions ? and ö are concatenated into one labeling 
function & with &( _v) := ?( _v)ö( _v). We relax rule (3) of definition 
2.2. When &( _v) begins with a nonterminal, then this denotes the infix; 
in all other cases the infix is empty.

- We introduce a new terminal symbol a as end-of-file marker and 
therefore augment the grammar by a new production S0 ! S a without 
changing the generated language, using S0 as new start symbol. This is 
to prevent the root from appearing in a cycle.

The set of all cursor nodes at step i is denoted by _Ci. We indicate the 
structure of the algorithm by indentation.

Algorithm 4.1 Let G = (?; N; P; S) be a grammar, x = a1:::an a an input 
word.
Let G0 := (? [ fag; N [ fS0g; P 0; S0) a modified grammar with new start 
symbol S0 and P 0 = P [ fS0 ! S ag. The CGPL _G is initialized with root 
_s labeled &(_s) := S0 and an edge (_s; _t) with &(_t) := S a. 
Initialize _C0 := f_tg. Program MAIN
FOR i := 1; : : : ; n + 1 DO:

1) 8 cursors _w 2 _Ci?1 which have not yet been processed:

a) IF &( _w) = Xffi, where X 2 N :

remove _w from _Ci?1
8 productions X ! fl 2 P :
IF fl = ffl:
MAKEBROTHER ( _w)
ELSE
GENERATE ( _w, fl)
b) IF &( _w) = ffl:
8 fathers _v of _w:
MAKEBROTHER ( _v)

2) 8 cursors _w 2 _Ci?1, &( _w) = affi:

IF a = ai: /* input matches */
replace &( _w) := ffi

add _w to _Ci /* Let only matching nodes survive */

3) IF _Ci = ;:

reject the input word and stop.

4) IF ai =a:

accept the input word and stop.

Procedure MAKEBROTHER ( _w):
/* Apply rule (5) of definiton 2.2 for those nodes where this has not 
been done yet. */

IF _w has not been considered in step i for this test:

8 fathers _v of _w:
IF ö( _w) = ffl:
MAKEBROTHER ( _v) /* Don't generate ffl-brothers */
ELSE
GENERATE ( _v, ö( _w)) /* Generate brother of _w */

Procedure GENERATE ( _v, ff):
/* Generate a new node with label ff and connect it to father _v */
/* If such a node was already generated in step i, use it instead. */

IF 6 9 _w generated in step i with &( _w) = ff:

generate a new node _w.

&( _w) := ff

Add _w to _Ci?1

ELSIF MAKEBROTHER ( _w) had been completed previously in step i:

/* Make up leeway */ let ff = Xffi for the following section:

IF ffi = ffl:
MAKEBROTHER ( _v) /* Don't generate ffl-brothers */

ELSE
GENERATE ( _v, ffi) /* Generate/Connect also to brother of _w */

add a new edge ( _v; _w) to _E

7


Explanations:

- The two steps of theorem 3.7, generating _Gi+1 and therefrom _G0i+1, 
are combined. Procedure

GENERATE automatically unifies all nodes which would be generated in 
_Gi+1 in one input step, so we directly produce _G0i+1.

- GENERATE will never produce an ffl-node, because it is always called 
with a non-empty argument for ff. Instead, rule (5) of definition 2.2 is 
applied at once whenever a call with - would be possible.

- The quantifier 8 denotes a loop with nondeterministic choice of the 
quantified elements. In case 1) of MAIN, we can select any element from 
the cursor set until all elements have been processed. Note that while 
processing an element, new members may be added to the cursor set by 
GENERATE.

- MAKEBROTHER will "generate" a brother of a node or call itself 
recursively in case of ffl-brothers only if this has not been done in 
the same input step before. In this way, cycles in the graph and/or 
multiple copies caused by different paths to the same node will not 
matter3. Whenever a brother of some _w has to be "generated", it has to 
be connected to all of _w's fathers, which is done automatically by the 
unification of GENERATE. This results in (at most) one brother connected 
to all fathers. However, in some cases not all fathers are known to 
MAKEBROTHER, because new fathers may be added later. If new fathers are 
added later to such a node _w via GENERATE, we have to make up this 
leeway by doing the same things as had been done previously in 
MAKEBROTHER.

- All really new generated nodes are added to the set of cursor nodes 
_Ci?1 without regard, whether their total prefix really matches all read 
input symbols. When the generated fl's begin with another nonterminal, 
they are treated again and will therefore be removed from _Ci?1. At the 
end of step 1) of MAIN, all nodes in _Ci?1 begin with a terminal. The 
subsequent step 2) will take over to _Ci only the matching nodes.

5 Complexity Considerations

By construction, algorithm 4.1 accepts any context-free grammar4 G. 
First we estimate the space and time complexities in the worst case, 
when the grammar is unrestricted:

In worst case, procedure GENERATE can produce at most c different nodes 
in any input step, where c is a constant bounded by the grammar size. So 
the total number of nodes is O(n), and the total space complexity is 
O(n2) for the edges, assuming that nodes may have O(n) predecessors.

(a) To see the general case time complexity at once is a little more 
difficult. First, we leave out the ELSIF part of GENERATE which 
complicates things. When we have shown the complexity without it, we 
will find that its presence does not change anything. We can state the 
following:

- The inner complexity of GENERATE (assuming to leave out the ELSIF 
part) is bounded by a constant, for it has to compare ff with at most c 
nodes generated in the same input step. In a smart implementation, this 
comparison may be achieved by a simple table lookup.

- The sets _Ci?1 never contain more than 2c elements, because c new 
elements may be added by GENERATE to at most c old members.

- Step 1) of MAIN may execute the subcases a) and b) at most 2c times in 
every pass of the i-loop. The inner complexity of case b) is O(n), 
yielding an overall time complexity of O(n2) caused by step 1).

- Regardless of the number of times MAKEBROTHER will be called, it 
performs its inner body only for at most O(n) different nodes in one 
input step. When the node has been tested before, the inner complexity 
is bounded by a constant. Otherwise, when the inner body is executed, 
its inner complexity is O(n) because _w has at most O(n) fathers. So the 
complexity caused by MAKEBROTHER in one input step is O(n2), in spite of 
the recursivity. The trick is to prevent cycles and multiple visits by 
the test.

- Thus follows the overall time complexity of O(n3).

3In fact, this is basically a transitive closure algorithm.
4There are no restrictions on the grammar; it need not be reduced, may 
have ffl-productions and may contain useless and unit productions, or 
cycles X +) X

8


Now we have to release our assumption that the ELSIF part of GENERATE is 
left out. The only purpose of the ELSIF part is to do the same things 
which would have been done by MAKEBROTHER if all fathers had been 
present from the beginning. So our argumentation will hold also for 
taking the ELSIF part into account.

(b) Now we are interested in the time complexities when the grammar is 
restricted in some way. First we investigate the case of bounded 
ambiguous grammars. A grammar is called to be of bounded ambiguity k 
when all possible derivations lead to at most k different derivation 
trees.

The following arguments show that the overall complexity is O(n2) in 
this case: By definition of the bounded ambiguity, all nodes _v in the 
CTPL have at most k successors which will "survive" until rule (5) of 
definiton 2.2 is applied the last time to _v. Note that as long as the 
subgraph _S( _v) is not completed, there may be more successors, but 
they are guaranteed to produce not more than k brothers of _v in at most 
k different input steps, because the sublanguage L(?( _v)) is also of 
bounded ambiguity k. Therefore the maximum number of brothers _v can get 
by rule (5) is k.

As a consequence, the total number of successors of each node is bounded 
by some constant k0. This is because successors are generated either in 
step 1)a) of MAIN (at most jP j) or by MAKEBROTHER, which will be called 
at most k times for each differently labeled brother5, so the maximum 
number of successors depends only on the grammar.

Thus follows that the total number of edges in our graph is now O(n). 
Each time MAKEBROTHER examines all fathers of _w, it goes through the 
incoming edges of _w. In one input step any edge can be touched only 
once by MAKEBROTHER (or substitutively by GENERATE), so it has 
complexity O(n) in this input step, resulting in an overall complexity 
of the algorithm of O(n2).

(c) Finally we investigate the case where the grammar is LR(k). By 
definiton of the LR(k) property, multiple branches occuring in the CTPL 
will lead to a dead end after at most k input steps. Now we argue in a 
simple way: Let _v be in _Ci?1, then for i < j < i + k there may occur 
multiple descendants of _v in _Cj?1, but in _Ci+k?1 there will remain at 
most one descendant of _v. So the number of brothers produced by 
MAKEBROTHER in each input step is bounded by some constant k0, for each 
brother generated at a certain input step j must be a descendant of a 
bounded number of cursors of previous cursor sets.

As can be seen easily, MAKEBROTHER now has a constant bounded complexity 
in each input step; the same is true for step 1) of MAIN and for all 
other parts of the algorithm. So we have an overall time complexity of 
O(n).

As a result of this section, we find that the worst case asymptotic time 
complexity of our algorithm is as good as Earley's. We are currently 
investigating the practical aspects of our algorithm, to see the 
behaviour of constant factors. Also, the nondeterministic choices in 
step 1) of MAIN give the opportunity to use several heuristic selection 
strategies. We recommend a queue structure for the _Ci?1, where elements 
are removed at the front and added at the end. However, we have 
currently no idea whether other strategies (like sorting by number of 
successors etc.) will give some gains at least with certain grammars. 
This may be a wide research area.

6 Possible Modifications, Conclusion

Our Algorithm could be classified as an efficient general directional 
top-down method with maximally restricted breadth-first search. There 
are several ways to implement our it; one has been carried out by the 
author. However, there are more interesting points regarding possible 
modifications of the algorithm itself.

Several techniques known from improvements of Earley's algorithm (see, 
e. g., [GHR80], [GJ90], [Lei90]) could be applied to our algorithm in a 
modified form. For example, we could introduce lookahead in several 
ways. A simple way would be for example to compute FIRST and FOLLOW sets 
to restrict case 1)a) of MAIN to those nodes matching the lookahead. 
Similarly, the reduction in case 2) could be restricted by lookahead 
techniques. Other types of precomputing could be useful, too.

On the lowest level, it would be interesting to compare the running time 
of our algorithm to the other known general context-free parsers. This 
seems to be very difficult, for there are no commonly defined micro 
operations allowing a hardware independent comparison.

Earley's algorithm has been adapted for many grammar classes used in the 
field of natural language processing. So it is interesting whether 
adapting our algorithm to, say, ID/LP grammars or various

5However, each generated brother may produce new brothers in turn, but 
they have shorter labels. This process will terminate in at most c0 
generations of brothers, where c0 is the maximum number of nonterminals 
at the righthand side of the productions

9


unification methods would give any gains.

Also, it can be seen easily that there are efficient bit-vector 
implementations: Just view the graph as adjacency matrix, then operate 
on sets of edges by representing them as bit vectors. We expect this 
will increase speed on real machines and on real problems considerably.

The interrelations to Earley's algorithm and varieties will be explained 
in a separate paper. To our current knowledge, there is an argument that 
gives evidence our algorithm is at least as good as Earley's: Just 
partition Earley's state sets Si into groups where all group members 
have the same dotted rule X ! ff:fi (but different backward pointers). 
Then each group roughly corresponds to a node in our algorithm, labeled 
with fi. The main difference is, that Earley's pointers point to state 
sets, wheras our edges point to "states" respective such "groups". At 
first glance, Earley's structure seems to be less expensive than our's, 
because one particular Earley pointer pointing to a state set may 
address multiple states, while we need an extra edge for each of them. 
However, Earley's completer has to do exhaustive search to find all of 
these nodes within the addressed state set, while our algorithm simply 
follows an edge backward. Another difference to Earley's algorithm is 
that it only can unify dotted rules resulting from the same production, 
while our algorithm takes into account only the remainder behind the dot 
for melting nodes. Conversely, we have therefore discovered an 
improvement of Earley's algorithm: Just view states < X ! ff:fi; j > and 
< X ! ff0:fi; j > as equal even if ff 6= ff0. However, our algorithm 
would allow to unify < X ! ff:fi; j > and < Y ! ff0:fi; j > where X 6= Y 
, speaking analogously in terms of Earley.

The interrelations to Tomita's algorithms are not clear at present. When 
applying graph structured stacks [To88] to a top-down algorithm (e. g. 
apply graph structuring to the LL(1) method), we will get an algorithm 
which seems similar to ours. The main difference will be that labels are 
symbols rather than strings. A similar behaviour could be seen in our 
algorithm when the grammar is transformed to a special left-recursion 
free chomsky normal form where some redundant information is thrown out 
by sharing common suffixes of the original grammar by the same chains of 
rules. But we feel our algorithm will be more practical than this.

References

[BPS75] Bouckaert, M., Pirotte, A., Snelling, M.: Efficient Parsing 
Algorithms for General Context-free Parsers. Inf. Sciences, vol. 8, no. 
1, p. 1-26 (Jan 1975)

[Ea70] Earley, J.: An Efficient Context-Free Parsing Algorithm. Comm. 
ACM 13, 2 (Feb. 1970), 94-102

[GJ90] Grune, D., Jacobs, C. J. H.: Parsing Techniques. a practical 
guide. Ellis Horwood 1990, ISBN 0-13-651431-6

[GHR80] Graham, S. L., Harrison, M. L., Ruzzo, W. L.: An Improved 
Context-Free Recognizer. ACM Trans. Prog. Lang. Syst. 2, 3 (1980), 
415-462

[HU79] Hopcroft, J. E., Ullman, J. D.: Introduction to Automata Theory, 
Languages, and Computation. Addison Wesley 1979, ISBN 0-201-02988-X

[KA89] Kruseman Aretz, F. E. J.: A new approach to Earley's parsing 
algorithm. Sci. Comput. Program (Netherlands), vol. 12, no. 2, p. 
105-121 (July 1989)

[Lei90] Leiss, H.: On Kilbury's Modification of Earley's Algorithm. ACM 
Trans. Programming Languages and Systems, Vol. 12, No. 4, p. 610-640 
(Oct. 1990)

[To86] Tomita, M.: Efficient parsing for natural languages. Kluwer 
Academic Publishers, Boston, 1986, p. 201

[To88] Tomita, M.: Graph-structured stack and natural language parsing. 
26th Annual Meeting of the Association for Computational Linguistics. 
Proceedings of the Conference, Buffalo, NY, 7-10 June 1988 (Morristown, 
NJ: Assoc. Comput. Linguistics 1988) p. 249-257

10