Seite 1  Seite      Textversion  Grafikversion    Übersicht

Universität Stuttgart

Fakultät Informatik

Constructing Height-Balanced

Multicast Acknowledgment Trees

with the Token Repository Service

C. Maihöfer, K. Rothermel

Bericht 1999/15
November 1999

Constructing Height-Balanced

Multicast Acknowledgment Trees

with the Token Repository Service

Authors:
Dipl.-Inf. C. Maihöfer
Prof. Dr. K. Rothermel

Institut für Parallele und Verteilte
Höchstleistungsrechner (IPVR)
Fakultät Informatik
Universität Stuttgart
Breitwiesenstr. 20 - 22
D-70565 Stuttgart


Seite 1  Seite      Textversion  Grafikversion    Übersicht
Seite 2  Seite      Textversion  Grafikversion    Übersicht

Abstract -- Many reliable multicast protocols use socalled ACK-trees to avoid the well-known acknowledgment implosion problem in case of large multicast groups. For constructing ACK trees, usually expanding ring search techniques are applied. Our simulation results show that those techniques have scalability problems itself.
In this paper, we propose a novel approach for building ACK trees, the token repository service (TRS). The basic concept of our approach is a token, which represents the right to connect to a certain node in the corresponding ACK tree. For each node in the ACK tree TRS stores a token for each potential successor this node can accept. A node that wants to join a group requests TRS for an appropriate token. The TRS protocol described in this paper provides for height-balanced ACK trees.
Our simulation results show that the created heightbalanced ACK trees have significant benefits. They reduce round trip delay and optimize reliability in case of node failures. Moreover, compared to expanding ring search, TRS results in a much lower message overhead.

I. INTRODUCTION

A great variety of todays networked applications require reliable one-to-many or many-to-many communication. For those applications a number of reliable multicast protocols [1,2,3,4,5,6] have been proposed. Most of them are based on IP multicast [7,8] and use positive or negative acknowledgments (ACKs) to confirm delivery. In large multicast groups, the sender may be overwhelmed by the amount of acknowledgments received, which is the well-known ACK implosion problem. The most promising approach to address this problem is to organize the group of receivers in a so-called ACK tree [3,4,5,6]. While the multicast messages are sent directly to the receivers (e.g., by using IP multicast), the responded ACK messages are aggregated and propagated along the edges of the ACK tree.
When a new member wants to join the multicast group, it has to be connected to the group's ACK tree. Several techniques have been proposed in the literature for maintaining the ACK tree, most of them are based on expand-

ing ring search (ERS) [9]. The advantage of ERS is its simplicity and fault tolerance. On the other hand, our simulations results will reveal several drawbacks of ERS, like poor scalability and problems correlated with the various multicast routing protocols.
In this paper, we propose the token repository service (TRS) as an alternative approach for constructing ACK trees. It is based on the concept of a distributed token repository, where a token represents the right to connect to a certain node in an existing ACK tree. A new member willing to join the group's ACK tree asks the TRS for a token of that group. TRS selects and delivers a token, which identifies the node in the group's ACK tree the new member can connect to. For each successor the new member can accept, a token is generated and stored by TRS.
It is important to mention that TRS achieves the same level of fault tolerance as ERS. Even if all token information becomes unavailable due to node and communication failures TRS is still operational. In [10] we have already proposed TRS with the random-choice strategy (TRS-RC). Simulations have shown that in terms of message overhead and reliability of the created ACK trees, TRS-RC is a substantial improvement over ERS. However, in terms of round trip delay TRS- RC performs only with core based routing approaches better than ERS. Moreover, the created ACK trees are not completely height-balanced, which means that they are not optimal in terms of reliability.
The scheme presented in this paper, TRS with minimal-height strategy (TRS-MH), allows to create heightbalanced ACK trees. A configurable deviation parameter specifies the maximum acceptable height deviation from strictly height-balanced trees. A deviation of zero results in strictly height-balanced trees, which leads to significantly lower round trip delays as compared to TRS-RC and ERS. On the other hand, creating strictly height-balanced ACK trees causes more message overhead than constructing less balanced trees. Therefore, the deviation parameter determines the trade-off between message overhead and delay/reliability. The effects of this parameter will be illustrated by our simulation results. The remainder of this paper is structured as follows. The next section discusses related work and provides for

Constructing Height-Balanced Multicast Acknowledgment Trees

with the Token Repository Service

Christian Maihöfer and Kurt Rothermel
Institute of Parallel and Distributed High-Performance Systems (IPVR), University of Stuttgart, Germany
{maihoefer,rothermel}@informatik.uni-stuttgart.de
Seite 2  Seite      Textversion  Grafikversion    Übersicht
Seite 3  Seite      Textversion  Grafikversion    Übersicht

the necessary background. In Section III, the TRS approach based on the minimal-height strategy will be described in detail. In Section IV, we compare TRS with the ERS based approaches in terms of performance. Finally, we conclude with a brief summary.

II. PROBLEM STATEMENT AND RELATED WORK

As stated above, the majority of reliable multicast protocols use the concept of ACK trees to avoid the wellknown implosion problem. For each multicast group there exists an ACK tree, whose nodes are the members of the multicast group. ACK messages flow in a leaf-toroot direction, where an ACK sent by a node acknowledges receipt for the entire subhierarchy of this node. To avoid the implosion problem, each node in the ACK tree is assumed to have an upper bound on the number of its successors. We will call a node k-bounded if the number of successors never exceeds k. A k-bounded node is defined to be occupied if it has k successors. The bound chosen for a node depends on various characteristics, such as the node's reliability, load, and performance. When a new member joins the group, it must be connected to a non-occupied node in the corresponding ACK tree. In order to allow for large multicast groups, the underlying join mechanism must be scalable and efficient. Moreover, it is desirable that this mechanism constructs well-formed ACK trees, i.e. those with minimal height.
The height of the ACK tree affects both delay and reliability. Low delays are desirable since the average throughput of a reliable (multicast) channel based on the PAR scheme [11] is limited by the quotient of buffer size / round trip delay. The lower the delay, the higher is the throughput respectively the lower is the required buffer size.
The multicast service may be disrupted for a node if one of its predecessors in the ACK tree becomes unavailable. Therefore, the lower the number of predecessors the higher the reliability from this node's perspective. So, the average path length of the ACK tree can be taken as a quality criteria for reliability.
Most approaches to establish ACK trees are variations of ERS, which was first proposed in [9]. When a node wants to join the ACK tree, basic ERS [3] simply multicasts a join request to the members of this group. In order to decrease network load and to find a predecessor as close as possible to the searching node, the search is limited by a search scope. The initial request is sent with an time-to-live (TTL) of 1 and thus is limited to the sender's LAN. When a non-occupied node in the group's ACK tree receives this message, it returns an answer. If no node answers within a certain time, the requestor multi-

casts the join request with an increased TTL. It repeats this process until an answer arrives or the maximum TTL of 255 is reached. The node that answers first becomes the predecessor of the new member.
In a variation of ERS, called expanding ring advertisement (ERA), the nodes that are already in the ACK tree actively search for successors [4,6]. Non-occupied ACK tree nodes send multicast invitation messages to gain further successor nodes. Since the receivers need not send multicast messages to join the ACK tree, no support for bidirectional multicast is required. Some protocols use a combination of ERS and ERA (e.g., see [5]).
The great advantage of ERS and ERA is its implicit fault tolerance. On the other hand, our simulation results in Section IV will show that ERS as well as ERA may cause a huge message overhead. Another shortcoming of ERS is its dependency on the various routing protocols, each has its own problems with respect to ERS. ERS with DVMRP [12] routing leads to a vast overhead at all involved routers because a new multicast routing tree is to be build for each node joining a group. For pure receivers, these trees are only used for running ERS. With shared tree approaches like PIM-SM [13], the use of ERS results in a traffic concentration at the core, an even higher message overhead, and ACK trees of poor quality. A serious drawback of ERA is the message overhead due to the invitation messages, which are sent even if no node wants to join.
In [10], we have proposed the basic TRS with randomchoice strategy (TRS-RC) for constructing multicast ACK trees. We have shown that TRS-RC causes a much lower message overhead than the ERS or ERA schemes. Moreover, in terms of reliability and round trip delay TRS-RC performs in many cases better than ERS and ERA (see Section IV for details).
TRS with minimal-height strategy (TRS-MH), presented in this paper, constructs height-balanced ACK trees. Simulation results in Section IV show that in terms of delay and reliability, TRS-MH is superior to TRS-RC.

III. THE TOKEN REPOSITORY SERVICE WITH MINIMAL- HEIGHT STRATEGY (TRS-MH)

A. Basic Principle and Interface

The core concept of our approach are tokens, where a token represents the right to connect to a particular node in a given ACK tree. When a k-bounded node joins the ACK tree, k tokens are generated and stored in the TRS. The joining node is called the tokens' owner. A token is therefore identified by its group and owner. We define the height of a token to be the height of its owner in the corresponding ACK tree.
Seite 3  Seite      Textversion  Grafikversion    Übersicht
Seite 4  Seite      Textversion  Grafikversion    Übersicht

When a node wants to join a given group, it asks the token repository service for a token belonging to this group. The repository service selects a token of this group and delivers it. The node receiving this token can then connect to the token's owner in the corresponding ACK tree. When a node leaves a group, it returns the token to the repository service, which then can be reused by some other node joining this group later. Since only k tokens are generated for a k-bounded node, no more than k successors can be connected to it in the corresponding ACK tree. Table 1 shows the operations provided by the TRS.

B. Implementation as a Distributed Server Hierarchy

In order to achieve scalability, the TRS is implemented as a distributed system consisting of a hierarchy of token repository servers, or repServers for short. In our approach, the network is structured into hierarchical domains. Leaf domains encompasses a disjunct set of nodes, inner domains include all successor domains and finally the root domain includes all nodes of the network. We assume that domains group the network by communication distance, i.e. the communication distance between two nodes belonging to the same domain is typically smaller than between two nodes in different sibling

domains. Figure 1 depicts the hierarchical domain structure. Each domain is administered by a repServer. For example repServer S1 in Figure 1 is responsible for domain 1 consisting of nodes N1x and repServer S5 is responsible for domain 5, which consists of several subdomains. A leaf repServer, responsible for a certain node in its domain, is called this node's home repServer. For example in Figure 1, S1 is the home repServer for all nodes N1x of domain 1. Nodes access the TRS only via their home repServer, which are always leafs in the repServer hierarchy. Non-leaf repServers are responsible for token searching (see Section III.F for details).
As mentioned before, the TRS stores tokens, representing the right to connect to a certain predecessor node in the ACK tree. These tokens are stored on leaf repServers in so-called token baskets. Each leaf repServer has a basket for each known group, which includes all tokens of the corresponding group.
When a leaf repServer is asked for a token of a given group and no suitable token is locally available for that group, it starts the token search procedure. A token is defined to be suitable if there exist no tokens of the same group with a lower height. To facilitate searching for each group a so-called group tree is maintained, which is a subtree of the hierarchy of repServers. A group's group tree contains all leaf repServers that store a token basket of this group and all ascendants of these nodes in the rep- Server hierarchy. repServers store group tree information in so-called group records. With TRS-MH, a group record stored at a repServer includes the minimal height of the tokens available in this repServer's subhierarchy for the corresponding group.
Let us briefly sketch the search procedure for TRS- MH. A repServer starts a token search by sending a search request to its predecessor. If the receiver of this request is not part of the group tree, it just forwards the request to its predecessor. Otherwise, it determines -

TABLE 1 Operations provided by the TRS

repCreateGroup (Group, AckRoot, K):
This operation announces the group identified by Group to the TRS. The root of Group's ACK tree is identified by AckRoot, which is K-bound.

repDeleteGroup (Group):
This operation deletes all state information associated with Group in the TRS.

repJoinGroup(Group, NewMember, K) ret. Token This operation is called when the node identified by NewMember wants to join Group, where NewMember is K-bounded. The operation returns a token that identifies the node, NewMember is supposed to connect to.

repLeaveGroup(Group, Member):
This operation deletes all tokens owned by Member for Group in the TRS.

repAddToken(Group, Owner):
This operation is called when a successor disconnects from Owner in Group's ACK tree. This operation adds a new token owned by Owner into the TRS.

repRemoveToken(Group, Owner):
This operation is called when a node identified by Owner in Group's ACK tree has accepted a new successor via ERS. It removes one of Owner's tokens.

Figure 1: Domain structure

Domain 1

N13 N14

N17
N16
N15

N12
N11

N23 N24

N25

N22
N21

S1 S2

S3

Domain 5

Domain 2
...

S5

... ...

S4

...
...

...
...

repServer


Seite 4  Seite      Textversion  Grafikversion    Übersicht
Seite 5  Seite      Textversion  Grafikversion    Übersicht

depending on the height information included in the local group record - whether its subhierarchy contains a suitable token. If no suitable token is available, the search domain is enlarged by forwarding the token request to its predecessor. Otherwise, the search request is forwarded in a root-to-leaf direction along the edges of the group tree until it reaches a leaf repServer storing a suitable token. Note that this search procedure together with our assumption that a node within the same domain is closer in terms of communication distance than a node in a sibling domain ensures the following: (a) From all tokens available for a given group always one with the lowest height is selected, and
(b) if there are several tokens with the lowest height, one whose owner is as close as possible to the joining node is selected.
Note that the above mentioned deviation parameter can be used to trade-off between the communication distance between nodes in the ACK tree and the height of this tree. If, for example, the acceptable deviation is 2, then a leaf repServer may select a local token (i.e., whose owner is close) even if this token exceeds the minimal height of the globally available tokens by 2.
In contrast, TRS-RC [10] does not store height information in non-leaf group records. If the random choice strategy is applied, the token is forwarded in a leaf-toroot-direction until a repServer is reached whose subhierarchy holds some token of the corresponding group. Then the token request is forwarded to some leaf repServer in the subhierarchy holding such a token, which is chosen randomly if there are several of them. Consequently, with TRS-RC, the token is selected whose owner is as close as possible to the joining node, independent of this token's height.

C. Group State Information

The group record structure at leaf repServers is depicted in Table 2. Each group record contains a token basket, which is a set of token packets belonging to the same group. The token packet structure is given in table Table 4. Each token packet contains a number of tokens belonging to the same owner.
Besides the token basket, a leaf repServer's group record contains the fields MinHeightGlobal and Min- HeightLocal. The first specifies the minimal height of all tokens in the entire repository hierarchy, while the latter records the minimal height of all locally available tokens. As we will see in Section III.F, MinHeightGlobal is only a lower bound of the group's minimal token height and not necessarily the exact value.
The group record structure for non-leaf repServers is given in Table 3. Besides MinHeightGlobal, vector Min-

HeightSub is included, which determines the group tree's successor nodes. MinHeightSub[s] denotes the vector's entry for successor s. If a successor s does not belong to the group's tree, MinHeightSub[s] contains a null entry. Otherwise, it specifies the minimal height of the group's tokens available in domain s. In the following, we will use min(MinHeightSub) to denote the minimal height of all entries in MinHeightSub.

D. Group Record Updates

A group tree may grow and shrink during its lifetime, and the contents of group records may change. A group record at a leaf repServer is to be established when the first set of tokens associated with the group are created locally, and it is deleted when the last token has been removed from it. If a new group record is created at a leaf repServer, this leaf repServer is linked to the group tree

TABLE 2 Leaf node group record

Attribute Description

Group Unique multicast group identifier

TokenBasket Set of token packets

MinHeightGlobal Minimal height of tokens in the entire repository hierarchy

MinHeightLocal Minimal height of all local tokens belonging to Group

TABLE 3 Non-leaf node group record

Attribute Description

Group Unique multicast group identifier

MinHeightGlobal Minimal height of tokens in the entire repository hierarchy

MinHeightSub Vector determining the minimal token height for each successor
domain

ExpDate Date when the group record expires

TABLE 4 Token packet structure

Attribute Description

Owner The Owner of the tokens in this packet

Height Height of Owner in the corresponding ACK tree

Tokens Amount of tokens in this packet

ExpDate Date when the packet expires
Seite 5  Seite      Textversion  Grafikversion    Übersicht
Seite 6  Seite      Textversion  Grafikversion    Übersicht

by sending its predecessor a HeightUpdate message, containing the group's identifier as well as the sender's Min- HeightLocal and MinHeightGlobal values.
A non-leaf repServer receiving a HeightUpdate message checks whether the corresponding group record already exists. If it already exists, the receiver updates MinHeightSub and MinHeightGlobal accordingly. Otherwise, it creates a new group record for this group and sets the MinHeightGlobal and corresponding MinHeightSub entry to the MinHeightGlobal and MinHeightLocal value included in the message.
If a group record is removed, a HeightUpdate message is sent to the repServer's predecessor, indicating that no local token is available anymore. If a non-leaf repServer receives such a message, the corresponding MinHeight- Sub entry is set to null. If all entries are null, this rep- Server does no longer belong to the group tree, and hence its group record can be removed also, causing a Height- Update message to be sent to its predecessor.
If we want to built a strictly height-balanced ACK tree, each time a leaf repServer delivers a new token, whose height is greater than its local MinHeightGlobal, this value is updated to the height of this token. As we will see in Section III.F this can only happen after a token search has been performed. If we take into account the deviation parameter Dev, MinHeightGlobal is only updated to the token's height if a global token search has resulted in a token with height greater than MinHeight- Global + Dev. Remember that Dev specifies the maximum allowed height deviation from a strictly heightbalanced tree. With this updating strategy we make sure that MinHeightGlobal is always a lower bound of the minimal token height, which is needed to determine when a global token search is required.
Changes of group records in non-leaf repServers are propagated accordingly to predecessor nodes. Now let us briefly describe the expiration date mechanism. All state information are maintained following the soft state principle [14]. Both, group records and token packets are associated with an expiration date. If an expiration date is not extended when it expires, the corresponding piece of information will be discarded automatically. Although our protocols allow for discarding state information explicitly, this mechanism is needed to ensure that all state information will be eventually removed even in the presence of node and communication failures. In addition, having such a mechanism in place allows to use light-weight protocols for explicitly deleting state information.
How can expiration dates be extended? Clearly, the lifetime of a token packet depends on the lifetime of its owner. When a token packet expires, the repServer stor-

ing this packet asks the packet's owner to extend the expiration date. If the owner responds, the expiration date is updated accordingly, otherwise the entire package is removed. If the token basket of the corresponding group becomes empty, the token basket's group record is deleted and the leaf repServer removed from the group tree.
When a group record expires, the repServer storing this record asks the successor repServers whether they are still part of the corresponding group tree. If at least one of them responds positively, then the record's expiration date is extended accordingly, otherwise it is deleted and removed from the group tree.

E. Creating and Deleting Groups

In this and the following section we describe the group management operation, i.e. create group, delete group, join group and leave group in detail. When describing the protocol we assume the absence of failures. We will consider communication and node failures in Section III.H. When a new group is created, tokens must be created and an initial group tree must be established. Figure 2 illustrates a scenario, where a node creates a group and two other nodes join this group. Assume that S1 is the home leaf repServer at which the repCreate- Group(Group1, AckRoot, K) operation of node N11 was issued. S1 creates a group record with token basket for Group with the amount of tokens specified by K, where AckRoot becomes the owner of these tokens.
The group tree to be established consists of S1 and all ancestors of S1 in the repServer hierarchy. After creating the token basket, S1 sends a CreateGroup request to its predecessor. On receiving a CreateGroup request, a rep- Server creates and initializes a group record and again forwards the request until the root repServer is reached. The MinHeightSub entry is initialized with 2, since all tokens in this domain are of height 2; all other entries are initialized with null, indicating that these successor domains do not belong to the group tree. MinHeightGlobal is initialized with 2 because the lowest height of the group's globally available tokens is 2, too.
When the operation repDeleteGroup(Group) is issued at a leaf repServer, this server deletes the Group's group record with all token packets and sends a DeleteGroup request to its predecessor. Non-leaf repServers forward this request along the edges of Group's group tree, and each repServer receiving this request deletes all state information associated with Group. Note that this explicit discarding of state information is only an optimization

1. Here we assume an external mechanism that
ensures group ids to be unique. See [10] for details.
Seite 6  Seite      Textversion  Grafikversion    Übersicht
Seite 7  Seite      Textversion  Grafikversion    Übersicht

since the expiration date mechanism ensures that all state information is removed eventually.

F. Joining and Leaving Groups

When a repJoinGroup(Group, NewMember, K) returns (Token) operation is called, the called leaf repServer checks its group record to see, whether it has a suitable token in the Group's token basket. A suitable token is available at a leaf repServer if the following condition is fulfilled:
MinHeightLocal ? MinHeightGlobal + Dev,
where Dev is the deviation parameter mentioned above, i.e. strictly height-balanced trees are constructed if Dev is set to zero.
If a suitable token is locally available, a token with height MinHeightLocal is removed from the token basket and returned to the caller of repJoinGroup. The home repServer also generates a new token packet for New- Member with K tokens and puts it into Group's token basket.
In the example depicted in Figure 2 and the following ones, we will assume Dev=0. Figure 2 shows the outcome of a repJoinGroup performed by node N12. Its home repServer S1's check results in MinHeightGlobal is equal to MinHeightLocal. It removes a local token with height 2, returns it to N12 and generates a new token packet with the owner N12.
If a suitable token is not available at a leaf repServer,

this server starts the search procedure. In the first phase, a TokenRequest request is forwarded in leaf-to-root direction until a repServer is found that is part of Group's tree and the following condition is satisfied:
min(MinHeightSub) ? MinHeightGlobal + Dev. If this condition holds at a repServer, a suitable token can be found in this repServer's subhierarchy, and hence the search domain needs no further enlargement. Such a situation is depicted in Figure 2. N21 initiates a repJoinGroup operation at its home repServer S2. S2 has no token basket for the requested group. Therefore, it starts a global token search. S5 is the first repServer that is part of the group tree. Since MinHeightGlobal is equal to min(MinHeightSub), the first search phase stops and the second, root-to-leaf directed search phase, is initiated by S5.
In the second search phase, each repServer forwards the TokenRequest message to a subhierarchy s with Min- HeightSub[s] = min(MinHeightSub) until a leaf repServer is reached. If more than one subhierarchy satisfies this condition, the repServer randomly selects one of them. Finally, the found leaf repServer removes a token with height MinHeightLocal from Group's token basket and delivers it directly to the searching leaf repServer. In the example depicted in Figure 2, S5 has to choose S3, which itself choose leaf repServer S1.
When receiving the token, the searching leaf repServer establishes a new group record with a token basket, including a token packet owned by NewMember and finally delivers the received token to the caller of rep- JoinGroup. In order to connect itself to Group's group tree, it sends a HeightUpdate message to its predecessor (see Section III.D).
Figure 3 depicts the group state information after N21 has received the token. MinHeightGlobal has still the value 2 since S1 has still tokens with this height. Min- HeightSub[S4] at S5 and MinHeightSub[S2] at S4 are set to 3, the height of the tokens at S2.
Subsequent repJoinGroup operations will first consume S1's tokens with height 2. If the last token with height 2 has been consumed, the next repJoinGroup will result in a global token search, because MinHeightGlobal still indicates the existence of a token with height 2 at most group tree's repServers. Of course, the global token search cannot find such a token, hence the token with the lowest height of all remaining ones is chosen and Min- HeightGlobal of the searching repServer is updated. The predecessor is informed of this update by a HeightUpdate message as already explained in Section III.D. Assume that in Figure 3, S1 has delivered its last token with height 2. Then S1 sends a HeightUpdate message to its predecessor with the new minimal token height of 3. This

Figure 2: Group state information after creating a group

Token Basket

repServer

N11 Client

S1

S3

S5

S4

S2

N12 N21

repJoinGroup repJoinGroup
repCreateGroup

2

2

2

2

2
2

S6root
MinHeightGlobal
MinHeightSub

Group Record

2 2
2

MinHeight-
Local

MinHeight-
Global

- - -

- - -

-
2
-
2


Seite 7  Seite      Textversion  Grafikversion    Übersicht
Seite 8  Seite      Textversion  Grafikversion    Übersicht

message is forwarded to the ancestors until the root rep- Server S6 receives it. Now a repJoinGroup is initiated at S2 and hence S2 starts a token search since MinHeight- Global denotes that possibly a better token could be found globally. The search request is forwarded until S6 is reached, where it also finds no successors with a suitable token of height 2. Finally S2 chooses a local token. Figure 4 shows the group state information after the processed updates initiated by S2. Note that all MinHeight- Global entries on the path from S2 to S6 are updated to token height 3, but not those on other paths. For example S1 and S3 still have an MinHeightGlobal entry of 2, indicating that possibly a token with height 2 could be found, which obviously is not the case.
Now let us consider these exceptional cases in more detail. First of all, if the first search phase discovers no token with a smaller height than the searching rep- Server's MinHeightLocal value, then the searching leaf repServer chooses a local token. The search can be stopped if at a repServer's group record MinHeightGlobal is equal to MinHeightLocal. If in our example depicted in Figure 4 S1 initiates a token search as its MinHeightGlobal value indicates that a better token might be available, the search process can be terminated at S5. Since MinHeightGlobal is 3, and there are also local tokens at S1 with the same height, S1 chooses a local token. Of course, S1 initiates the update of MinHeightGlobal after delivering the local token. If a search request arrives at the root repServer, the root checks if min(MinHeightSub) is smaller than the searching repServer's MinHeightLocal value included in the TokenRequest message. If this is the case, the root rep-

Server sends the TokenRequest to the appropriate successor repServer. The search algorithm is summarized in Figure 5.

During the token search, inconsistencies in the search records may occur due to network partitioning and/or node crashes (see Section III.H.). Temporary inconsistencies can also occur due to the time needed to propagate updates.
Finally we describe the repLeaveGroup operation.

Figure 3: Group state information after a
join operation at S2

repServer

S1

S3

S5

S4

S2

2

2

2

2

2
2

S6root

Group Record

2 2

MinHeight-
Local

MinHeight-
Global

3
2

3 3
3

3
2

3
-
-

-
-
-
-
-

MinHeightGlobal

MinHeightSub -
2
-
2

3

H receives repJoinGroup (Group, NewMember, K) from N:
if MinHeightLocal > MinHeightGlobal + Dev
send TokenRequest(Group, MinHeightLocal, H) to predecessor

elsesend token with height == MinHeightLocal to N

R receives TokenRequest(Group, MHL, H) from successor:
/* MHL is the MinHeightLocal of the searching leave repServer H */ if (MHL <= MinHeightGlobal + Dev) or
(R == root and MHL <= min(MinHeightSub))
take local token at H
elseif (min(MinHeightSub) <= MinHeightGlobal + Dev) or
(R == root)
send TokenRequest(Group, MHL, min(MinHeightSub), H)
to a succ. S with MinHeightSub[S] == min(MinHeightSub)

elsesend TokenRequest(Group, MHL, H) to predecessor

R receives TokenRequest(Group, MHL, Height, H) from a predecessor: /* MHL is the MinHeightLocal of the searching leaf repServer H Height is the height of the searched token */
if min(MinHeightSub) <= Height
send TokenRequest(Group, MHL, Height, H) to a succ. S
with MinHeightSub[S] == min(MinHeightSub)

elsesend HeightUpdate and TokenRequest(Group, MHL, H) to pre.

Figure 5: Search algorithm

Figure 4: Group state information after S1 has delivered the last token with height 2

repServer

S1

S3

S5

S4

S2

3

3

3

2

3
2

S6root

Group Record

3 3

MinHeight-
Local

MinHeight-
Global

3
3

3 3
3

3
3

3
-
-

-
-
-
-
-

MinHeightGlobal

MinHeightSub -
3
-
3


Seite 8  Seite      Textversion  Grafikversion    Übersicht
Seite 9  Seite      Textversion  Grafikversion    Übersicht

When a node, say N, leaves a group, it conceptually releases a token owned by its predecessor. Hence, N's predecessor is requested to add this token by means of the repAddToken operation to the token basket of its home repServer when it recognizes that N leaves the group. As we assume that a node has no descendants in the ACK tree when it leaves the group, all tokens owned by N should be in the group's token basket stored on N's home repServer at the time the repLeaveGroup operation is called. When receiving this call, N's repServer removes N's token packet from the group's token basket. If the token basket becomes empty, it's group record is removed and a HeightUpdate message is sent to N's predecessor.

G. Cashing of Group Information

In our descriptions so far, we have not considered that a repServer's memory is a limited resource. Since we cannot guarantee that a repServer can store all created tokens, we must introduce a token caching mechanism. We assume that a repServer has a token cache, i.e. it stores all tokens as long as memory is available and throws away tokens if it runs out of memory. The design goal of the caching strategy is to distinguish between more and less valuable tokens. If memory runs short, the less valuable ones are the first to be discarded. The value of a token is determined by three factors: the effort it takes to get another token for the same group, the probability that the token is needed for following join operations and the height of the token, which determines its quality. The higher these factors are, the more valuable is a token and hence should be kept as long as possible. We divide token packets into three categories. A token packet is a first class token packet, i.e. most valuable, if it is the only token packet for the corresponding group at this repServer. This refers to the first value factor; if a first class token packet is discarded, the effort to get another token is high, since a global token search must be started.
A second class token packet is a token packet of a multicast group that is highly dynamic and/or grows very fast. In particular, this can be the case if the group has been created only recently. The definition of this second class recognizes the second value factor, the probability that a token is needed for a following join operation. However, to keep tokens of recently created groups has a second reason, too. In general, the ACK tree of a recently created group has only a few hierarchy levels. If tokens are discarded in this phase, this may result in highly unbalanced ACK trees.
Third class token packets are all remaining ones not

belonging to class one or two. Third class tokens should be removed at first, then second class tokens and first class tokens last. If a class contains several token packets, always the token packet with the greatest height should be removed.
With the described caching mechanism we ensure, that if memory runs short, we keep valuable tokens as long as possible. This reduces the message overhead of the TRS and leads to well-shaped ACK trees with low height and low delays between successor and predecessor nodes. As we will be seen in the next section, our protocol is still operational even if all tokens are discarded. Hence, limited memory does not affect the robustness of the proposed protocol.

H. The Protocol in the Presence of Failures

RepServers may become unavailable due to crashes or network partitioning. Since we assume, that token basket as well as group records are stored in volatile memory due to performance reasons, these information is lost in case of crashes and will not be recovered.
When the home repServer is not available when rep- CreateGroup, repDeleteGroup or repJoinGroup is to be performed, some other leaf repServer can be selected, preferably in a close domain. The caller of repLeave- Group can just give up in this case as the expiration date mechanism will ensure that the corresponding token packet will be deleted.
If the home repServer is not available, alternatively, the issuer of repCreateGroup and repDeleteGroup can just give up. In the case of repCreateGroup no token information is established, which is treated the same way as the loss of tokens due to crashes (see below). In the case of repDeleteGroup the expiration date mechanism will eventually removes all outdated tokens. To perform joining a group, the joining node can alternatively initiate ERS if its home repServer is down (see below). If a non-leaf repServer crashes, it looses all group records. Although possible, we believe that recovering the group records after restarting is not worth the effort. Due to an unavailable or recently restarted repServer, a group's tree may be in an inconsistent state. However, the expiration date and update mechanism take care of that and after some time, each group tree is recovered. If a join group operation fails due to an unavailable repServer or inconsistent group tree, ERS must be used instead. When ERS delivers a predecessor node, the joining node as the owner, creates a new token packet by issuing a repAddToken, preferably at its home repServer. With this mechanism we ensure that the next join operation concerning this group can be processed by the TRS again. After a node has joined via ERS, the predecessor
Seite 9  Seite      Textversion  Grafikversion    Übersicht
Seite 10  Seite      Textversion  Grafikversion    Übersicht

node (discovered by ERS) calls a repRemoveToken operation at its home repServer to remove one of its tokens in the repository.
In summary, the TRS is as robust as ERS, since it is still operational even if all repServers are unavailable.

IV. SIMULATIONS

In this section, we will present simulation results that compare TRS with expanding ring search strategies in terms of message overhead, round trip delay and reliability of the generated ACK trees.

A. Simulation Scenario

Our simulations were performed using the NS2 [15] network simulator. The networks were generated with Tiers [16] and consists of up to almost 2000 nodes. The links' bandwidth is 10Mbps for LAN links, 100Mbps for MAN links and 1000Mbps for WAN links; the link delays are chosen randomly for each link from 1ms to 3ms for LANs, 1ms to 8ms for MANs and 5 to 19ms for WANs.
Since the multicast routing protocol significantly affects the measured results, we have run most simulations with both, the distance vector multicast routing protocol (DVMRP) [12] and protocol independent multicast - sparse mode (PIM-SM) [13].
Some simulations are performed with various background traffic conditions. The background traffic is generated by randomly placed senders and receivers of TCP streams. The traffic generated by a sender is distributed exponentially. Since the background traffic consumes a lot of CPU and memory resources we were not able to simulate high background load with the given network bandwidths. Therefore, we had to decrease the bandwidth by factor 100 to be able to run the simulations with background traffic. The TRS is configured with 8 leaf repServers and a branching factor of 2, which results in 15 repServers in total.

B. Simulation Results

Figure 6 shows the average received number of messages per repServer. Since the minimal-height strategy results in more global token searchs than the randomchoice strategy, more messages must be sent within the repository tree. However, if a maximum height deviation of 2 is allowed, the message overhead already decreases significantly. For example, to process 200 join operations with deviation 0, Figure 6 depicts 59 messages for the root repServer to be received. If a deviation of 2 is allowed, the root repServer must process only 6 mes-

sages. If the repServers should be overloaded, the number of repServers must be increased to distribute the load. Figure 7 shows the dependency between the message overhead and various levels of background load. The background load is defined to be the percentage of busy links during simulation time, i.e. if the background load is 100% each network link is busy during the entire simulation. The timeout parameter for ERS specifies the time per hop a node waits for an answer to arrive, before it sends a new search message with an increased TTL. For example, if the timeout is 1 second and the search scope 10 hops, then the node issuing ERS waits 10 seconds for an answer before it starts a new search.
The results show that ERS scales poorly with the background load. If the background load exceeds a certain level, the number of received messages rises exponentially. This behavior is caused by increased message delays due to high background load. When the delay of a search and the resulting answer message exceeds the timeout interval, the issuer of ERS sends a new multicast message with increased TTL. The smaller the timeout interval, the earlier occurs this effect. However, the timeout parameter can only be increased within a certain range, since this affects the delay of a join operation. Moreover, as it can be seen in the chart, increasing the timeout interval also increases the message overhead in

Figure 6: Number of received messages per repServer

0

10

20

30

40

50

60

70

20 50 70 100 120 150 170 200 220
number of join operations

averagenumberofreceivedmessages TRS-MH (dev. 0) / root
TRS-MH (dev. 0) / inner
TRS-MH (dev. 0) / leaf
TRS-MH (dev. 2) / root
TRS-MH (dev. 2) / inner
TRS-MH (dev. 2) / leaf
TRS-RC / root
TRS-RC / inner
TRS-RC / leaf

Figure 7: Scalability in terms of network load

0

2000

4000

6000

8000

10000

12000

14000

0 3 6 13 17 18 19
background load [%]

averagenumberofreceivedmessages

TRS-RC
TRS-MH (dev. 0)
ERS 0.1s timeout
ERS 0.25s timeout
ERS 0.5s timeout
ERS 1s timeout
ERS 2s timeout
ERA 20s timeout
ERA 30s timeout


Seite 10  Seite      Textversion  Grafikversion    Übersicht