Universität Stuttgart Fakultät Informatik I V P R Fakultät Informatik Institut für Parallele und Verteilte Höchstleistungsrechner Universität Stuttgart Breitwiesenstraße 20 - 22 D-70565 Stuttgart Partitioning and Mapping Techniques for Distributed Multimedia Applications Partitioning and Mapping Techniques for Distributed Multimedia Applications M. Ashraf Iqbal, Alexander Hagin CR-Klassification: C.2.4, C.4, G.1.6, G2.2, I6 Fakultätsbericht 14/1996 Technical Report Dezember 1996 Partitioning and Mapping Techniques for Distributed Multimedia Applications M. Ashraf Iqbal1 Institute of Parallel & Distributed Systems (IPVR) Breitwiesenstr. 20-22, D-70565 Stuttgart, Germany University of Engineering & Technology, Lahore 54890, Pakistan Alexander Hagin Institute of Parallel & Distributed Systems (IPVR) Breitwiesenstr. 20-22, D-70565 Stuttgart, Germany St. Petersburg State Technical University, St. Petersburg, Russia Abstract Distributed Computer Systems have become competitive in providing large amounts of computational power at a very low cost. The system consisting of personal computers, mini or main frames, or high performance multiprocessors, integrated into a high speed computer network is capable of providing any organization the power of a super machine with only a small initial cost. Such a system has the further benefit of providing industry with an easy and modular upgrade path because increasing the power of the system simply involves increasing the number of networked computers. Distributed Multimedia Applications (DMA) are so rich in their diversity of methodology and so inherently computational intensive that they naturally require a very heterogeneous mix of distributed processors interconnected by an efficient interconnection network. A DMA can be represented by a precedence graph, where nodes represent components (or modules) of the application, interconnected by arcs representing data streams flowing between different components. In this paper we study the problem of partitioning and mapping Distributed Multimedia Application graphs onto a heterogeneous distributed computer system. Such applications require a compromise between quality of service and cost of the utilized resources of the distributed computer system. As the problem is difficult to solve, in general, we present an approximate scheme which optimally assigns task modules of the application onto the processors of the distributed system. 1 This research was supported by a research grant of the University of Engineering & Technology, Lahore, Pakistan. Additional support was provided by the German Academic Exchange Service (DAAD). 1 Introduction 2 1 Introduction Recent research in parallel and distributed processing technology has resulted in many advances in all directions of computing technology, e.g., in device technology, networking capability, and in software engineering [1]. Research in device technology has resulted in faster and more powerful processors. Advances in computer network capability have introduced powerful, faster, and more reliable networks. Research in software has provided user friendly tools and environments. These advances have the potential to satisfy the ever growing computation needs of many scientific and engineering applications, e.g., now it has become economically possible to process digital audio and video signals in real time leading to the development of distributed multimedia systems. Distributed Multimedia Applications (DMA) demands an efficient framework for its implementation over a Distributed Computer System (DCS). In fact DMA problems are so rich in their diversity of methodology and so inherently computational intensive that they naturally require a very heterogeneous mix of distributed processors interconnected by an efficient network. A DMA can be represented by a precedence graph, where nodes represent components (or modules), interconnected by arcs representing data streams flowing between different components of the application. Each node of the graph is weighted by computation requirements of the corresponding component and each arc is weighted by the channel capacity needed for the communication between adjacent components. Multimedia streams can originate at multiple sources, traverse a number of intermediate components and end at multiple sinks [2,3]. The Distributed Computer System (DCS) is a heterogeneous mix of mini, microcomputers, or workstations interconnected through a point to point physical link, Local Area Network (LAN) and or a Wide Area Network (WAN). The DCS is also represented by a graph where nodes represents individual machines while edges represents virtual channels of the DCS. Each node of this graph has a weight associated with the available computation capacity. Similarly there is a weight associated with each edge, it signifies the available capacity of the corresponding channel [3, 4, 5]. By partitioning the application task onto different machines that communicate over the network, components or stages of DMA can be executed simultaneously on the machines to which they are best suited in terms of cost and time constraints of execution [6, 7, 8]. Thus a network of distributed machines may be able to provide an optimal performance for such an applications. Efficient utilization of such an approach depends on a number of issues like modeling the DCS as well as the DMA for algorithm development, design of partitioning and mapping strategies, 1 Introduction 3 and integrating these strategies into existing programming systems to solve computationally intensive problems [1, 6, 7]. We would like to address some of these problems in this paper, in particular we shall attack the following problem: Given a set of M components of the multimedia application connected in some fashion, and a distributed computer system consisting of different machines, find an assignment of components to processors that minimizes the cost of using the computational as well as communicational resources such that the load on every machine is bounded by a fixed number while the total communication overhead on the network is also kept below its capacity. Note that the last constraint, i.e., the total communication requirement should be kept below the total capacity of the network, makes it possible for us to model the DCS graph as a completely connected structure. We should also try to keep the load on every machine in DCS below a certain level in order to provide an adequate quality of service for the multimedia applications. If the number of processors are only two and we want to minimize the total cost of execution plus communication ( i.e., without any other constraints) then it is possible to solve this problem using the network flow approach pioneered by Stone [7, 15]. If the interconnection structure of the DMA graph is chain or tree like, it is still possible to solve the problem for an arbitrary number of processors using a shortest tree algorithm designed by Bokhari [6, 7]. Towsley [7, 19] has shown how to find an optimal partitioning for a series-parallel graph using a series of graph transformations. It is important to notice that the general partitioning and mapping problem is very difficult to solve. Some of the researchers have thus solved the partitioning problem by putting constraints on the structure of the application graph while others have found an optimal solution by restricting the number of processors. In [3] an approach based on branch and bound method is proposed, however, the algorithm complexity restricts the dimensions of DMA and DCS that can be handled by the algorithm. If, on the other hand, the problem is to minimize the load on the most heavily loaded processor in a conventional parallel processing environment then the problem, in general, is very difficult to solve using exact algorithms and this explains why most of the work in this field focused on heuristic techniques [5, 9]. If the structure of the application graph is as simple as a chain and the number of processors are confined to only two even then the problem is difficult to handle as reported in [10, 11, 13, 14]. A number of researchers have, however, attempted to solve the problem under certain restrictions on possible partitionings as described in [7, 10, 13]. Some of the graph theoretical research, conducted in the past was directed to find a general method for partitioning the vertices of a graph into two sets of prescribed sizes by the removal of minimum number of edges [16]. A number of researchers have also developed parallel algorithms for partitioning series-parallel and bandwidth-k graphs [17]. Solutions to such problems are also help- 1 Introduction 4 ful in designing partitioning algorithms or approximate schemes for parallel and distributed computer systems. Approximate techniques for partitioning chain and tree structured image processing tasks onto heterogeneous distributed computer systems have also been reported in [13, 14]. Iqbal [8, 11] have devised fully polynomial time approximation schemes in order to solve the partitioning and mapping problem approximately. The time complexity of such an scheme is polynomial in both the size of the problem as well as in 1/e, where e is the relative error bound for the approximate scheme. In order to appreciate the usefulness of the approximate solutions, one should bear in mind that data for the problem being solved is often only approximately known as the procedures of code type profiling and analytical benchmarking are still in their infancy. Hence approximate solutions may be as meaningful as an exact solution for many of the practical problems where the extra accuracy of the exact solution is not needed and where the approximate solution can be obtained in a relatively short time [8, 11]. In this paper we would consider the problem of partitioning DMA whose graphs are restricted to chain or tree like structures, and would present approximate solutions. The algorithms, that we present here are derived from the earlier work of Bokhari [4, 6, 7] and Iqbal [8, 10, 13, 14] who solved partitioning problems in sequential as well as in parallel processing environments using distributed machines with dedicated communication links. It is important to note that now we are concentrating on distributed computers interconnected using a general purpose LAN interconnection which, according to our opinion, would become a cost effective, efficient and commonly used resource in the coming future. The paper is organized as follows. The next section formulates the partitioning and mapping model and defines the cost functions addressed in this paper. We also model the DMA and the DCS in this section. Section 3 addresses the partitioning problem. The results and conclusions are summarized in Section 4. 2 Problem Formulation 5 2 Problem Formulation 2.1 Model of the Distributed Multimedia Applications Distributed multimedia applications are employed to generate, process, and consume continuous (e.g. audio, video) data streams. DMA topology can be constructed by specifying components interconnected via links. Components encapsulate processing of multimedia data, e.g., for generating (source components), consuming (sink components) or manipulating (filters and mixers) data. A component is an individually schedulable unit (e.g., by mapping to a thread). A link provides an abstraction from underlying communication mechanisms which may be used to perform the transport of data units. To provide a uniform data access point for the components, ports are used that deliver data units to the component (input port) or take the data units from the component (output port). A component designer has to associate with each component port the streamtype to be used, thus making all related information available at the port. A DMA can be represented by one or more precedence graphs [3]. In a DMA graph, nodes represent components that are interconnected by arcs representing data streams between components. Each component is associated with at least one device that produces (a source component) or processes (an intermediate component - filter or mixer) or consumes (a sink component) data streams. Media streams can originate at multiple sources, traverse a number of intermediate components and end at multiple sinks. Before using an application, desired user QoS (Quality of Service) is specified with respect to output data generated by sink components (e.g. presented video frame size and rate). To guarantee the specified QoS requirements, corresponding resources for DMA components and links mapped to a distributed computer system have to be reserved. Thus, each node of the application graph is weighted by computational requirements of the corresponding component and each arc is weighted by the channel capacity needed for remote communication between adjacent components. Let us consider a DMA graph and determine the values of node and arc weights. Every node is weighted by computational requirements of the corresponding component of the DMA. Let denote the arrival rate (messages per second) of input data streams to component in the DMA graph. To process every message, component needs processor operations. Let denotes the component computational requirement (operations per second). To exclude unlimited queue of the messages, it is necessary that . l i i i V i Ci C i liV i > 2 Problem Formulation 6 Every arc in the DMA graph is weighted by capacity requirement (bits per second) that must satisfy the inequality , where is the length of the message (bits) arrived at the corresponding link of the DMA graph. An example of a DMA graph is presented in Figure 2.1. The topology of DMA is composed of three source-components and connected to two mixing components and , last of which provides data streams to two sink components and . Weights at nodes and arcs denote computational and communication resource requirements respectively. 2.2 Model of the Distributed Computer System A DMA graph can be arbitrarily distributed over several nodes of a distributed computer system. Generally, set of computers on which a component can be assigned depends on whether the computer configuration has devices and resources required to perform multimedia functions needed by a certain component type. On the other hand, some of the source-components and/or sink-components can only be assigned to certain computers in advance, these components are called pre-attached ones. Let us consider a graph of a DCS. Every node is weighted by available computational resource (operations per second). If is the total computational resource of computer in the DCS and is the computational resource already used by all other applications processed in the DCS, then the available computational resource of computer is The graph representation of the DCS shows possible virtual channel connections (VC) between the computers of the DCS. A VC is a direct oriented logical connection between two computers b a c d e f g 1 3 2 4 1 3 2 3 4 3 3 3 3 Figure 2.1. An example of DMA graph i j , ( ) C ij Cij l i Li > Li a b , c d e f g n Rn Bn n bn n Rn Bn bn ? = 2 Problem Formulation 7 (endsystems) with some assigned capacity. A VC is routed over one or more communication resources of the DCS (physical links, networks) to achieve sender-computer to receiver-computer connectivity. The available capacity of a VC is equal to minimum available capacities of all DCS communication resources over which the VC is routed. Let be the available capacity of DCS communication resource ; be the set of DCS communication resources used by VC (n,m) from computer to computer of the DCS. Then the available capacity of the VC (n,m) is given by (2.1) Figure 2.2(a) illustrates an example of a DCS structure that is represented by the logical system graph shown in Figure 2.2(b). Here the capacities available for every computer pair connection are as follows: , , etc., where are available capacities of the output and input interfaces of computer , Further every arc (n,m) of the system graph represents corresponding VC (n,m) of the DCS and is weighted by available capacity (bits per second) of the VC . It is important to note, that communication resources in the DCS can be shared between different VCs. For example, the capacity of the LAN Ethernet does not belong to any pair of computers but is distributed among all computers of the LAN. The LAN provides a virtual channel between any pair of computers. Therefore, a system graph for the LAN is a logical graph representing all possible VCs between computers connected to the LAN (see Figure 2.3). Available capacity of the LAN transmission line is distributed among all data-exchanging computer pairs. Therefore the following inequality has to be satisfied: (2.2) The capacity is available for every possible VC in the system graph. It means that if the available capacity of the LAN, for example, is decreased by , then the available capacity of every possible virtual channel with decreases by the same amount. Equations (2.1) and (2.2) specifies the relationship between capacities of the communication resources of the DCS virtual channels, represented by arcs in the system graph. A s s rnm n m Rnm m i n A s s rnm ? , { } = R12 min A1 out A2 in ALAN1 , , { } = R13 m i n A1 ou t A3 i n ALAN1 AWAN ALAN2 , , , , { } = An out An i n , n Rnm n m , ( ) A 0 Rnm A ? n m , ( ) ? ? A a Rnm A = 2 Problem Formulation 8 WAN 1 2 3 4 a) LAN1 LAN2 1 3 2 4 LAN1 WAN LAN2 b) Figure 2.2, a) Communications in the DCS take place through various networks, b) Representation of computer communications through VCs in the system graph 1 2 3 4 LAN A 1 4 2 3 LAN A a) b) Figure 2.3,a,b) Connections in the DCS through local area network, 1 2 3 4 c) c) Representation of computer communications through VCs in the system graph R12 R21 2 Problem Formulation 9 2.3 Cost functions There are different ways to partition and map a DMA graph over the DCS graph. We should select the one that meets QoS requirement at minimal cost. In order to calculate the cost functions we must take into account the following: 1. The DCS is heterogeneous, i.e., the computers can differ with respect to their power and capacity of available resources, the virtual channels between the computers can be provided by various mediums of communication in the DCS. 2. Each component of the DMA can be implemented in different ways on different computers, e.g., by hardware, software, or a mixture of the two. Moreover different kinds of components can differ from each other in types and sizes of required computer resources (CPU slots, catche and disk memory, bus capacity, etc.) Thus cost of different permissible component allocations to computers will be represented by cost matrix f = {fni} with entries fni denoting cost of allocation of component i to computer n. Suppose a cost function gs(x) for every communication resource s of the DCS is given. Then the cost of mapping a DMA link (i,j) to virtual channel (n,m) of the DCS can be computed by the formula: where is set of communication resources, which channel (n,m) is routed over; is required communication capacity of the DMA link (i,j). 2.4 Problem Statement The general problem formulation of partitioning and mapping a DMA graph to a DCS graph is as follows. We are given the following information [2]: 1. A DMA graph with - a set of nodes (components or modules), - a set of directed arcs connecting components with each other, , - a required computational resource for every component - a required communication capacity for every link , 2. A DCS graph with - a set of nodes (each node represent a computer) , - a set of VCs (or simply channels) connecting computers with each other, , gnm i j gs d i j ( ) s p nm ? ? = pnm d i j h l l i j , ( ) i j h ? , , { } = d i i h ? dij i j , ( ) l ? z p p n m , ( ) n m z ? , , { } = 2 Problem Formulation 10 - an available (vacant) computational resource of every computer , - a set of communication resources in the DCS1, - a set of communication resources of the DCS used by channel , , - a set of channels routed over shared communication resource , , - a capacity of a communication resource s available to the mapped DMA, , - a set of acceptable locations of every component in the DCS , 3. Cost functions f - a cost matrix, an element fni of f specifies the cost of mapping component i to computer n, g - a cost matrix, an element gij of g specifies the cost of mapping DMA link (i,j) to virtual channels (n,m) of the DCS. The solution variables are such that , if component is assigned to computer , and , otherwise. The Partitioning and Mapping Problem can then be defined as: (2.3) subject to (2.4) (2.5) (2.6) where =0 if ; if and ; if . In this formulation, objective function minimizes the total cost of computational and communication resources used for the DMA assignment onto the DCS. The first term in the objective function identifies the cost of computer resources that are used to execute components of the DMA. The second term represents the cost of communication resources of channels on which DMA arcs are placed. Constraint set (2.4) guarantees that every component will be placed into the DCS and only onto one computer. Constraint set (2.5) guarantees that resources used by components assigned to a computer do not exceed the available resource of the computer. 1 Computer interfaces in the DCS can also be included into the set . Rn n z ? r r rnm n m , ( ) rnm r rnm n m , ( ) ? ? ? , ? ps s r ? p s s r ? ? p = As s r ? z i i h ? x i n xin 1 = i n xin 0 = F xin ( ) m i nxin x i n f n i xin n m , ( ) p ? ? i j, ( ) l ? ? x j mgnm i j + n z i ? ? i h ? ? ? ? ? ? ? ü = xin n zi ? ? 1 i h ? " , = x i nd i Rn n z ? " , ? i h ? ? xin i j, ( ) l ? ? x j md i j As s r ? " , ? n m , ( ) p s ? ? gnm i j n m = gnm i j 0 ? n m ? n m , ( ) p ? gnm - = n m , ( ) p ? F i h ? 3 Partitioning & Mapping Schemes 11 Constraint set (2.6) guarantees that capacity of communication resource in the DCS used by all DMA arcs placed on resource do not exceed the available capacity of the resource. 3 Partitioning & Mapping Schemes In this section we would describe efficient partitioning and mapping schemes which can be used for partitioning chains or tree structured distributed multimedia applications. We intend to extend the approach of our algorithm for series-parallel graphs which is another important class of applications in multimedia systems. The algorithm is derived from the earlier work of Bokhari [4, 5, 6, 7] and Iqbal [8, 10, 13, 14], who solved partitioning problems in sequential as well as parallel processing environments using distributed machines with dedicated communication links. Here we are concentrating on distributed computers interconnected using a general purpose LAN interconnection. As discussed before our objective is to minimize the sum of execution and communication costs with the constraint that the total communication requirement over the network is bounded. We work under the assumption that the available computing resources of every machine in the DCS are enough to satisfy the requirements, we provide an approximate solution to the partitioning and mapping problem taking into account only the communication constraint and minimizing the total cost. It is possible to design efficient heuristics based on this approach which can take into account the additional resource constraint on the load assigned to machines in the distributed computer system. 3.1 Partitioning Chain Structured Applications We show a chain structured application graph in Fig. 3.1. It consists of a source, a sink, and one or more intermediate components. Every data unit (video frame or audio sample) of media stream is generated, processed, and consumed by source, intermediate (e.g., a filter), and sink components respectively. Should a module resident on one machine communicates with a module resident on another computer, there will be an overhead of inter computer communication through the network. This will not only burden the communication resource but would also incur a cost proportional to the amounts of data transmitted between the two computers. The cost of transmitting data between two coresident modules is assumed to be negligible. This assumption can also be relaxed to take into account nonnegligible intracomputer communication costs. s s 3 Partitioning & Mapping Schemes 12 The Assignment Graph Given the invocation chain of Fig. 3.1, and the execution and communication costs, we can draw an assignment graph as shown in Fig. 3. 2. This figure assumes a three processor system. It is important to note that the following observations apply to the assignment graph: 1. There is one special node called the start node, denoted by s, and there is an end node denoted by t. 2. In addition to the start and the end nodes there are further nodes in the assignment graph, where each node is labelled with a pair of numbers (i,p), and represents the assignment of module i to processor p. 3. A directed path from the start node to the end node in this graph corresponds to a partitioning of the chain structured program over the distributed computer system. Similarly any partitioning of the chain can also be represented by a path in this graph as shown in Fig. 3.2 in bold. 4. There is an ordered pair, of weights attached with each edge. 5. All edges incident on the end node have zero weights on them, i.e., both elements of the ordered pair are zero. 6. The edge joining the start node to node (1,p) have an ordered pair, , associated with each edge where and . 1 2 3 4 5 6 Fig. 3.1 A chain structured Distributed Multimedia Application. M N ? a f p 1 = b 0 = 3 Partitioning & Mapping Schemes 13 11 21 31 41 51 61 12 13 22 23 32 33 42 43 52 53 62 63 Fig. 3.2 An assignment graph for the chain of Fig. 3.1 and a three computer system. Start Node End Node 3 Partitioning & Mapping Schemes 14 7. The edge joining node (i,p) to node (j,q) has an ordered pair, , where and . 8. It is important to note that each path from the start node to a node (i,p) in the assignment graph corresponds to an assignment of module 1 to module i onto the machines of the distributed computer system. The path is composed of a number of edges and if we now sum up the first elements (i.e. a's) of the ordered pair associated with each edge in the path then it would specify the total cost of assigning the first i modules of the chain structured application, this will include execution as well as communication costs. We denote this sum by F. 9. If we sum up the second elements (i.e. b's) of the ordered pairs associated with each edge in the path then the corresponding sum would specify the total communication requirements of assigning the first i modules of the application onto the distributed network system. We denote this sum by D. 10. Consider two distinct but partial assignments (or two paths between two nodes in the assignment graph) of the chain structured application. Let the two assignments are denoted by P1 and P2. If D(P1) as well as F(P1) is less than or equal to the respective values of D(P2) and F(P2) then the path P2 is just redundant and there is no need to consider the assignment corresponding to this path. So we just ignore the so called redundant path P2. Note that we are in a position to reject a number of partial paths which satisfy the above constraint without extending these paths to the start or the end nodes. This facility provides us to cut down the complexity of the problem and reduce our search from an exponential number of paths to a clean polynomial expression as described in the partitioning scheme. 11. Consider the possibility that D(P1) is less than D(P2) but F(P1) is larger than F(P2) then it is not possible to reject one path or the other as we are not sure which assignment would provide us best results, i.e., minimal cost of assignment such that the total communication requirement is bounded. We have to extend these paths to the start node on one side and the end node on the other side to make a final judgement, the information available until now is not sufficient to make a comparison. The Partitioning Scheme Let CT represents the maximum possible value of the communication requirement of assigning the chain structured application on the distributed computer network. It is evident that it would be equal to the sum of all the 's in the chain structured application. We now resolve CT to a f q j gpq i j + = b dij = d i j 3 Partitioning & Mapping Schemes 15 an accuracy of e, i.e., two adjacent levels for the communication requirement are separated by e. In other words the total communication requirement is restricted to have only CT /e distinct values in the range of zero to CT . Now consider two paths P1 and P2 arriving at a node (j,q) in the assignment graph. Two possibilities exist: 1. The total communication requirements of the two paths lie into two distinct levels. If D(P1) as well as F(P1) is less than or equal to the respective values of D(P2) and F(P2) then path P2 is just redundant and there is no need to consider the assignment corresponding to this path. If, however, D(P1) is less than D(P2) but F(P1) is larger than F(P2) then it is not possible to reject one path or the other as we are not sure which assignment would provide us best results, i.e., a minimal cost of assignment such that the total communication requirement is bounded. 2. The total communication requirement of the two paths lie in between two successive permissible levels. If D(P1) as well as F(P1) is less than or equal to the respective values of D(P2) and F(P2) then path P2 is just redundant and there is no need to consider the assignment corresponding to this path. But if D(P1) is less than D(P2) and F(P1) is larger than F(P2) even then it is possible to reject one path as follows. Here we reject the path P1 as the cost of assignment of P1 would always be larger than the cost of assignment corresponding to P2 with a guarantee that the error in the communication requirement of the selected path, i.e., P2 would be bounded by e. With the rejection techniques described above, the number of outgoing paths from a node would be restricted to CT/e. Thus the maximum number of incoming paths would also be restricted to CT/e from each processor. The total comparisons to be made at each node (j,q) of the assignment graph in order to reject unnecessary paths would be proportional to O(NCT/e) with the surety that the outgoing paths from the node would be restricted to at the most CT/e under worst case circumstances. As the number of modules in the chain structured application are equal to M then the total time complexity of the approximate scheme would be bounded by O(N2M CT /e) with the guarantee that the maximum difference between the total communication requirement of the partitioning we found and the actual communication requirement of the optimal partitioning is at the most equal to Me. If the relative error bound for the user interested to use the approximate scheme is then the time complexity of the algorithm would be equal to . The approximate technique that we have described is thus a fully polynomial time approximation e Me = O N2M2 CT e ? ( ) ( ) 3 Partitioning & Mapping Schemes 16 scheme in which the time complexity is a polynomial function of the size of the problem as well as the reciprocal of the total error . 3. The output of our algorithm would provide CT/e partitionings of the chain structured application over the distributed computer system with different costs and communication requirements. It will, for example, include the special case when the communication requirement is very small but then it might be very expensive, similarly it will also accommodate a case when the communication requirement is large but the total cost is relatively small. The user will have the freedom to select the partitioning which he can afford provided the communication requirement is below the capacity of the network. The accuracy of communication requirement resolution is determined by the relationship , where the relative error bound can be selected on the basis of acceptable quality of service degradation permissable by the user. The Resource Constraint on Load The additional resource constraint on the load assigned to every machine can easily be plugged into the approximate scheme but then it no longer remains a fully polynomial time approximation scheme but would become a heuristic. Consider the assignment graph once again as shown in Fig. 3.2. Suppose that the available computational resource for a computer x is Rx , i.e., we should not assign load to computer x more than Rx . Now consider only those paths in the assignment graphs between the start node and any intermediate node in which the load assigned to every machine is below or just equal to its capacity. There exists two possibilities: 1. If the number of paths, inspite of the resource constraint on load, grows exponentially then it justifies our assumption used in this paper that the available computing resources of every machine is adequate enough to satisfy the requirements. Under such conditions the number of paths or assignments can be limited by selecting the one with minimal cost (with some fast priority scheme) as described in the assignment scheme. 2. If, on the other hand, the number of paths which satisfies the load constraint does not grow exponentially then it is some thing to be happy about. Under such conditions we can easily select the assignemnt of minimal cost. In practise, however, the number of paths would initially have the tendency to grow exponentially but then the number would be limited due to the resource constraint on load. The actual e e e M ? = e 3 Partitioning & Mapping Schemes 17 behaviour would be determined by the particular application and how the heuristic is actually implemented. 3.2 Partitioning Tree Structured Applications We show a tree structured application graph in Fig. 3.3. Two media streams starts from source component 5 and 6. The first one is processed by filter 4 and then they are mixed by component 3 which provides the mixed stream through filter 2 to sink component 1. It is also possible to consider an application with more than one sink components, what is essential at this point is that the graph of the multimedia application should be a tree. It is, however, possible to extend the approach to series-parallel graphs [7] and perhaps to some other restricted structures. The Assignment Graph The assignment graph of the tree structured application of Fig. 3.3, is shown in Fig. 3.4, we assume a three processor distributed computer system. There are a number of similarities and a number of important differences between this assignment graph and the one shown in Fig. 3.2. We would concentrate on important similarities as well as differences: 1. There are multiple number of start or source nodes and similarly there are multiple sink or end nodes. 2. Each assignment tree corresponds to an application assignment and each application assignment corresponds to an assignment tree. One such assignment tree in the assignment graph is shown in bold in Fig. 3.4. 3. If we sum up the first elements (i.e. a's) of the ordered pair associated with each edge in the assignment tree then it would specify the total cost of assignment of the tree structured application, this will include execution as well as communication costs. We denote this sum by F. 4. If we sum up the second elements (i.e. b's) of the ordered pairs associated with each edge in the assignment tree then the corresponding sum would specify the total communication requirement of the said assignment. We denote this sum by D. 3 Partitioning & Mapping Schemes 18 The problem is then to find an assignment tree in the assignment graph with minimal cost in which the total communication requirement is bounded. The Procedure Merge We have already described how to partition a chain structured application over a distributed computer system such that the total cost of execution plus communication is minimal while the total communication requirement is within a certain bound. Here in this procedure we will discuss ways of utilizing the chain partitioning procedure to partition tree structured applications. Consider the tree shown in Fig. 3. 3, it is a tree and not a chain because the indegree of module 3 is 2. We will use the procedure Merge to handle such a situation. Consider the assignment graph again reproduced in Fig. 3.5. We can find approximate partitionings from the start node to every node in the layer, which corresponds to the module with an indegree more than 1. Here this would be layer 3, thus we have to find paths or partitionings from to each 2 1 4 5 3 6 Fig. 3.3 A tree structured Distributed Multimedia Application. s1 CT e ? s1 3 Partitioning & Mapping Schemes 19 node in the 3rd layer, i.e., node 31, 32, and to node 33 in the assignment graph. One such path or partitioning is shown, in bold, in Fig. 3.5 (top). We would have to make comparisons to find all the approximate partitionings from to every node in layer number 3 where k is the number of layers between and the layer number 3. 11 21 31 41 51 61 12 13 22 23 32 33 42 43 52 53 62 63 Fig. 3.4 An assignment graph for the tree of Fig. 3.3 and a three processor system. s1 s2 End Node N 2k CT e ? ( ) s1 s1 3 Partitioning & Mapping Schemes 20 Similarly we can find paths or partitionings from to each node in the 3rd layer, i.e., node 31, 32, and to node 33 in the assignment graph. One such path or partitioning is shown, in bold, in Fig. 3.5 (top). Still we have to make comparisons each time we find approximate partitionings from to node 31, 32, and to node 33 in the assignment graph. Node 32, for example, will have a number of paths coming from and a similar number of approximate paths coming from start node . Each path from to node 32 can be combined with a path from to node 32. By combination we mean that the two paths can be represented by an edge from a pseudo node s to node 32 as shown in Fig. 3.5 (bottom). Let L1 represents a path from to node 32, and let us represent a path from to node 32 by L2. The path (it is only an edge) from the pseudo node s to node 32 is represented by L3. Then we have the following equations: The number of edges (or paths) between the pseudo node s and the node 32 would be equal to , each edge corresponds to a combination of two paths, one path coming from and the other coming from . We can reduce these paths into by a technique very similar to the one we already used in the partitioning of chain structured programs. Let P1 and P2 are two edges (or two paths) between s and node 32. Assume that the total communication requirement of the two paths lie in between two successive permissible levels. If a(P1) as well as b(P1) is less than or equal to the respective values of a(P2) and b(P2) then the path P2 is just redundant and there is no need to consider the assignment corresponding to this path. But if b(P1) is less than b(P2) and a(P1) is larger than a(P2) then also it is possible to reject one path as follows. Here we reject the path P1 as the cost of assignment of P1 would always be larger than the cost of assignment corresponding to P2 with a guarantee that the error in the communication requirement of the selected path, i.e., P2 would be bounded by e. Each application of the procedure Merge will take time not larger than O(MN2 (CT/e)2) with the guarantee that the maximum difference between the total communication requirement of the partitioning we found and the actual communication requirement of the optimal partitioning is at the most equal to Me. If the relative error bound for the user interested to use the approximate scheme is then the time complexity of the algorithm would be equal to . CT e ? s2 N2k CT e ? ( ) s2 CT e ? s1 s2 s1 s2 s1 s2 a L3 ( ) F L1 ( ) F L2 ( ) + = b L3 ) ( ) D L1 ( ) D L2 ( ) + = CT e ? ( )2 s1 s2 CT e ? ( )2 CT e ? e O CT e ? ( )2N2M3 ( ) 3 Partitioning & Mapping Schemes 21 11 21 31 41 51 61 12 13 22 23 32 33 42 43 52 53 62 63 11 21 31 12 13 22 23 32 33 Fig. 3.5 A path from s1 to node 32 and a path from s2 to node 32 (top) are combined into a single path or edge from s to node 32 (bottom). s1 s s2 End Node 4 Conclusions 22 The approximate technique that we have described is thus a fully polynomial time approxima- tion scheme in which the time complexity is a polynomial function of the size of the problem as well as the reciprocal of the total error . 4 Conclusions Distributed Computing offers a solution where a network of heterogeneous computers can be used to solve large scale scientific and engineering problems. In theory, this approach can dramatically improve the performance because each component of the application is executed on an architecture that is best suited for it. In reality, however, a number of problems should be solved in order to utilize the full potential of the distributed system. The distributed computer system and the distributed application should be modeled in such a fashion that analytical research can provide solutions to the fundamental question of how to optimally partition the application across the machines in the distributed computer system. In this paper we have presented efficient schemes which can partition chain or tree structured Distributed Multimedia Applications consisting of several heterogeneous components across the distributed machines in the network. We have provided approximate solutions to the problem of partitioning chain and tree structures taking into account only the communication constraint and minimizing the total cost. It is possible to extend these techniques to less restricted structures and currently we are working on partitioning series-parallel structures which belongs to a yet another useful class of distributed applications. We also intend to find exact or approximate solutions to the general partitioning problem under further restrictions on the application problem. It is also possible to design efficient heuristics based on this approach which can take into account the additional resource constraint on the load assigned to machines in the distributed computer system. Acknowledgment: We acknowledge the motivation, encouragement and support provided by Asima Ashraf, Gabriel Dermler, Thomas Braunl, and Kurt Rothermel. e 5 References 23 5 References [1] A. Khokhar, V. K. Prasanna, M. Shabaan, and C. Wang, ?Heterogeneous Supercomputing: Problems & Issues,'' IEEE Computer, June 1993.. [2] Rothermel K., Barth I., Helbig T., ?CINEMA - an architecture for configurable distributed multimedia applications,'' Tech. Report 3/1994, Universität Stuttgart, Fakultät Informatik. [3] Hagin A., Dermler G., Rothermel K., ?Problem formulations, models and algorithms for mapping distributed multimedia applications to distributed computer systems,'' Tech. Report 3/ 1996, Universität Stuttgart, Fakultät Informatik. [4] S. H. Bokhari, ?Dual processor scheduling with dynamic reassignment,? IEEE Transactions on Software Engineering,? July 1979, Pages = 341-349. [5] S. H. Bokhari, ?On the mapping problem?, IEEE Transactions on Computers, March 1981, Pages 207-214. [6] S. H. Bokhari, ?A shortest tree algorithm for optimal assignments across space and time in a distributed processor system?, IEEE Transactions on Software Engineering, November 198 1, Pages=583-589. [7] S. H. Bokhari, ?Assignment problems in parallel and distributed computing?, Kluwer Academic Publishers, 1987. [8] M. Ashraf Iqbal,?Approximate algorithms for partitioning and assignment problems?, ICASE Report Number = ?86-40?, NASA Contractor Report 178130, June 1986. [9] M. Ashraf Iqbal, Joel H. Saltz and S. H. Bokhari, ?A comparative analysis of static and dynamic load balancing strategies?, Proceedings of the 1986 International Conference on Parallel Processing, August 1986. [10] M. Ashraf Iqbal and S.H. Bokhari, ?Efficient Algorithms for a class of Partitioning Problems,'' IEEE Trans. on Parallel & Distributed Systems, 1995. [11] M. Ashraf Iqbal, ?Approximate Algorithms for Partitioning Problems'', International Journal of Parallel Programming, October 1991. 5 References 24 [12] M. Ashraf Iqbal, ?Efficient Algorithms for Dilated Mapping of Binary Trees'', Journal of Parallel & Distributed Computing (JPDC), 1992. [13] M. Ashraf Iqbal, Saeed Iqbal, and M. E. Shabaan, ?Partitioning Image Processing Tasks on Heterogeneous Computer Systems, ''Proceedings of the Workshop on Heterogeneous Processing, April 1994. [14] M.Ashraf Iqbal & M.E. Shabaan, ?Heterogeneous Partitioning of Chain Structured Image Processing Tasks,''Worshop on Computer Architectures for Machine Perception (CAMP'93), NewOrleans, Louisiana. [15] Harold S. Stone, ?Multiprocessor scheduling with the aid of network flow algorithms?, IEEE Transactions on Software Engineering, January 1977, Pages = 85-93. [16] T.N. Bui & C. Jones, ?Parallel Algorithms for Partitioning Simple Classes of Graphs,'' International Conference on Parallel Processing, August 1990. [17] R.M. MacGregor, ?On Partitioning a Graph: a theoretical & empiric study'', Ph.D Thesis, University of California, Berkeley, 1978. [18] D. M. Nicol and D. R. O'Hallaron, ?Improved Algorithms for Mapping Pipelined and Parallel Computations,'' IEEE Trans., Computers, vol. 40, No. 3, 1990. [19] D. F. Towsley, ?Allocating programs containing branches and loops within a multiple proccesser system, '' IEEE Trans. on Software Engineering, Oct., 1986, pp. 272-277.