Universität Stuttgart

Fakultät Informatik

I V P R

Fakultät Informatik
Institut für Parallele und
Verteilte Höchstleistungsrechner
Universität Stuttgart
Breitwiesenstraße 20 - 22
D-70565 Stuttgart

Partitioning and Mapping
Techniques for Distributed
Multimedia Applications

Partitioning and Mapping Techniques for

Distributed Multimedia Applications

M. Ashraf Iqbal, Alexander Hagin

CR-Klassification: C.2.4, C.4, G.1.6, G2.2, I6

Fakultätsbericht 14/1996
Technical Report
Dezember 1996


Partitioning and Mapping Techniques for

Distributed Multimedia Applications

M. Ashraf Iqbal1
Institute of Parallel & Distributed Systems (IPVR)
Breitwiesenstr. 20-22, D-70565 Stuttgart, Germany
University of Engineering & Technology, Lahore 54890, Pakistan

Alexander Hagin
Institute of Parallel & Distributed Systems (IPVR)
Breitwiesenstr. 20-22, D-70565 Stuttgart, Germany
St. Petersburg State Technical University, St. Petersburg, Russia

Abstract

Distributed Computer Systems have become competitive in providing large 
amounts of computational power at a very low cost. The system consisting 
of personal computers, mini or main frames, or high performance 
multiprocessors, integrated into a high speed computer network is 
capable of providing any organization the power of a super machine with 
only a small initial cost. Such a system has the further benefit of 
providing industry with an easy and modular upgrade path because 
increasing the power of the system simply involves increasing the number 
of networked computers. Distributed Multimedia Applications (DMA) are so 
rich in their diversity of methodology and so inherently computational 
intensive that they naturally require a very heterogeneous mix of 
distributed processors interconnected by an efficient interconnection 
network. A DMA can be represented by a precedence graph, where nodes 
represent components (or modules) of the application, interconnected by 
arcs representing data streams flowing between different components. In 
this paper we study the problem of partitioning and mapping Distributed 
Multimedia Application graphs onto a heterogeneous distributed computer 
system. Such applications require a compromise between quality of 
service and cost of the utilized resources of the distributed computer 
system. As the problem is difficult to solve, in general, we present an 
approximate scheme which optimally assigns task modules of the 
application onto the processors of the distributed system.

1 This research was supported by a research grant of the University of 
Engineering & Technology, Lahore, Pakistan. Additional support was  
provided by the German Academic Exchange Service (DAAD).

1 Introduction 2

1 Introduction

Recent research in parallel and distributed processing technology has 
resulted in many advances in all directions of computing technology, 
e.g., in device technology, networking capability, and in software 
engineering [1]. Research in device technology has resulted in faster 
and more powerful processors. Advances in computer network capability 
have introduced powerful, faster, and more reliable networks. Research 
in software has provided user friendly tools and environments. These 
advances have the potential to satisfy the ever growing computation 
needs of many scientific and engineering applications, e.g., now it has 
become economically possible to process digital audio and video signals 
in real time leading to the development of distributed multimedia 
systems.

Distributed Multimedia Applications (DMA) demands an efficient framework 
for its implementation over a Distributed Computer System (DCS). In fact 
DMA problems are so rich in their diversity of methodology and so 
inherently computational intensive that they naturally require a very 
heterogeneous mix of distributed processors interconnected by an 
efficient network. A DMA can be represented by a precedence graph, where 
nodes represent components (or modules), interconnected by arcs 
representing data streams flowing between different components of the 
application. Each node of the graph is weighted by computation 
requirements of the corresponding component and each arc is weighted by 
the channel capacity needed for the communication between adjacent 
components. Multimedia streams can originate at multiple sources, 
traverse a number of intermediate components and end at multiple sinks 
[2,3].

The Distributed Computer System (DCS) is a heterogeneous mix of mini, 
microcomputers, or workstations interconnected through a point to point 
physical link, Local Area Network (LAN) and or a Wide Area Network 
(WAN). The DCS is also represented by a graph where nodes represents 
individual machines while edges represents virtual channels of the DCS. 
Each node of this graph has a weight associated with the available 
computation capacity. Similarly there is a weight associated with each 
edge, it signifies the available capacity of the corresponding channel 
[3, 4, 5].

By partitioning the application task onto different machines that 
communicate over the network, components or stages of DMA can be 
executed simultaneously on the machines to which they are best suited in 
terms of cost and time constraints of execution [6, 7, 8]. Thus a 
network of distributed machines may be able to provide an optimal 
performance for such an applications. Efficient utilization of such an 
approach depends on a number of issues like modeling the DCS as well as 
the DMA for algorithm development, design of partitioning and mapping 
strategies,

1 Introduction 3

and integrating these strategies into existing programming systems to 
solve computationally intensive problems [1, 6, 7]. We would like to 
address some of these problems in this paper, in particular we shall 
attack the following problem: Given a set of M components of the 
multimedia application connected in some fashion, and a distributed 
computer system consisting of different machines, find an assignment of 
components to processors that minimizes the cost of using the 
computational as well as communicational resources such that the load on 
every machine is bounded by a fixed number while the total communication 
overhead on the network is also kept below its capacity. Note that the 
last constraint, i.e., the total communication requirement should be 
kept below the total capacity of the network, makes it possible for us 
to model the DCS graph as a completely connected structure. We should 
also try to keep the load on every machine in DCS below a certain level 
in order to provide an adequate quality of service for the multimedia 
applications.

If the number of processors are only two and we want to minimize the 
total cost of execution plus communication ( i.e., without any other 
constraints) then it is possible to solve this problem using the network 
flow approach pioneered by Stone [7, 15]. If the interconnection 
structure of the DMA graph is chain or tree like, it is still possible 
to solve the problem for an arbitrary number of processors using a 
shortest tree algorithm designed by Bokhari [6, 7]. Towsley [7, 19] has 
shown how to find an optimal partitioning for a series-parallel graph 
using a series of graph transformations. It is important to notice that 
the general partitioning and mapping problem is very difficult to solve. 
Some of the researchers have thus solved the partitioning problem by 
putting constraints on the structure of the application graph while 
others have found an optimal solution by restricting the number of 
processors. In [3] an approach based on branch and bound method is 
proposed, however, the algorithm complexity restricts the dimensions of 
DMA and DCS that can be handled by the algorithm.

If, on the other hand, the problem is to minimize the load on the most 
heavily loaded processor in a conventional parallel processing 
environment then the problem, in general, is very difficult to solve 
using exact algorithms and this explains why most of the work in this 
field focused on heuristic techniques [5, 9]. If the structure of the 
application graph is as simple as a chain and the number of processors 
are confined to only two even then the problem is difficult to handle as 
reported in [10, 11, 13, 14]. A number of researchers have, however, 
attempted to solve the problem under certain restrictions on possible 
partitionings as described in [7, 10, 13]. Some of the graph theoretical 
research, conducted in the past was directed to find a general method 
for partitioning the vertices of a graph into two sets of prescribed 
sizes by the removal of minimum number of edges [16]. A number of 
researchers have also developed parallel algorithms for partitioning 
series-parallel and bandwidth-k graphs [17]. Solutions to such problems 
are also help-

1 Introduction 4

ful in designing partitioning algorithms or approximate schemes for 
parallel and distributed computer systems.

Approximate techniques for partitioning chain and tree structured image 
processing tasks onto heterogeneous distributed computer systems have 
also been reported in [13, 14]. Iqbal [8, 11] have devised fully 
polynomial time approximation schemes in order to solve the partitioning 
and mapping problem approximately. The time complexity of such an scheme 
is polynomial in both the size of the problem as well as in 1/e, where e 
is the relative error bound for the approximate scheme. In order to 
appreciate the usefulness of the approximate solutions, one should bear 
in mind that data for the problem being solved is often only 
approximately known as the procedures of code type profiling and 
analytical benchmarking are still in their infancy. Hence approximate 
solutions may be as meaningful as an exact solution for many of the 
practical problems where the extra accuracy of the exact solution is not 
needed and where the approximate solution can be obtained in a 
relatively short time [8, 11].

In this paper we would consider the problem of partitioning DMA whose 
graphs are restricted to chain or tree like structures, and would 
present approximate solutions. The algorithms, that we present here are 
derived from the earlier work of Bokhari [4, 6, 7] and Iqbal [8, 10, 13, 
14] who solved partitioning problems in sequential as well as in 
parallel processing environments using distributed machines with 
dedicated communication links. It is important to note that now we are 
concentrating on distributed computers interconnected using a general 
purpose LAN interconnection which, according to our opinion, would 
become a cost effective, efficient and commonly used resource in the 
coming future.

The paper is organized as follows. The next section formulates the 
partitioning and mapping model and defines the cost functions addressed 
in this paper. We also model the DMA and the DCS in this section. 
Section 3 addresses the partitioning problem. The results and 
conclusions are summarized in Section 4.


2 Problem Formulation 5

2 Problem Formulation

2.1  Model of the Distributed Multimedia Applications

Distributed multimedia applications are employed to generate, process, 
and consume continuous (e.g. audio, video) data streams. DMA topology 
can be constructed by specifying components interconnected via links. 
Components encapsulate processing of multimedia data, e.g., for 
generating (source components), consuming (sink components) or 
manipulating (filters and mixers) data. A component is an individually 
schedulable unit (e.g., by mapping to a thread). A link provides an 
abstraction from underlying communication mechanisms which may be used 
to perform the transport of data units.

To provide a uniform data access point for the components, ports are 
used that deliver data units to the component (input port) or take the 
data units from the component (output port). A component designer has to 
associate with each component port the streamtype to be used, thus 
making all related information available at the port.

A DMA can be represented by one or more precedence graphs [3]. In a DMA 
graph, nodes represent components that are interconnected by arcs 
representing data streams between components. Each component is 
associated with at least one device that produces (a source component) 
or processes (an intermediate component - filter or mixer) or consumes 
(a sink component) data streams. Media streams can originate at multiple 
sources, traverse a number of intermediate components and end at 
multiple sinks.

Before using an application, desired user QoS (Quality of Service) is 
specified with respect to output data generated by sink components (e.g. 
presented video frame size and rate). To guarantee the specified QoS 
requirements, corresponding resources for DMA components and links 
mapped to a distributed computer system have to be reserved. Thus, each 
node of the application graph is weighted by computational requirements 
of the corresponding component and each arc is weighted by the channel 
capacity needed for remote communication between adjacent components.

Let us consider a DMA graph and determine the values of node and arc 
weights. Every node is weighted by computational requirements of the 
corresponding component of the DMA. Let denote the arrival rate 
(messages per second) of input data streams to component  in the DMA 
graph. To process every message, component  needs  processor operations. 
Let  denotes the component computational requirement (operations per 
second). To exclude unlimited queue of the messages, it is necessary 
that .

l i
i
i V i Ci

C i liV i
>


2 Problem Formulation 6

Every arc  in the DMA graph is weighted by capacity requirement  (bits 
per second) that must satisfy the inequality , where  is the length of 
the message (bits) arrived at the corresponding link of the DMA graph.

An example of a DMA graph is presented in Figure 2.1. The topology of 
DMA is composed of three source-components  and  connected to two mixing 
components  and , last of which provides data streams to two sink 
components  and . Weights at nodes and arcs denote computational and 
communication resource requirements respectively.

2.2  Model of the Distributed Computer System

A DMA graph can be arbitrarily distributed over several nodes of a 
distributed computer system. Generally, set of computers on which a 
component  can be assigned depends on whether the computer configuration 
has devices and resources required to perform multimedia functions 
needed by a certain component type. On the other hand, some of the 
source-components and/or sink-components can only be assigned to certain 
computers in advance, these components are called pre-attached ones.

Let us consider a graph of a DCS. Every node  is weighted by available 
computational resource  (operations per second). If  is the total 
computational resource of computer in the DCS and  is the computational 
resource already used by all other applications processed in the DCS, 
then the available computational resource of computer  is

The graph representation of the DCS shows possible virtual channel 
connections (VC) between the computers of the DCS. A VC is a direct 
oriented logical connection between two computers

b

a

c

d

e

f

g

1

3

2

4

1

3

2

3
4 3

3

3

3

Figure 2.1. An example of DMA graph

i j
,
( ) C ij

Cij l i Li
> Li

a b
, c d e
f g

n
Rn Bn n

bn
n Rn Bn bn
?
=


2 Problem Formulation 7

(endsystems) with some assigned capacity. A VC is routed over one or 
more communication resources of the DCS (physical links, networks) to 
achieve sender-computer to receiver-computer connectivity. The available 
capacity of a VC is equal to minimum available capacities of all DCS 
communication resources over which the VC is routed. Let  be the 
available capacity of DCS communication resource ;  be the set of DCS 
communication resources used by VC (n,m) from computer  to computer  of 
the DCS. Then the available capacity of the VC (n,m) is given by

(2.1)

Figure 2.2(a) illustrates an example of a DCS structure that is 
represented by the logical system graph shown in Figure 2.2(b). Here the 
capacities available for every computer pair connection are as follows:

, , etc.,

where  are available capacities of the output and input interfaces of 
computer ,

Further every arc (n,m) of the system graph represents corresponding VC 
(n,m) of the DCS and is weighted by available capacity  (bits per 
second) of the VC .

It is important to note, that communication resources in the DCS can be 
shared between different VCs. For example, the capacity of the LAN 
Ethernet does not belong to any pair of computers but is distributed 
among all computers of the LAN. The LAN provides a virtual channel 
between any pair of computers. Therefore, a system graph for the LAN is 
a logical graph representing all possible VCs between computers 
connected to the LAN (see Figure 2.3). Available capacity  of the LAN 
transmission line is distributed among all data-exchanging computer 
pairs. Therefore the following inequality has to be satisfied:

(2.2)

The capacity  is available for every possible VC in the system graph. It 
means that if the available capacity of the LAN, for example, is 
decreased by , then the available capacity of every possible virtual 
channel with  decreases by the same amount. Equations (2.1) and (2.2) 
specifies the relationship between capacities of the communication 
resources of the DCS virtual channels, represented by arcs in the system 
graph.

A s
s rnm
n m

Rnm m i n A s s rnm
?
,
{ }
=

R12 min A1
out A2
in ALAN1
, ,
{ }
= R13 m i n A1
ou t A3
i n ALAN1 AWAN ALAN2
, ,
, ,
{ }
=

An
out An
i n
, n

Rnm n m
,
( )

A

0 Rnm A
?
n m
,
( )
?
?

A
a
Rnm A
=


2 Problem Formulation 8

WAN

1 2

3 4

a)

LAN1

LAN2

1 3

2 4

LAN1 WAN LAN2

b)

Figure 2.2, a) Communications in the DCS take place through various 
networks, b) Representation of computer communications through VCs in 
the system graph

1 2 3 4

LAN
A

1

4 2

3

LAN

A

a) b)

Figure 2.3,a,b) Connections in the DCS through local area network,

1

2

3

4

c)

c) Representation of computer communications through VCs in the system 
graph

R12

R21


2 Problem Formulation 9

2.3  Cost functions

There are different ways to partition and map a DMA graph over the DCS 
graph. We should select the one that meets QoS requirement at minimal 
cost. In order to calculate the cost functions we must take into account 
the following:

1. The DCS is heterogeneous, i.e., the computers can differ with respect 
to their power and capacity of available resources, the virtual channels 
between the computers can be provided by various mediums of 
communication in the DCS.

2. Each component of the DMA can be implemented in different ways on 
different computers, e.g., by hardware, software, or a mixture of the 
two. Moreover different kinds of components can differ from each other 
in types and sizes of required computer resources (CPU slots, catche and 
disk memory, bus capacity, etc.)

Thus cost of different permissible component allocations to computers 
will be represented by cost matrix f = {fni} with entries fni  denoting 
cost of allocation of component i to computer n.

Suppose a cost function gs(x) for every communication resource s of the 
DCS is given. Then the cost of mapping a DMA link (i,j) to virtual 
channel (n,m) of the DCS can be computed by the formula:

where  is set of communication resources, which channel (n,m) is routed 
over;  is required communication capacity of the DMA link (i,j).

2.4  Problem Statement

The general problem formulation of partitioning and mapping a DMA graph 
to a DCS graph is as follows. We are given the following information 
[2]:

1. A DMA graph with

- a set of nodes (components or modules),

- a set of directed arcs connecting components with each other,
,

- a required computational resource for every component

- a required communication capacity for every link ,

2. A DCS graph with

- a set of nodes (each node represent a computer) ,

- a set of VCs (or simply channels) connecting computers with each 
other, ,

gnm
i j gs d i j
( )
s p nm
?
?
=

pnm d i j

h

l
l i j
,
( ) i j h
?
, ,
{ }
=

d i i h
?

dij i j
,
( ) l
?

z

p
p n m
,
( ) n m z
?
, ,
{ }
=


2 Problem Formulation 10

- an available (vacant) computational resource of every computer ,

- a set of communication resources in the DCS1,

- a set of communication resources of the DCS  used by channel , ,

- a set of channels routed over shared communication resource , ,

- a capacity of a communication resource s available to the mapped DMA, 
,

- a set of acceptable locations of every component  in the DCS ,

3. Cost functions

f - a cost matrix, an element fni of f specifies the cost of mapping 
component i to computer n,

g - a cost matrix, an element gij of g specifies the cost of mapping DMA 
link (i,j) to virtual channels (n,m) of the DCS.

The solution variables are  such that , if component  is assigned to 
computer , and , otherwise.

The Partitioning and Mapping Problem can then be defined as:

 (2.3)

subject to

(2.4)

(2.5)

(2.6)

where =0 if ;  if  and ;  if .

In this formulation, objective function  minimizes the total cost of 
computational and communication resources used for the DMA assignment 
onto the DCS. The first term in the objective function identifies the 
cost of computer resources that are used to execute components of the 
DMA. The second term represents the cost of communication resources of 
channels on which DMA arcs are placed.

Constraint set (2.4) guarantees that every component  will be placed 
into the DCS and only onto one computer. Constraint set (2.5) guarantees 
that resources used by components assigned to a computer do not exceed 
the available resource of the computer.

1 Computer interfaces in the DCS can also be included into the set .

Rn n z ?

r

r

rnm n m ,
( )
rnm r rnm
n m
,
( ) ?
?
?
,
?

ps s r ? p s
s r
?
? p
=

As s r ?

z i i h
?

x i n xin 1
= i n
xin 0
=

F xin
( ) m i nxin x i n f n
i xin
n m
,
( ) p
?
?
i j,
( ) l
?
? x j mgnm
i j
+
n z i
?
?
i h
?
?
? ?
? ?
? ü
=

xin
n zi
?
? 1 i h
?
"
,
=

x i nd i Rn n z
?
"
,
?
i h
?
?

xin
i j,
( ) l
?
? x j md i j As s r
?
"
,
?
n m
,
( ) p s
?
?

gnm
i j n m
= gnm
i j 0
? n m
? n m
,
( ) p
? gnm -
= n m
,
( ) p
?

F

i h
?


3 Partitioning & Mapping Schemes 11

Constraint set (2.6) guarantees that capacity of communication resource  
in the DCS used by all DMA arcs placed on resource  do not exceed the 
available capacity of the resource.

3 Partitioning & Mapping Schemes

In this section we would describe efficient partitioning and mapping 
schemes which can be used for partitioning chains or tree structured 
distributed multimedia applications. We intend to extend the approach of 
our algorithm for series-parallel graphs which is another important 
class of applications in multimedia systems. The algorithm is derived 
from the earlier work of Bokhari [4, 5, 6, 7] and Iqbal [8, 10, 13, 14], 
who solved partitioning problems in sequential as well as parallel 
processing environments using distributed machines with dedicated 
communication links. Here we are concentrating on distributed computers 
interconnected using a general purpose LAN interconnection. As discussed 
before our objective is to minimize the sum of execution and 
communication costs with the constraint that the total communication 
requirement over the network is bounded. We work under the assumption 
that the available computing resources of every machine in the DCS are 
enough to satisfy the requirements, we provide an approximate solution 
to the partitioning and mapping problem taking into account only the 
communication constraint and minimizing the total cost. It is possible 
to design efficient heuristics based on this approach which can take 
into account the additional resource constraint on the load assigned to 
machines in the distributed computer system.

3.1  Partitioning Chain Structured Applications

We show a chain structured application graph in Fig. 3.1. It consists of 
a source, a sink, and one or more intermediate components. Every data 
unit (video frame or audio sample) of media stream is generated, 
processed, and consumed by source, intermediate (e.g., a filter), and 
sink components respectively. Should a module resident on one machine 
communicates with a module resident on another computer, there will be 
an overhead of inter computer communication through the network. This 
will not only burden the communication resource but would also incur a 
cost proportional to the amounts of data transmitted between the two 
computers. The cost of transmitting data between two coresident modules 
is assumed to be negligible. This assumption can also be relaxed to take 
into account nonnegligible intracomputer communication costs.

s
s


3 Partitioning & Mapping Schemes 12

The Assignment Graph

Given the invocation chain of Fig. 3.1, and the execution and 
communication costs, we can draw an assignment graph as shown in Fig. 3. 
2. This figure assumes a three processor system. It is important to note 
that the following observations apply to the assignment graph:

1. There is one special node called the start node, denoted by s, and 
there is an end node denoted by t.

2. In addition to the start and the end nodes there are  further nodes 
in the assignment graph, where each node is labelled with a pair of 
numbers (i,p), and represents the assignment of module i to processor p.

3. A directed path from the start node to the end node in this graph 
corresponds to a partitioning of the chain structured program over the 
distributed computer system. Similarly any partitioning of the chain can 
also be represented by a path in this graph as shown in Fig. 3.2 in 
bold.

4. There is an ordered pair, <a,b> of weights attached with each edge.

5. All edges incident on the end node have zero weights on them, i.e., 
both elements of the ordered pair are zero.

6. The edge joining the start node to node (1,p) have an ordered pair, 
<a,b>, associated with each edge where  and .

1 2 3 4 5 6

Fig. 3.1 A chain structured Distributed Multimedia Application.

M N
?

a f p
1
= b 0
=


3 Partitioning & Mapping Schemes 13

11

21

31

41

51

61

12 13

22 23

32 33

42 43

52 53

62 63

Fig. 3.2 An assignment graph for the chain of Fig. 3.1 and a three 
computer system.

Start Node

End Node


3 Partitioning & Mapping Schemes 14

7. The edge joining node (i,p) to node (j,q) has an ordered pair, <a,b>, 
where and .

8. It is important to note that each path from the start node to a node 
(i,p) in the assignment graph corresponds to an assignment of module 1 
to module i onto the machines of the distributed computer system. The 
path is composed of a number of edges and if we now sum up the first 
elements (i.e. a's) of the ordered pair associated with each edge in the 
path then it would specify the total cost of assigning the first i 
modules of the chain structured application, this will include execution 
as well as communication costs. We denote this sum by F.

9. If we sum up the second elements (i.e. b's) of the ordered pairs 
associated with each edge in the path then the corresponding sum would 
specify the total communication requirements of assigning the first i 
modules of the application onto the distributed network system. We 
denote this sum by D.

10. Consider two distinct but partial assignments (or two paths between 
two nodes in the assignment graph) of the chain structured application. 
Let the two assignments are denoted by P1 and P2. If D(P1) as well as 
F(P1) is less than or equal to the respective values of D(P2) and F(P2) 
then the path P2 is just redundant and there is no need to consider the 
assignment corresponding to this path. So we just ignore the so called 
redundant path P2. Note that we are in a position to reject a number of 
partial paths which satisfy the above constraint without extending these 
paths to the start or the end nodes. This facility provides us to cut 
down the complexity of the problem and reduce our search from an 
exponential number of paths to a clean polynomial expression as 
described in the partitioning scheme.

11. Consider the possibility that D(P1) is less than D(P2) but F(P1) is 
larger than F(P2) then it is not possible to reject one path or the 
other as we are not sure which assignment would provide us best results, 
i.e., minimal cost of assignment such that the total communication 
requirement is bounded. We have to extend these paths to the start node 
on one side and the end node on the other side to make a final 
judgement, the information available until now is not sufficient to make 
a comparison.

The Partitioning Scheme

Let CT represents the maximum possible value of the communication 
requirement of assigning the chain structured application on the 
distributed computer network. It is evident that it would be equal to 
the sum of all the  's in the chain structured application. We now 
resolve CT to

a f q
j gpq
i j
+
=
b dij
=

d i j


3 Partitioning & Mapping Schemes 15

an accuracy of e, i.e., two adjacent levels for the communication 
requirement are separated by e. In other words the total communication 
requirement is restricted to have only CT /e distinct values in the 
range of zero to CT . Now consider two paths P1 and P2 arriving at a 
node (j,q) in the assignment graph. Two possibilities exist:

1. The total communication requirements of the two paths lie into two 
distinct levels. If D(P1) as well as F(P1) is less than or equal to the 
respective values of D(P2) and F(P2) then path P2 is just redundant and 
there is no need to consider the assignment corresponding to this path. 
If, however, D(P1) is less than D(P2) but F(P1) is larger than F(P2) 
then it is not possible to reject one path or the other as we are not 
sure which assignment would provide us best results, i.e., a minimal 
cost of assignment such that the total communication requirement is 
bounded.

2. The total communication requirement of the two paths lie in between 
two successive permissible levels. If D(P1) as well as F(P1) is less 
than or equal to the respective values of D(P2) and F(P2) then path P2 
is just redundant and there is no need to consider the assignment 
corresponding to this path. But if D(P1) is less than D(P2) and F(P1) is 
larger than F(P2) even then it is possible to reject one path as 
follows. Here we reject the path P1 as the cost of assignment of P1 
would always be larger than the cost of assignment corresponding to P2 
with a guarantee that the error in the communication requirement of the 
selected path, i.e., P2 would be bounded by e.

With the rejection techniques described above, the number of outgoing 
paths from a node would be restricted to CT/e. Thus the maximum number 
of incoming paths would also be restricted to CT/e from each processor. 
The total comparisons to be made at each node (j,q) of the assignment 
graph in order to reject unnecessary paths would be proportional to 
O(NCT/e) with the surety that the outgoing paths from the node would be 
restricted to at the most CT/e under worst case circumstances.

As the number of modules in the chain structured application are equal 
to M then the total time complexity of the approximate scheme would be 
bounded by O(N2M CT /e) with the guarantee that the maximum difference 
between the total communication requirement of the partitioning we found 
and the actual communication requirement of the optimal partitioning is 
at the most equal to Me. If the relative error bound for the user 
interested to use the approximate scheme is  then the time complexity of 
the algorithm would be equal to . The approximate technique that we have 
described is thus a fully polynomial time approximation e Me
= O N2M2 CT e ?
( )
( )


3 Partitioning & Mapping Schemes 16

scheme in which the time complexity is a polynomial function of the size 
of the problem as well as the reciprocal of the total error .

3. The output of our algorithm would provide CT/e partitionings of the 
chain structured application over the distributed computer system with 
different costs and communication requirements. It will, for example, 
include the special case when the communication requirement is very 
small but then it might be very expensive, similarly it will also 
accommodate a case when the communication requirement is large but the 
total cost is relatively small. The user will have the freedom to select 
the partitioning which he can afford provided the communication 
requirement is below the capacity of the network. The accuracy of 
communication requirement resolution is determined by the relationship , 
where the relative error bound  can be selected on the basis of 
acceptable quality of service degradation permissable by the user.

The Resource Constraint on Load

The additional resource constraint on the load assigned to every machine 
can easily be plugged into the approximate scheme but then it no longer 
remains a fully polynomial time approximation scheme but would become a 
heuristic. Consider the assignment graph once again as shown in Fig. 
3.2. Suppose that the available computational resource for a computer x 
is Rx , i.e., we should not assign load to computer x more than Rx . Now 
consider only those paths in the assignment graphs between the start 
node and any intermediate node in which the load assigned to every 
machine is below or just equal to its capacity. There exists two 
possibilities:

1. If the number of paths, inspite of the resource constraint on load, 
grows exponentially then it justifies our assumption used in this paper 
that the available computing resources of every machine is adequate 
enough to satisfy the requirements. Under such conditions the number of 
paths or assignments can be limited by selecting the one with minimal 
cost (with some fast priority scheme) as described in the assignment 
scheme.

2. If, on the other hand, the number of paths which satisfies the load 
constraint does not grow exponentially then it is some thing to be happy 
about. Under such conditions we can easily select the assignemnt of 
minimal cost.

In practise, however, the number of paths would initially have the 
tendency to grow exponentially but then the number would be limited due 
to the resource constraint on load. The actual

e

e e M
?
=
e


3 Partitioning & Mapping Schemes 17

behaviour would be determined by the particular application and how the 
heuristic is actually implemented.

3.2  Partitioning Tree Structured Applications

We show a tree structured application graph in Fig. 3.3. Two media 
streams starts from source component 5 and 6. The first one is processed 
by filter 4 and then they are mixed by component 3 which provides the 
mixed stream through filter 2 to sink component 1. It is also possible 
to consider an application with more than one sink components, what is 
essential at this point is that the graph of the multimedia application 
should be a tree. It is, however, possible to extend the approach to 
series-parallel graphs [7] and perhaps to some other restricted 
structures.

The Assignment Graph

The assignment graph of the tree structured application of Fig. 3.3, is 
shown in Fig. 3.4, we assume a three processor distributed computer 
system. There are a number of similarities and a number of important 
differences between this assignment graph and the one shown in Fig. 3.2. 
We would concentrate on important similarities as well as differences:

1. There are multiple number of start or source nodes and similarly 
there are multiple sink or end nodes.

2. Each assignment tree corresponds to an application assignment and 
each application assignment corresponds to an assignment tree. One such 
assignment tree in the assignment graph is shown in bold in Fig. 3.4.

3. If we sum up the first elements (i.e. a's) of the ordered pair 
associated with each edge in the assignment tree then it would specify 
the total cost of assignment of the tree structured application, this 
will include execution as well as communication costs. We denote this 
sum by F.

4. If we sum up the second elements (i.e. b's) of the ordered pairs 
associated with each edge in the assignment tree then the corresponding 
sum would specify the total communication requirement of the said 
assignment. We denote this sum by D.


3 Partitioning & Mapping Schemes 18

The problem is then to find an assignment tree in the assignment graph 
with minimal cost in which the total communication requirement is 
bounded.

The Procedure Merge

We have already described how to partition a chain structured 
application over a distributed computer system such that the total cost 
of execution plus communication is minimal while the total communication 
requirement is within a certain bound. Here in this procedure we will 
discuss ways of utilizing the chain partitioning procedure to partition 
tree structured applications. Consider the tree shown in Fig. 3. 3, it 
is a tree and not a chain because the indegree of module 3 is 2. We will 
use the procedure Merge to handle such a situation. Consider the 
assignment graph again reproduced in Fig. 3.5. We can find approximate 
partitionings from the start node  to every node in the layer, which 
corresponds to the module with an indegree more than 1. Here this would 
be layer 3, thus we have to find paths or partitionings from  to each

2

1

4

5

3

6

Fig. 3.3 A tree structured Distributed Multimedia Application.

s1
CT e
? s1


3 Partitioning & Mapping Schemes 19

node in the 3rd layer, i.e., node 31, 32, and to node 33 in the 
assignment graph. One such path or partitioning is shown, in bold, in 
Fig. 3.5 (top). We would have to make  comparisons to find all the 
approximate partitionings from  to every node in layer number 3 where k 
is the number of layers between  and the layer number 3.

11

21

31

41

51

61

12 13

22 23

32 33

42

43
52

53

62

63

Fig. 3.4 An assignment graph for the tree of Fig. 3.3 and a three 
processor system.

s1

s2

End Node

N 2k CT e
?
( )

s1
s1


3 Partitioning & Mapping Schemes 20

Similarly we can find  paths or partitionings from  to each node in the 
3rd layer, i.e., node 31, 32, and to node 33 in the assignment graph. 
One such path or partitioning is shown, in bold, in Fig. 3.5 (top). 
Still we have to make  comparisons each time we find approximate 
partitionings from  to node 31, 32, and to node 33 in the assignment 
graph. Node 32, for example, will have a number of  paths coming from  
and a similar number of approximate paths coming from start node . Each 
path from  to node 32 can be combined with a path from  to node 32. By 
combination we mean that the two paths can be represented by an edge 
from a pseudo node s to node 32 as shown in Fig. 3.5 (bottom). Let L1 
represents a path from  to node 32, and let us represent a path from  to 
node 32 by L2. The path (it is only an edge) from the pseudo node s to 
node 32 is represented by L3. Then we have the following equations:

The number of edges (or paths) between the pseudo node s and the node 32 
would be equal to , each edge corresponds to a combination of two paths, 
one path coming from  and the other coming from . We can reduce these  
paths into  by a technique very similar to the one we already used in 
the partitioning of chain structured programs. Let P1 and P2 are two 
edges (or two paths) between s and node 32. Assume that the total 
communication requirement of the two paths lie in between two successive 
permissible levels. If a(P1) as well as b(P1) is less than or equal to 
the respective values of a(P2) and b(P2) then the path P2 is just 
redundant and there is no need to consider the assignment corresponding 
to this path. But if b(P1) is less than b(P2) and a(P1) is larger than 
a(P2) then also it is possible to reject one path as follows. Here we 
reject the path P1 as the cost of assignment of P1 would always be 
larger than the cost of assignment corresponding to P2 with a guarantee 
that the error in the communication requirement of the selected path, 
i.e., P2 would be bounded by e.

Each application of the procedure Merge will take time not larger than 
O(MN2 (CT/e)2) with the

guarantee that the maximum difference between the total communication 
requirement of the

partitioning we found and the actual communication requirement of the 
optimal partitioning is

at the most equal to Me. If the relative error bound for the user 
interested to use the approximate

scheme is  then the time complexity of the algorithm would be equal to .

CT e
? s2

N2k CT e
?
( )

s2
CT e
? s1

s2 s1
s2

s1 s2

a L3
( ) F L1
( ) F L2
( )
+
=

b L3 )
( ) D L1
( ) D L2
( )
+
=

CT e
?
( )2 s1 s2 CT e
?
( )2 CT e
?

e O CT e ?
( )2N2M3
( )


3 Partitioning & Mapping Schemes 21

11

21

31

41

51

61

12 13

22 23

32 33

42

43
52

53

62

63

11

21

31

12 13

22 23

32 33

Fig. 3.5 A path from s1 to node 32 and a path from s2 to node 32 (top) 
are combined into a single path or edge from s to node 32 (bottom).

s1

s

s2

End Node


4 Conclusions 22

The approximate technique that we have described is thus a fully 
polynomial time approxima-

tion scheme in which the time complexity is a polynomial function of the 
size of the problem

as well as the reciprocal of the total error .

4 Conclusions

Distributed Computing offers a solution where a network of heterogeneous 
computers can be used to solve large scale scientific and engineering 
problems. In theory, this approach can dramatically improve the 
performance because each component of the application is executed on an 
architecture that is best suited for it. In reality, however, a number 
of problems should be solved in order to utilize the full potential of 
the distributed system. The distributed computer system and the 
distributed application should be modeled in such a fashion that 
analytical research can provide solutions to the fundamental question of 
how to optimally partition the application across the machines in the 
distributed computer system.

In this paper we have presented efficient schemes which can partition 
chain or tree structured Distributed Multimedia Applications consisting 
of several heterogeneous components across the distributed machines in 
the network. We have provided approximate solutions to the problem of 
partitioning chain and tree structures taking into account only the 
communication constraint and minimizing the total cost. It is possible 
to extend these techniques to less restricted structures and currently 
we are working on partitioning series-parallel structures which belongs 
to a yet another useful class of distributed applications. We also 
intend to find exact or approximate solutions to the general 
partitioning problem under further restrictions on the application 
problem. It is also possible to design efficient heuristics based on 
this approach which can take into account the additional resource 
constraint on the load assigned to machines in the distributed computer 
system.

Acknowledgment: We acknowledge the motivation, encouragement and support 
provided by Asima Ashraf, Gabriel Dermler, Thomas Braunl, and Kurt 
Rothermel.

e


5 References 23

5 References

[1] A. Khokhar, V. K. Prasanna, M. Shabaan, and C. Wang, ?Heterogeneous 
Supercomputing: Problems & Issues,'' IEEE Computer, June 1993..

[2] Rothermel K., Barth I., Helbig T., ?CINEMA - an architecture for 
configurable distributed multimedia applications,'' Tech. Report 3/1994, 
Universität Stuttgart, Fakultät Informatik.

[3] Hagin A., Dermler G., Rothermel K., ?Problem formulations, models 
and algorithms for mapping distributed multimedia applications to 
distributed computer systems,'' Tech. Report 3/ 1996, Universität 
Stuttgart, Fakultät Informatik.

[4] S. H. Bokhari, ?Dual processor scheduling with dynamic 
reassignment,? IEEE Transactions on Software Engineering,? July 1979, 
Pages = 341-349.

[5] S. H. Bokhari, ?On the mapping problem?, IEEE Transactions on 
Computers, March 1981, Pages 207-214.

[6] S. H. Bokhari, ?A shortest tree algorithm for optimal assignments 
across space and time in a distributed processor system?, IEEE 
Transactions on Software Engineering, November 198 1, Pages=583-589.

[7] S. H. Bokhari, ?Assignment problems in parallel and distributed 
computing?, Kluwer Academic Publishers, 1987.

[8] M. Ashraf Iqbal,?Approximate algorithms for partitioning and 
assignment problems?, ICASE Report Number = ?86-40?, NASA Contractor 
Report 178130, June 1986.

[9] M. Ashraf Iqbal, Joel H. Saltz and S. H. Bokhari, ?A comparative 
analysis of static and dynamic load balancing strategies?, Proceedings 
of the 1986 International Conference on Parallel Processing, August 
1986.

[10] M. Ashraf Iqbal and S.H. Bokhari, ?Efficient Algorithms for a class 
of Partitioning Problems,'' IEEE Trans. on Parallel & Distributed 
Systems, 1995.

[11] M. Ashraf Iqbal, ?Approximate Algorithms for Partitioning 
Problems'', International Journal of Parallel Programming, October 1991.

5 References 24

[12] M. Ashraf Iqbal, ?Efficient Algorithms for Dilated Mapping of 
Binary Trees'', Journal of Parallel & Distributed Computing (JPDC), 
1992.

[13] M. Ashraf Iqbal, Saeed Iqbal, and M. E. Shabaan, ?Partitioning 
Image Processing Tasks on Heterogeneous Computer Systems, ''Proceedings 
of the Workshop on Heterogeneous Processing, April 1994.

[14] M.Ashraf Iqbal & M.E. Shabaan, ?Heterogeneous Partitioning of Chain 
Structured Image Processing Tasks,''Worshop on Computer Architectures 
for Machine Perception (CAMP'93), NewOrleans, Louisiana.

[15] Harold S. Stone, ?Multiprocessor scheduling with the aid of network 
flow algorithms?, IEEE Transactions on Software Engineering, January 
1977, Pages = 85-93.

[16] T.N. Bui & C. Jones, ?Parallel Algorithms for Partitioning Simple 
Classes of Graphs,'' International Conference on Parallel Processing, 
August 1990.

[17] R.M. MacGregor, ?On Partitioning a Graph: a theoretical & empiric 
study'', Ph.D Thesis, University of California, Berkeley, 1978.

[18] D. M. Nicol and D. R. O'Hallaron, ?Improved Algorithms for Mapping 
Pipelined and Parallel Computations,'' IEEE Trans., Computers, vol. 40, 
No. 3, 1990.

[19] D. F. Towsley, ?Allocating programs containing branches and loops 
within a multiple proccesser system, '' IEEE Trans. on Software 
Engineering, Oct., 1986, pp. 272-277.