- 1 -

Abstract

This report presents measurements performed between four European high 
performance computing (HPC) centers to investigate the possibilities and 
challenges of European Meta Computing using broadband connections. A 
high speed, low latency pilot network consisting of Ethernet, FDDI, 
Datex-M and ATM sections was set up between workstation clusters at the 
computing centers.

The dynamic load balancing environment HiCon [Beck95e] was installed on 
the clusters and several complex, parallelized applications - image 
recognition, finite element analysis and database processing - were 
executed and observed. Additionally, several of these applications where 
executed concurrently in the system. The load balancing environment 
matched the trade-off between resource exploitation and communication 
overhead according to the system and network behavior. Performance was 
compared to the nowadays available Internet connection.

The trials were supported by the European Commission among a set of 
different distributed computing trials within the E=MC2 project 
[Horn94], [EMC95c]. The results strengthen common promising expectations 
and show several interesting challenges, limitations and guidelines for 
European Meta Computing - resource sharing between distant high 
performance computing centers by high speed networks and flexible load 
distribution services.

Table of Contents

1 Introduction - The TEN-IBC E=MC2 Project 
....................................................................................... 
2

2 Purpose and Rationale of the 
Trials.................................................................................................... 
2

3 Load Balancing Environment 
.............................................................................................................. 
4

4 The Trials: Application Scenarios 
....................................................................................................... 
6

5 Computing Center, Network and Application Configurations 
.............................................................. 8

6 Measurement Results 
....................................................................................................................... 
10

7 Feedback from Research Community and End 
Users......................................................................24

8 Conclusion 
........................................................................................................................................ 
27

Fine Grained Workload Distribution

Across Workstation Clusters

of European Computing Centers

Coupled by Broadband Networks

Wolfgang Becker

Faculty Report No. 1995 / 9

Institute of Parallel and Distributed High-Performance Systems (IPVR)

University of Stuttgart, Breitwiesenstr. 20-22, D-70565 Stuttgart, 
Germany

Phone: +49 711 7816 433, Fax: +49 711 7816 424

wbecker@informatik.uni-stuttgart.de

http://www.informatik.uni-stuttgart.de/ipvr/ipvr.html

This work was partially supported by the European Commission, Project 
B2010 TEN-IBC E=MC2

- 2 -

1 Introduction - The TEN-IBC E=MC2 Project

The TEN-IBC (Trans European Networks - Integrated Broadband 
Communication) program [Roy94], [TEN94a], [TEN94b] shall provide 
detailed guidelines that help identifying the missing elements in the 
harmonious development and functioning of the infrastructure for the 
provision of the Trans-European networks - Integrated Broadband 
Communications. The guidelines shall set the network development 
approach and yield sound and economic investments for implementing the 
networks. TEN-IBC defines generic wide-area broadband services and 
identifies favorable conditions for the deployment of the broadband 
communications.

TEN-IBC investigates and evaluates different application types for a 
European high speed network. The results will be disseminated to related 
interest groups and user communities. Application domains investigated 
within TEN-IBC are cooperative work, distributed information services 
and parallel / distributed high performance computing (E=MC2). TEN-IBC 
is focussed on application level rather than on network level.

The E=MC2 (European Meta Computing Utilizing Integrated Broadband 
Communications) project within the TEN-IBC program [EMC94a/b], 
[EMC95a/b/c], [Horn94] addresses broadband interconnection of High 
Performance Computing Centres for remote and distributed use of super 
computers to meet the challenge of major applications and optimize their 
use across dispersed research teams. Between autumn 1993 and summer 1996 
it identified user interests and evaluates the impact of Europe-wide 
broadband network availability on the use of supercomputers and 
computing clusters by research agencies and commercial users.

High Performance Computing (HPC) is severely limited by network 
bandwidth in the degree to which the resources available at HPC centres 
throughout Europe can be harnessed. The Rubbia report identified the 
need for high bandwidth connectivity as a key factor constraining the 
economic benefits of co-operative uses of HPC. E=MC2 set up a 
trans-national ATM based broadband network and exploits it for 
applications which demand high network performance for their scientific 
or commercial success, or even for them to be feasible at all.

The project involves several HPC centres, a telecommunications value 
added services company and a commercial HPC manufacturer (section 5).

E=MC2 runs several trials, each concentrating on a different aspect of 
network requirements and reflecting applications of strong interest in 
the HPC community. The trials concentrate on 1) coupled computing in 
modelling and simulation using an atmospheric and oceanographic 
simulation as its application, 2) distributed supercomputing and 
workstation clustering between the centres and 3) remote submission and 
execution of applications to investigate the potential for commercial 
brokerage of HPC services. This report concentrates on the second area. 
For more detailed information about the whole project and the results 
obtained so far, we refer to [EMC94a], [EMC95b].

The TEN-IBC E=MC2 issues are a hot research topic and will become a 
commercially promising technology in the next years. The concepts for 
distributed parallel computing mainly come from distributed operating 
system research [Tane85], [Gosc91] and database processing [Rahm93]. 
Recent network technologies enable local and wide area networking for 
HPC with very high performance [Hand91], [Lin94], [McCa94], [Wolm94], 
[Stru95], [Tolm95], [Pozz95]. Based on this, network computing 
environment prototypes are being developed [Cap93], [Blum94]. Recently, 
people even start working on parallelization and distribution of 
relevant industrial codes [Mech95] , [Buba94], [Mier95], [Colb95].

2 Purpose and Rationale of the Trials

The main issues and purposes of E=MC2 can be summarized as follows:

- Investigation of technical feasibility of wide area distributing 
compute load across nation boundaries.

- Identification of software support requirements for proper 
distribution of large compute work load.

- Observation of bandwidth requirements / bandwidth utilization for 
distributed execution of different parallel application classes and 
multiuser load. Monitoring of application behavior and network 
utilization.

- Review potential benefits from trans European high speed networks for 
commercial and scientific HPC. Obtain user requirements and user 
satisfaction to evaluate marketability.

- Identify economically important and promising HPC application domains 
that can utilize European high speed networks, showing completely 
different profiles and challenges than the more common and more 
understood domain of video/voice transmission and cooperative work.

- 3 -

The project is not expected to yield exciting discoveries, but to 
provide a quantitative base for future planning and activities in this 
area which can be of important commercial benefit in the near future.

The E=MC2 trials are planned and set up according to following 
principles:

- Take existing applications. Adapt them as necessary for wide area 
distribution and heterogeneous network capacities and processing nodes.

- Use IP over ATM rather than native ATM protocols, because all existing 
parallel / distributed applications are based on proprietary or on IP 
mechanisms.

- Couple several distant High Performance Computing (HPC) centers. These 
centers are currently connected by narrowband ethernet, and were 
connected by high speed ATM during the broadband trials.

- Choose representatives for relevant and typical application classes:

- European weather forecast simulation coupled with oceanographic 
simulation.

- Parallel grid based numerical simulations.

- Workstation cluster load distribution services with different sample 
application loads.

- Client-server structured parallel applications with automatic load 
balancing support.

- Network monitoring characterizes the application behavior and network 
utilization patterns.

The main topics for investigation in the load balancing trials will be 
the following:

- Will the introduction of broadband connections enable a shared, 
increased utilization of the huge computing resources distributed among 
the various high performance computing centers within Europe?

- In which degree can these distributed resources be effectively 
exploited? The granularity and flexibility depends on the availability 
and capacity of trans-European network connections, on the capabilities 
of underlying load balancing services and runtime environments and also 
on the applications' flexibility with regard to migration and task 
decomposition across heterogeneous structured networks of dissimilar 
computer architectures.

The coarsest level of load distribution is the submission of whole large 
batch jobs to suitable computing centers. This does not require 
sophisticated software support nor network capabilities, except that all 
services must be available at each site and input data must be copied to 
the executing site.

A finer distribution granularity is the assignment / migration of 
processes throughout the network. Due to unpredictable arrival rates and 
processing demands in interactive operation this should be done 
dynamically and hence requires operating system support for transparent 
file access, service availability and process migration. On-line network 
connections with sufficient bandwidth are required.

The finest distribution level assigns even small subtasks of large 
complex parallel applications to processors within and between computing 
clusters to increase the available parallel processing power. This 
approach additionally requires advanced system services and powerful 
point to point communication facilities with low latency for 
synchronization / cooperation within parallel applications and remote 
access to fine grained data items.

Overall, the question is whether it is feasible to use the European 
computing centers as cooperating clusters or even as a large parallel 
metacomputer.

- Under which circumstances is it worthwhile to distribute and shift 
load mixes of several concurrent large parallel applications among 
computing centres to improve performance by load balancing? What 
connectivity is required to enable a significant and effective reduction 
of load skews between distant computing centers, provided a suitable 
load balancing service is available?

- Under which circumstances is it worthwhile to distribute one single 
large parallel application among computing centres to improve 
performance by load balancing? How far can coupled computing centers be 
viewed as one large resource pool, increasing the amount of available 
resources for HPC?

- What technological challenges and economical constraints limit 
automatic load distribution for exploitation of the European computing 
resources? By adapting and upgrading the load balancing environment 
within these trials and porting different application types of common 
interest, we try to identify the most critical issues to be improved in 
nowadays operating systems and load distribution environments.

Which performance criteria apply to the investigated objects?

- 4 -

- The first object type, workstation clusters at computing centers, are 
observed as to how much of the available processing power can be 
fruitfully utilized within the clusters and globally.

- The second object type is the network itself. The trials will focus on 
the average latency of short messages (packets) that are exchanged to 
synchronize within parallel applications and ship control information 
throughout the distributed runtime environment, and on the average 
latency for mixed short and long messages that are sent for data 
communication and data movement purposes. The overall utilized network 
bandwidth is not of primary interest in the trials. The measurements 
obtained from the narrowband connection will be compared to the ATM 
results.

- The third object of investigation is the load balancing service. The 
performance evaluation criteria comprise the achieved resource 
utilization, the average achieved proportion of computing time to 
communication time per task execution and the throughput gain by 
coupling several unequally loaded remote clusters. An important issue is 
the effectiveness of load balancing decisions, i.e. how far the 
trade-off between exploitation of idle remote CPU cycles and increased 
data communication overhead can be covered.

- Finally the applications are subject to evaluation. The users' 
performance criterion is simply the elapsed time of a whole application. 
The trials evaluate the potential of three different application types 
for automatic distribution across nation boundaries by load balancing 
services. The potential depends on the applications' work load profiles 
(task granularity and potential parallelism), their communication 
intensity and also on the complexity to properly distribute their tasks 
across the nodes.

Overall, the different objects and their evaluation criteria capture the 
interests of users, application developers, operating system developers, 
computing center operators and network providers.

Following technical challenges were to be considered for the trials:

- Current parallel and distributed computing on workstation clusters 
suffers from insufficient concepts for naming, security, vendor 
independence and availability, which make remote processing, global file 
access, communication and management difficult:

There is no useful or complete hierarchical naming scheme for countries, 
computing centers, hosts or network interfaces. Default directory paths 
for executables and users' homes are different and user names or 
identifiers often must be different. Illegal data access and resource 
usage across the network is prohibited by inflexible `fire wall' routers 
at the computing centers that render remote execution and communication 
more difficult. Heterogeneous computer architectures and versions of the 
UNIX operating system differ in definition files and library 
functionality; They use different byte orders and byte alignments for 
data structure representation. Differing executables and data file 
formats must be managed additionally to the version management of 
programs, configurations and data across the network and complex scripts 
are necessary to manage and distribute the versions across the network. 
Each time, as network lines or participating hosts become unavailable, 
configuration changes are necessary.

- On the application and load balancing level, the IPVR provided an 
environment including several applications that turned out to be 
suitable for low bandwidth as well as for broadband trials. For other 
existing relevant applications, proper parallelization, restructuring 
and tuning as well as enhancing their portability and flexibility 
requires huge efforts that can hardly be afforded within this project.

- The proper exploitation of distributed system resources across the 
network requires not only equalization of processors' load, but also 
consideration of the data communication within parallel applications and 
the cost for accessing non-local common data items. The load balancing 
structure can be clustered according to the geographic distribution of 
the workstation clusters.

- The timely network availability and reliability necessary to perform 
the trials was not clear. Missing experience with the new technology at 
the manufacturers, network providers and administrators could also infer 
serious problems with regard to network management and costs.

3 Load Balancing Environment

Clusters of high performance workstations not only provide individuals 
with personal computing resources, but can be exploited as parallel 
computing systems. The workstations within clusters are usually not used 
by their owners all the time but are idle for about 90% of the time. 
Hence, for batch and interactive processing demands of the users the 
resources of idle workstations within the cluster can be exploited. 
Further workstation clusters can be used as a parallel system for large 
compute intensive applications. To achieve

- 5 -

a reasonable exploitation of the available resources and leave 
programmers and users almost unaware of distribution aspects, dynamic 
load balancing is required to schedule arriving tasks within the cluster 
towards optimal throughput.

Different computing centers supply clusters of workstations that meet a 
relevant, growing portion of the scientific and commercial processing 
demands. These workstation clusters can also be connected to one huge 
processing array for large computations or yield an optimal utilization 
of system resources by different users and applications entering the 
system from local or remote sites. However, to exploit this 
metacomputing ressource for the European research and commercial 
community, suitable broadband connections between European computing 
centers as well as automatic dynamic load balancing facilities operating 
within and between clusters are basic requirements.

In contrast to the remote execution trials within E=MC2, the load 
balancing trials do not focus on a general purpose environment for mixed 
load by anonymous processes, but on load balancing facilities for 
increasing the overall system throughput of several big, heterogeneous 
applications typical for commercial and scientific production 
environments. Here, the proper exploitation of distributed system 
resources across the network additionally requires consideration of the 
data communication within parallel applications and the cost for 
accessing non local, common data items. Further, to efficiently equalize 
load for large, important applications it is worthwhile to provide load 
profile estimations of the applications` tasks and data to the load 
balancing facility. Unlike other common load balancing and remote 
execution environments, the HiCon load balancing environment (see below) 
can deal with parallelized applications with heavily communicating 
tasks, synchronization relationships and cooperation on common data 
structures. Exploitation of distributed system resources across the 
network requires, besides full utilization of the CPUs, also 
consideration of data affinity and data communication overhead when 
spawning tasks that operate close together.

A full implementation of the dynamic, workstation based load balancing 
environment HiCon, developed by the IPVR was employed. In this 
environment load balancing is expected to assign work load in a way that 
maximizes overall throughput. The execution environment offers advanced, 
flexible and adaptive dynamic load balancing for compute and 
communication intensive applications on parallel shared nothing 
architectures like workstation clusters. Although application 
independent, load balancing is able to exploit various resource 
information units for decision making, like CPU run queue lengths, task 
queue lengths, location of data or copies and estimations about task 
profiles. It can operate different complex, reactive and predictive 
strategies. Further, load balancing is able to dynamically adapt its 
decision parameters to the current application and system state. 
Different clustering structures between centralized and completely 
distributed load balancing structures can be configured. For the trials 
it is essential that the load balancing system is scalable and flexible 
enough to span across clusters of heterogeneous workstations of high 
performance computing centers in Europe. Task and communication 
granularity as well as communication patterns can be observed in 
different granularity. Summarizing, the HiCon system provides following 
features:

- Dynamic task assignment, application independent, adaptive. For 
heterogeneous workstation clusters, for large, parallel applications, 
for multiple concurrent applications.

- Central & local task queues

- Centralized per cluster, decentralized between clusters

- Decision cost model: Minimum estimated task response time: wait time 
in local server queue + compute time on node + delays for data 
communication

- Decisions base on measurements of nodes' load, data distribution, data 
exchange cost (including network bandwidth and latency) and 
pre-estimations of task size, data access pattern

- Applications observed under load balancing: Task parallelized, client 
- server structured, fine granular communicating tasks within 
applications, cooperation by accessing virtual shared data (runtime 
system support)

Applications obey the client - server processing scheme, they are 
functionally decomposed, and one client per application which controls 
the application flow, issues tasks to be worked off by some instances of 
a server class. Of each server class multiple instances can be 
distributed among the processors. There is no explicit communication 
between servers; instead, they cooperate by using global data 
structures, actually automatically distributed and replicated among the 
computing nodes. Parallelism within applications is achieved by 
asynchronous server class calls.

During the trials we will not look closer to load balancing policies, 
load balancing overhead or adaptability of dynamic load balancing 
strategies to network load patterns. We just employ the most 
sophisticated and suitably tuned and load balancing structure and policy 
respectively.

- 6 -

4 The Trials: Application Scenarios

A set of applications was chosen, which is representative for the 
behavior of large important applications in the areas of numerical 
simulations, image processing and commercial data processing. The 
applications are parallelized and nevertheless can be run concurrently 
as homogeneous or heterogeneous load mix on arbitrary workstation 
networks within the load balancing environment. The following gives a 
brief characterization of the applications used within the load 
balancing trials.

Parallel Finite Element Analysis

The finite element method is mainly used in the mechanics and 
thermodynamics to investigate the behavior of constructions and volumes 
under the influence of forces and temperatures by numerical simulation. 
The objects to be investigated are decomposed into a large number of 
small elements, between which the physical interactions are calculated 
by approximation of the respective partial differential equations. This 
element calculation process yields a large global linear equation 
system, which must be solved to get the resulting displacements, stress 
or temperature of the elements. The figures show an example result of a 
block under stress, and the execution scheme of a parallel finite 
element analysis.

Figure 1: Result example of a finite element calculation.

The algorithm is a major representative for computational load by large 
scientific simulations. The element calculation is parallelized by 
element number ranges, a global stiffness matrix is established using a 
distributed storage format for sparse matrices. The following conjugate 
gradient solver contains three parallelized sections per iteration. 
Overall, the application type shows regular task parallelism, stable 
task sizes and stable data reference patterns. All phases of the 
computation require heavy data communication, because the tasks operate 
on common data elements.

Figure 2: Execution scheme of a parallel finite element analysis.

Parallel Image Processing

Image segmentation converts a given pixel image into a set of 
homogeneous areas. This process usually is the first step of a complete 
image recognition. Each area shall consist of similar colored dots with 
few exceptions. The implemented algorithm takes four steps: Starting 
from an initial partitioning, the square merge phase tries to combine 
neighbored squares as long as the resulting squares still give a 
homogeneous area. Parallel to this process square split operations 
refine squares that could not be merged, until each square represents a 
homogeneous area. These both processes are blockwise parallelized and 
work concurrently with irregular parallelism on the image. The next 
step, polygon merge, merges as many as possible neighbored squares to 
arbitrary polygons. The last step, boundary extraction, yields the edges 
surrounding the polygons.

The split and merge operations are mostly very fine grained. Some can be 
bundled into larger calls for picture segments. Overall the number of 
tasks and their granularity depends heavily on the respective structure 
of the image and influences the parallelism that can be fruitfully 
exploited. During the polygon merge phase it is no longer possible to 
target the calls to disjoint image partitions, because the polygons have 
arbitrary shapes. Depending on the image structure this causes intensive 
data communications and limits the useful parallelism. This observation 
also holds for the following calculation of the polygon boundaries.

fixed
to
wall

F=80N

Element
Calculations

Boundary
Conditions

next

Matrix * Vector Scalar * Vector Vector + Vector,Scalar * Vector

Scenery
Input +

Iteration

Stress and
Displacement
Calculation
Load
Calculation


- 7 -

Figure 3: Steps of the image segmentation algorithm.

This class of algorithms, typical for image processing algorithms, shows 
several computation stages, unstructured parallelism, differing task 
sizes and frequent data communication; load profiles depend heavily on 
the actual structure of the image.

Figure 4: Execution scheme of a parallel image recognition.

Parallel Relational Database Processing

In commercial data processing, a large portion of the efforts consist of 
retrieval, combination, filtering and modification of data items within 
a huge set of semantically correlated data. Relational database 
management systems convert complex, descriptive operations into a set of 
simple basic executions.

Here, complex query graphs, consisting of scans, projections, joins and 
loads are performed exploiting functional and data parallelism on key 
range partitioned data, yielding execution profiles characteristic for 
commercial data processing. The figure shows a sample query graph along 
with a possible parallel execution scheme. In this application type, 
task sizes and data reference patterns can be pre-estimated.

Figure 5: Algebraic description and parallel execution scheme of a 
sample query.

Network Requirements

The feasibility of load equalization across computing centers depends on 
the capabilities of the underlying network connections. The requirements 
for the trials can be defined by the message traffic profile imposed by 
the load balancing environment. Within one workstation cluster clients 
and servers exchange call and result messages which are routed via the 
centralized load balancing component (point-to-point messages

rectangular split & merge boundary extraction polygon merge

Quad-Merge and Quad-Split Merge Update Boundary Trace

X

X

s

s P

s
R0 R3 = s R0

R1

R5 = R3 x R1

R2 R4 = sR2 R6 = P R4

R8 = R7 x R6

R7 = s R5


- 8 -

with a size of about 100 bytes and a frequency of 0.05 to 50 messages 
per second, depending on the task granularity and the parallelism within 
the applications). These messages are almost latency bound.

Between system partitions the same type of messages for task migrations 
and remote result returns occur between the load balancing components. 
Their frequency depends on the load skew tolerance of the intercluster 
load balancing policy (rough equalization yields low message traffic). 
Further the load balancing components interchange system load 
information messages of 50 byte size in adjustable periods (usually 
seconds).

Data communications between servers involve short control messages 
(about 50 bytes) to the load balancing component or probable data owner 
and longer messages for the actual data transfer, which occur directly 
between the current owner of the data and the requester node. The size 
of the data messages is application dependent (usually between 100 and 
10.000 bytes), the frequency depends on the data requirements of the 
applications` tasks and also heavily on the data affinity of the servers 
to the requested data, which can be increased by advanced load balancing 
strategies. These messages are almost bandwidth bound. Further, there 
are short control messages for data copy invalidations, and for data 
transfer across clusters additional forwarding messages between the 
concerned load balancing components are necessary.

Overall, the message traffic in the load balancing environment consists, 
besides control, task and result messages, mainly of data communication 
messages between server processes. There are situations, showing 
thousands of data communication messages per second that easily congest 
low bandwidth networks. Messages transferring large data items demand 
high bandwidth, shorter ones are latency bound. Although message latency 
can be hidden partially by multiple servers per node and different 
concurrent applications, the progress of parallel applications heavily 
depends on fast message delivery.

The load balancing environment uses the UDP message passing protocol 
(TCP/IP family) rather than ATM specific protocols for portability 
reasons.

5 Computing Center, Network and Application Configurations

E=MC2 couples several European HPC centers to exploit the aggregated 
computing power consisting of supercomputers, parallel systems and 
workstation clusters. Figure 6 shows the project partners, Figure 7 the 
network connectivity that was set up by the project.

Figure 6: Map of the E=MC2 project partners.

Figure 7: Network connectivity for the E=MC2 trials.

Cerfacs Toulouse

GMD Bonn

IPVR Stuttgart

Octacon Middlesborough
Queens Belfast

RAL Eaton

Telmat Soultz

(computing center

(computing center)

(computing center

(network service provider)
(computing center

(computing center)

(network evaluation) & application engineer)

& application engineer)

& application engineer)

ATM switch

SuperJanet
34 MBit/sec

pdh/sdh SuperJanet
20 MBit/sec
smds

London

Bagnolet

European ATM Pilot
34 MBit/sec

Toulouse

Datex-M
34MBit/sec
Bonn

Stuttgart

Belfast

IP Router
Protocol Converter

Milan

Paris ATM

ATM
5 MBit/sec

Cologne


- 9 -

Figure 8 illustrates the general scenario for the load balancing trials: 
the computing centers participating in the projects each provide 
workstation clusters. A central load balancing component was used for 
simplicity and to increase load balancing accuracy by global overview.

Figure 8: Clustered environment for wide area networked parallel 
computing.

The general network and application topology for the load balancing 
trials consists of all four HPC centers CERFACS, GMD, IPVR and Queens. 
The environment and the applications can run across multiple platforms. 
Here System V and BSD UNIX derivatives from IBM, Sun and Silicon 
Graphics are involved. Different network capacities were investigated, 
each of them was evaluated for different load profiles, i.e. application 
scenarios. Possible network configurations, as shown in the figure, are 
locally Ethernet connected workstations, locally ATM connected 
workstations, trans European narrowband connected workstation clusters 
and trans European high speed connected workstation clusters.

Figure 9: General arrangement of network and applications for the load 
balancing trials.

Of course, due to time limitations and unavailability / unreliability of 
the broadband connections not all cells in the tables above could be 
filled by realizing corresponding measurement scenarios. Instead, in the 
following the successfully realized trials are presented.

finite element image recognition database operations

1 parallel

N parallel

application

applications
concurrently

- network / CPU utilization
- load balancing performance
- application elapsed times

finite element image recognition database operations

1 parallel

N parallel

application

applications
concurrently

in-house trials (LAN)

trans
European
trials (WAN)

- network / CPU utilization
- load balancing performance
- application elapsed times

existing narrowband
network during workday

existing narrowband
network during night

broadband pilot
network for E=MC2


- 10 -

6 Measurement Results

Although the E=MC2 project performed a variety of different 
measurements, it is impossible to investigate network and resource 
utilization and the application performance on a continuous line of 
network speed.

In the following, the trials are presented for the different topologies 
that were employed. Each topology is described in three aspects:

- the network routing,

- the network performance and

- the application performance.

For each topology we tried to compare network lines of differing 
capacity, namely the existing Internet and the broadband pilot lines set 
up especially for the project.

Network Routing

The routing of message packets across the existing narrowband network 
within Europe is rather complicated and not straightforward. The figures 
show typical routing lines between the participating computing centers 
within E=MC2. Therefore, the IP traceroute utility measures the average 
round trip delay for sending short UDP packets (40 bytes) across the 
network and traces these delays for each router along the path. Hence, 
it provides the whole route which the packets take and the time spent on 
each section. The latencies vary extremely, depending on the traffic 
profiles, which change during day and week time. Hence, the routing 
changes dynamically and may also contribute to latency variations. For 
example, between Stuttgart and Toulouse, latencies in the range of 200 
ms to 600 ms can be observed. The same degree of variation holds for the 
network performance measurements.

Not only the far distance lines, but also the large number of hops 
within the switching stations, i.e. network provider exchanges or 
computing centers, add significant delays to the message delivery time. 
Hence, the broadband pilot connections established for the E=MC2 trials 
not only largely improve the throughput, but also reduce the latencies 
significantly. Further, the routers and switches are also the main 
reason for the packet loss rates, because they drop packets in 
congestion situations.

Within the UK the research centers are already connected by SuperJanet 
broadband lines for the usual Internet services. Instead, France and 
Germany are equipped with region wide broadband networks only, like the 
Belwue in Baden-W�rttemberg.

Network Performance

The main focus of the E=MC2 research is on the application level, where 
the application performance of course depends heavily on the performance 
and characteristics of the underlying networks. So it is essential to 
judge the network performance. The raw network level throughput is the 
most commonly used measure for network performance. This was also 
observed in depth [EMC95a], but the E=MC2 consortium identified further 
characteristics as important and tightly correlated to performance on 
application level:

- Distributed HPC applications usually exchange data and synchronization 
messages between the participating processes. This yields no continuous 
data flow and hence no effective network utilization. However, it 
requires very high bandwidth during certain phases within an application 
execution.

- Most applications are parallelized in a way, that the progress of the 
participating processes depends on the arrival of certain data or 
synchronization messages. Therefore, the elapsed time for a single, 
possibly short message is a limiting factor for the performance of 
distributed applications. This is why the network latency and the 
throughput for single / synchronous messages is an important network 
characteristic. Synchronous messages are used frequently, if the 
underlying delivery service is unreliable, or if the sender requires 
that the receiver really has dispatched the message, possibly just to 
ensure that the partner has also reached a certain point within the 
common computation.

- The Internet protocol (IP) provides unreliable packet delivery. 
Packets can be lost on their way arbitrarily, due to falsification of 
the bit contents or due to congestion of the input / output queues of 
the intermediate routers and switches. There is no notification 
mechanism that tells about lost packets. Hence, the protocols basing on 
IP usually optimistically send several message packets across the 
network and besides listen for acknowledge packets. After some time or 
if a certain amount of data has been sent, and no acknowledge returned 
so far, the sender protocol simply retransmits the packets. So,

- 11 -

the average packet loss rate also significantly influences the network 
throughput visible to the application, because lost packets entail large 
delays for the time-outs and additional traffic on the network due to 
retransmissions.

The network performance graphs present the relevant performance 
characteristics of the networks used by the trials. It should be 
mentioned, that the performance on the existing narrowband wide area 
networks show huge derivations in all aspects. So the numbers in the 
graphs merely reflect typical average behavior.

Application Level Measurement Results

As depicted in section 5, we tried to execute and observe as much as 
possible different scenarios on the different network topologies. The 
results on application level give elapsed time of a whole parallel 
application's execution or elapsed time of a whole mix of concurrently 
executing parallel applications. This serves as a judgement of response 
time and of system throughput as well.

6.1  In-house Broadband Trials

Within the IPVR, some trials were performed to compare the performance 
of the Ethernet CSMA network with up to 10 Mbit/sec bandwidth to the ATM 
switched network providing 100 Mbit/sec (TAXI). Up to five hosts were 
connected by either Ethernet or ATM. Figure 10 gives a performance 
comparison between the Ethernet and the ATM broadband connection 
in-house within the IPVR. Latency is about the same, because it is 
mainly software protocol stack processing and context switching time. 
This holds for messages up to 3 KByte. The throughput, however, 
increases dramatically.

Figure 10: LAN network performance.

Despite the throughput enhancement, the application level measurements 
showed no significant differences. Load balancing always exploits all 
available servers. But as this type of distributed computing does not 
heavily use large data streams, the latency of the LAN is a limiting 
factor.

For illustration purpose, Figure 11 and Figure 12 observe one of the 
five participating hosts during the execution of three concurrent and of 
nine concurrent parallel finite element analysis applications on ATM 
LAN. By tracing the ATM cells sent / received by this host and 
underneath displaying the CPU utilization profile, it shows the 
intermittently usage pattern of the network, corresponding to the 
calculation phases of the applications. The network utilization 
increases as expected, if more applications are executing concurrently. 
Still, they produce no continuous but rather irregular traffic, 
demanding very high bandwidth for short phases. Note, that this is only 
the network load by one of the hosts, hence the message traffic crossing 
the switch is about five times as high, rising up to 5 Mbit/sec 
sustained throughput in peak situations.

1

latency

1

throughput

16000

3000
1600

[Kbit/sec]

0 0

packet loss rate

0

%
[msec]

3200
2000

20000

throughput

[Kbit/sec]

single/sync msgs bulk/async msgs

narrowband workday
narrowband night
broadband

1

reduced scale by factor 10!


- 12 -

Figure 11: Network utilization by one host during the in-house ATM 
trials.

Figure 12: Network utilization by one host during in-house ATM trials 
with high concurrence.

- 13 -

6.2  National Broadband Trials between Stuttgart and Bonn

Cooperating with the national research laboratory GMD near Bonn, it was 
possible to set up a Datex-M link with broadband end connections on both 
sides between GMD and Stuttgart for the E=MC2 trials. After walking 
through various problems and challenges which where typical for all the 
broadband pilots within the project, the line was up for several days 
with acceptable bandwidth and reliability. First, we look at the routes 
and message latencies on the existing national Internet and on the 
national broadband pilot connection established trials between the IPVR 
in Stuttgart and the GMD laboratories in Bonn (Figure 13). The Stuttgart 
router converts IP packets into SMDS packets and sends them over an HSSI 
interface per Datex- M to Cologne. This is based on DQDB technology. In 
Cologne, the protocol is transformed from DQDB towards ATM. The Bonn 
router then receives SMDS packets via ATM, extracts the IP packets and 
packs them up into AAL5 packets. These packets are sent via ATM to the 
target workstations.

Figure 13: Current Internet vs. broadband pilot routing Stuttgart - 
Bonn.

A performance comparison between the existing national network and the 
broadband pilot connection between IPVR Stuttgart and GMD Bonn gives 
following results:

Figure 14: National network performance between Bonn and Stuttgart.

This time, network throughput was measured by observing the UNIX rcp 
command for remote file copying. Because a 1 MB file was copied, the 
numbers roughly correspond to the bulk / asynchronous values of the 
measurements in other topologies. The raw network performance of the 
broadband connection, consisting of mainly Datex-M and smaller ATM / 
FDDI / Ethernet parts was told to provide a raw throughput of 8 MBit/ 
sec as requested from Telekom. The narrowband connection (WIN) was 
specified to offer nearly 1 MBit/ sec if completely undisturbed. During 
workdays, however, the network specialists said it could be arbitrarily 
slow. User level throughput is of course smaller due to rcp, IP and ATM 
protocol overhead. The average message latency could be dramatically 
reduced by the high speed link.

Based on this broadband connection, the largest variety of different 
scenarios involving all three application types could be performed. 
Comparative measurements using the usual Internet connection could be 
obtained during night and during workdays. The clusters consist of 4 
workstations running SunOS (BSD Unix) or Solaris (System V Unix) at each 
site. At the IPVR, two processor workstations were employed.

First, each application was observed in dedicated mode on the cluster. 
Next, each of the applications was run several times concurrently. Here, 
all the 5 application executions were started from one GMD host and 
appropriately distributed across the network by automatic load 
balancing. The results are summarized in the following figures.

The parallel finite element analysis was executed by nearly 100% 
exploiting the IPVR servers and about 75% of the GMD servers. The 
problem of the executions on the high capacity network is that the 
waiting

Router, usual Internet delay night delay day

gw-216 2 ms 2 ms

cisco1-ivaih.rus.uni-stgt.de 2 ms 2 ms

cisco3.rus.uni-stgt.de 2 ms 2 ms

Stuttgart1.BelWue.DE 2 ms 3 ms

Stuttgart4.BelWue.DE 3 ms

gmdbigate.gmd.de 43 ms

bilangate.gmd.de 44 ms

hrcat1.gmd.de 44 ms 99 ms

Router, Datex-M + ATM pilot delay

gw-216 2 ms

cisco1-ivaih.rus.uni-stgt.de 2 ms

cisco3.rus.uni-stgt.de 2 ms

Stuttgart1.BelWue.DE 3 ms

(Datex-M and ATM
routing invisible
on IP level)

bilangate.gmd.de 19 ms

hrcat1_a1.gmd.de 19 ms

19 ms
99 ms

53

177

latency

17

throughput

1400

550
400

[Kbit/sec]
[msec]

IPVR
Stuttgart

GMD
Bonn


- 14 -

times at the end of iterations due to slightly unequal computing times 
are getting more important as the network communication delays drop.

Figure 15: Application level performance of the national trials, 
dedicated applications.

The parallelized image recognition on the narrowband network was 
executed 70% at IPVR machines (quad phase). In the merge phase, load 
balancing exploited an average parallelism degree of 3 using IPVR 
machines. In the boundary search phase, however, all servers where used. 
On the broadband network, slightly more parallelism was exploited.

The parallel database operation execution was always executed to 95% at 
the IPVR, because internal parallelism was not that high and the IPVR 
machines are significantly faster.

Overall, it must be stated that the rather fine granular distribution of 
a single parallel application across the wide area network does not show 
significant improvements in general, indicating that wide area resource 
sharing is especially worthwhile to increase the computing potential if 
a local site cannot provide enough resources for the workload it 
receives, and not to achieve arbitrary speedups by further parallelizing 
applications.

Figure 16: Application level performance of the national trials, 
multiuser operation.

For the 5 concurrent image recognition computations load balancing 
decided to exploit all servers for almost all the computation phases. 
The same holds for the other 2 applications. So, load balancing 
estimates this network topology as close enough to fully exploit the 
distant computing clusters as one meta computer. While in following 
international topologies this feeling can be achieved by the broadband 
ATM pilot only, here the broadband connection does not principally 
change the distribution policy but yields largely reduced communication 
delays and significant better resource utilization. This can 
successfully be exploited to improve application level throughput.

6.3  Trans European Narrowband Trials at Different Time of Day

The performance of the existing narrowband WAN within Europe available 
for a user varies largely between night and day. During working days the 
network lines are completely overloaded - [EMC95a] provides extensive 
measurements. The network routing figures are not shown here, but in 
following sections which also used broadband lines. Figure 17 just shows 
a typical Internet routing between Queens University of Belfast in 
Northern Ireland and CERFACS Toulouse in Southern France.

These trials on the narrowband network were performed to evaluate the 
sensitivity of the application performance to the underlying network 
performance. However, it must be mentioned that the network shows a very 
bad performance both night and day, so that load balancing mostly 
refused to really exploit this network. Hence, one major conclusion of 
these measurements is, that the current network really is not capable to 
enable suitable European meta computing.

The configuration consists of all three distributed computing centers, 
each of them supplying a workstation cluster as indicated in Figure 17.

The network characteristics for some of the connections that can be 
obtained for typical distributed HPC applications' communication 
profiles is shown in Figure 18.

367
389
491

finite element

elapsed time

[sec]

123
155
217

image processing

60
74
177

database processing

Parallel applications
executed in dedicated mode

1200
1447

2191

finite element

elapsed time

[sec]

335
459

1500

image processing

209
225
348

database processing

Parallel applications
executed in multiuser concurrency


- 15 -

Figure 17: Current Internet routing Belfast - Toulouse and trial 
topology.

In order to motivate load balancing to utilize the distributed 
resources, five applications of the respective type were started 
concurrently at the IPVR. Although their initial tasks were performed at 
the IPVR, the applications soon began to unfold extensive parallelism, 
computational load and data communication. Load balancing had to decide, 
whether it was worthwhile to execute some of the tasks at distant sites.

One restriction of the centralized load balancing structure which was 
used for most of the trials, is that it was not able to simply assign or 
migrate whole applications among the clusters, because it decides task 
based. Usually, the effects are similar to distributing whole 
applications, if the network power differs widely within and between 
clusters. But sometimes it turns out that load balancing does not use 
the other clusters, except these where the respective application 
originated, because when the applications grow in this cluster, 
distributing some of their tasks to a remote cluster would lead to huge 
data communication efforts.

Figure 18: Trans European narrowband performance Belfast - Stuttgart - 
Toulouse.

Figure 19 shows the elapsed time for the whole application mixes, for 
each application type and for execution at night and during a workday 
respectively. To explain the execution time differences, it should be 
considered, that the CPU resources were quite the same, the applications 
load was the same, just the underlying network performance changed. 
Further, load balancing had to decide, how much of the resources it 
could fruitfully exploit in order to improve the overall performance. 
Hence, in contrast to statically distributed applications, changing 
network characteristics yield completely other task assignments and 
schedules. Finally it should be noted, that load balancing does not 
primarily try to fully use up the available network bandwidth, but tries 
to get the workload done as efficiently as possible. This often results 
in intermittently network utilization.

IPVR
Stuttgart

Cerfacs
Toulouse

Queens
Belfast

Router delay

fmercure.cerfacs.fr 1 ms

tancarville.cerfacs.fr 3 ms

193.50.120.1 5 ms

toulouse4.octares.ft.net 8 ms

toulouse.remip.ft.net 8 ms

193.55.244.242 8 ms

montpellier.renater.ft.net 11 ms

marseille.renater.ft.net 14 ms

lyon1.renater.ft.net 20 ms

stamand1.renater.ft.net 25 ms

stamand3.renater.ft.net 26 ms

rbs1.renater.ft.net 34 ms

ri-renater.gix-paris.ft.net 38 ms

eu-gw.ja.net 340 ms

smds-gw.ulcc.ja.net 401 ms

smds-gw.mcc.ja.net 408 ms

sj-gw.mcc.ja.net 410 ms

sj-gw.qub.ja.net 450 ms

143.117.41.6 457 ms

apache.qub.ac.uk 460 ms

Lyon

Marseille

Montpellier

Paris

Toulouse

12 ms

11 ms
3 ms
11 ms

Rutherford

9 ms

50 ms

Belfast

363 ms

London

120

200

latency throughput

45
25

[Kbit/sec]
[msec]

90
50

throughput

[Kbit/sec]

single/sync msgs bulk/async msgs

Belfast - Stuttgart

165

380

latency throughput

165
125

[Kbit/sec]

320
250

throughput

[Kbit/sec]

single/sync msgs bulk/async msgs

[msec]

Stuttgart - Toulouse


- 16 -

Figure 19: Application level performance of European narrowband trials.

For the finite element analysis applications, load balancing decided to 
use mostly Stuttgart machines; At night Toulouse and Belfast where 
utilized slightly more. For the even more fine grained and more 
communication intensive image processing applications, load balancing 
decided to use Stuttgart only, regardless of the daytime. If load 
balancing was forced artificially to use the other clusters also, the 
application mix took 640 seconds at night. This is just added to ensure, 
that load balancing decisions are in fact reasonable. For the database 
processing applications, load balancing fully utilized all clusters in 
both cases.

These measurements indicate that on existing wide area networks fine 
grained international meta computing is infeasible, and that HPC 
application performance is really sensitive to the underlying network 
capacity.

6.4  Trans European Broadband Trials: Stuttgart - Paris

During the Interop+Networld fair 1994 in Paris, the RUS of the 
University of Stuttgart set up an ATM / Datex-M broadband connection 
between Stuttgart and Paris. It was arranged that this connection could 
be used for E=MC2 load balancing trials at night.

The message routing and latencies on the trans European broadband pilot 
connection established during the Interop fair between IPVR Stuttgart 
and the conference center in Paris are shown below.

Figure 20: Broadband pilot routing Paris - Stuttgart.

Between Stuttgart and Berlin a 34 Mbit/s Datex-M line was used. Between 
Paris and Berlin an ATM line was set up with intermediate switches in 
Bagnolet and Cologne, i.e. the European ATM pilot. At both ends of the 
ATM line messages where converted from / to Ethernet protocol, which 
limited the access bandwidth to the broadband connection to 10 Mbit/s. 
Further, Telekom supplied an OSI level 3 protocol router between DQDB 
(Datex-M) and HSSI/DXI (ATM) in Berlin, which increased message 
latencies, because packets were reassembled on IP level.

Overall, several serious technical and organizational restrictions 
diminished the executable measurements and the strength of the results. 
These problems, caused by the instability of the connection and the 
prototypical setup and configuration, turned out to be a characteristic 
problem for the E=MC2 project:

- The line was effectively available for small time slots only, allowing 
no detailed observation, adaption of load balancing or application to 
increase the bandwidth exploitation, nor measurement repetitions.

- For days, all IP packets larger than 600 bytes oozed away between 
Paris and Berlin.

- Only 10Mbit/s access speed was available to the broadband lines.

- As described above, the latency was very large due to many high level 
protocol conversions.

finite element analysis image recognition database processing

elapsed time

[sec] 1680

2210

255
665
830

255

640 *

Router delay

gwinf.informatik.uni-stuttgart.de 3 ms

cisco1-ivaih.rus.uni-stuttgart.de 4 ms

cisco3.rus.uni-stuttgart.de 4 ms

Stuttgart1.BelWue.DE 4 ms

(Datex-M routing invisible on IP level)

153.17.160.31 34 ms

(ATM routing invisible on IP level)

53.17.170.2 62 ms

paris.rus.uni-stuttgart.de 63 ms

Stuttgart
Paris
63 ms

Cologne

Bagnolet

Berlin


- 17 -

- Only two workstations in Paris and just one in Stuttgart could be 
coupled fruitfully for this trial. This does not present real coupled 
workstation `clusters'. Hence, load balancing was encouraged 
artificially to distribute the computation over all hosts.

Figure 21: Network performance between Paris / Toulouse and Stuttgart.

One parallel finite element analysis application was performed. For 
comparison to the existing narrowband network, a similar configuration 
was set up and measured between Stuttgart and Toulouse, as shown in 
Figure 21. It should be mentioned, that the single, dedicated 
application, although parallelized, could not exploit the full 
bandwidth, but caused intermittently network traffic.

The execution profiles of the broadband configuration show acceptable 
ratios of data communication portion to CPU processing portion by full 
exploitation of all three machines with rather fine grained distribution 
of tasks. The workstation in Stuttgart initially owned all data, so 
large amounts of data had to be replicated and migrated across Europe 
during the element calculation. During equation solving it turned out 
that inhouse data communication between the Interop machines was 
significantly less expensive due to smaller latencies. Load balancing 
was forced to distribute the tasks nevertheless. The execution profiles 
of the low speed network measurements show large communication overhead 
due to increased latencies and poor bandwidth.

Figure 22 compares the elapsed execution times for the element 
calculation phase and the average elapsed times per equation solving 
iteration:

Figure 22: Application level performance of first European broadband 
trials.

Note that the evaluation of the broadband trials focussed on the load 
balancing and application level only. Network level performance 
evaluations to examine the effectively exploited bandwidth could not be 
performed due to lack of time.

The real obtained performance gain on the application level even under 
the adverse circumstances, together with the acceptable data 
communication overhead, show the following: The increased bandwidth can 
be reflected in corresponding speedup for distributed parallel 
applications; High speed networks can enable sufficient fine grained 
load balancing between European computing centers, provided advanced 
load balancing support and suitably decomposed applications, while 
current networks are insufficient.

For larger application scenarios and multi user concurrence load 
balancing will be able to better exploit the network bandwidth. More 
workstations per cluster and more computing centers participated in most 
other

Stuttgart

Toulouse

165

380

latency throughput

165
125

[Kbit/sec]

5

packet loss rate

20

320
250

throughput

[Kbit/sec]

single/sync msgs bulk/async msgs

[msec]

Paris

63

353

3000

20

elapsed
time [sec]

290
455

850

element calculation phase

163
298
363

average iteration time within equation solver

- 18 -

measurements in order to enable real exploitation of European computing 
resources and match the tradeoff between communication cost and 
utilization of CPU cycles, memory, disks etc.

6.5  Trans European Broadband Trials: Stuttgart - Toulouse

After an unexpected long time of organizational and technical problems, 
the ATM link between Stuttgart and Toulouse became available at least 
for several days. Still then, the reliability was not good.

The figure shows the typical message routing and latencies on the 
existing trans European network, and on the trans European broadband 
pilot connection established for the E=MC2 trials between IPVR Stuttgart 
and CERFACS Toulouse.

Figure 23: Current Internet vs. broadband pilot routing Stuttgart - 
Toulouse.

The performance comparison between the existing network and the trans 
European broadband pilot connection established for the E=MC2 trials 
between IPVR Stuttgart and CERFACS Toulouse (Figure 24).

Figure 24: Trans European network performance between France and 
Stuttgart.

Router delay

gwinf.informatik.uni-stuttgart.de 3 ms

cisco1-ivaih.rus.uni-stuttgart.de 4 ms

cisco3.rus.uni-stuttgart.de 4 ms

Stuttgart1.BelWue.DE 4 ms

(Datex-M routing invisible on IP level)

153.17.200.248 43 ms

(ATM routing invisible on IP level)

192.168.31.2 43 ms

mercure.cerfacs.fr 44 ms

frene.cerfacs.fr 44 ms

Router delay

gwinf.informatik.uni-stuttgart.de 4 ms

cisco1-ivaih.rus.uni-stuttgart.de 4 ms

cisco3.rus.uni-stuttgart.de 4 ms

Stuttgart1.BelWue.DE 5 ms

Duesseldorf3.WiN-IP.DFN.DE 52 ms

ipgate2.win-ip.dfn.de 107 ms

duesseldorf2.empb.net

amsterdam7.empb.net 254 ms

Amsterdam1.dante.net 334 ms

Geneva1.dante.net 400 ms

Cern-EBS1.Ebone.net

Paris-EBS1.Ebone.NET

Renater-RBS1.Ebone.net

stamand3.renater.ft.net 424 ms

stamand1.renater.ft.net

lyon1.renater.ft.net

marseille.renater.ft.net

montpellier.renater.ft.net

toulouse.renater.ft.net

193.55.244.241 430 ms

192.70.80.206

cerfacs-toulouse.octares.ft.net

193.50.120.2 480 ms

mercure.cerfacs.fr

frene.cerfacs.fr 510 ms

Stuttgart

D�sseldorf
107 ms

Amsterdam 227 ms

Geneva

66 ms

Lyon

Marseille

Montpellier

Paris

Toulouse

24 ms
3 ms

3 ms
3 ms
50 ms

Stuttgart
Paris

Toulouse

44 ms

Cologne
Bagnolet

165

380

latency

44

throughput

700

165
125

[Kbit/sec]

5

0

packet loss rate

20

320
250

1130

throughput

[Kbit/sec]

single/sync msgs bulk/async msgs

[msec]


- 19 -

Using this broadband connection, detailed measurements involving all 
three application types could be performed and results were related to 
corresponding executions based on the narrowband connection. The figure 
depicts the network and host topology.

First, the three applications were launched in dedicated mode on the 
cluster. It turned out that, because all applications were originated at 
the IPVR, load balancing kept them almost within this one cluster. The 
reason for that decision was that the whole system was not too much 
overstressed, the evolving parallelism not too high, and the latency 
differences between local data communication and trans European 
communication were still large (1 msec vs. 44 msec). Hence, for a real 
exploitation of the meta computer for one single application, larger 
input data sets i.e. larger problems, would have to be calculated and 
the granularity of task interaction would have to be more coarse 
grained. Note, that the trans European trials between Paris and 
Stuttgart (described above) also investigated a single application 
scenario, but used a load balancing policy that was encouraged to 
distribute tasks more effusive.

Second, a multi user concurrence of 5 parallel finite element analysis 
applications was initiated at the IPVR cluster and executed under 
automatic load balancing support. The same scenarios where performed 
using the parallel image processing or the parallel database processing 
applications respectively. Figure 25 summarizes the application level 
performance for each of the application types, compared to the same 
trials running on the existing narrowband links at night or during a 
workday. These results built the most comprehensive evaluation of the 
question, how far automatic dynamic load balancing can exploit such 
European meta computer for typical HPC applications.

Figure 25: Application level performance and trial topology of second 
European broadband trials.

A more detailed look into the execution traces of the trials gives some 
interesting insights:

The parallel finite element analysis was distributed over the whole 
system on the broadband network and on the narrowband network as well. 
Figure 26 shows, that the communication portion of the tasks is 
considerably bigger on the slow network. Additionally, load balancing 
kept more whole applications on one cluster in the case of a poor 
network, sacrificing full compute resource exploitation in favor of data 
affinity, i.e. reducing expensive data communications. In the figure, 
each line represents the execution trace of one server, the time 
advancing from left to right. Black parts are task computations, while 
the grey part of the boxes gives the percentage of communication time 
spent. White areas are idle times.

The parallel image recognition applications (Figure 27) execute more 
fine grained and heavily communicating tasks in the first phases, and 
large sets of probably accessed common data in the further phases. 
Hence, load balancing did not dare to distribute applications across the 
WAN in the low bandwidth configuration. Moreover, with existing Internet 
it did not use the Toulouse cluster at all, because all applications 
originated at the IPVR, and at the time, when the IPVR cluster became 
overloaded by the applications, exploiting their parallelism, they 
already had created large sets of common data. So, the load balancing 
configuration that decided on per task base, mostly did not think it 
worthwhile to migrate tasks across Europe. In the broadband 
configuration, a better utilization of the meta computer was achieved. 
Still, only few applications were striped across nations and only for 
short phases. Mostly, load balancing tried to keep the applications 
within a cluster each, as far as a good overall system utilization was 
achievable.

Parallel database processing (Figure 28) again has rather different task 
granularity, parallelism and communication profiles. Load balancing 
exploited the distributed system in both cases. However, in the case of 
a poor network, the tasks within an application were kept more within 
one cluster each. This, similarly to the finite element application 
type, resulted in an overall worse resource utilization, and 
nevertheless significantly increased data communication portions for 
access to common global data. Another application type specific 
observation can be made here: Because the successor tasks within a query 
execution graph operate to a considerable degree on the intermediate 
data produced by their predecessor tasks, the migration of one task 
across cluster boundaries pulls the execution of related and succeeding 
tasks also towards the other cluster. In the figure, the low bandwidth 
network trace shows a shift from the IPVR where

IPVR
Stuttgart

Cerfacs
Toulouse

1626

2288
2558

finite element

elapsed time

[sec]

295
416
579

image processing

198
362
539

database processing


- 20 -

the applications started, to the CERFACS cluster. This is an execution 
at night; During working days, load balancing separated the applications 
into clusters.

Figure 26: Execution profiles of concurrent parallel finite element 
calculations.

Figure 27: Execution profiles of concurrent parallel image recognition 
applications.

IPVR
CERFACS

idle busy data comm

time

tasks colored per application

Broadband:

Narrowband:

IPVR
CERFACS

Task distribution
 across the network

Almost everything
performed within
 one cluster

IPVR
CERFACS

idle busy data comm

time

tasks colored per application

Broadband:

IPVR
CERFACS

Full task
distribution
across the
network

Narrowband:
Cluster affinity
based task
distribution


- 21 -

Figure 28: Execution profiles of concurrent parallel database processing 
applications.

6.6  Trans European Broadband Trials: Belfast - Stuttgart - Toulouse

Part of the network routing and performance within the European triangle 
was shown already in the previous sections. So, Figure 30 just depicts 
the typical message routing and latencies on the existing trans European 
network and on the trans European broadband pilot connection established 
for the E=MC2 trials between IPVR Stuttgart and Queens Belfast.

The resulting network performance on this line is accordingly:

Figure 29: Trans European network performance between Belfast and 
Stuttgart.

The significant packet loss rate of 30% intimates misconfiguration of 
some switches along the broadband pilot connection.

IPVR
CERFACS

idle busy data comm

time

tasks colored per application

Narrowband

IPVR
CERFACS

Broadband:
Smooth resource
utilization

at night:
Worse overall
utilization

120

200

latency

60

throughput

136
45
25

[Kbit/sec]

5

30

packet loss rate

20
[msec]

90
50

277

throughput

[Kbit/sec]

single/sync msgs bulk/async msgs


- 22 -

Figure 30: Current Internet vs. broadband pilot routing Belfast - 
Stuttgart.

Due to the late and unreliable availability of the pilot connections, 
only one broadband trial between Belfast, Stuttgart and Toulouse could 
be performed within E=MC2 (Figure 31).

Figure 31: Network topology for third European broadband trial.

For this configuration, no comparable execution on the existing 
narrowband network was achievable, because the big packet loss rates, 
the huge latencies and the poor and rapidly varying throughput always 
led to message queue congestions or time-outs within some participating 
application processes. So, this measurement scenario mainly shows, that 
for heavy, real world load, and if actually several distant European 
clusters shall be coupled to form a meta computer, a certain quality of 
the network service is absolutely required. Figure 32 shows the 
broadband execution trace, mainly to illustrate the multitude of 
computational load within the distributed system. The lower part of the 
figure traces, to which application the tasks belong. This gives an 
impression of how much load balancing shifts tasks across the hosts and 
even clusters to achieve a continuous utilization of the resources. It 
must be stated that the load balancing policy which was employed here, 
often distributed more than suitable, i.e. introduced too much data 
communication cost that did not outweigh the exploited CPU time.

Router delay

gwinf.informatik.uni-stuttgart.de 3 ms

cisco1-ivaih.rus.uni-stuttgart.de 3 ms

cisco3.rus.uni-stuttgart.de 3 ms

Stuttgart1.BelWue.DE 4 ms

(Datex-M routing invisible on IP level)

153.17.200.247 38 ms

smds-gw.rl.ja.net 63 ms

smds-gw.mcc.ja.net 76 ms

sj-gw.mcc.ja.net 76 ms

sj-gw.qub.ja.net 77 ms

143.117.41.6 87 ms

apache.qub.ac.uk 87 ms

sioux-atm.qub.ac.uk 87 ms

Router delay

gwinf.informatik.uni-stuttgart.de 4 ms

cisco1-ivaih.rus.uni-stuttgart.de 4 ms

cisco3.rus.uni-stuttgart.de 4 ms

Stuttgart1.BelWue.DE 4 ms

Stuttgart4.BelWue.DE 6 ms

Duesseldorf4.WiN-IP.DFN.DE 38 ms

ipgate2.win-ip.dfn.de 51 ms

duesseldorf2.empb.net 55 ms

london4.empb.net 79 ms

eu-gw.ja.net 90 ms

smds-gw.ulcc.ja.net 90 ms

smds-gw.rl.ja.net 90 ms

sj-gw.rl.ja.net 92 ms

sj-gw.qub.ja.net 100 ms

cisco1.qub.ac.uk 110 ms

apache.qub.ac.uk 110 ms

sioux-atm.qub.ac.uk 119 ms

Stuttgart

D�sseldorf
53 ms
London

34 ms

Rutherford

3 ms

27 ms
Belfast

Stuttgart

London

34 ms

Rutherford

38 ms

11 ms
Belfast

Cologne

IPVR
Stuttgart

Cerfacs
Toulouse

Queens
Belfast


- 23 -

Figure 32: Execution profiles of third European broadband trial.

IPVR hosts
CERFACS hosts

time

Queens hosts

task of application 3 1 5 4 2

idle busy data comm
tasks colored per application


- 24 -

7 Feedback from Research Community and End Users

One important aspect of the project was the involvement of and feedback 
from end users and the common interest group. Hence, E=MC2 attended 
international conferences, performed workshops and demonstrations and 
disseminated and discussed the approach and results by questionnaires.

7.1  Presentation at International Conferences

The E=MC2 project presented and discussed its approach, besides 
presentations within earlier project phases, on the network fair `ATM 
developments' in Rennes, France during March 1995 and on the HPC fair 
`High Performance Computing and Networking Europe' in Milan, Italy 
during May 1995. In Milan, additionally a half day workshop presenting 
and evaluating the project was given, two papers concerning the project 
and the load balancing concepts were presented. Last not least, an 
on-line demonstration visualized five concurrent parallel image 
recognition applications were running distributed across Europe via ATM 
based broadband wide area network which was established between Milan 
and Stuttgart during the conference. Figure 33 shows a screen dump: The 
on-line visualization displays the current network traffic within and 
between clusters (large boxes), the host utilizations (small boxes) and 
the applications' progress by printing a bar per finished task in the 
upper part of the window on one line per application.

The general opinion of most of the people to whom we talked in Rennes 
and Milan was, that the E=MC2 meta computing approach and the actually 
performed trans European trials with complex relevant application codes, 
is very interesting. Some people even expressed, that the more 
well-known usage of broadband links, mainly for video transmission, is 
quite simple and well understood compared to efficiently distribute 
large HPC applications across coupled computing centers. Although, many 
people could not imagine that this way of distributed HPC is technically 
or even commercially feasible within the next few years. It still sounds 
like science fiction. Another interesting observation was, that some 
people where excited to see someone really demonstrating and 
investigating these issues, hoping that this could also help to change 
the Telecom pricing policies and push the public and political focus on 
these issues which were not doubted to become, not the most important, 
but another key factor of European technological competitiveness. 
Although some of the people were a bit disappointed that the E=MC2 did 
not unravel dramatic new or unexpected discoveries or guidelines, all 
interlocutors agreed that it is worthwhile and interesting to strengthen 
and stabilize the possibilities, challenges, circumstances, requirements 
and limitations of trans European HPC by a variety of real important 
measurements, providing competent, quantitative evaluation results and 
experiences.

7.2  User & Interest Group Feedback to E=MC2

Additionally to the questionnaire performed during the project 
definition phase, a questionnaire concerning the requirements towards 
remote execution facilities, and another questionnaire, discussing the 
project approach and providing early measurement results was 
broadcasted. It disseminated the E=MC2 work in compact form to a variety 
of research and development people as well as application developers and 
end users around the world, and yield an interesting valuation and 
feedback from competent researchers and also from the user side of HPC 
and HPC centers. In the following, the aggregated questionnaire response 
from experienced researchers and developers in this computer science and 
scientific / commercial computing domain is presented. The character `+' 
indicates that at least 80% of the responses confirmed the thesis, `?' 
stands for mixed feelings and `-' indicates prevailing disagreement.

- What investigations do users expect from E=MC2?

+ technical feasibility of European wide area distributed HPC

+ bandwidth requirements / bandwidth utilization (parallel applications 
/ multi user load)

+ potential benefits from trans European high speed networks

+ observation of application behavior / network utilization

+ user requirements and user satisfaction to rate marketability

- Relevance and importance of E=MC2 project?

+ wide area distributed HPC is of important commercial benefit in the 
near future

+ for scientific, simulations

+ for database processing

? important compared to video conferencing, video transmission, 
cooperative work

- 25 -

Figure 33: Screen dump of E=MC2 on-line European Meta Computing 
demonstration during HPCN.

- User community that can profit from distributed HPC?

+ HPC centers

+ universities and research centers

+ companies that need huge computing resources

- single end users for private applications

- Major challenges for distributed HPC?

+ network availability and reliability

+ network management

+ network bandwidth

+ network latency

+ network costs

+ operating system support for addressing/communication within 
applications

+ operating system support for automatic remote execution / load 
balancing

+ parallelization / restructuring / tuning of relevant applications

+ portability / flexibility of relevant applications

- Are the E=MC2 trials suitable for the investigations?

+ heterogeneous computer architectures realistic

+ IP over ATM realistic


- 26 -

? IP over ATM only major way to use ATM for HPC next years

+ network and HPC topology realistic and interesting

+ application types relevant

? application types cover major HPC patterns

+ network monitoring relevant

- Load balancing trials - load balancing concept suitable?

+ dynamic task assignment, central & local task queues

+ no preemptive migration due to node heterogeneity

? centralized per cluster, decentralized between clusters

+ decision cost model: Minimum task response time (queue wait time + 
compute time + comm. time)

+ observe nodes' load, data distribution, data communication cost

+ exploit pre-estimations of task sizes, data access patterns

+ load balancing explicitly considers communications

+ application task parallelized, client - server structured

- distributing fine granular communicating tasks can be profitable

? communication via virtually shared data is a suitable concept

- Load balancing trials - applications relevant?

Parallel finite element analysis: + realistic - typical

Parallel image recognition: + realistic ? typical

Parallel complex database operations: + realistic ? typical

- Value of early E=MC2 measurement results?

- Characterization of existing networks

+ corresponds to common experiences

+ throughput and latency are the main characteristics

- Comparison of current network night / day

+ scenarios suitable

+ results plausible

- Trans European broadband trials between IPVR Stuttgart and Interop 
Paris

+ scenario suitable

+ results plausible

+ conclusions derivable

- Agreement to E=MC2 general experiences

+ Large, important existing applications are inflexible, not ready for 
network distribution. Suitable application porting and tuning is 
necessary and expensive.

+ Internetworking configuration and addressing still difficult, wide 
area ATM even more complicated

? High speed networks are new technology and availability / stability 
are prototype like, no commercial strength

+ Operating system support has to be enhanced in terms of more 
transparent network management / addressing and automatic load balancing 
support

? On current low bandwidth networks across within Europe interactive 
remote work and coupled computing are infeasible

+ Distributing coarse grain large, isolated jobs already possible

+ Wide area coupled HPC is severely affected by concurrent multi user 
network traffic

- 27 -

+ Trans European high speed networks can be utilized profitable, they 
enable trans European cooperative computations and suitable workload 
balancing, i.e. the exploitation of the distributed computing resources 
like one large meta-computer.

+ Message latency is an important limiting factor for parallel and 
distributed HPC, if fine granular data communication and synchronisation 
within parallel applications is inherent to the problem.

+ A major part of important algorithms are bound to these closely 
coupled, communication intensive execution patterns

+ European meta computing is technically feasible

? European meta computing is commercially feasible with the approach 
investigated by E=MC2

Some additional interesting comments are listed below:

- ?E=MC2 evaluation criteria are mainly performance issues, but there 
are also price and usability issues. The project work seems to focus on 
scientific computing which is less than 1% of the use of computers. I 
think that other applications (e.g. database, transaction processing, 
mail, interactive video) will be much more important.?

- Explicit agreement was expressed to the motivation and importance of 
the E=MC2 project: ?Investigation of another, economically very 
important and promising domain of applications using European high speed 
networks with completely different profiles and challenges than the more 
common and more understood domain of video/voice transmission and 
cooperative work.?

- Are network costs a major challenge? ?Network costs are small, PTT 
prices and policies are HUGE.?

- Main expected user community: ?HPC centers, universities and research 
centers are less than 5% of the society. This is less than sales tax in 
California. This sales tax is ignored and we can ignore these users too. 
The real users are companies and individuals. We would not build a 
special phone system for universities and HPC centers. We will not build 
a special net for them.?

- Americans view network latency and throughput of the existing European 
network as astonishing poor.

- Load balancing approach: ?Load balancing should come from the 
application.?

- Choice of applications: ?Heterogeneous applications are missing - 
Simulation and visualization.?

- ?The commercial feasibility and profitability are important questions 
to be answered by projects like E=MC2. If I knew the answers, the 
projects would not be necessary any more.?

- Communication via access to virtually shared data for database 
processing: ?If the database application using virtually shared data is 
unchanged from its non-distributed version then it will perform poorly. 
However, a database designed explicitly to take advantage of virtually 
shared data, one that knows the cost, latency, etc, could perform well, 
possibly better than one based on a higher level client-server task 
decomposition model.?

- Major challenges for distributed HPC: ?The network hardware will soon 
be available with sufficient quality, only the prices may cause 
problems. Most important is a ingenious management concept for proper 
ressource utilization.?

- Wide area coupled HPC affected by concurrent multi user traffic - 
solvable by ATM bandwidth reservation facilities? ?Not quite sure: Fixed 
reservations on such networks that are used by so many endpoints 
concurrently, could drastically reduce the available bandwidth.?

8 Conclusion

We quantified the feasibility and limitations of current European 
networks to figure out the real need for trans European broadband 
networks. Therefore, the E=MC2 project prepared and performed the trials 
as follows:

- The project partners identified appropriate, relevant existing 
applications, they did not start new development or specific porting and 
parallelization of software packages. However, the chosen applications 
had to be adapted for wide area distribution. They had to be carefully 
enhanced and flexibilized to run suitably on heterogenous processing 
nodes and across machines connected by heterogeneous networks.

- 28 -

- For all broadband trials, the IP over ATM protocol was employed rather 
than native ATM protocols, because all existing parallel / distributed 
applications are based on proprietary or on IP mechanisms.

- The participating high performance computing centers, PPC at Queens 
University Belfast, RUS and IPVR at University of Stuttgart, CERFACS at 
Toulouse, GMD at Bonn and RAL at Rutherford are currently connected by 
low speed networks (Ethernet & FDDI & ATM LANs and X.25 & leased lines 
for WAN). One main effort of the preparation phase in the E=MC2 project 
was to achieve a pilot broadband connection, mainly based on the 
promising new ATM protocol to perform real international broadband 
trials across Europe.

- Compared to most other recent ATM broadband projects, the E=MC2 
project did not use ATM to connect two machines within a room to 
transfer continues data streams between them, but performed real complex 
applications distributed across the European wide area broadband 
network, namely between Bonn - Stuttgart, Paris - Stuttgart, Belfast - 
Stuttgart - Toulouse and Milan - Stuttgart. Hence, realizing European 
meta computing the project had to face various new challenges.

Despite the various technical problems and time limitations, the project 
managed to get some early, encouraging results. The presented 
measurements can be viewed as a rather pessimistic analysis due to the 
adverse circumstances. That means, less prototypical conditions will 
provide even more satisfying and promising results.

Several general observations and conclusions emerged from the project 
work:

- It is nearly impossible to work interactively on distant computing 
centers during day time. However, this is absolutely necessary to 
install software, to set up configurations and batch jobs as well as to 
watch, manage and control remote executions. At night or weekends it is 
possible although slow, but this is not enough for real scientific and 
commercial use.

Wide area distributed computing clusters connected by existing low speed 
networks cannot be viewed as a large meta computer. There is a huge gap 
in terms of cooperation, communication and task exchange for balanced 
resource utilization between within a computing center and across 
centers.

On current low bandwidth WANs it is not possible to perform load 
balancing on process, command or even on task level within big, 
parallel, mission critical applications; May be it is partially possible 
on a batch job level.

- The latency aspect was recognized as an important factor for parallel 
and distributed high performance computing on a fine grained level. At 
the first glance, latency seems to depend on the number of hops and 
conversions between the distant sites. However, network experts tell 
that the number of hops is no problem, neither for ATM nor IP: ATM 
switches consume about 0.01 msec to forward an ATM cell, IP router take 
about 1 msec per packet. In fact, the problem is caused by the 
congestion of the narrowband networks which results in long queues 
within the hops. The same problem is to be expected on the ATM networks 
once they are loaded similarly. Provided a suitable bandwidth 
reservation, latency will be finally dominated by the physical distance.

- The presented measurements can be viewed as a worst case analysis due 
to the adverse circumstances. For larger application scenarios and 
multiuser concurrence load balancing will be able to better exploit the 
network bandwidth. More workstations per cluster and more computing 
centers have to participate in further measurements in order to enable 
real exploitation of European computing resources and match the 
trade-off between communication cost and utilization of CPU cycles, 
memory, disks etc. Also different application types should be observed. 
Hence, further ATM measurements with direct high broadband access of the 
machines to the network are inevitable.

- On the application and load balancing level the broadband trials 
confirm, that parallel applications cannot be distributed arbitrarily 
fine grained, especially algorithms showing intensive data communication 
and synchronization. Broad band connected distant computing centers 
actually can be coupled to effectively increase the processing power, 
but it is important to decompose applications into coarse grained tasks 
as decoupled as possible. Latency bound communicating tasks should still 
be kept within local area clusters, data bound computations still close 
to their data. Hence, in principle load balancing can effectively 
distribute groups of related tasks within competing parallel 
applications across Europe, which can be parallelized in finer degree 
within the computing centers.

- 29 -

- Broadband connected computing resources throughout Europe can be 
fruitfully utilized by proper automatic, application independent load 
balancing support. Furthermore, the computing resources can be maximally 
exploited on rather fine grained task level. This enables to even 
dynamically spread large parallel applications across nation boundaries 
according to instant availability of network and computing capacities. 
However, the problems encountered with one of the most advanced and 
flexible load balancing environments showed, that much development work 
has to be done in this area to make suitable load balancing generally 
available. The difference between local area and wide area cooperation 
will not disappear.

- Preparing the complex, existing applications for distributed 
heterogenous processing showed that current parallel and distributed 
computing on workstation clusters suffers from insufficient concepts for 
naming, security, vendor independence and availability, which make 
remote processing, global file access, communication and management 
difficult:

There is no useful or complete hierarchical naming scheme for countries, 
computing centers, hosts or network interfaces. Default directory paths 
for executables and users' homes are different and user names or 
identifiers often must be different. Transparent remote file access 
(e.g. NFS, AFS) is mostly not available across cluster boundaries, and 
if available then performance and reliability are poor. Illegal data 
access and resource usage across the network is prohibited by inflexible 
`fire wall' routers at the computing centers that render remote 
execution and communication more difficult. Heterogeneous computer 
architectures and versions of the UNIX operating system differ in 
definition files and library functionality; They use different byte 
orders and byte alignments for data structure representation. Differing 
executables and data file formats must be managed additionally to the 
version management of programs, configurations and data across the 
network and complex scripts are necessary to manage and distribute the 
versions across the network. Each time, as network lines or 
participating hosts become unavailable, manual configuration changes are 
necessary.

Because the UNIX operating system is not distributed, there is no 
transparent way for parallel computing and communication between 
distributed processes. Hence, due to poor operating system support for 
addressing, communication within applications, for automatic remote 
execution and load balancing, the parallelization, the restructuring and 
the tuning of relevant applications is quite difficult nowadays. 
Environments like PVM or DCE provide more transparency for distributed 
computing, but they became broadly available in the last two years and 
therefore still have performance and flexibility problems, remaining 
platform dependencies, and very few relevant applications obey these 
protocols so far.

The difficulties in the many projects like EUROPORT that distribute and 
parallelize commercial applications in a presumably platform independent 
way, show that this is a subject of current development and that the 
existing applications are very hard to re-engineer. Even parallelized 
applications are fixed to run on a certain array of nodes within one 
homogeneous parallel machine only or within one local cluster. 
Applications are not easily portable and quite inflexible to be 
distributed arbitrarily. Finally, once an application is parallelized, 
it requires much time, effort and application as well as operating 
system experience to adjust task sizes, communication pattern and data 
layout, in order to make them run efficiently on a certain 
configuration.

- Specific problems that come along with distributed HPC when using wide 
area international broadband connections which can be summarized in 
following items:

The comparably huge latency, the unreliability of both the broadband and 
the narrowband network and the relative low effectively visible 
bandwidth of the broadband lines as well as the current network cost 
made the trials less comprising and leave a gap between the trial 
results and the circumstances for real commercial usage.

Distributed large HPC applications demand high bandwidth networks. They 
are not just isolated computations but large, parallelized, data 
communication intensive applications, showing heavily cooperating tasks 
and large portions of remote data access.

A proper exploitation of the European computing resources requires not 
just keeping all processors busy, but also considering the imposed 
network load. Suitable task distribution granularity and clustering of 
tasks is necessary and must be as dynamically and automatically as 
possible.

- 30 -

Network Performance: Overall, the existing trans European network lines 
show very poor performance. The broadband links that have been set up 
during the project, show a large improvement in terms of latency and 
throughput. However, there are still many tuning parameters at the 
switches and routers which will have to be adjusted for optimum network 
performance. This can be seen sometimes at the packet loss rates due to 
suboptimal buffer sizing and message decomposition into frames and 
cells. In terms of throughput, the broadband links may reach conditions 
similar to usual LANs, but the latency will remain larger in orders of 
magnitude. The near future will provide more stable and efficient 
international connections, which should be able to reduce the long 
distance latencies by a factor of 10 compared to what is available 
today.

The broadband connection was intended to be available during 7 days 
continuously, but actually was available two of these days. This is 
characteristic for the state and strength of current broadband pilot 
connections beyond intra-campus networks. Not only to take commercial 
advantage but also to provide a useful service for research projects 
requires significantly more stability and availability. Another 
practical problem is the high latency as well as the huge gap between 
the raw throughput of the network and the sustained throughput on 
application level.

Currently, the main reasons for the poor availability are missing 
experience of the network providers concerning the new technologies, the 
difficulties of coordination between the network providers and 
administrators along the path and the poor hardware and software quality 
of the new network technology.

- For all trials the same team of hosts was used and the same number and 
sizes of applications were performed under the very same load balancing 
environment. The only differences are the characteristics of the 
underlying network which had to be learned by the load balancing 
facility itself by measurements and feedback from application behavior 
during runtime.

Load balancing was able to squeeze remarkable throughput enhancements 
out of this coupled system by a sophisticated policy that exploited the 
network in a degree matching the trade-off between full CPU utilization 
and data communication costs. The one strategy successfully managed all 
three application types which show quite different profiles in terms of 
task granularity, potential parallelism, data communication frequency / 
message sizes and possible reference locality for global data access.

Comparing the latency as well as the throughput differences between the 
narrowband and the broadband lines and the corresponding application 
throughput differences on these WAN coupled systems, it becomes obvious 
that, provided a suitable data, task and cooperation layout of the 
applications, and provided a suitable dynamic load balancing facility, 
the feasibility of building and using European meta computers strongly 
depends on the performance of the interconnection.

On the application and load balancing level the broadband trials 
confirm, that parallel applications cannot be distributed arbitrarily 
fine grained, especially algorithms showing intensive data communication 
and synchronization. Broad band connected distant computing centers 
actually can be coupled to effectively increase the processing power, 
but it is important to decompose applications into coarse grained tasks 
as decoupled as possible. Latency bound communicating tasks should still 
be kept within local area clusters, data bound computations still close 
to their data. Hence, in principle load balancing can effectively 
distribute groups of related tasks within competing parallel 
applications across Europe, which can be parallelized in finer degree 
within the computing centers.

Broadband connected computing resources throughout Europe can be 
fruitfully utilized by proper automatic, application independent load 
balancing support. Furthermore, the computing resources can be maximally 
exploited on rather fine grained task level. This enables to even 
dynamically spread large parallel applications across nation boundaries 
according to instant availability of network and computing capacities. 
However, the problems encountered with one of the most advanced and 
flexible load balancing environments showed that much development work 
has to be done in this area to make suitable load balancing generally 
available. The difference between local area and wide area cooperation 
will not disappear.


- 31 -

Some global observations from the load balancing trials corroborate 
common, intuitive expectations by these real measurements:

- Multi user HPC scenarios, i.e. computational load profiles with 
different unrelated, concurrent tasks / applications, are closer 
correlated to the network bandwidth, while the dedicated application 
executions depend more heavily on the network latency.

- In local networks with lower latency by more than an order of 
magnitude, network improvements do not result in according speedup of 
HPC applications, because the latency does not differ so much, and the 
latency is the major limiting factor there.

- The strong correlation of end user 's application performance to the 
underlying network power which still holds for really high bandwidth 
lines, gives rise to the usefulness and absolute necessity to establish 
trans European broadband lines for this domain of HPC. However, it must 
be kept in mind, that propagating the network enhancement up to the user 
requires serious, non-negligible efforts to adjust the applications and 
integrate load distribution support into operating systems.

The ongoing research within the E=MC2 project concentrates on more 
detailed measurements, extended scenarios, more detailed network 
monitoring and closer end user integration. A prototype of a flexible 
brokerage service for end users and computing center administrators is 
the next step towards real exploitation of the European high performance 
computing resources for commercial and scientific purposes.

Acknowledgments

The measurements could be achieved with the help of many networking and 
computing engineers and scientists within and associated to the project. 
Especially, Paul Christ, Holger Fahner, Peter Feil (RUS Stuttgart), 
Lothar Klein, Petra Link, Roland V�lpel (GMD Bonn) helped organizing and 
administrating these trials. The project team Kevin Brooks, Andrew 
Hilborne, David Horne, Paul Huish (Octacon Middlesborough), Rick Mc 
Connel, Richard Rankin, Paul Sage (Queens Belfast), Franz Fabian, Uwe 
Geuder, Walter Strommer (IPVR Stuttgart), Frederic Carbonnel, Patrick 
Laporte (CERFACS Toulouse), Ali Kabou, Fabrice Magnan, and Hubert 
Zeyssolff (Telmat Soultz) planned, organized and performed the trials, 
management and publications. Many researchers, scientists and engineers 
responded to the questionnaires and significantly contributed by 
interesting comments and discussions.

References

[Beck93] W. Becker, Globale dynamische Lastbalancierung in 
datenintensiven Anwendungen, Faculty Report 1993-1, University of 
Stuttgart, Institute for Parallel and Distributed High Performance 
Systems, 1993

[Beck94a] W. Becker, Das HiCon-Modell: Dynamische Lastverteilung f�r 
datenintensive Anwendungen auf Rechnernetzen, Faculty Report 1994-4, 
University of Stuttgart, Institute for Parallel and Distributed High 
Performance Systems, 1994

[Beck94b] W. Becker, R. Pollak, Efficiency of Server Task Queueing for 
Dynamic Load Balancing, Faculty Report 1994-9, University of Stuttgart, 
Institute for Parallel and Distributed High Performance Systems, 1994

[Beck94a] W. Becker, G. Waldmann, Exploiting Inter Task Dependencies for 
Dynamic Load Balancing, Proc. IEEE 3rd Int. Symp. on High-Performance 
Distributed Computing (HPDC), San Francisco, 1994

[Beck94b] W. Becker, J. Zedelmayr, Scalability and Potential for 
Optimization in Dynamic Load Balancing - Centralized and Distributed 
Structures, Mitteilungen GI, Parallele Algorithmen und 
Rechnerstrukturen, GI/ITG Workshop Potsdam, 1994

[Beck95a] W. Becker, Lastverteilung in Workstation-Netzen, BI Special 
Issue on Parallel Computing, RUS, University of Stuttgart, 1995

[Beck95b] W. Becker, Das HiCon-Modell: Dynamische Lastverteilung f�r 
datenintensive Anwendungen auf Rechnernetzen, Informatik Forschung und 
Entwicklung Vol. 10 No. 1, Springer Verlag, 1995

[Beck95c] W. Becker, G. Waldmann, Adaption in Dynamic Load Balancing: 
Potential and Techniques, Tagungsband 3. Fachtagung 
Arbeitsplatz-Rechensysteme (APS), 1995

[Beck95d] W. Becker, Dynamic Balancing Complex Workload in Workstation 
Networks - Challenge, Concepts and Experiences,Proc. Int. Conf. High 
Performance Computing and Networking (HPCN) Europe, LNCS, Springer 
Verlag, 1995


- 32 -

[Beck95e] W. Becker, Dynamische adaptive Lastbalancierung f�r grosse, 
heterogen konkurrierende Anwendungen, Ph.D. Thesis, University of 
Stuttgart, Institute for Parallel and Distributed High Performance 
Systems, 1995

[Blum94] R. Blumofe, D. Park, Scheduling large-scale parallel 
computations on networks of workstations, Proc. High Performance 
Distributed Computing, USA, 1994

[Buba94] M. Bubak, J. Moscinski, R. Slota, Implementation of parallel 
lattice gas program on workstations under PVM, Parallel Scientific 
Computing, LNCS 879, Springer Verlag 1994

[Cap93] C. Cap, V. Strumpen, Efficient parallel computing in distributed 
workstation environments, Parallel Computing, Vol. 19, No. 11, 1993

[Colb95] A. Colbrook, J. Elliott, M. Lemke, F. Wray, EUROPORT2 - ESPRIT 
European Porting Action No 2, HPCN Europe, Italy, 1995

[EMC94a] P. Huish (Ed.), European Meta Computing Utilising Integrated 
Broadband Communications - Trials & Applications Descriptions, 
Deliverable CEC Project B2010 TEN-IBC E=MC2, 1994

[EMC94b] P. Huish (Ed.), European Meta Computing Utilising Integrated 
Broadband Communications - Evaluation Plan, Deliverable CEC Project 
B2010 TEN-IBC E=MC2, 1994

[EMC95a] P. Huish (Ed.), European Meta Computing Utilising Integrated 
Broadband Communications - Interim Report, Deliverable CEC Project B2010 
TEN-IBC E=MC2, 1995

[EMC95b] P. Huish (Ed.), European Meta Computing Utilising Integrated 
Broadband Communications - Evaluation Report, Deliverable CEC Project 
B2010 TEN-IBC E=MC2, 1995

[EMC95c] E=MC2 consortium, The European Meta Computing Utilizing 
Integrated Broadband Communications (E=MC2) Project, Proc. Int. Conf. 
High Performance Computing and Networking (HPCN) Europe, LNCS, Springer 
Verlag, 1995

[Gosc91] A. Goscinski, Distributed Operating Systems: The Logical 
Design, Addison-Wesley, 1991.

[Hand91] R. Handel, M. Huber, Integrated Broadband Networks - An 
Introduction to ATM-Based Networks, Addison-Wesley, England, 1991

[Horn94] D. Horne, The E=MC2 Project, ULCC NEWS, University of London 
Computing Centre, 1994

[Lin94] M. Lin, J. Hsieh, D. Du, J. Thomas, J. MacDonald, Distributed 
Network Computing over Local ATM Networks, Proc. Supercomputing `94, 
USA, 1994

[McCa94] J. McCabe, ATM in a Supercomputer Network Environment, 
Tutorial, Supercomputing `94, USA, 1994

[Mech95] C. Mechoso, High-Performance Computing and Networking for 
Climate Research, HPCN Europe, Italy, 1995

[Mier95] H. Mierendorff, K. St�ben, C. Thole, O. Thomas, Europort-1: 
Porting industrial codes to parallel architectures, HPCN Europe, Italy, 
1995

[Pozz95] E. Pozzetti, G. Serazzi, Characterizing the resource demands of 
TCP/IP, HPCN Europe, Milan, 1995

[Rahm93] E. Rahm, R. Marek, Analysis of Dynamic Load Balancing 
Strategies for Parallel Shared Nothing Database Systems, Proc. 19th VLDB 
Conference, Ireland, 1993

[Roy94] M. Roy, TEN-IBC `95, European Commission Directorate XIII B 
Advanced Communications Technologies and Services, RA947242, 1994

[Stru95] V. Strumpen, T. Casavant, Exploiting communication latency 
hiding for parallel network computing, Int. Conf. on Parallel and 
Distributed Systems, Taiwan, 1995

[Tane85] A. Tanenbaum, R. van Renesse, Distributed Operating Systems, 
Computing Surveys, Vol. 17, No. 4, 1985

[TEN94a] TEN-IBC, TEN-IBC Concertation Proceedings, European Commission 
Directorate XIII B Advanced Communications Technologies and Services, 
1994

[TEN94b] TEN-IBC, TEN-IBC Baseline Document, European Commission 
Directorate XIII B Advanced Communications Technologies and Services, 
RA947538, 1994

[Tolm95] D. Tolmie, Gigabit LAN issues - HIPPI, fibre channel, or ATM?, 
HPCN Europe, Italy, 1995

[Wolm94] A. Wolman, G. Voelker, C. Thekkath, Latency Analysis of TCP on 
an ATM Network, Proc. WinterUSENIX Conference, USA, 1994