Universität Stuttgart Fakultät Informatik ATOMAS: A Transaction-oriented Open Multi Agent-System. Final Report M. Straßer, J. Baumann, F. Hohl, M. Schwehm, K. Rothermel Bericht 1998/11 Juni 1998 ATOMAS: A Transaction-oriented Open Multi Agent-System. Final Report Authors: Dipl.-Inform. M. Straßer Dipl.-Inform. J. Baumann Dipl.-Inform. F. Hohl Dr. M. Schwehm Prof. Dr. K. Rothermel Institut für Parallele und Verteilte Höchstleistungsrechner (IPVR) Fakultät Informatik Universität Stuttgart Breitwiesenstr. 20 - 22 D-70565 Stuttgart ATOMAS: A Transaction-oriented Open Multi Agent-System Final Report A Project Funded By Tandem Inc., Cupertino 22.6.1998 I V P R Authors: Prof. Dr. K. Rothermel Dr. M. Schwehm Dipl.- Inform. J. Baumann Dipl.- Inform. F. Hohl Dipl.-Inform. M. Straßer Institute for Parallel and Distributed High Performance Systems University of Stuttgart Breitwiesenstraße 20-22 D-70565 Stuttgart Contents 2 Contents Section 1: Introduction 6 Section 2: Workplan and Project State 7 WP 2.1: Design and Implementation of Agent Migration 7 WP 2.2: Requirement Analysis Concerning Security 7 WP 2.3 Recoverable Agents 7 WP 2.4 Developed Concepts and Implementation 7 Orphan Detection for Mobile Agent Systems 8 Section 3: WP 2.2: Security 9 Abstract 9 Introduction 9 The Problem of Malicious Hosts 10 Existing Approaches 13 Blackbox Security: The Idea 14 Mobile Cryptography 15 Time Limited Blackbox Protection 16 What Is Changing If the Blackbox Is Time Limited? 16 No communication with a third party 16 Communication only with trusted servers 17 Communication with untrusted servers 17 Migration of the agent 19 How Can We Reach Time Limited Blackbox Protection? 19 Agent mess-up algorithms 20 Agent attributes that can be modified 20 Statements 20 Data 21 Abilities and characteristics of the attacker 21 Examples for mess-up algorithms 21 Contents 3 Variable Recomposition 22 Conversion of Control Flow Elements into Value-Dependent Jumps 22 Deposited Keys 23 Counter attacks 23 Variable Recomposition 23 Conversion of Control Flow Elements into Value-dependent Jumps 24 Deposited Keys 24 Problems with mess-up algorithms 24 How can a blackbox protected agent be created? 25 Recharging of protected agents 25 Which Other Attacks by Malicious Hosts Can Be Prevented Using Blackbox Protection? 26 New Attacks: Sabotage and Blackbox Testing 26 What Blackbox Security Costs 27 Conclusions and Future Work 27 Bibliography 28 Section 4: WP 2.3: Recoverable Agents 30 A Fault-Tolerant Protocol for Providing the Exactly-Once Property of Mobile Agents 30 Introduction 30 Agent Execution Model 31 A Simple Solution 32 Protocol Overview 33 Voting Protocol 36 Monitoring and Selection Protocol 42 Blocking Probability and Message Complexity 43 Optimizing the Stage Construction 45 Related Work 49 Conclusion and Future Work 51 Appendix 52 Concepts for a Reliable and Scalable Agent Server 55 Requirements 55 Architecture of the Agent System 55 Lifecycle of an Agent 56 Contents 4 Messages 58 Remote Method Invocation 60 Migration 60 Current State 60 Bibliography 60 Section 5: WP 2.4: Developed Concepts and Implementation 64 Mole 3.0: A Middleware for Java-Based Mobile Software Agents 64 Introduction 64 Mole System Overview 65 Lifecycle of an Agent 66 Agent Migration 67 Agent Communication and the Session Concept 68 Badges 68 Sessions 68 Agent Infrastructure 71 Resource Manager 71 Directory Service 72 Security Model 72 Graphical Agent Monitor 72 Implementational Issues 73 Agent Identifiers and Name Resolution 73 Thread Management Unit 74 Using Java-Enabled Web Browsers to Run Mobile Agents 75 Installation 76 System Requirements 76 Configuration Files 77 Starting a Sample Agent 77 Conclusions and Future Work 78 Section 6: WP 2.5: An Orphan Detection Protocol for Mobile Agents 81 Introduction 81 Contents 5 The Agent Model 82 The Shadow Protocol 82 The Idea 82 The Protocol 84 Mobile Shadows 86 Optimizing the Communication 88 Fault Tolerance 89 Related Work 90 Conclusion and Future Work 91 1 Introduction 6 1 Introduction The electronic marketplace of the future will consist of a large number of services located on an open, distributed and heterogeneous platform, which will be used by an even larger number of clients. Mobile Agent Systems are considered to be a precondition for the evolution of such an electronic market. They can provide a flexible infrastructure for this market, i.e. for the installation of new services by service agents as well as for the utilization of these services by client agents. Mobile Agent Systems basically consist of a number of locations and agents (see Figure 1). Locations are (logical) abstractions for (physical) hosts in a computer network. The network of locations serves as a unique and homogeneous platform, while the underlying network of hosts may be heterogeneous and widely distributed. Locations therefore have to guarantee independence from the underlying hard- and software. To make the Mobile Agent System an open platform, the system furthermore has to guarantee security of hosts against malicious attacks. (User) Agents are active, autonomous software objects, that reside (and are processed) on locations. They can communicate with other agents either locally inside one location or globally with agents on other locations. Mobile Agents furthermore can migrate from one location to another. Mechanisms for the communication between agents and for the migration of agents have to be provided by the Mobile Agent System. Service Agents are interfaces to services. Next to the normal communication mechanisms between agents of the mobile agent system, they have access to services provided by the underlying host. Because of their machine dependent purpose, service agents are not mobile. The Atomas project aims in developing an open agent system as an enabling technology for the evolution of a electronic marketplace. This report documents the achieved results of the second year. Section 2 provides an overview over the objectives of the work packages for the first year and their completion state. Sections 3-??? present the results in detail. migration location 2 location 3 application service agent Hel lo, user agent Figure 1.1. Mobile Agent System 2 Workplan and Project State 7 2 Workplan and Project State In the second year, the agent system architecure developed in the first year had to be extended to support migration and to provide reliable agent execution. Furthermore, requirements concerning security issues had to be investigated and the concepts developed during the project had to be evaluated. Therefore, four work packages had been identified (Section 2.1 - Section 2.4 ). In addition orphan detection for agents has been identified as a very important functionality to be supported by mobile agent systems and thus an additional work package (Section 2.5) has been added that concentrates on this aspect. The following sections contain a short introduction into the work packages and the completion state of these work packages. 2.1 WP 2.1: Design and Implementation of Agent Migration In work package 1.2, a special form of migration, called weak migration, had already been developed during the first year. This form of migration has the advantage that it can be implemented without changing the Java virtual machine and therefore allows to run the developed agent system on each architecture for which a Java virtual machine is available. Our experience in using this form of migration in several student thesises proved the validity of this concept. Therefore, no further work has been invested in this topic in the second year. 2.2 WP 2.2: Requirement Analysis Concerning Security In the first year, driven by the importance of the security aspect of mobile agent systems, we already started this work package. Section 7 of the last years report investigates the security aspects of Mobile Agent Systems and identifies the problem areas which has to be handled. In Section 3 of this report, a approach to partially solve one of the most difficult aspects of security of mobile agents systems,the problem of malicious hosts, is presented. This problem consists in the possibility of attacks against a mobile agent by the party that maintains an agent system node, a host. 2.3 WP 2.3 Recoverable Agents An important prerequisite for the use of mobile agents in industrial environments is to provide reliable and fault tolerant execution of the agents. Section 4 describes two different approaches to provide the required reliability. In the first approach, fault-tolerant execution of agents on unreliable systems is provided. Section 4.1 summarizes two published papers on this topic. In the second approach, the reliability of agents is provided by using the TUXEDO platform on Tandem Himalaya to build the agent system. 2.4 WP 2.4 Developed Concepts and Implementation In this work package, the developed concepts are summarized and implemented within the mobile agent system Mole 3.0. Next to the concepts for agent communication and agent migration 2 Workplan and Project State 8 designed and implemeted in the previous year, version 3.0 of Mole now provides an extensive infrastructure for agents, including a resource manager, a directory service and a global naming scheme for agents. In order to support the design of agents, a graphical agent monitor allows to visualize the system behaviour as a whole or of a single agent in particular. Mole further provides a thread management unit to overcome some shortcomings of the Java virtual machine. Mole provides a simple means for installation and configuration of the system. 2.5 Orphan Detection for Mobile Agent Systems Orphan detection in an agent system is very important both from the user's and from the system side, because a running agent uses resources which are valuable to both user and system. The user has to pay for resources (at least in principle), and the system has only a limited amount of them. So if the user does not need the results of a distributed computation in progress anymore, he wants to be able to terminate the computation to minimize the resulting cost. With an orphan detection mechanism the user simply declares the agents to be terminated as orphans. Orphan detection guarantees that the now useless agents can be determined by the system and ended, thus freeing the resources they have bound. In this paper we will present a new protocol, the shadow protocol, that allows both control of mobile agents and orphan detection. 3 WP 2.2: Security 9 3 WP 2.2: Security This section summarizes a paper published in [Vig98]. 3.1 Abstract In this report, an approach to partially solve one of the most difficult aspects of security of mobile agents systems is presented, the problem of malicious hosts. This problem consists in the possibility of attacks against a mobile agent by the party that maintains an agent system node, a host. The idea to solve this problem is to create a blackbox out of an original agent. A blackbox is an agent that performs the same work as the original agent, but is of a different structure. This difference allows to assume a certain agent protection time interval, during which it is impossible for an attacker to discover relevant data or to manipulate the execution of the agent. After that time interval the agent and some associated data get invalid and the agent cannot migrate or interact anymore, which prevents the exploitation of attacks after the protection interval. 3.2 Introduction Mobile agent systems are expected to become a possible base platform for an electronic services framework (see e.g. [Mob96]), especially in the area of Electronic Commerce. In this application area, security is a crucial aspect since all parties involved require the confirmation that none of the other parties will break the rules without being punished. This requirement is not always fulfilled even in the traditional, non-electronic commerce. The anonymity of a worldwide communication network and the ease of automatic exploitation of security gaps in electronic applications make it necessary to meet this demand in the area of commercial transactions done by computers. Mobile agents are entities that consist of code, data and control information (e.g. thread states). Mobile agent systems are platforms that allow mobile agents to migrate between different nodes of the agent system. From a more technical view, mobile agents can be compared to programs that migrate to nodes autonomously, while nodes offer the run-time environment of these programs including the program interpreters. As in Mobile Code systems (e.g. the Java applet system), one aspect of security is the protection of the node, or host, against possible attacks of the mobile agent. Therefore, some of the security mechanisms developed in this field can also be applied to mobile agent systems. An example is sandbox security, i.e. the need of authorizing security-sensitive commands like the deletion of a file by a designated component. Other security mechanisms like authentication of single agents instances do not have a counterpart in mobile code systems and have to be designed using standard cryptographic techniques like encryption or digital signatures. The reverse security issue, the protection of a mobile agent from possible attacks by a malicious host, is new as there are barely other areas where this aspect is important. Nevertheless, the protection of mobile agents from malicious hosts is ? at least from the viewpoint of the owner of the agent ? as important as the protection of the host from malicious agents. As we will see, apart from organisational solutions, no technical approaches to solve this problem without spe- 3 WP 2.2: Security 10 cial secure hardware exist so far. The solubility of this problem which is called the problem of malicious hosts is even estimated to be very low [FGS96]. This report presents an approach to solve most of the aspects of the problem of malicious hosts. This approach will cost both execution time and communication bandwidth and will require some time-critical restrictions, but gives the agent the possibility to do some security sensitive work without the danger of an immediate exploitation of sensitive data by the host. 3.3 The Problem of Malicious Hosts The fact that the runtime environment (the host) may attack the program (the agent), plays hardly a role in existing computer systems. Normally, the party that maintains the hosts also employs the program. But in the area of open mobile agents systems, an agent is operated in most cases by another party, the agent owner. This environment leads to a problem, that is vital for the usage of mobile agents in open systems: the problem of malicious hosts. A malicious host can be defined in a general way as a party that is able execute an agent that belongs to another party and that tries to attack that agent in some way. The question of what action is considered to be an attack depends on the question which assurances an agent owner needs in order to use a mobile agent. If we try to achieve a protection level that is comparable to the one of agents that run on non-malicious, or trusted hosts, we can identify the following attacks: 1. spying out code 2. spying out data 3. spying out control flow 4. manipulation of code 5. manipulation of data 6. manipulation of control flow 7. incorrect execution of code 8. masquerading of the host 9. denial of execution 10.spying out interaction with other agents 11.manipulation of interaction with other agents 12.returning wrong results of system calls issued by the agent To illustrate these attacks we will use a small purchase agent as an example. The purchase agent contains a data and a code block. Entries in the data block may include: Address home = ?PDA, sweet PDA? Money wallet = 20$ float maximumprice = 20.00$ good flowers = 10 red roses Address shoplist[] = empty list int shoplistindex = 0 float bestprice = 20.00$ Address bestshop = empty 3 WP 2.2: Security 11 The central procedure startAgent, that is called by the host every time the agent arrives, could look like this: Using this example, the attacks listed above can be illustrated. 1. Spying out code The code of the agent has to be readable by the host. Although this requirement can be restricted to the next instruction at a single point of time, this does not solve the problem since some hosts see almost all of the code because they execute most of the commands. In our example the host visited last executes nearly all the code. If the agent code is characteristic not only for a single, but a whole class of agents, the whole code of the agent may be known even before execution time. If an agent is generated out of standard building blocks (which is a good idea regarding code migration costs and ease of agent construction), the detail specification is available for building blocks like libraries or classes. Furthermore, these blocks can be explored by blackbox tests. Knowing the code leads to knowledge about the execution strategy of the agent, knowledge about the exact physical structure of code and data in the memory of the host and sometimes (by using data statements like initial variable assignments) to knowledge about parts of the agent data. 2. Spying out data The threat of a host reading the private data of an agent is very severe as it leaves no trace that could be detected. This is not necessarily true for the consequences of this knowledge, but they can occur a long time after the visit of the agent on the malicious host. This is a special problem for data classes such as secret keys or electronic cash, where the simple knowledge of the data results in loss of privacy or money. In our example, the money variable would be security sen- 1 public void startAgent() { 2 3 if (shoplist == null) { 4 shoplist = getTrader(). 5 getProvidersOf(?BuyFlowers?); 6 go(shoplist[1]); 7 break; 8 } 9 if (shoplist[shoplistindex]. 10 askprice(flowers) < bestprice) { 11 bestprice = shoplist[shoplistindex]. 12 askprice(flowers); 13 bestshop = shoplist[shoplistindex]; 14 } 15 if (shoplistindex >= (shoplist.length - 1)) { 16 // remote buy 17 buy(bestshop,flowers,wallet); 18 // go home and deliver wallet 19 go(home); 20 if (location.getAddress() = home) { 21 location.put(wallet); 22 } 23 } 24 go(shoplist[++shoplistindex]); 25 }} 3 WP 2.2: Security 12 sitive when it is represented in a way that the binary number of the ?coin? is the money and therefore can be used as real world cash. But there are also other classes of data, which can be used for an attack although they have not the nature of classes like e-cash. In our example, the knowledge of the maximum price or the best price so far can be used by a malicious host to offer flowers for a slightly lower amount than the competitors, although the regular price is much lower. 3. Spying out control flow As soon as the host knows the entire code of the agent and its data, it can determine the next execution step at any time. Even if we could protect the used data somehow, it is rather difficult to protect the information about the actual control flow. This is a problem, because together with the knowledge of the code, a malicious host can deduce more information about the state of the agent. In our example, we can recognize whether an offer is better or worse than the best offer so far by simply watching the control flow, even if we could not read any data. 4. Manipulation of code If the host is able to read the code and if it has access to the code memory, it can normally modify the program of an agent. It could exploit this either by altering the code permanently, thus implanting a virus, worm or trojan horse. It could also temporarily alter the behaviour of the agent on that particular host only. The advantage of the latter approach consists in the fact, that the host to which the agent migrates cannot detect a manipulation of the code since it is not modified. Applied to our example, a malicious host could modify the code of the agent with the effect that it prefers the offer of a certain flower provider, regardless of the price. 5. Manipulation of data If the host knows the physical location of the data in the memory and the semantics of the single data elements, it can modify data as well. In our example, the host could cut down the shop list after setting the offer of the local flower provider as the best offer. 6. Manipulation of control flow Even if the host does not have access to the data of the agent, it can conduct the behaviour of the agent by manipulating the control flow. In our example, the host could simply alter the flow at the second or third if statement, forcing the agent to choose the offer of the shop preferred by the host as the best. 7. Incorrect execution of code Without changing the code or the flow of control, a host may also alter the way it executes the code of an agent, resulting in the same effects as above. 8. Masquerade It is the liability of a host that sends an agent to a receiver host to ensure the identity of that receiver. Still, a third party may intercept or copy an agent transfer and start the agent by masking itself as the correct receiver host. A masquerade will probably be followed by other attacks like read attacks. 9. Denial of execution As the agent is executed by the host, i.e. passive, the host can simply not execute the agent. This 3 WP 2.2: Security 13 can be used as an attack e.g. in the case that a host knows about a time limited special offer of another host. The host simply can prevent the detection of this offer by the agent by delaying its execution until the offer expires. 10. Spying out interaction with other agents The agent may buy the flowers remotely from a shop situated on another host. If the interaction between agent and the remote flower shop is not protected, the host of the agent is able to watch the buy interaction even in case the host cannot watch the execution of the agent. In our example, the host could read e.g. wallet and spend the stored money. 11. Manipulation of interaction with other agents If the host can also manipulate the interaction of the agent it can act with the identity of the agent or mask itself as the partner of the agent. In our example the host can e.g. redirect the buying interaction to another shop, or it can interrupt the interaction e.g. to prevent spending the money by the agent. 12. Returning wrong results of system calls issued by the agent In line 20 of the example code (?if (location.getAddress() = home)?), the agent requests the name of the current location. Here the host could mask itself as the agent's home location by returning the corresponding address. The agent then thinks that it is at home and delivers the wallet to the host. After stating the problem we will now have a look on possible solutions. First we will examine some approaches that try to prevent single attacks. In the next section we will see an approach that try to restore the autonomy of the agent, the so called blackbox approach. 3.4 Existing Approaches As mentioned above, a malicious host is defined as a party that is able execute an agent that belongs to another party and that tries to attack that agent in some way. This also means that malicious hosts are only a problem for agents that cannot trust a host in advance. In this case trust means, that the owner either knows or hopes that the operator will not attack. Therefore, some approaches (see e.g. [FGS96]) exist that try to circumvent the problem of potentially malicious hosts by not allowing agents to move to non-trusted hosts. There are also approaches that use a trust approach to protect hosts from agents by not allowing to accept agents that have been on non-trusted hosts before. The problem of these approaches are that trust in this context is absolute (you do not hide anything from a trusted node), and that it is not always clear in advance whether a host is trusted or not. This can severely reduce the number of hosts an agent might migrate to. Even if an owner trusts a big company when it comes e.g. to accounting, it may not want them to see its secret communication key. If an agent has to obtain prices for a flight, it cannot trust the host of an air line or any other host that is maintained by a company related to an air line and so forth. Another ?trust? approach is the organizational solution: the agent system is not open in the sense that everybody can open a host, but only trustworthy parties can operate hosts. This is the approach General Magic [GM96] used for its agent system application, e.g. PersonaLink1, that was operated by AT&T [Mob96]. 3 WP 2.2: Security 14 As trust is a relationship between agent and host which often cannot be determined in advance, a commonly used notion of trust, reputation, is used in another approach [RJ96]. This also is problematic, as we have seen that trust depends on the task an agent has to fulfill. A reputation approach, where betrayed agents can complain about malicious hosts, that in turn, lose reputation, can also result in a new security problem. Agents could attack hosts using a ?character assassination? attack, by simply complaining about being betrayed. Another approach [Vig97] enables an agent to detect and prove modification attacks in order to allow the owner to use legal or organizational ways to get its damage refunded. But this approach cannot prevent other attacks, and it assumes an organizational or legal framework for an agent system. In the first case, such an organizational framework may not exist in an open agent system without a central organization. In the second case it seems to be not realistic to assume such a legal framework on an international level, since also other laws required by new technologies, e.g. for data protection and privacy, are far from being homogeneous or even widespread. Since the problem is the wrong behaviour of the executing environment, in contrary to a behaviour that meets the specification, another class of approaches (e.g. [Pal94]) uses specialized, attack-proven hardware that can ensure its integrity. These approaches therefore require the usage of this hardware in every host, which is currently a too restricting assumption. As the presented approaches either do not protect from all the attacks, or do not allow open mobile agent systems, a more adequate approach is needed. 3.5 Blackbox Security: The Idea In this section we will discuss an approach that is able to protect an agent from most of the attacks mentioned in the first section. The central idea of this approach is to generate an executable agent from a given agent specification which cannot be attacked by read or manipulation attacks. This agent is considered to be a ?blackbox?, if the following applies: Def: Blackbox Property - an agent is a blackbox if: 1. at any time 2. code and data of the agent specification cannot be read 3. code and data of the agent specification cannot be modified If this definition can be applied to an agent, only input to and output from the blackbox can be observed. The ?conversion mechanism? that generates an agent with the blackbox property uses configuration parameters that allow to create different blackboxes out of the same specification (see Figure 3.2). These parameters allow to prevent dictionary attacks. Dictionary attacks guess the attributes of the blackbox by converting a number of agent specifications and compare the cre- 1. PersonaLink was a service that allowed users to send electronic mails that carried agents. It was based on the Telescript mobile agent system. Input Output agent alpha Figure 3.1. Blackbox property 3 WP 2.2: Security 15 ated blackboxes with the attacked one. If an agent fulfills the blackbox property defined above, it is autonomous in the sense that if a hosts executes that agent, the host cannot interfere with this execution in a directed way. If an agent reaches that level of autonomy, it can be protected from other attacks. Masking of the host or reading and manipulating the interaction of the agent with other parties can then be prevented by using conventional mechanisms from the area of stationary distributed systems. The problem now is to ensure the blackbox property. Currently, there is no known algorithm to fully provide blackbox protection even if one other approach exists that seems to proceed in this direction. It is called Mobile Cryptography. 3.6 Mobile Cryptography This approach does not call itself a blackbox approach, but it can be classified in this category. Sander and Tschudin describe in [ST98a] and [ST98b] a way to use encrypted programs as a means to protect agents from malicious hosts. Encrypted programs are programs that consist of operations that work on encrypted data. Agents are produced by converting a agent specification into some executable code plus initial, encrypted data. Since the attacker cannot break the encryption of the data, it cannot read or manipulate the original data. See [ST98a] for a detailed description of the Mobile Cryptography approach. The advantages of this approach over the one that will be presented in the next section are: - the protection of the agent is easily provable - the costs of the protection are probably small - the protection is not time-limited The current restrictions of the Mobile Cryptography approach are: - random programs cannot be used as the input specification; currently only polynomial and rational functions can be used for this purpose - the interaction model of the agent suffers the restriction that cleartext data can be sent only to trusted hosts The extension of the approach to recursive functions and Turing machine program equivalent mechanisms are subject to future work. As soon as the latter can be used as an input to the conversion function, encrypted programs have also the blackbox property. However, even now most of the aspects described in this article, which do not rely on the specific conversion mechanism, apply also for encrypted programs. agent specification conversion mechanism executable agent parameters (blackbox) Figure 3.2 Blackbox approach 3 WP 2.2: Security 16 The second restriction (cleartext data can be sent only to trusted hosts) is not mentioned explicitly. Still, receivers can only read encrypted output of the agent when they know the decryption function (which includes a potential key). If an attacker is able to decrypt the output of an protected agent, it is likely that it can also attack the agent itself. 3.7 Time Limited Blackbox Protection As we have seen, the only known approach that tries to provide fully blackbox protection is currently not applicable to every existing agent. In order to remove this restriction, we redefine the blackbox property definition in a way which differs in the statement about how long the blackbox property is valid. Now we do not assume that the protection holds forever, but only for a limited, known minimal time interval known in advance. Therefore the definition is now: Def: Time Limited Blackbox Property - an agent is a blackbox if: 1. for a certain known time interval 2. code and data of the agent specification cannot be read 3. code and data of the agent specification cannot be modified - attacks after the protection interval are possible 4. but these attacks do not have effects To make the protection time interval explicit, an expiration date is attached to the blackbox. Although this definition is weaker than the original blackbox property and results, as we will see, in more complex mechanisms, it has one big advantage: there is a way to achieve this. Before this way is sketched, we examine what changes if blackboxes are time-limited. 3.8 What Is Changing If the Blackbox Is Time Limited? For achieving the requirement that attacks after the protection interval do not have effects, we have to examine the circumstances under which time limitedness affects processing. To do this, four different interaction scenarios are introduced. It will be argued that effects of an attack can only occur when information of the agent is communicated to third parties. 3.8.1 No communication with a third party In this scenario neither the agent nor the host communicates with a third party. Although this is a merely academic setting, it demonstrates that the temporal aspect is of no importance in this Input Output agent alpha will expire in: 13305 ms Figure 3.3 Time l. blackbox property 3 WP 2.2: Security 17 context. Even if the host successfully attack the agent, nothing results from these attacks. 3.8.2 Communication only with trusted servers Here the agent communicates only with a trusted third party. A party can be considered as trusted if this party never attacks. These two partners can establish a secure communication channel to prevent attacks by the host. Time limitedness of the agent plays a role in this scenario since the communication partner has to know whether it can still trust the agent or not. If the host would have been able to attack the agent, the attacker could use the agent to mask itself as the agent. Since attacks can only take place after the protection interval, the trusted server has to know the expiration date associated with the agent before it starts communication. This can be done using an extended key certificate (see Figure 3.6). The resulting overhead is acceptable since secure communication requires already authentication of the partners 3.8.3 Communication with untrusted servers In this scenario the agent communicates with either an untrusted third party or with the host, which is by definition untrusted (see Figure 3.7). We have to distinguish two kinds of data that can be communicated: token and non-token data. Agent Host Figure 3.4 No communication Figure 3.5 Communication only with trusted server Figure 3.6 Extended key certificate Version Serial Number Certification Alg. Certification Par. Name of CA Validity of Certific. Name of Agent Expiration Date Public Key of Agent Signature of CA Agent Host Server Trusted Figure 3.5 Communication only with trusted servers Figure 3.6 Extended key certificate 3 WP 2.2: Security 18 Token data are self-contained documents that depend on the identity of the issuer. Therefore they often bear digital signatures. Examples for tokens are electronic money coins, secret keys and capabilities. The problem with tokens is, that an attacker may use or trade them without having obtained them regularly. Therefore, also tokens have to bear expiration dates to prevent the usage of tokens that could have been obtained by attacking the agent. Every party that receives a token by another party thus has to check whether the expiration date of the token has passed or not. To do that, this party has to be able to get the correct global time. This means that time limitedness always require synchronized clocks. Note that it is not necessary for the party that sends a token to know the current time. Only the party that issues a token and the party that receives a token have to have this information. The issuer needs it to add the protection interval to it. The receiver needs the current time since if a party accepts an outdated token, no other party will accept it in return. The drawback of the expiration date is, that a token cannot be protected after the expiration date. Thus, tokens which need a larger protection interval must not be transported by the agent. This can be the case for some existing token systems that do not include expiration aspects or which cannot be extended by this aspect. A good example for tokens that cannot be protected in agents are secret keys of an agent owner since they are valid normally for a long time. Non-token data is everything else. Examples for this category are simple values that do not need to be protected and values that are security sensitive like the maximum price. The blackbox property guarantees that they cannot be read or modified before the expiration date has passed. They cannot be used against the agent or its owner since they not depend on the identity of the issuer. Since non-token data cannot be used to interact with third parties, it does not need to be protected against modification attacks after the protection interval. Although nothing has to be done to protect non-token data, there is a restriction for these elements: an agent must not transport non-token data, that can be used to attack the owner of the agent and whose protection interval has to be larger than the lifetime of the agent. An example for such data could be a variable describing the maximum price for a good that is valid for all purchasing agents of a user ever used. Fortunately, data elements with a larger protection need do not seem to occur very frequently in reality. Note that this scenario does include both planned interaction of the agent with an untrusted party or unplanned interaction by an attack of the host. Since an ?unplanned interaction?, i.e. a read attack by the host can only take place after the expiration date, all allowed tokens are also Agent Host Server Untrusted Server Trusted Token Token Figure 3.7 Communication with untrusted servers signature Data Issuer Exp.date Figure 3.8 Token structure 3 WP 2.2: Security 19 outdated then and non-token data does not have to be protected any longer due to the mentioned restriction. 3.8.4 Migration of the agent The last scenario comprises the remaining possibility to explicitly communicate agent information: the migration of the whole agent to a new host. The problem here is, that the agent may have been ?overtaken? or tampered by the host after the expiration date. Although it is unlikely that the code of the agent was manipulated by the host since it is rather easy to protect constant code from manipulation attacks by using digital signature techniques, an attacker could have been altered values that are not protected by the signature, i.e. mainly variable data. Therefore, the receiving host has to ensure that the arriving agent is still valid, i.e. that its expiration date has not passed already. As we have seen the agent will probably be protected by a signature, and all we have to do is to either include the expiration date into the constant part of the agent allowing the signature to also protect this date or to use the extended key certificates we introduced above. The receiver then can simply check the signature of the agent and the validity of the agent by checking the expiration date. As we have seen, it is possible to compensate most of the effects that occur when agents are subject of time-limitedness. The next question to answer is how such a protection can be reached. 3.9 How Can We Reach Time Limited Blackbox Protection? The lack of approaches that protect agents from host attacks is based on the observation that a host is always able to read every bit of the memory and the content of every variable and to know the memory location of every line of code. Therefore, some authors conclude, it is impossible to prevent e.g. read attacks. While this observation is always true for the ?semantics in the small?, i.e. the meaning of these elements for the next execution step, it is not necessarily true for the ?semantics in the large?, i.e. the meaning of these elements to the overall semantics of the application. An example for this difference is the code in Figure 3.10 where you can of course put the finger on every statement and every variable, but to explain the meaning of a statement or a variable in relation to the overall result, you have to think about it (the code fragment computes the difference of two two-bytes-numbers). Host Agent Host A B Figure 3.9 Agent migration w[6] = b[3] - b[5]; w[7] = b[2] * 256; w[8] = w[7] + w[6]; w[5] = w[8] - b[4] * 256; b[0] = w[5] DIV 256; b[1] = w[5] MOD 256; Figure 3.10 A code fragment 3 WP 2.2: Security 20 This effect results from the fact, that this overall semantics is not expressed by code, but by the ?mental model? of the programmer or the reader of a program. To attack an agent, the human attacker has to have such a mental model of the code in order to find certain points in the code or values that are interesting for the attacker. The central idea now is not to allow an attacker to build such a mental model of the agent in advance, i.e. before the agent arrives, and to make the process of building this model a timeconsuming task. The first goal is reached by creating a new ?form? of the agent dynamically, in an unpredictable, manner at the start of the protection interval. The second goal is reached by using conversion algorithms that produce a new form that is hard to analyse. In this context hard means that the analysis should take as much as time as possible. These conversion algorithms are therefore called obfuscating or mess-up algorithms. Note that the approach does not assume that it is impossible for the attacker to analyse the agent, the analysis simply takes time. The assumption is that a lower bound of this time can be determined and that this time interval is large enough for most agent applications on one host. 3.9.1 Agent mess-up algorithms The task of a mess-up algorithm is to generate a new agent out of an original agent, which differs in code and data representation but yields the same results. This means, that the specification of the agent is given as an executable, unprotected agent. Agents consists of executable code and some data. To prevent dictionary attacks (see Section 2), it has to use a random parameter that allows the algorithm to create different new agents out of a single original one (see Figure 3.11). To achieve the requirement that a blackbox protected agent that is hard to analyse, the designer of a mess-up algorithm has to take into account two key aspects: the attributes of an agent that can be modified and the abilities and characteristics of the attacker 3.9.2 Agent attributes that can be modified Statements A statement has a type and a location in a program (it also consists of data, but this aspect is viewed below). The type of a statement can be hidden until the statement is executed by dynamically creating it at runtime. This is possible by using e.g. self-modifiable code. The location of a statement can also be hidden, either implicitly by using dynamic code creation or explicitly by hiding a certain statement into other statements. original agent mess-up algorithms blackboxed agent random parameters Figure 3.11 Time-limited blackbox approach 3 WP 2.2: Security 21 Data Data, i.e. variables and constants, consist of a type, a value and a location. The type of a data element can be hidden until the data is needed. This is even normal for languages that use dynamic typing as e.g. Smalltalk. The value of a data element can be hidden. One way to achieve this is to replace element accesses by accesses on subelements and to translate operations on the data elements by operations on the subelements. This results in an execution where the value of an data element never occurs as a whole. Finally, the location of a data element can be hidden either statically, e.g. by splitting up the element and distributing the parts, or dynamically by e.g. allowing the element to move around in the data area. 3.9.3 Abilities and characteristics of the attacker To model the properties of the attacker, we have to distinguish two cases. In the first case, the attacker does not know the original version of the agent in advance. Therefore, a human has to analyse the blackbox to build up a mental model. Although it can use the aid of computerized tools to do this, humans tend to be far too slow compared to the execution speed of computer. This slowness cannot be reduced fundamentally since it is not possible to speed up humans. Therefore, the next case seems to be much more relevant. In the second case, the attacker does know the exact specification of the agent in advance. This case is probably the common one if most agents in an agent system are instances of a set of standard agents. If it is possible to identify the type of an agent, i.e. the original agent, then the exact specification is accessible. If now an attacker knows the exact specification, it can automate the attack by generating a program that tries to compute only a few or even a single attribute of the agent, e.g. the current location of a certain variable in the blackboxed agent. In this case the attack can be accelerated by using faster computers or by employing several computers in parallel. In both cases the generated code has to be constructed in a way that standard program analysing techniques such as program slicing, data flow analysis or program abstraction, cannot be used to analyse the agent before the expiration date has passed. Good mess-up algorithms do not allow the complete analysis to be done statically, but also require to run the agent at least partially. Let us now have a look at three example algorithms. 3.9.4 Examples for mess-up algorithms The most important aspects are the structure and attributes of the used mess-up algorithms, as they decide about the protection strength of the security mechanism. Therefore, three mess-up algorithms will be sketched here. 3 WP 2.2: Security 22 Variable Recomposition This algorithm takes the set of program variables, cuts each variable content into segments and creates new variables that contain a recomposition of the original segments. The original variable accesses in the program code are then adapted correspondingly. In Figure 3.12a, you can see the original variable access, Figure 3.12b defines a scheme for recomposing two new variables v23 and v19 from the contents of three original variables. The access code for the new variables as displayed in Figure 3.12d can therefore be created automatically, given the recomposition scheme, by using conversion functions (see Figure 3.12c) that create the original values from the new variables. As a result, now there is no direct relationship between variables and processing model elements like the maximum price from our example. The variable names are now meaningless and the data representation is rather complicated. Conversion of Control Flow Elements into Value-Dependent Jumps The next presented mechanism is a conversion of compile-time control flow elements into run-time data dependent jumps. Control flow elements like if and while statements allow the programmer to imagine the potential control flow even at compile time as these statements make control flow explicit. If we convert these elements into a form that depends on the content of variables, the control flow cannot be determined as easily as before. This dependence can be achieved by the usage of jumps that are bound to variable contents, e.g. switch-statements. The effect can even be strengthened by using complex variable expressions instead of using simple variables. Figure 3.12a Original variable access Figure 3.12b Variable recomposition 5 buy(bestshop,flowers,wallet) 6 go(home) bestshop flowers wallet v23 v19 public Address c7(Bitstring b) public Good c4(Bitstring b) public Money c3(Bitstring b) public Address c34(Bitstring b) 5 buy(c7(v23[0]+v19[4]+v23[3]) ,c4(v19[0]+v19[3]+v23[1]), c3(v23[2]+v19[1]+v23[4])) 6 go(c34(v21[4]+v19[2]+v21[2])) Figure 3.12c Conversion functions Figure 3.12d New variable access if (a(b) < c) { b = s(d(e) + f); } z= 0 DO if (z=0) then t1 = a(b); z=1; continue; if (z=1) then t2 = t1 < c; z=2; continue; if (t2) then t3 = d(e); z=3; continue; if (z=3) then t4 = t3 + f; z=4; continue; if (z=4) then b = t4; z=5; continue; if (z=5) then break; LOOP Figure 3.13a Original code Figure 3.13b Converted Code 3 WP 2.2: Security 23 Deposited Keys If the whole protection information is included in the agent, an attacker is able to break that protection sooner or later. If we can encrypt parts of the agent or identify other information, that is both small and important for the execution of the agent, we can ?externalize? this information on another, trusted server. The idea is to let the agent request these information parts, or keys, from the trusted server by indicating the state of the agent. An example for one type of keys can be found in Figure 3.13b where the numbers printed in bold denote data that ?interconnects? the statements. If these numbers are not present in the code, the attacker is not able to analyse the agent and, therefore, to attack the agent before runtime. The trusted host will deliver them to the agent when the right state is indicated. There are, of course, more and better algorithms, but the above examples demonstrate some of the principles they have to follow: - the algorithm needs to be parametrizable with a very large parameter space in order to avoid dictionary attacks - it must not be possible to break the protection without running the code - it may be useful to take out parts of agent and to put these parts on trusted nodes Each single of these algorithms may be not that strong if they would be used alone. It can be expected that a combination of these algorithms is much stronger than the sum of the single strengths. Therefore, our approach uses a ?chain? of mess-up algorithms. To illustrate the effects of the mess-up algorithms, we want now to sketch possible algorithms that try to break the protection generated by the example algorithms. 3.9.5 Counter attacks Having the three example mess-up algorithms in mind, we want now have a look at possible counter attacks. Counter attacks are algorithms that try to break the protection, i.e. mess-up algorithms. Variable Recomposition There are at least two possible approaches: we can try to guess the variable layout by analysing the access to a byte statistically over the operations known to access certain original variables. The other approach tries to read the original variables simply by reading the parameters of the calls of the known procedures. This means in our example, that the attacker sees the values of bestshop, flowers and wallet as soon as buy is called as these are the contents of the parameters of this procedure. Both attacks assume knowledge about the original procedures. Fortunately, this attack is not that important as either the known procedure is a system call of the host or a call of an ?internal? procedure. In the first case, the parameters do not have to be protected as they are by definition not secret (they are delivered to the host). In the second case we can dissolve the internal procedures into code of the main procedure, so that they are not visible any more. Agent malicious host trusted host statei keyi Figure 3.14 Deposited keys 3 WP 2.2: Security 24 Conversion of Control Flow Elements into Value-dependent Jumps The presented version of this algorithm is rather easy to break as it can be analysed without having to run the code. All we have to do is to create the original statement out of the if-statements. The computational complexity of this is roughly proportional to the number of if-statements, which corresponds at most to the number of single language expression nodes of the syntax tree. We can prevent such an approach by replacing the constant numbers in the if-conditions by more complex expressions and by adding another algorithm that adds more dynamics to the computation of the conditions, e.g. the Deposited Keys mechanism. Deposited Keys We can attack this algorithm by creating every possible state of the agent and by requesting all the keys that are associated to these states. We then have all the runtime informations of the agent and can try to analyse it. The question is, how the attacker gains all the states of an agent. If the states can be associated to the execution of the agent (e.g. by computing a key that has to be delivered with the request), the host has to execute the agent. We then can control the attack by the trusted host since it can notice the execution of the agent. 3.9.6 Problems with mess-up algorithms The first main problem is that the protection intervals have to be of a ?useful? length. Useful in this context means the question of how long a protection interval has to be in order to allow the agent to do something useful. The answer depends of course on the task the agent has to fulfill on a host, but with an interval that allows two ?long-range? migrations, some execution time and enough time for the protection overhead, most applications should be in range. If the protection interval is longer, the agent can migrate to more hosts or compute a longer time on every host. If the protection interval is much smaller, the possible application areas of the protected agent is severely restricted. The second and even bigger problem is the question of how to determine these protection intervals from the used mess-up mechanism. Here, the usage of cryptography to protect data has a valuable advantage: it is possible to express the protection strength of the crypto algorithm in terms of the needed computational power. This is possible because there are known algorithms that are able to break the encryption. Sometimes (as with RSA), these algorithms are not necessarily the best possible mechanisms, but the best known, even after some decades of research. More often, the complexity class of the problem that breaks the encryption is known to be too hard to be computed, even in case of technological progress if the key is long enough. Compared to cryptography, blackbox protection has two advantages: - there is no receiver that has to apply the reverse encryption process - the identity (and - in limits - the specification) and the order of the used algorithms does not have to be known in advance We could now try to apply the same mechanism of determining the protection strength of crypto algorithms to mess-up algorithms. Unfortunately, this approach seems to be very difficult. The reason for that is the current lack of a formal model of the agent mess-up and the associated counter algorithms. While in traditional cryptography the problem of breaking an 3 WP 2.2: Security 25 encryption can be tracked down to a well-defined mathematical problem, the possible attacks against a blackbox protected agent are numerous, and different in nature. Future research has to develop a model that expresses e.g. the hiding of the location of a variable on several places and that computes the complexity to find that location. Fortunately we do not have to formalize the process of building up a mental model of a program by a human as we have excluded this possibility due to the lack of attack performance of humans in Section 8.3. However, the current lack of a formal model is not an immanent problem of the approach. It is an open problem that has to be solved in order to both estimate the strength of the protection and to compute the current protection interval for a specific agent. This computation will then take into account the average computational need for solving the problem of breaking the blackbox protection and estimate how much computational power an attacker will stake. Since the computation will be done before an agent migrates the first time, the estimation can be adjusted according to the existing technology. 3.9.7 How can a blackbox protected agent be created? To create a protected agent, token type data has to be converted into tokens that bear an expiration date and that are signed digitally. In the next step all security sensitive library calls (like calls of an encryption function) have to be replaced by the corresponding library code. Afterwards the code mess-up algorithms are applied to the code and the data of the agent. Finally, the agent has to be signed digitally after receiving the agent expiration date. Now the agent is ready to migrate. 3.9.8 Recharging of protected agents If the ?maximum distance? of the agent is determined by its expiration date, is it possible to ?recharge? the agent in order to allow it to migrate further? Due to the nature of the code messup algorithms, any host could convert the agent to a new form without having to know its internal structure or the contents of the original data. Therefore, we could assign this task to any host that does not cooperate with any malicious host the agent has visited or will visit. Unfortunately, the expiration dates of the agent and of the transported tokens are a problem as they cannot be modified that easily. The first problem is that the agent has to be assigned with a new expiration date and signed digitally by a party that the agent (or its owner) trusts. This also incurs, that the agent gets a new identity, as it differs at least in the expiration date. The second problem is, that the tokens have to be replaced by new ones. If you think of electronic money coins, you have to change them into new coins with a new expiration date, while tokens that have no real value like keys, can be easily created. All of this can only be done by a trusted host. If the agent has checked the identity of the trusted host, it delivers the tokens that have to be replaced and gets the new ones in return. An alternative to recharging an agent is to extract the state of a nearly expired agent and to ?inject? it into a new agent ?hull?, thus creating a new agent that contains the state of the old one. The advantage of this alternative is that it prevents the delay that would be needed to mess-up the old agent after its arrival. Now we have seen how to achieve time limited blackbox protection. But what can we do to prevent other attacks by the host? 3 WP 2.2: Security 26 3.10 Which Other Attacks by Malicious Hosts Can Be Prevented Using Blackbox Protection? Even if an agent is protected by time-limited blackbox security, there are still some possible attacks: - a malicious host can try to mask itself as another, perhaps trusted host - a malicious can try to read and manipulate the interaction of a hosted agent with a third party and - a malicious can return wrong results when the agent is calling system library procedures While there is no known protection from the latter attack apart from verifying the answer by another third party (but then using library code seems to be rather redundant), the first two attacks can be prevented. This is possible since a blackbox protected agent is autonomous again, i.e. that if a hosts executes that agent, the host cannot interfere in this execution. This allows us to use the same mechanisms to prevent the mentioned attacks as in distributed systems where the parties reside on different, and therefore autonomous, nodes. We can prevent masking of hosts by using existing authentication methods using symmetric or asymmetric encryption schemes. We can even strip down these protocols a little bit since no third party can read the local communication between the agent and the host. We can also prevent attacks against the interaction of an agent with another party by using secure channels between the interaction partners. These channels are obtained by exchanging session keys between the partners and by encrypting the traffic between them. Since in this scenario, the malicious host can be modeled as an attacker on a connection between two autonomous nodes, the protocols do not have to be modified. 3.11 New Attacks: Sabotage and Blackbox Testing If there is an agent protection scheme like the one described in this report, one can imagine attacks that rely on the characteristics of this scheme. One attack is sabotage, or the action of destroying parts of the agent without being detected. As an agent contains data that might change during execution, the attacker can simply modify single bits of the data area without knowing about the effects to the agent. Fortunately, this attack is very similar to the problem of data that is sent over an insecure network. Therefore, similar error detection or even correction mechanisms like CRC, computed by the agent itself, can be used as long as the attacker cannot detect the detail structure of the mechanism. It is easy to circumvent a CRC algorithm if the exact mechanism is known and if it the borders of the protected data elements can be seen. Another attack is the blackbox test. Its aim is to determine characteristics of the inside of the ?black box? by executing the box with different input parameters and by watching the effects. The recorded reactions can be formal results like output values or characteristic ?activity patterns?. In our example, the attacker could execute the agent until it tries to buy the flowers, starting over and over with the initial agent. The only value that is changed over the tests is the price for the flowers. When the agent finally wants to buy, the attacker knows the price that is both the lowest so far and that is below the maximum price. Even if the agent would not buy 3 WP 2.2: Security 27 the flowers immediately (it might want to ask at least three different providers), the attacker can watch whether the data of the agent changes. If this is the case, it is very likely, that this agent has memorized a better price. If it comes to countermeasures, two goals have to be reached: first, the parallel execution of the same agent has to be suppressed, e.g. by using a trusted third party that is informed by the agent about its execution. Second, the very fast execution of an agent has to be prevented, e.g. by using a similar interaction with a trusted host. Finally, activity patterns can be covered up by inserting and executing dummy code. 3.12 What Blackbox Security Costs Protecting agents using blackbox security is not for free. Since the costs mainly depend on the class of an agent, it has to be decided per class whether this kind of protection is appropriate or whether the agent should operate immobile from a trusted host via remote communication. For calculating the costs, we can distinguish four classes of costs that result from blackbox security: costs at creation time These are the costs for converting the original agent into the new form. These costs are not important for the execution time of the agent, only for the ?delay? of starting the execution. If we get an agent with low execution time overhead, we can accept higher creation time costs. costs at transmission time This is the size overhead of the agent, since the transmission time is determined by the size. The main problem here is the fact, that agents have to transport all library code that is security sensitive instead of using the corresponding system library at the target host. An example are the J/Crypto libraries from Baltimore Technologies that implement cryptographic functions like DES, RSA, SHA-1 and MD-5 and which consist of 200 KB of Java bytecode. costs at execution time The execution time overhead results on the one hand from the computations that are introduced by the mess-up algorithms and on the other hand from the execution time of the transported libraries if this time is longer than the execution time of a system library call. There are also costs if communication with remote trusted nodes is needed (e.g. in the case of Deposited Keys). ?costs? by not using efficiency enhancing mechanisms Due to the blackbox mechanism, it is possible, that mechanisms enhancing efficiency cannot be used by protected agents. One example is the fact, that blackbox agents are not modular and hence cannot use code caching mechanisms as the code is different for every agent even if providing exactly the same functionality. 3.13 Conclusions and Future Work Blackbox security is a new approach to solve the problem of malicious hosts, a problem in the area of mobile agent security, that has been rated as not solvable by software means. The pre- 3 WP 2.2: Security 28 sented approach does not prevent every possible attack. It is still possible for the host to deny the execution and to return wrong system call results to the agent. It is further still possible to read and to manipulate data and code, but as the attacker cannot determine the role of these elements for the application, the attack results are random. The approach is able to guarantee a certain protection time interval. Therefore, the agent and its transported data get invalid after this ?expiration date?. For the purpose of comparing the expiration dates with the current time, synchronized clocks are necessary. As the strength of blackbox security depends on algorithms that ?mess-up? code and data of the agent, these algorithms have to be constructed in a way that can guarantee the protection time interval, which also have to be of a useful length. As we have seen, this kind of security is not for free, but costs both in terms of execution and transmission speed. We expect therefore, that blackbox security will be applied only to agents that transport money-like values or security sensitive data such as secret keys. We will implement a framework for blackbox security for our own Java-based agent system, Mole [Mole]. At the moment, no overall implementation of the approach exists as it is a complex framework that needs a lot of modifications in an agent system. Currently, we are finishing the implementation of a first combination of code mess-up algorithms [Röh97], and we are starting to develop a formal model of the mess-up effects to be able to compute their protection strength. To prevent blackbox testing attacks, we are currently working on an extension of the blackbox mechanism, which will also allow agents to authenticate their hosts. 3.14 Bibliography [FGS96] Farmer, William; Guttmann, Joshua; Swarup, Vipin: Security for Mobile Agents: Authentication and State Appraisal, in: Proceedings of the European Symposium on Research in Computer Security (ESORICS), pp. 118-130, Springer LNCS 1146, 1996 [GM96] General Magic: The Telescript Reference Manual. 1996. http://www.genmagic.com/ Telescript/Documentation/TRM/ [Hoh97] Hohl, Fritz: An approach to solve the problem of malicious hosts. Universität Stuttgart, Fakultät Informatik, Fakultätsbericht Nr. 1997/03, 1997. http://www.informatik.unistuttgart.de/cgi-bin/ncstrl_rep_view.pl?/inf/ftp/pub/library/ncstrl.ustuttgart_fi/TR-1997- 03/TR-1997-03.bib [Vig98] Vigna, Giovanni (Ed.): Mobile Agents and Security, Springer-Verlag, to appear 1998 [Röh97] Röhrle, Klaus: Konzeption, Implementierung und Analyse von Verwürfelungsmechanismen für Quellcode, Diploma Thesis Nr. 1541, Faculty of Informatics, University of Stuttgart, Germany, 1997 [Mob96] Mobilis: Exploring Telescript - mobilis Reader Interview: General Magic's Jim White. Mobilis March 1996. http://www. volksware.com/mobilis/march.96/interv1.htm [Mole] Mole project page. http://www.informatik.uni-stuttgart.de/ipvr/vs/projekte/mole.html [Pal94] Palmer, E: An Introduction to Citadel - a secure crypto coprocessor for workstations, in: Proceedings of the IFIP SEC'94 Conference, 1994 [RJ96] Rasmusson, Lars; Jansson, Sverker: Simulated Social Control for Secure Internet Commerce, in: New Security Paradigms `96, ACM Press, 1996 3 WP 2.2: Security 29 [ST97a] Sander, Tomas: Security! or ?How to Avoid to Breath Life in Frankensteins Monster?. Slides of a talk at the ICSI Inhouse Workshop on Auto Mobile Code, ?Technology and Applications of Auto Mobile Code (AMC)?, September 1997. http:// www.icsi.berkeley.edu/~tschudin/amc/workshop97/security.html [ST98a] Sander,Tomas; Tschudin,Christian: Protecting Mobile Agents Against Malicious Hosts, in: Vigna, Giovanni (Ed.): Mobile Agents and Security, Springer-Verlag, 1998. http://www.icsi. berkeley.edu/~sander/publications/MA-protect.ps [ST97b] Sander,Tomas; Tschudin,Christian: Towards Mobile Cryptography. Technical Report 97-049, International Computer Science Institute, Berkeley. 1997. http:// www.icsi.berkeley.edu/~sander/publications/tr-97-049.ps [ST98b] Sander,Tomas; Tschudin,Christian: On Sofware Protection via Function Hiding. Submitted to the 2nd International Workshop on Information Hiding, Dec 1998. http:// www.icsi.berkeley.edu/~sander/publications/hiding.ps [Vig97] Vigna, Giovanni: Protecting Mobile Agents through Tracing, in: Proceedings of the Third ECOOP Workshop on Operating System support for Mobile Object Systems, 1997. To appear. 4 WP 2.3: Recoverable Agents 30 4 WP 2.3: Recoverable Agents An important prerequisite for the use of mobile agents in industrial environments is to provide reliable and fault tolerant execution of the agents. This chapter describes two different approaches to provide the required reliability. In the first approach, fault-tolerant execution of agents on unreliable systems is provided. Section 4.1 summarizes two published papers [RoSt98][StRoMa98] on this topic. In the second approach, the reliability of agents is provided by using the TUXEDO platform on Tandem Himalaya to build the agent system. 4.1 A Fault-Tolerant Protocol for Providing the Exactly-Once Property of Mobile Agents 4.1.1 Introduction Over the last few years, the concept of mobile agents has drawn a lot of attention in both academia and industry. Today many prototypes of mobile agent systems exist, most of them based on the Java programming language. Moreover, various efforts to standardize mobile agent technology are already underway (e.g., OMG MASIF, CSELT FIPA). However, despite of all these activities, only few ?real? applications based on mobile agents exist today. One reason for that might be that current mobile agent platforms are in a rather early stage. Application-critical functions, such as security mechanisms, are often incomplete or missing at all. Moreover, only little work has been done so far in studying the problem of integrating agent technology with legacy systems, such as TP-Monitors and transactional resource managers. In this paper, we will show how agent technology can be integrated with transactional technology to improve faulttolerance. Mobile agents are autonomous objects that are able to migrate from node to node in a computer network. When an agent decides to migrate to another node, the agent's code, data and execution state1 is captured and transferred to the next node, where it is initiated after arrival. Agent execution proceeds in stages [Sch97], where the operations of a stage are performed at a single node. Whenever an agent moves to a new node, this ends the current stage and begins a new one. The assignment of stages to nodes can be defined by means of a user-defined itinerary [LO97][GM] before the agent is launched, or on the fly by the agent logic taking into account the current system state [PS97][SBH96]. The use of mobile agents has been proposed for many application areas, including electronic commerce, systems management, or active messaging. In electronic commerce scenarios, for instance, agents autonomously go shopping on a user's behalf, do the reservations needed for a business trip, or monitor the stock market and trigger user-defined operations when certain conditions occur. Obviously, many of these applications require an agent to be executed exactly once. For example, assume a user that launches a mobile agent to make a flight and hotel reservation for a forthcoming business trip. The agent is expected to make both reservations if possible, and in any case return a status message back to the user. Of course, the user will only del- 1. Actually we distinguish between strong and weak migration [GV97]. While weak migration only transfers the code and data, strong migration also transfers the agent's execution state. 4 WP 2.3: Recoverable Agents 31 egate this job to an agent if it is guaranteed that the agent does it ?exactly once? and cannot be caught by a network partitioning or node failure. In other words, independent of node and communication failures it must be ensured that the agent is never lost and hence will get its job done eventually. Moreover, failures may not cause the agent to perform operations more than once (e.g., to reserve and pay two seats instead of one). The exactly once property has already been defined for RPC systems [Spe82], where it defines the failure semantics of a single remote procedure. In the context of mobile agents, a sequence of agent stages are to be considered rather than a single procedure. An agent execution is defined to be ?exactly once? if the entire sequence of its stages is eventually performed, and all operations of each stage are executed exactly once. In this paper, we will first describe a simple protocol based on transactional message queues (e.g. IBM MQSeries, see [BE97][Bla95]). This protocol already provides the ?exactly-once? semantics as defined above. However, for many applications it is not sufficient to get the job done ?eventually? but as fast as possible or even up to a certain deadline. In our reservation example above, the agent's status message should arrive at least before the date the business trip is scheduled. The problem with our simple protocol is that an agent may be blocked due to a node crash or network partitioning even if there are other nodes, where it could continue processing. Therefore, we propose an extension of this simple protocol to reduce the probability of agents to be blocked. The extended protocol allows a number of observer nodes to be assigned to each stage. The observers monitor the stage node currently executing the agent and take over agent execution when this node becomes unavailable. A voting procedure integrated in commit processing ensures the ?exactly-once? semantics. The protocol is currently implemented in Mole [Mole][BauEA98], a mobile agent system developed at Stuttgart University. The remainder of the paper is structured as follows. In the next section, we will describe our agent execution model. Section 4.1.3 presents the simple protocol and discusses the problems associated with it. Section 4.1.4 introduces an enhanced model for agent processing and gives an overview of the extended algorithm, which is subdivided in a voting protocol and a so-called selection protocol. These two protocols are described in detail in Section 4.1.5 and Section 4.1.6 . Section 4.1.7 discusses the gain in fault-tolerance obtained by the protocol and gives a short estimation on the costs introduced. Section 4.1.8 presents an approach to reduce the overhead introduced by the protocol. Related work is discussed in Section 4.1.9 , before the paper concludes with a brief summary. 4.1.2 Agent Execution Model In our agent execution model, tasks are assigned to agents, which perform them autonomously. To execute its task a mobile agent may exploit the services provided by the various nodes of a computer network. According to the mobile agents paradigm [GV97] an agent moves to a node before accessing the node's services, i.e., agents only interact with local services. Once launched an agent moves from node to node according to its itinerary, which may be determined before the agent is initiated or dynamically while agent execution is in progress. A more detailed description of the concept of an itinerary can be found in [StRoMa98]. Agent execution proceeds in steps, where a new step is initiated whenever an agent migrates to the next node. A step of an agent at a node is defined to be the set of operations performed by 4 WP 2.3: Recoverable Agents 32 the agent while it visits this node. Consequently, all operations of a given step are performed at the same node and access local resources only. In our model, we assume that resources are encapsulated in resource managers, which - depending on the actual system environment - may be represented by stationary agents, servers or (recoverable) objects. Each step may change the agent's state as well as the state of the local resources. For example, assume an agent buying a ticket from a ticket server. After this step, the agent's state would reflect the ticket information as well as the modified electronic wallet data, and the ticket server's database would have been updated accordingly. Note that this step has to be performed in an atomic manner. Let L(I) be the number of nodes in the agent's itinerary I=[N1,N2,...,NL(I)] and Si be the step to be performed on node Ni (1?i?L(I)). Then the execution of an agent is defined to be exactlyonce if - the agent executes step Si before step Si+1, 1?iN2->...- >Nk-1->Nk . As an agent may visit the same node several times, Ni and Nj (1?i,j?k) may denote the same or different nodes. Assume further that an agent is stored in a message queue when it is accepted by the agent system for execution. Once the agent has been stored in the initial queue 4 WP 2.3: Recoverable Agents 33 (Q1 in our example), the owner of the agent can be informed that this agent - provided that nodes, queues and the network recover - will be performed exactly once eventually. Except Nk each other node is performing the following sequence of operations: Begin_Transaction; Get(Agent); Execute(Agent); Put(Agent); Commit. Get removes an agent from the node's input queue, Execute performs the received agent locally, and Put places it directly into the input queue of the node to be visited next. All three operations are performed within a transaction and hence build an atomic unit of work. So, if for instance transaction Ti aborts due to a node or transaction failure, recovery undoes all of the agent's effects at Ni and restores the agent in its original state in Qi. Any effects in Qi+1 are undone also. After recovery is finished, Ni continues normal processing and will execute this agent eventually and then hand it over to its successor. Of course, the last node in the agent's itinerary, Nk, does not have to perform Put, it simply destroys the agent when the execution is finished. The problem with this simple solution stems from the autonomy of agents. Due to this property there is no ?natural? instance that monitors the progress of an agent. If a node crashes after the agent has been placed in its (local) input queue and before it is moved to the next queue, the agent is ?caught? as long as the node is down even if there are other stage nodes which could execute the agent. A partitioning of the underlying network may have similar effects. Note that this is different in client/server systems, where a client calling the operations of a server monitors the availability of this server. When it detects a server failure, the client can continue processing by using alternative servers offering the same or similar services. With the above protocol, there is no system entity that will notice that an agent is ?caught?. Of course, the end user might notice when the agent misses a deadline. This is a serious drawback since agent processing is blocked even if alternative nodes providing the needed services are available. Even if those nodes do not exist, some sort of exception handling should be performed, e.g., informing the user that the agent will most probably miss the deadline. In the next section we will extend the simple protocol described above to reduce the probability of agents to be blocked due to failures. 4.1.4 Protocol Overview The execution of an agent proceeds in a sequence of stages. The operations associated with a stage are entirely performed at a single node, and an agent enters a new stage whenever it moves to the next node. For each stage there exists a non-empty set of nodes which alternatively can perform that stage. Each stage initially includes a worker node, which is responsible for executing the agent in this stage. The other nodes are observers monitoring the availability of the stage's worker. When the worker fails, this will be detected by the observers, which then will elect a new worker from the set of available stage nodes. Each stage node is associated with a Q1 Q2 N1 Execute Put Get T1 Q3 N2 Execute Put Get T2 Qk Nk-1 Execute Put Get Tk-1 Nk Execute Get Tk Launch Figure 4.1 Simple implementation of exactly-once agents using message 4 WP 2.3: Recoverable Agents 34 priority, which defines a total ordering between the nodes belonging to the same stage. Node priorities are required for the voting and selection process. The node with the highest priority becomes the initial worker of a stage. Figure 4.2 shows a 3-stage execution of an agent. For example, stage S1 is associated with one worker, and 4 observers. In S2, the node with the highest priority (1) failed and the node with priority 2 was elected to be the new worker. What are the functional capabilities expected from observers? Ideally, an observer provides the same set of services an agent expects to find at the initial worker (e.g., a flight reservation service). However, an observer that offers no more than an environment for running agents is also acceptable. At such a node an agent can perform the exception handling mentioned above. For example, it can use the infrastructure services to find alternative servers, it can change its travel plans, or it can just move back to the user's machine to report the problems and receive new directions. To allow an observer to take over agent execution, it obviously needs a copy of the agent. Therefore, in our scheme, a worker sends the agent not only to the (initial) worker but to all nodes of the next stage when it has finished processing. However, only the worker initiates agent processing, while the observers just do the monitoring for this stage. As in our simple protocol above, we use transactional message queues to move agents from one stage to another. In contrast to the simple protocol, the Put-operation of the message queue has to ensure that the message has already arrived at the destination node at transaction commit. Stage processing has the following structure (see Figure 4.3): Begin_Transaction; Get(Agent); Execute(Agent); Put(Agent) to (All- NodesOfNextStage); Commit.2 The monitoring protocol (see Section 4.1.6 ) ensures that an observer eventually recognizes when a worker becomes unavailable. In such a case, the observers of that stage select a new worker, which initiates a new stage processing transaction comprising the sequence of operations described above. Now, there is an obvious problem with this approach. Since the observers in general cannot decide whether an unavailable worker has crashed or is still active in a different partition of the network, it may happen that two or more nodes of the same stage execute the 2. Note that the Get operations of the observer nodes are not part of the transaction. If they would be included in the transaction, this would require all nodes of a stage to be available to execute an agent at that stage. Clearly, this increases the probability that an agent becomes ?caught? rather than decreasing it. It is important to notice that having several Puts instead of one in the transaction does not increase the ?caught? probability since the observers for the next stage can be determined on the fly from the set of available nodes. 4 3 2 1 5 3 2 1 4 3 2 1 5 S1 S2 S3 Figure 4.2 Execution of an agent in 3 stages 4 WP 2.3: Recoverable Agents 35 agent at the same time. However, the exactly once property of agents requires that exactly one stage transaction is committed per stage. In order to achieve this, we integrate a voting protocol into the two-phase commit (2PC) processing [GR94] of stage transactions: a transaction can only commit if a majority of stage nodes agree. It is also the responsibility of this voting protocol to make sure that all observers of a stage remove all stage information when a worker's stage transaction commits (see the Get* operations in Figure 4.3). In Section 4.1.5 , we will present our voting protocol and show how it can be integrated in standard 2PC processing. We will assume an architecture similar to the X/Open Distributed Transaction Processing [X/O91] architecture, which consists of transaction managers running the 2PC protocol and resource managers maintaining the recoverable data. Figure 4.4 depicts the components and interactions relevant for the processing of stage Si. When the worker of Si calls Commit, the local TM initiates 2PC processing, which involves the worker itself and all nodes of Si+1. During the commit procedure, each involved TM interacts with those local resource managers that were involved in stage processing. For example, at the worker node this is the queue manager associated with the worker's input queue and the other local resource managers (e.g. a DBMS) that have been involved in agent execution. In addition, the worker's TM interacts with another type of resource manager called orchestrator. The orchestrator, which communicates with the so-called voters belonging to its stage, is responsible for orchestrating the voting procedure. Each stage node runs a voter, which determines and communicates the node's 3 2 1 3 2 1 Put Put Put Get Get Get* Get* Si Ti Si+1 Execute Figure 4.3 The transactional processing of an agent in a stage TM QM RM O V V TM QM RM Si Si+1 Agenda: QM: Queue Manager RM: Resource Manager W: Worker O: Orchestrator Figure 4.4 Components and interactions relevant to process a stage W 4 WP 2.3: Recoverable Agents 36 vote. The orchestrator and the voters of a stage communicate according to the voting protocol presented in the next section. It is important to notice that the proposed architecture nicely separates voting and 2PC processing. From a TM's point of view, the orchestrator is just another resource manager, which provides the same interface as all other resource managers (e.g., an XA interface [X/O91][BE97]). Consequently, the voting procedure can be easily integrated in existing middleware systems, such as CORBA [OMG96] or X/Open compliant systems, just by implementing a new resource manager, or a new recoverable server to use CORBA terminology. Besides the voting procedure, a selection protocol is needed, which allows the observers of a stage to select a new worker when they recognize that the old one failed. Since the voting during 2PC processing already ensures that only one stage transaction commits, the exactly once property is not jeopardized even if more than one new worker is selected. Actually, each observer that recognizes a worker failure could select itself without talking to the other observers. Consequently, the problem of selecting a new worker differs from the well-known election problem as defined in the literature (e.g., see [GM82]). For that reason we are using the term ?selection? rather than ?election? throughout this paper. The selection protocol proposed in Section 4.1.6 is a ?light-weight? protocol, which usually selects one new worker, but also can end up with multiple workers in rare situations. Each worker and observer node runs monitor processes that do the monitoring and the selection of new workers if needed. 4.1.5 Voting Protocol In this section, we will focus on the voting protocol and its integration into 2PC processing. Instead of describing the well-known 2PC procedure, we will confine ourselves on presenting the interactions between the transaction manager (TM) and the local voting orchestrator (see Figure 4.5). The voting protocol used here is based on the fault tolerant majority quorum algorithm [Thom79][Giffor79] and is extended by an algorithm similar to the one described in [Maek85] to resolve concurrent invocations of the algorithm using priorities. In terms of message complexity more efficient quorum based fault-tolerant algorithms for mutual exclusion have been proposed (e.g. [AgrAbb91], [ChaCha97]). Although these algorithms have the same time complexity in the error-free case (O(1)), they have a higher time complexity in the presence of failures. In addition, the difference in number of messages for small stages (5 to 7 nodes) is marginal in the error free case. Therefore, we chose to use the simple algorithm. As already stated in the previous section, from the TM's point of view the orchestrator looks like an ordinary resource manager. We assume that resource managers implement an XA-like interface with the following operations: rm_prepare, rm_commit and rm_rollback. The first operation, called in the first phase of 2PC, returns either rm_yes or rm_no, depending on whether or not the resource manager is able to prepare for commitment. In the second phase, the TM issues either rm_commit or rm_rollback depending on the transaction's outcome. Upon such a call a resource manager terminates the transaction accordingly and returns rm_ack to the TM. 4 WP 2.3: Recoverable Agents 37 The interactions between the 2PC of the stage transaction and the voting protocol are shown in Figure 4.5. The voting protocol is run between the orchestrator and the voters of a stage. While the orchestrator is located at a worker node only, there exists a voter at each stage node. When 2PC processing is started at the orchestrator (i.e., when rm_prepare is called), it issues vote requests to the voters of its stage and then collects the returned votes. Only if it receives a majority of yes votes, the orchestrator returns a rm_yes to its local TM, and a rm_no otherwise. In other words, only if a majority of voters vote yes, the transaction can be committed. That is why only one transaction can commit per stage even if there is more than one worker. We distinguish between two types of stable states, namely transaction states and stage states. Both are stored on stable storage and thus are supposed to survive node failures. Transaction states are maintained by orchestrators, while voters maintain stage states. A transaction's state can be ?Unknown?, ?Ready? or ?Committed?, while a stage's state may be ?Unknown? or ?Active?. For both types of states ?Unknown? means that no state information is stored on stable storage for the corresponding transaction or stage. The state information of an ?Active? stage is stored in a so-called stage record on stable storage. It contains the following information: - An identifier of the stage, which consists of an (AgentId, HopCount) pair. AgentId is a globally unique agent identifier and HopCount is incremented whenever the agent is moved to the next stage. - A list of nodes participating in the stage. For each node the node's identifier and priority is included. When an agent moves to the next stage, not only the agent itself but also the stage record of the next stage is Put into the input queues of the nodes associated with the next stage. Each stage node reads the stage record without actually removing it. Once the stage record has been Put into the message queue (on stable storage), the stage becomes ?Active? at the corresponding node. Since all Put operations are performed in a single transaction (see Figure 4.3), either all stage nodes are ?Active?, or none of them. Initially, the stage node with the highest priority becomes the worker, while all other nodes take over the observer role. Voters and orchestrators are identified by globally unique node identifiers (ids), i.e., a voter and orchestrator residing on the same node have the same id. In analogy to 2PC processing, our voting protocol proceeds in two phases which are described in the following paragraphs. Figure 4.6 shows two possible scenarios in a stage with two stage nodes to illustrate the voting algorithm. TM O V TM O rm_prepare rm_yes voting rm_commit voting rm_ack Stage i V Stage i+1 phase 1 phase 2 Figure 4.5 Integration of voting into 2 PC voting 2PC 4 WP 2.3: Recoverable Agents 38 In Figure 4.6a only one orchestrator initiates the voting procedure while Figure 4.6b shows a scenario where both orchestrators initiate the voting procedure concurrently. Normal Processing: Phase 1 Phase 1 of the voting protocol is initiated when an orchestrator receives a rm_prepare call from its local TM. First, the orchestrator sends a VOTE request to each voter of its stage. This request includes several globally unique identifiers: the id of the stage currently processed, the orchestrator's id, and the id of the transaction the orchestrator is currently involved in3. Then, the orchestrator waits for the answers, periodically resending the VOTE request to all stage nodes that have not yet answered. To record its votes already given to orchestrators, a voter maintains a list called OrchSet on stable storage. Whenever the voter returns a YES vote, the identifier of the receiving orchestrator is recorded in OrchSet. Normally, OrchSet ends up with one node identifier. In the presence of failures, however, there might be several orchestrators competing for a node's vote. A voter receiving a VOTE(StageId, TId, OrchId) request for stage StageId determines its reply based on its OrchSet. If OrchSet is empty, the voter has not voted YES before. In this case, OrchId is added to OrchSet and a YES(StageId, TId, VoterId) reply is sent back to the orchestrator, where VoterId identifies the voter (replies of the voters in Figure 4.6a, the first replies of the voters in Figure 4.6b). If OrchSet is not empty instead, there are obviously several orchestrators competing for the vote. To make sure that one of the them will eventually receive a majority of votes, our voting protocol prefers the orchestrator with the highest priority. Assume that N is the node with the highest priority in OrchSet. If OrchSet is not empty and OrchId has a lower priority than N, then the voter 3. The transaction identifier is received in the rm_prepare call and is used here to match VOTE requests with the corresponding votes. Due to node failures it may happen that the same orchestrator starts several rounds of voting. Figure 4.6 Figure 4.6a One orchestrator initiating voting Figure 4.6b Two orchestrators initiating voting concurrently V1 O1 V2 VOTE VOTE rm_prepare YES YES rm_yes rm_commit FORGET FORGET ACK ACK rm_yes O1 V1 V2 O2 VOTE VOTE YES YES H_PRIO GAVE_UP UN_VOTE NO YES rm_prepare rm_no rm_commit rm_prepare rm_yes FORGET ACK ACK rm_ack 4 WP 2.3: Recoverable Agents 39 has already voted YES for a node with a higher priority. In this case, the voter replies with NO(StageId, TId, VoterId), i.e., OrchId loses the competition. In our scenario in Figure 4.6b, V1 already gave its vote to orchestrator O1 which has a higher priority than orchestrator O2. That is why V1 sends a NO to O2. If OrchSet is not empty and OrchId has a higher priority than N, then the voter has already voted but only for orchestrators with a lower priority. If N is not the voter's node, the voter immediately sends back a COND_YES(StageId, TId, OrchSet, VoterId) and then adds OrchId to its OrchSet. The semantics of this vote is that VoterId votes YES, provided that all nodes in OrchSet also vote YES. If N equals the voter's node, there exists a local orchestrator, which has already initiated a competing voting procedure. Since OrchId has a higher priority than the local orchestrator, the latter one is supposed to give up. This, however, is only possible (and desirable) before the stage transaction at the orchestrator has entered the ?Ready? state (i.e., before the orchestrator has got a majority of votes). To check the transaction's state, the voter sends a HIGHER_PRIO request to the local orchestrator, which returns either GAVE_UP to indicate its stage transaction has been aborted, or ALREADY_DONE if the transaction state is already ?Committed? or ?Ready?. If ALREADY_DONE is returned, the voter sends a NO(StageId, TId, VoterId) message to OrchId, and a COND_YES(StageId, TId, OrchSet-{N}, VoterId) message (or YES(StageId, TId, VoterId) if OrchSet-{N} is empty) otherwise. In Figure 4.6b, V2 receives a vote request from O1 after already having given a YES vote to O2 which has a smaller priority than O1. Therefore, V2 sends a HIGHER_PRIO request to O2 which, still being in ?Unknown? state, replies with GAVE_UP, enabling V2 to reply a YES vote (COND_YES with empty OrchSet) to O1. To record the received votes matching the current TId, the orchestrator maintains three sets in volatile storage: YesVotes, NoVotes and CondYesVotes. When it receives a YES or NO vote, it includes the voter's id in YesVotes or NoVotes, respectively. If it receives a COND_YES, it adds the (VoterId, OrchSet) pair included in this message to CondYesVotes. Note that this conditional YES becomes a ?real? YES after all nodes in OrchSet voted YES. In other words, if OrchSet- YesVotes equals the empty set, VoterId can be added to YesVotes and (VoterId, OrchSet) can removed from CondYesVotes. Obviously, this check has to be performed when the (VoterId, Orch- Set) pair is added to CondYesVotes and whenever YesVotes is changed. Once YesVotes contains a majority of votes, the orchestrator moves into the ?Ready? state and then returns rm_yes to its local TM (see orchestrator O1 in our scenarios). Then it waits for the TM's commit or abort decision. Note that the rm_yes response is only a prerequisite for commitment rather than a commit decision. If a majority becomes impossible (i.e., at least half of the voters voted NO), the orchestrator returns a rm_no to its local TM, sends an UN_VOTE(StageId, TId, OrchId) message to all voters recorded in its YesVotes and CondYes- Votes set, and then forgets the transaction (see orchestrator O2 in Figure 4.6b). Note that the rm_no response forces the TM to abort the stage transaction. The orchestrator's node then changes from the worker to the observer role (see Section 4.1.6 ). When the orchestrator receives a HIGHER_PRIO message, it replies ALREADY_DONE if its stage transaction is already ?Ready? or ?Committed?. If the transaction is still in the ?Unknown? state, it sends back GAVE_UP to the local voter and returns rm_no to the local TM (see O2 in Figure 4.6a). Furthermore, it sends UN_VOTE messages to all voters recorded in its Yes- 4 WP 2.3: Recoverable Agents 40 Votes and CondYesVotes set before it forgets the transaction. As above, orchestrator's node then changes from the worker to the observer role. An orchestrator receives a GIVE_UP request from the local voter if another stage node already committed its stage transaction (see below). Clearly, this message can only arrive while the receiving orchestrator resides in the ?Unknown? transaction state. When GIVE_UP arrives, it immediately forgets the transaction, and returns rm_no to the local TM. Normal Operation: Phase 2 If the TM commits the transaction, it issues rm_commit for each local participating resource manager. When rm_commit is called, the orchestrator atomically enters the ?Committed? state and returns an rm_ack to the local TM. Subsequently, it sends a FORGET(StageId, OrchId) message to all voters of its stage and then waits for the acknowledgements to arrive. It periodically resends FORGET until it received an ACK from each voter. When the ACKs are complete, it moves to the ?Unknown? transaction state before it forgets the transaction. A voter receiving FORGET atomically goes into the ?Unknown? stage state, i.e. the stage's stage record (together with the agent) is removed from the voter's transactional input queue in an atomic fashion. Subsequently, the voter removes the stage's OrchSet from stable storage and sends back an ACK(StageId) message to the sender of FORGET. If there happens to be a local orchestrator different from OrchId, the voter sends GIVE_UP to this orchestrator, causing the locally initiated stage transaction to be aborted. If the orchestrator receives rm_abort instead of rm_commit from its TM, it enters the transaction state ?Unknown? and then sends UN_VOTE requests to all voters recorded in its YesVotes or CondYesVotes set. Then the orchestrator's node restarts the transaction. Voters receiving an UN_VOTE remove OrchId from their OrchSet, i.e., they withdraw their votes previously given to OrchId. Note that this ?unvote? mechanism is needed to allow a lower priority node to achieve a majority after some higher priority node gave up. Remember that a voter only votes YES (or COND_YES) if it has not already voted YES (or COND_YES) for some other node with a higher priority Failure Recovery Once a voter has returned a vote to an orchestrator, it can expect either a FORGET or UN-VOTE response. When the voter times out while waiting on the response, it sends an INQUIRY message to the corresponding orchestrator. INQUIRY messages are sent periodically until FORGET (or GAVE_UP) is received, or each orchestrator recorded in the voter's OrchSet returned an UN_VOTE response. An orchestrator's response on an incoming INQUIRY(StageId, VoterId) request depends on its current transaction state. If the transaction state is ?Ready?, the orchestrator adds VoterId to its YesVotes set if it is not already included. This ensures that the identified voter will be notified accordingly as soon as the TM issues rm_commit or rm_abort. If the orchestrator is in the ?Committed? state when receiving an inquiry, it responds with a FORGET message. If the orchestrator resides in the ?Unknown? state, two cases must be distinguished: If there is no active 4 WP 2.3: Recoverable Agents 41 transaction belonging to the stage identified by StageId, the orchestrator returns an UN_VOTE message. If there is an active transaction instead (i.e., voting is still in progress for the stage) the INQUIRY can be ignored. Let us briefly argue why. If VoterId is already in YesVotes or CondYesVotes or will be included at a later point in time, the identified voter will be informed during phase 2. Even if this is not the case, a future INQUIRY will eventually find no locally active transaction, causing an UN_VOTE to be returned to the voter. When a node recovers from a failure, it reads the transaction and stage states recorded in stable storage. Orchestrator recovery only takes place if the transaction state is ?Committed? or ?Ready?. If the transaction is ?Ready? after restart, the orchestrator waits until it is informed by the local TM about the transaction's outcome, and then proceeds as described above. If the transaction is already ?Committed? instead, the orchestrator sends FORGET to all voters of the stage and collects the ACKs. After having received all ACKs, it can enter the ?Unknown? transaction state. A voter only performs recovery if its stage state is ?Active?. In this case, the voter periodically sends INQUIRY request to all orchestrators recorded in its OrchSet. It continues to send inquiries until it receives FORGET from some orchestrator, or it got an UN_VOTE from each orchestrator in OrchSet. It acts upon the received responses as described above. Correctness Arguments In the following, we will give some informal correctness arguments for the voting protocol described above. We will assume that the selection protocol ensures that there eventually exists a non-empty set of orchestrators (or workers). The objective of the voting protocol is to guarantee that exactly one of these orchestrators will commit its state transaction. Let us first show that - given a non-empty set of orchestrators - exactly one of them will eventually enter the ?Ready? transaction state. If there is only one orchestrator, it will get YES votes from all available voters. As soon as a majority of voters is available, it can enter the ?Ready? state. Now assume that there are several competing orchestrators, and O1 is the one with the highest priority. When another orchestrator, say O2, receives O1's vote request, it has either already entered the ?Ready? or ?Committed? state, or it gives up. In the first case, O2 has got a majority of votes, which allows no other orchestrator to move into the ?Ready? state. In the latter case, O2 sends UN_VOTE to its voters, allowing its local voter to send a YES vote back to O1. All other voters either return a YES or COND_YES (O2) message back to O1, depending on the sequence O1's and O2's vote requests arrived. Since O2 returned a YES vote, the COND_YES(O2) votes can be interpreted as yes votes. Consequently, O1 will eventually receive a majority of votes and thus can enter the ?Ready? state. If O1 aborts its transaction, the ?unvote? mechanism ensures that all voters will eventually withdraw their votes given to O1 (or forget the stage). Therefore, also a lower priority orchestrator will get the chance to collect a majority of votes. As shown above, if there are several orchestrators, exactly one of them will eventually enter the ?Ready? state. This orchestrator's transaction will either commit or abort. In the case of commitment, all stage nodes forget the stage, and thus no other transaction of this stage will be able 4 WP 2.3: Recoverable Agents 42 to commit any more. In the case of abort, the transaction becomes ?Unknown? and its orchestrator starts a new transaction. In the latter case, as shown above, this or another orchestrator will eventually become ?Ready?. Consequently, exactly one orchestrator will eventually perform commitment. 4.1.6 Monitoring and Selection Protocol In the previous section, we already pointed out that in addition to the agent also the stage record of the stage to be performed next, say S, is Put into the input queues of the nodes associated with S. Remember that all these Put operations are performed within the transaction of the previous stage and thus are ?all or nothing?. Each stage node reads the stage record without removing it from its input queue and decides its initial role depending on the priorities recorded in the stage record. The node with the highest priority becomes the worker node, which then performs the sequence of operations already outlined in Section 4.1.4 : Begin_Transaction; Get(Agent); Execute(Agent); Put(Agent, StageRecord) to (AllNodesOfNext-Stage); Commit. The other stage nodes are observers, which monitor the worker. A worker, say W, periodically sends I_AM_ALIVE messages to the observers of its stage. If it receives an I_AM_ALIVE or I_AM_SELECTED (see below) message from another node, then there obviously exists a competing worker, say W'. If W' has a higher priority than W, W sends a HIGHER_PRIO request to the local orchestrator. If the response is GAVE_UP (see Section 4.1.6 ), W becomes an observer monitoring W'. When an observer times out while waiting on the worker's I_AM_ALIVE messages, it assumes that the worker is not available any more and initiates the procedure for selecting a new worker. The selection protocol described below adopts the basic principles of the fault-tolerant bully algorithm [GM82], which allows elections taking place even in presence of network partitioning. As already mentioned above, this protocol may end up with several workers in case of network partitioning. The voting protocol described in 4.1.5 ensures that only one worker commits its transaction. A node initiating the selection procedure sends ARE_YOU_THERE messages to all stage nodes with a higher priority. Available nodes (observers as well as workers) reply to this message with an I_AM_THERE message. If no reply arrives within a reasonable time, the initiator is selected to be the new worker (Figure 4.7a). The newly selected worker sends an I_AM_SELECTED message to all other stage nodes, and starts a new stage transaction comprising the sequence of operations sketched above. If the initiator receives a reply instead, it waits for the I_AM_SELECTED (or I_AM_ALIVE) of the new worker to arrive (Figure 4.7b). When this message arrives, it starts monitoring the new worker. Figure 4.7a Successful selection Figure 4.7b Unsuccessful selection 1 I_AM_THERE ARE_YOU_THERE 2 3 1 ARE_YOU_THERE 2 3 1 I_AM_SELECTED 2 3 Timeout 4 WP 2.3: Recoverable Agents 43 In the presence of network partitioning, the protocol presented so far selects a worker in each partition, if two partitions are joined, two workers remain in the resulting partition. Note that this is not a problem since our voting protocol ensures that only one worker will commit. Starting a transaction in a partition that does not include a majority of nodes is at least questionable. With a little modification of the protocol, starting transactions in partitions without a majority of nodes can be avoided: Observers getting an I_AM_SELECTED message are supposed to reply with an ACK, and the initiator of the selection protocol becomes the new worker only if it receives a majority of ACKs. Therefore, the initiator periodically sends I_AM_SELECTED messages until it receives either a majority of ACKs or an I_AM_SELECTED from a higher priority node. In the first case, it becomes the new worker, while it continues to be an observer in the latter case. The worst case message complexity of the chosen algorithm is O(n2) (all n stage nodes initiate the algorithm at the same time), the time complexity of the algorithm is O(1). A lot of other, more efficient election algorithms have been proposed (e.g. [AA88], [MNHT89], [Singh96], [Singh97]). Most of these algorithms do not terminate in the presence of network partitioning, but the main idea of some of these algorithms, to reduce message complexity by ?eliminating? the other nodes sequentially, could be applied to our selection problem. Using this approach enhances the time complexity considerably to O(n). Therefore, as long as stages are relatively small (max. 5 to 7 nodes), the chosen algorithm is considered to be convenient. 4.1.7 Blocking Probability and Message Complexity The worker node needs to collect a majority of votes during 2PC processing to be able to commit the transaction of a stage. Therefore, a transaction can only be committed if more than half of the stage nodes (including the worker) are available. This fact can be used to give a (simple) metric for the availability As of a stage which is the probability that a majority of stage nodes is available so that an agent can finish a step and proceed with the next step. Let n be the number of the nodes of a stage and p be the availability of an individual node (i.e. the probability that the node is available). Then the probability that exactly m out of these n nodes are available can be calculated using the binomial probability function [HuGr71]. The availability As(n, p) of a stage S can then be calculated by The blocking probability is defined to be the probability that the agent is blocked in the stage. It is calculated by . The relative blocking probability Br(n, p) is calculated by , where a relative blocking probability of Br(n, p)=0.4 means for example that the probability of an agent blocking in a stage with n nodes (node availability p) is only 40% of the probability of an agent blocking on one node with availability p. f n m , ( ) n m ? ? ? ö pm 1 p ? ( ) n m ? ( ) = A s n p , ( ) n i ? ? ? ö p i 1 p ? ( ) n i ? ( ) . i n 1 + 2 ------------ = n ? = Bs n p , ( ) 1 As n p , ( ) ? = B r n p , ( ) B s n p , ( ) B s 1 p , ( ) ? = 4 WP 2.3: Recoverable Agents 44 Table 4.1 and Figure 4.8 show the relative blocking probability Br depending on the availability p of a node and the number n of stage nodes. It shows that an odd number of nodes bigger or equal to 3 reduces the relative blocking probability dramatically. Obviously, the fault-tolerant protocol considerably reduces the blocking probability of an agent compared to the simple protocol of Section 4.1.3 . The price one has to pay for this is the overhead introduced by the protocol. In this section, we compare the number of inter-node messages necessary for the simple protocol with the number of intra-node messages necessary for the fault tolerant protocol. Assuming that errors are the exception, we only examine the error-free case. The simple protocol introduced in Section 4.1.3 is the ?cheapest? possibility to provide the exactly-once property for mobile agents. The amount of messages used depends heavily on the implementation of the message queues. An optimized version of a message queue only needs 4 messages, piggybacking the PREPARE to the put message: (data + PREPARE), PREPARED, COMMIT and ACK The communications potentially taking place in the fault-tolerant protocol are shown in Figure 4.9. Using optimized message queues as described above, all in all 4n messages (n=number of nodes) are necessary for the transport of the agent to the next stage and the 2PC (solid arrows). If the worker of the current stage is also member of the next stage, the number of messages reduces to 4(n-1). The two phases of the voting protocol (dashed arrows) need 2(n-1) messages for the voting and 2(n-1) messages for the termination of the stage. In addition, (n-1) messages are sent periodically during stage execution for monitoring purposes. Further optimizations are possible. If the timeout for monitoring is bigger than the execution time of an agent in a stage, no monitoring messages are necessary (the FORGET replaces the I_AM_ALIVE). Additionally, the FORGET message may be delayed (and piggybacked onto another message) until the next monitoring message would have to be sent. Finally, the ACK message acknowledging the FORGET can be delayed some time (it just has to be in time before 1 3 5 7 9 0,7 0,8 0,9 0,95 0,99 0% 50% 100% B r n p Table 4.1 Relative blocking probability of a stage Figure 4.8 Relative blocking probability of a stagef n p 0.75 0.9 0.99 1 100% 100% 100% 2 175% 190% 199% 3 62% 28% 3% 4 105% 52% 6% 5 41% 9% ~0% 6 68% 16% ~0% 7 28% 3% ~0% i i+1 2PC Agents Monitoring Voting Termination Figure 4.9 Communication patterns 4 WP 2.3: Recoverable Agents 45 the receiver resends the FORGET). Another possibility to reduce the overhead based on the knowledge on the agents itinerary is presented in the next section. Altogether, the number of messages for the fault-tolerant protocol without monitoring messages is between 6(n-1) if the worker is member of the next stage and all FORGETs as well as all ACKs could be piggybacked and 4n+4(n-1) if none of the optimizations is possible. 4.1.8 Optimizing the Stage Construction The fault-tolerant protocol described above introduces some overhead. This overhead can be reduced if some information about the agent's travel plan is known in advance. The first part of this section introduces a facility called itinerary which allows a very flexible description of the agent's travel plan. Then the notion of the exactly-once property is extended to the itinerary concept and a classification of stage node types is introduced. Finally, an algorithm reducing the protocol overhead is presented. The Itinerary While performing a job, a mobile agent often has to visit several nodes to use services offered locally. In many cases, some (or all) of these nodes are either known before agent initialization or can be determined by the agent several steps in advance. However, as in real life, no strict order exists in which the nodes have to be visited. For example, an agent having to buy a CD, one pound of beef and a theatre ticket, may perform these tasks in any sequence. On the other hand, if there are several branches of a music shop, the agent needs only to visit one of these branches. To exploit the possible benefits given by a flexible travel plan (e.g. by calculating the shortest path) and to provide a powerful facility to the agent developer, an itinerary concept is provided. This itinerary concept allows a very flexible specification of an agent's travel plan as well as the dynamic adaptation and expansion of the travel plan during the execution of the agent. The itinerary is composed using different types of itinerary entries. The simplest form of an entry is a simple pair (node, method) specifying a node which has to be visited and the step (defined by method) which has to be executed on this node (see [WoEA97]). The other possible entries, called sequence, set and alternative contain several other entries (recursively). A sequence is a list [e1,...,en] of n entries (n?1) defining that the nodes specified by entry ei (1?i|Si\Nextw|, then all nodes in Si\Nextw are used as exception nodes. In order to select the yet missing exception nodes, future destinations specified in the itinerary can be taken into account. In sum, Si+1 includes Nextw and a set of exception nodes selected as described above. Since Nextw includes regular nodes, the exception nodes are assigned a lower priority than the ones in Nextw . Priorities: To determine the priorities of the nodes, several possibilities exist. A simple approach is to randomly assign the priorities just ensuring unique priorities per stage. A more effective strategy is to exploit knowledge about node reliability. In this case, the priorities are assigned in accordance with the reliability, i.e., the higher the reliability, the higher the priority. Priorities between nodes with equal reliability can be chosen randomly. Obviously, with this heuristic more reliable nodes are preferred for agent execution. A third strategy to determine the priorities is to take into account not only the next stage but also the ones following the next stage. Unfortunately, the nodes which can be visited in a stage depend on the worker of the previous stage, making the computations rather complex and time consuming. Therefore, this path hasn't been investigated further here. The possible reductions of the protocol overhead gained by the presented algorithm are shown in Table 4.2. Assuming a number of stage nodes of n=3, the use of e.g. one node of the current stage as stage node of the next stage (regular or exception node) already reduces the number of code transports by one third. The example itinerary shown in Figure 4.11 represents the optimal case for the stage construction algorithm. The first stage (n=3) contains nodes N1 to N3, ordered by their priority. For this first stage, only the code has to be transported to the nodes (3 code transports) and the transaction of putting the code onto the first stage nodes has to be committed (4*3 mes- (N3, m) (N2, m) (N1, m) set Figure 4.11 Simple itinerary 4 WP 2.3: Recoverable Agents 49 sages). In the second stage, the worker of the first stage, (assume N1), acts as exception node. Here, no code transports, only two state transports and 4*2 messages for the 2PC are necessary. In the third stage, N1 and the worker of the second stage (assume N2) act as exception nodes. The number of data transports is the same as in the last stage. Therefore, a total of 3 code transports (no overhead at all!), 4 state transports (overhead: 2 transports) and 28 messages for the 2PC are needed. Without the flexible definition of the itinerary and the optimization of the stage construction algorithm, 9 code transports, 9 state transports and 36 messages for 2PC would have been used. In our example of Section , the first stage (n=3) contains the two ticket service shops and the flower shop (3 code transports + 4*3 messages). If the concierge agent is executed on the flower shop node first, the second stage consists of the two ticket service nodes (as regular nodes) and the flower shop node as exception node (two state transports and 4*2 messages). Then, the third stage consists of one of the restaurant nodes (depending on the ticket node used as worker in stage two) as regular node and the worker node of stage two and one other node of stage two as exception nodes (one code transport, two state transports and 4*2 messages). This results in a total of 4 code transports (overhead: one transport), 4 state transports (overhead: two transports) and 28 messages for 2PC. In this case, each stage contains as much as possible regular nodes. If, on the other side, the worker node of the first stage is one of the ticket service nodes, there can be an additional code transport if the flower shop is not contained in the second stage (the flower shop node is not in Nextw in this case, which is rather an inadequacy of the itinerary concept than of the stage construction algorithm). 4.1.9 Related Work In the field of mobile agents, only few research groups have considered aspects of transaction management and fault tolerance so far. In [MinsEA96] and [Sch97], a stage model similar to the one in this paper is proposed. However, the papers focus on a different aspect of fault-tolerance. Fault tolerance is achieved by processing the agent on each stage node (in parallel) and to send the migrating agent to all nodes of the following stage. Stage nodes perform voting on incoming agents to determine a majority of equal agents. Only an agent from this majority is processed further. Our approach assumes alternative services in a stage instead of replicated ones. Even in the case of failures, the agent is performed exactly once, e.g. in a car rental stage consisting of a Hertz, Avis and Budget server, the step ?rent a car? is eventually performed once at one server. In [SK97], an agent- based transaction model is presented. Similar to our model, an agent r Code Transports State Transports Messages for 2PC Si ? Si+1 = {} n n 4n Si ? Si+1 ? {} n - |Si ? Si+1| n 4n Si ? Si+1 ? {} wi ? Si+1 n - |Si ? Si+1| n-1 4(n-1) Table 4.2 Possible reductions of the overhead 4 WP 2.3: Recoverable Agents 50 executes a transaction while moving from node to node. To prevent the blocking of agents due to long-lasting failures, the use of monitoring components is proposed. However, this paper purely concentrates on modelling aspects, protocols or algorithms are not given. [DalmEA98] introduces a mechanism based on a special system model. A site consists of several nodes interconnected by a reliable LAN. Agents within a site are monitored by a recoverable checkpoint manager (CM). The CM, which is assumed to be very reliable, is responsible to keep track of the agents and to restart the agents after a node has crashed and recovered. Unfortunately, this approach is very sensitive against CM crashes (no migration is possible during a CM crash) and long lasting node crashes (an agent cannot be restarted while a node is down). [VoKuMo97a] [VoKuMo97b] claim to realize exactly-once semantics for mobile agents by executing the migration between nodes in a distributed transaction. According to the description, the agents are not stored in stable storage. Considering this fact and the fact that there is no monitoring of agents, it can be assumed that agents are lost in case of a node crash and that agents are caught in case of a network partitioning. [JoReSc95] proposes an implementation of fault-tolerance using rear-guard agents, which are instantiated at migration time on the system the agents are leaving and which monitor the progress of the agents. Unfortunately, no details are given. Particularly, it is unclear how the blocking of agents on nodes can be prevented during a partitioning between the rear-guard and the agent. [BagcEA98] presents a mechanism to provide (non agent) applications with fault-tolerance using mobile agents. In addition, there has been a lot of related work in the fields of transaction processing and faulttolerant computing. The ConTract model [WR92][RSS97] also aims at the exactly-once property and similar to our approach only allows for forward recovery. A ConTract is defined by a script which is performed by a ConTract manager. A first prototype implementation, APRI- COTS [Sch93], will be extended to allow the migration of scripts between ConTract managers, even in the case that the ConTract manager processing the script crashed, by using logging information written during the execution of the script to recover the state of the script on another node. However, there is currently no component which autonomously (and reliably) initiates the migration of a script to another ConTract manager if the ConTract manager executing the script crashes. Another approach is the use of process pairs [GR94] (also called hot backups [BE97]), where a primary process executes the program and sends checkpoint messages to a backup process. The backup process monitors the primary and takes over the execution using the latest checkpoint information received. In this approach, the communication between process pairs is assumed to be reliable. An extension to this approach are system pairs [GR94], where the processors can be geographically remote. This approach also cannot deal automatically with network partitioning. An approach based on replicated objects is presented in [BeedEA95]. Each replica of an object gets all messages sent to the object and executes all methods that are invoked. Only the primary sends messages (method calls, replies on method calls) to other objects. The system handles failures of primaries by selecting one of the replicas as primary. The communication system is assumed to be reliable and to offer an atomic broadcast. Despite the extraordinary overhead of executing all calls on an object on all its replicas, no byzantine errors are detected. Voting algorithms are mainly used in the area of replicated data (e.g. [Thom79][Giffor79]) and mutual exclusion in distributed systems (e.g. [Maek85]). In our protocol, voting is used as a mutual exclusion mechanism preventing multiple stage nodes to commit. 4 WP 2.3: Recoverable Agents 51 4.1.10 Conclusion and Future Work We have investigated how the exactly once property can be provided in mobile agent systems in a fault-tolerant way. We presented a protocol guaranteeing this property, while reducing the probability for agent blocking. Moreover, we proposed an architecture that allows to integrate the protocol in standard transactional technology. In other words, the proposed mechanism can be realized on top of conventional TP-monitors and transactional message queues. Currently, the protocol is under implementation in the Mole system [MOLE] and will be evaluated in terms of performance. Results are expected to be available in autumn 1998. Future work will concentrate on enhancing our agent execution model. A straightforward extension is to allow the agent to access remote services resulting in distributed transactions which include arbitrary nodes. More sophisticated problems to solve are the communication with other mobile agents and the support of enhanced transaction models. One limitation of our current protocol is e.g. that, on system level, it allows for forward recovery only. In other words, if a user wants to abort an agent, the potentially required compensation operations are not automatically triggered on the system level. Instead, the logic to perform compensations must be provided in the agent by the agent programmer. This goes in line with the experiences made with today's workflow systems that many operations can only be compensated in an application-specific manner and often require the intervention of human users. However, some compensation can be done automatically, and we will investigate what concepts and protocols are needed to support compensation on the system level. Related problems also to be investigated are support of long-lasting actions including the recovery on other stage nodes without losing all work already done in the stage (e.g. by using safe points) and ?atomic? actions over several nodes. We are confident that we can learn from the research conducted in the field of transaction models (e.g., Sagas [GMS87], open nested transaction [Wei91][WS91], etc.). 4 WP 2.3: Recoverable Agents 52 4.1.11 Appendix The voting protocol described in the Section 4.1.5 is illustrated in this appendix using a pseudocode notation. To keep the code as small as possible, several simplifications are performed. Most of the variables used in the code depend on the current StageId and should be read as x[StageId] instead of x. VoterId and OrchId are synonyms for the NodeId of the node the voter/orchestrator reside on. The orchestrator gets the information about a stage (StageId, TransactionId) from the entity performing the agent and is able to determine the StageId corresponding to the TransactionId (short: TId) and vice versa. All communication primitives are executed asynchronous. The tests using the TId to determine if votes are current are omitted. Orchestrator RECEIVE rm_prepare(TId){ // from TM YV=CYV=NV={}; // Vote Sets REPEAT SEND VOTE(StageId, TId, OrchId) TO StageVoters\(YV+CYW+NW); WAIT(sometime) UNTIL (majority achieved) } RECEIVE YES(StageId, TId, VoterId){ YV = YV + VoterId; CheckCondYes(CYV, YV); IF YV contains majority THEN TA_State = Ready; SEND(rm_yes) TO TM; } RECEIVE NO(StageId, TId, VoterId){ NV = NV + VoterId; IF NV contains majority THEN SEND(rm_no) TO TM; SEND(UN_VOTE(StageId, TId, OrchId)) TO All Voters in YW+CYW } RECEIVE COND_YES(StageId, TId OrchSet, VoterId){ CYW = CYW + (VoterId,OrchSet); CheckCondYes(CYV, YV); IF YV contains majority THEN TA_State = Ready; SEND(rm_yes) TO TM; } RECEIVE HIGHER_PRIO(StageId){ IF TA_State==Unknown SEND(rm_no) TO TM; SEND(UN_VOTE(StageId, TId, OrchId)) TO All Voters in YW+CYW REPLY GAVE_UP; ELSE REPLY ALREADY_DONE; } PROCEDURE CheckCondYes(CYV, YV){ FORALL (vid,orchSet) in CYW DO IF orchSet\YV == {} THEN CYV = CYV\(vid,orchSet)Set); YV = YV + vid; } RECEIVE GIVE_UP(StageId){ SEND(rm_no) TO TM; } 4 WP 2.3: Recoverable Agents 53 RECEIVE rm_commit(TId){ // from TM TA_State = Committed; SEND(rm_ack) TO TM; REPEAT SEND FORGET(StageId, OrchId) TO StageVoters; WAIT (sometime); UNTIL (all ACKs received) TA_State = Unknown; } RECEIVE rm_abort(TId){ // from TM TA_State = Unknown; SEND(UN_VOTE(StageId, TId, OrchId)) TO All Voters in YW+CYW } RECEIVE INQUIRY(StageId, VoterId){ IF TA_State == Ready THEN YV = YV + VoterId; ELSIF TA_State == Committed THEN SEND FORGET(StageId, OrchId) to VoterId ELSE // Unknown IF (no active TA for StageId) THEN SEND UN_VOTE(StageId, Null) TO VoterId; ELSE // ignore } PROCEDURE RECOVERY(){ FORALL TA with TA_state==Committed DO REPEAT SEND FORGET(StageId, OrchId) TO StageVoters; WAIT (sometime); UNTIL (all ACKs received) TA_State = Unknown; } 4 WP 2.3: Recoverable Agents 54 Voter RECEIVE VOTE(StageId,TId, OrchId){ IF OrchSet=={} THEN OrchSet = OrchSet + OrchId; REPLY YES(StageId, TId, VoterId); ELSE N=node with highest priority in OrchSet; IF prio(OrchId) < prio(N) THEN REPLY NO(StageId, TId, VoterId); ELSE IF N not VoterId THEN REPLY COND_YES(StageId, TId, OrchSet, VoterId); ELSE SEND HIGHER_PRIO(StageId); RECEIVE ANSWER; IF ANSWER = GAVE_UP THEN IF OrchSet\{N} =={} THEN REPLY YES(StageId, TId, VoterId); ELSE REPLY COND_YES(StageId, TId, OrchSet\{N}, VoterId); ELSE // ALREADY_DONE REPLY NO(StageId, TId, VoterId); IF voted YES or COND_YES THEN REPEAT WAIT(sometime); IF NOT RECEIVED (FORGET or UNVOTE) THEN SEND INQUIRY (StageId, VoterId) TO OrchId; UNTIL RECEIVED (FORGET or UNVOTE) } RECEIVE UN_VOTE(StageId,TId) FROM OrchId{ OrchSet = OrchSet\{OrchId}; } RECEIVE FORGET(StageId, OrchId){ remove stage record and OrchSet; REPLY ACK(StageId); SEND GIVE_UP TO local orchestrator; } PROCEDURE RECOVERY(){ FOR all OrchSets DO FOR all OrchIds in OrchSet DO REPEAT SEND INQUIRY (StageId, VoterId) TO OrchId; UNTIL RECEIVED (FORGET or UNVOTE) } 4 WP 2.3: Recoverable Agents 55 4.2 Concepts for a Reliable and Scalable Agent Server 4.2.1 Requirements Reliability: The agent system ensures, that the agents residing at the system don't get lost if some hardware or software components fail but that they get restarted in a well defined state. Especially in view of the use of mobile agent systems in the commercial environment, this requirement occupies a central role: nobody wants to lose money only because an agent carrying electronic cash gets lost. Scalability: With the increasing use of mobile agents to access services, it is of first importance for service providers, that the agent execution environment scales to be able to flexibly react on an increased demand. Communication and migration functionality should follow closely the functionality of the agent system Mole, which was developed at university of Stuttgart. The agent system offers functionality to synchronously and asynchronously send messages. If a message can't be delivered to its recipient, it is ether queued and delivered at a later time or the sender gets an error message (depending of the specified failure semantics of the message). To deliver a message to an agent, the location calls the receiveMessage() method of the agent. Besides the message mechanism, the agent system offers a remote method invocation mechanism which enables the agent to invoke methods of other agents. Additional to the direct addressing using unique agent names, it is able to use so-called badges for anonymous communication. Badges are tags (simple strings), which an agent may put on e.g. to announce that it offers a certain service. 4.2.2 Architecture of the Agent System The agent system consists of several components. Figure 4.12 depicts these components and their interactions. The basic idea of the agent system is to store all agent in a serialized form in a Tuxedo message queue. In the queue they are protected against software and hardware failures. To execute an agent ready for execution, the Q-Manager process gets the agent out of the message queue and sends the agent to a free server of the Tuxedo server class Sublocation using an asynchronous tpacall. The Sublocation server processes offer the ExecuteAgent function as its only service. This function takes a serialized agent as its parameter and executes this agent in a local (to the process) Java virtual machine. The ExecuteAgent function terminates if the agent terminates, if it migrates to another location or its execution state changes from ?executable? to ?blocked? (communication, sleep-functionality). In the last case, the Q-Manager re-inserts the serialized agent (and some additional state information) into the local message queue. All actions beginning from getting the agent out of the queue up to re-inserting the agent into the queue are performed within a transaction. In the case of an (hard-/soft-)failure of the Sublocation, the Q-Manager rolls the transaction back, if the Q-Manager fails, the transaction is rolled back automatically. If the transaction rolls back, the agent still resides in the input queue in its former state. 4 WP 2.3: Recoverable Agents 56 Locations are places where agent are instantiated, to which agents can migrate and where agents ?live? logically. Locations manage the agents residing at this location. One of their tasks is to make sure that blocked agents residing in the message queue are executed again if an communication request (message, remote method invocation) concerning this agent arrives. For performance reasons, the real communication takes place directly between the involved agents. Hereby, the agents are supported by their sublocation. A sublocation is some kind of runtime environment which among others provides the agent with the ability to communicate and to migrate. A sublocation is launched each time the ExecuteAgent service is invoked. The sublocation registers the agent at the local RMI registry to be able to receive messages and remote method invocations for the agent. Afterwards, the agent is registered at its location by passing the RMI-URL under which the agent can be reached. Before the ExecuteAgent service terminates, the agent is removed at the local RMI registry and its location. 4.2.3 Lifecycle of an Agent The state diagram in Figure 4.13 shows the life cycle of an agent, which will be explained in detail in the next paragraphs. An agent can be started in two different way. On the one hand, an agent can be started from the command line of a location using the new command (syntax: new [[, [,...]]];). On the other hand, an agent can be launched by an other agent using the createAgent() method provided by the location. After the creation and the ini- Location Q-Manager Sublocation Sublocation Sublocation Sublocation Server Class Message Queue Location tpacall / tpreturn tpqueue / tpdequeue tpqueue / tpdequeue Java-RMI Figure 4.12 Architecture of the agent system 4 WP 2.3: Recoverable Agents 57 tialization of the agent by the location, the agent including the information that the agent is ready to run is put into the message queue. The location additionally stores some other informations about the agent in a list containing all agents residing on this location. As soon as a server of the class ExecuteAgent is available, the Q-Manager starts a new transaction, reads the agent from the message queue and invokes its execution using the asynchronous call (tpacall) of the ExecuteAgent service. The ExecuteAgent service now starts a new sublocation. This sublocation registers itself (using the agent's name) at the local RMI registry to be able to receive (and to deal with) messages and remote methods invocations for the agent. Then, the sublocation registers the agent at its location using the registerAgent() method of the location. The location updates its informations about the agent (e.g. ?agent xy now being performed in service process z?) and returns state information to the sublocation. This state information tells the sublocation what to do with the agent. If the agent is newly started or if it is just migrated to the location, the start() method of the agent has to be invoked. If the agent stayed in the queue while being blocked (sleeping, waiting for communication), according steps have to be taken (see below). The start() method of an agent implements the desired functionality of the agent. During the execution of start(), the agent may communicate to other agents (messages, remote method invocations), use local services and invoke the migration to other locations. It is important to note that, because the agent always starts its execution on a location with the start() method, the agent itself always has to know what to do on the location it just arrived (e.g. by using some information stored in the agents data). If the agent intends to terminate itself, it calls the die() method. This method removes the agent at the location and the local RMI registry. Then, the execution of the ExecuteAgent service is terminated using the tpreturn call. The Q-Manager, which periodically checks if a service request returned (using tpgetrply), now commits the transaction (which was started before the agent was read from the queue). If, during the execution, an agent has no more active threads (e.g. if it sleeps or waits for communication requests), the agent can be put back into the message queue to enable other agents to be executed. In this case, the sublocation informs the location on the agent's state (waiting Queued State: Ready Queued State: Blocked Executing new ... createAgent() die() ExecuteAgent Migration agent terminates his threads RMI of / message to the blocked agent Figure 4.13 Life cycle of an agent 4 WP 2.3: Recoverable Agents 58 for communication, sleeping) and remove the agent (more exact: itself) from the local RMI registry. Then, the execution of the ExecuteAgent service is terminated using the tpreturn call, informing the Q-Manager to put the agent (and its state ?blocked?) into the message queue and to commit the transaction. If the reason that the agent is blocked doesn't hold any longer (communication request or time to sleep is exceeded), the location has to reactivate the agent. Therefore, it gets the agent from the message queue and puts the agent back in the queue with the information that the agent is ready to be executed. As soon as a server from the ExecuteAgent class is available, the Q-Manager invokes the agent execution as described above. If the sublocation executing the agent now registers itself at the location, the location provides the information why the agent has been activated (message, remote method invocation, time to sleep exceeded, combinations are possible) enabling the sublocation to invoke the required functionalities (receive the message, call the desired method or activate the agent after its sleep). 4.2.4 Messages The sublocation offers two different methods for message transmission for the agent. The synchmessage() method transfers messages synchronously using the thread in which the method has been called. The message() method transfers messages asynchronously by starting a new thread in which synchmessage() is executed. This enables the agent to continue its execution while the message is being transferred. Both methods require as a parameter an error sematics specification defining their behaviour in the case that the message can't be delivered to the receiver agent. Currently, three different possibilities exist. In the first case, the message is only discarded if the receiver agent does not reside at the specified location, no error message is generated. In the second case, the message is discarded and an error message is sent to the sender of the message. In the third case, the message is stored at the specified receiver location and is delivered when the receiver agent arrives at the location. Other parameters of the two methods are the receiver of the message and, of course, the content of the message. The receiver of the message is specified by providing the receiver agent's unique agent name and the location where the receiver agent resides. The location has to be specified because there is currently no service to map an agent's name to its current location. In the following paragraphs, a short description of the synchmessage() method is given. If the synchmessage() is invoked, the sublocation of the sender first invokes the getAgentRMIURL() method of the sender location to get the RMI-URL of the receivers sublocation. If the sender and the receiver reside at different location, the senders location passes the method invocation to the receiver location. If the receiver is currently being executed by the ExecuteAgent service, getAgentRMIURL() immediately returns the existing RMI-URL of the receiver agent's sublocation. If the receiver agent currently stays in the message queue, its location reactivates the agent by reading it out of the message queue and re-inserting it with the information that the agent is now ready for execution. Now the thread executing getAgentRMIURL() is suspended until a sublocation registers the receiver agent with the location via registerAgent(). Now, getA- gentRMIURL() is able to return the agent's RMI-URL. 4 WP 2.3: Recoverable Agents 59 The sending sublocation is now able to get a reference to the receiver's sublocation using the RMI lookup (with the RMI-URL received by getAgentRMIURL()). This reference is used to invoke the deliverMessage() method of the receivers sublocation. This method now invokes the best-fitting receiveMessage() method of the agent using the java reflection API (the agent may contain several methods having different message types (classes) as (its only) parameter). Figure 4.14 shows the synchronous message transfer in the case that sender and receiver agent both reside at the same location (black bars symbolize executing threads). In general, two communicating agents send more than one message in both directions. To enhance the efficiency of the communication by avoiding to resolve the receiver agents sublocation (by calling getAgentRMIURL() and resolving the reference by RMI lookup) for each communication, a cache is used. This cache maps agent names to the last known sublocation reference. At each invocation of synchmessage(), the cache is searched for an entry for the receiver agent. If the cache contains an entry, deliverMessage() is invoked with this reference as a parameter. If the reference contained in the cache is obsolete (e.g. because the agent stays in the message queue), an exception occurs. In this case as well as the case that there is no cache entry for the agent, getAgentRMIURL() and a RMI lookup are invoked as described above and the obtained reference is stored in the cache. The receiver agents may not only be addressed using their (unique) agent name, but it is also possible to send messages to anonymous recipients by using a badge. An agent may pin on a badge (containing a string) which can then be used to address the agent (instead of the agents name). If several agents on a location have pinned on the same badge, messages being addressed to agents wearing this badge are distributed ?round-robin? to the agents wearing this badge. Because agents are able to remove a badge they have pinned on and because badges can be used to realize a rudimentary load balancing between several agents, agent sublocations which are resolved from a badge are not stored in the cache. Sender Location Receiver getAgentRMIURL() deliverMessage(), receiveMessage() Figure 4.14 The synchmessage() method 4 WP 2.3: Recoverable Agents 60 4.2.5 Remote Method Invocation To invoke a method offered by another agent, the sublocation offers a call()-method to its agent. This method takes as parameters the name and the parameters of the method which should be called and the name (or badge) and location of the agent offering the method. The execution of the call() method is very similar to the execution of synchmessage(). By means of getAgentRMIURL() and RMI lookup, the reference of the sublocation of the receiver agent is resolved. This reference is then used to invoke the dispatchRPC() method of the receiver agent. This method uses the Java reflection API to invoke the desired method. Figure x may also be used to show the execution of a remote method invocation (substitute synchmessage() by call() and deliverMessage()/receiveMessage() by dispatchRPC()) 4.2.6 Migration If an agent wants to continue its execution on another location, it calls the migrateTo(destinationLocation) method of its sublocation. This method first suspends all active threads of the agent (of course all but the thread executing the migrateTo()). Then, the RMI-URL of and the reference to the destination location is determined. This reference is used to invoke the handleMigration() method of the destination location, passing the serialized agent as parameter. The handleMigration() method inserts the agent in its message queue. If the agent is passed successfully to the other location, the ?old? sublocation deletes its entry in the RMI registry and its location. Then, the ExecuteAgent service terminates and the Q-Manager commits the affiliated transaction. If the agent cannot be passed successfully to the destination location, the suspended threads of the agent are reactivated 4.2.7 Current State The agent server is currently under implementation on a Tandem Himalaya under OSS. First results are expected in June 1998. 4.3 Bibliography [AA88] Abu-Amara, H.H. 1988. ?Fault-tolerant distributed algorithm for election in complete networks.? IEEE Transactions on Computers, 37(4). 449-453 [AgrAbb91] Agrawal, D.; El Abbadi, A.: ?An Efficient and Fault-Tolerant Solution for Distributed Mutual Exclusion?. In: ACM Transactions on Computer Systems, Vol. 9, No. 1. February 1991, pp. 1-20 [BauEA98] Baumann, J.; Hohl, F.; Rothermel, K.; Straßer, M.: ?Mole - Concepts of a Mobile Agent System.? accepted for ?WWW Journal, Special issue on Applications and Techniques of Web Agents?, Baltzer Science Publishers, 1998. [BE97] Bernstein, P.; Newcomer, E.: ?Principles of Transaction Processing?. Morgan Kaufmann Publishers Inc, 1997 [BeedEA95] Beedubail, G., A. Karmarkar, A. Gurijala, W. Marti, and Udo Pooch. 1995. ?Fault Tolerant Objects in Distributed Systems Using Hot Replication?. Technical Report 4 WP 2.3: Recoverable Agents 61 TR_95-023. Department of Computer Science. Texas A&M University [BeHaGo87] Bernstein, P.; Hadzilacos, V.; Goodman, N.: ?Concurrency Control and Recovery in Databas Systems.? Addison-Wesley Publishing Company. 1987 [Blak95] Blakeley, B.: ?Messaging and Queuing Using the MQI?, McGraw-Hill series on computer communications, 1995 [ChaCha97] Chang, Y., and Chang, Y.: ?A Fault-Tolerant Triangular Mesh Protocol for Distributed Mutual Exclusion.? In: Operating Systems Review, Vol. 31, No. 3, July 1997. ACM Press. pp 29-44 [Giffor79] Gifford, D.K. 1979. ?Weighted Voting for Replicated Data.? Proc. 7th Symp. on Operating System Principles 1979 (SOSP''79). ACM, New York. pp. 150-162 [GM] General Magic: ?Agent Technology?, http://www.genmagic.com/agents/ [GM82] Garcia-Molina, H.: ?Elections in a Distributed Computing System?. IEEE Transactions on Computers, Vol. C-31(1), 1982 [GMS87] Garcia-Molina, H.; Salem, K.: ?Sagas?. In: Proc. ACM Conf. on Management of Data, pp. 249-259, 1987 [GR94] Gray, J.; Reuter, A.: ?Transaction Processing - Concepts and Techniques?. Morgan Kaufmann Publishers Inc, 1994 [GV97] Ghezzi, C.; Vigna, G.: ?Mobile Code Paradigms and Technologies: A Case Study?. In: Mobile Agents, Proc. 1st Int. Workshop, MA'97. Springer, 1997 [HR93] Haerder, T.; Reuter, A.: ?Principles of Transaction-Oriented Database Recovery.?. ACM Computing Surveys, 15(4), 1993 [HuGr71] Hughes, A.; Grawoig, D.: ?Statistics: A Foundation for Analysis.? Addison-Wesley Publishing Company, 1971. [IyKaBaA97] Iyer, R.K.; Kalbarczyk, Z.; Bagchi, S.: ?Chameleon: A Software Infrastructure and Testbed for Reliable High-Speed Networked Computing?. UIUC (University of Illinois at Urbana-Champaign) Technical Report No.UILU-ENG-97-2218, July 97. [Jal94] Jalote, P.: ?Fault Tolerance in Distributed Systems?. Prentice Hall Inc., 1994 [JoReSc95] Johansen, D.; van Renesse, R.; Schneider, F.B.: ?Operating system support for mobile agents.? Proceedings of the 5th. IEEE Workshop on Hot Topics in Operating Systems. IEEE. 1995 [Lam81] Lampson, B.: ?Atomic Transactions?. In: Lampson, B. et al (eds): ?Distributed Systems - Architecture and Implementation?, Springer-Verlag, 1981 [LO97] Lange, D.; Oshima, M.: ?Java Agent API: Programming and Deploying Aglets with Java?. To be published by Addison-Wesley, Fall 1997; a working draft, ?Programming Mobile Agents in Java, is available at http://www.trl.ibm.co.jp/aglets/whitepaper.htm. [Maek85] Maekawa, M.: ?A Algorithm for Mutual Exclusion in Decentralized Systems.? In: ACM Transactions on Computer Systems, Vol. 3, No. 2, May 1985, pp 145-159 [MinsEA96] Minsky, Y.; van Renesse, R.; Schneider, F.B.; Stoller, S.D.: ?Cryptographic Support for Fault-Tolerant Distributed Computing.? In:Proceedings of the Seventh ACM SIGOPS European Workshop. 1996. pp 109-114. [MNHT89] Masuzawa, T., N. Nishikawa, K. Hagihara, and N. Tokura. 1989. ?Optimal faulttolerant distributed algorithms for election in complete networks with a global sense of direction.? In Proceedings of the 3rd Int'l Workshop on Distributed Algorithms. [Mole] Project Mole, http://www.informatik.uni-stuttgart.de/ipvr/vs/projekte/mole.html [OMG96] Object Management Group: ?CORBA 2.0 specification?, ptc/96-03-04, 1996 N 4 WP 2.3: Recoverable Agents 62 [PS97] Peine, H.; Stolpmann, T.: ?The architecture of the Ara platform for mobile agents.?. In: Mobile Agents, Proc. 1st Int. Workshop, MA'97. Springer, 1997 [RoSt98] Rothermel, K.; Straßer, M.:?A Fault-Tolerant Protocol for Providing the Exactly- Once Property of Mobile Agents?. Accepted for 17th IEEE Symposium on Reliable Distributed Systems. 1998 [RSS97] Reuter, A.; Schneider, K.; Schwenkreis, F.: ?ConTracts Revisited?. In: S. Jajodia and L. Kerschberg (ed.): Advanced Transaction Models and Architectures (ATMA), Kluwer Verlag, 1997 [SBH96] Straßer, M.; Baumann, J.; Hohl, F.: ?Mole - A Java Based Mobile Agent System?. In: Mühlhäuser, M.: ?Special Issues in Object-Oriented Programming?, Workshop Reader ECOOP'96, p327-334, dpunkt.verlag, 1996. pp 327-334 [Sch93] Schwenkreis, F.: ?APRICOTS - Management of the Control Flow and the Communication System?. In Proc. of the 12th IEEE Symposium on Reliable Distributed Systems, Princeton, October 1993 [Sch97] Schneider, F.: ?Towards Fault-tolerant and Secure Agentry?. In: Proc. 11th Int. Workshop on Distributed Algorithms, 1997 [Singh96] Singh, G. 1996. ?Leader Election in the Presence of Link Failures.? IEEE Transact. on Parallel and Distributed Computing, 7(3) [Singh97] Singh, G. 1997. ?Leader Election in Complete Networks.? SIAM Journal on Computing, 26(3). June 1997 [SK97] Morais de Assis Silva, F.; Krause, S.:?A Distributed Transaction Model based on Mobile Agents?. In: Mobile Agents, Proc. 1st Int. Workshop, MA'97. Springer, 1997 [Spe82] Spector, A.: ?Performing remote operations efficiently on a local computer network?, Communications ACM, vol. 25, pp 246-260, Apr. 1982 [StRoMa98] Straßer, M.; Rothermel, K.; Maihöfer, C.: ?Providing Reliable Agents for Electronic Commerce.? In: ?Trends in Distributed Systems for Electronic Commerce?, International IFIP/GI Working Conference (TREC'98), Hamburg, Germany, June 1998. Lamersdorf, Winfried; Merz, Michael (Eds.), Lecture Notes in Computer Science 1402, Springer Verlag, Berlin, Pages 241-253. 1988 [Thom79] Thomas, R.H. 1979. ?A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases.? ACM Transactions on Database Systems, Vol. 4, No. 2., June 79.pp. 180-209. [VoKuMo97a] Vogler, H.; Kunkelmann, T.; Moschgath, M.L.: ?Distributed Transaction Processing as a Reliability Concept for Mobile Agents.? In: 6th IEEE Workshop on Future Trends of Distributed Computing Systems (FTDCS'97). IEEE Computer Society. 1997. ISBN 0-8186-8153-5. [VoKuMo97b] Vogler, H.; Kunkelmann, T.; Moschgath, M.L.: ?An Approach for Mobile Agent Security and Fault Tolerance using Distributed Transactions.? In; Proc. 1997 Int'l Conference on Parallel and Distributed Systems (ICPADS'97). IEEE Computer Society. 1997 ISBN 0-8186-8227-2 [Wei91] Weikum, G.: ?Principles and realization strategies of multi-level transaction management?. ACM Transactions on Database Systems, 16(1): pp. 132-180, March 1991 [WoEA97] Wong, D.; Paciorek, N.; Walsh, T.; DiCelie, J.; Young, M.; Peet, B.: ?Concordia: An Infrastructure for Collaborating Mobile Agents.? In: Rothermel, K.; Popescu-Zeletin, R. (eds.): ?Mobile Agents. First International Workshop MA `97.? Lecture Notes in 4 WP 2.3: Recoverable Agents 63 Computer Science, Vol. 1219, Springer. 1997, pp. 86-97. [WR92] Wächter, H.; Reuter, A.: ?The ConTract Model?. In: A. Elmagarmid (ed), Database Transaction Models for Advanced Applications, Morgan-Kaufmann, 1992 [WS91] Weikum, G.; Scheck, H.: ?Multi-level transactions and open nested transactions?. IEEE Data Engineering Bulletin, March 1991 [X/O91] X/Open DTP:?X/Open Common Application Environment?, ?Distributed Transaction Processing:Reference Model?, ?Distributed Transaction Processing: The XA Specification? Reading, Berkshire, England: X/open Ltd, 1991 5 WP 2.4: Developed Concepts and Implementation 64 5 WP 2.4: Developed Concepts and Implementation Version 3.0 of Mole has been strongly revised and several requests and proposals from users of the earlier versions of Mole were integrated into the new release. In particular Mole supports communication between agent groups and concept of sessions. The infrastructure of Mole includes a resource manager, a directory service and a global naming scheme for agents. In order to support the design of agents, a graphical agent monitor allows to visualize the system behaviour as a whole or of a single agent in particular. Mole further provides a thread management unit to overcome some shortcomings of the Java virtual machine. Mole provides a simple means for installation and configuration of the system. This chapter summarizes a paper that will be published at the Middleware'98 conference by Baumann, Hohl, Rothermel, Schwehm and Straßer (1998). 5.1 Mole 3.0: A Middleware for Java-Based Mobile Software Agents 5.1.1 Introduction Mobile agents are a new programming model for distributed systems. Generally a mobile agent is a process that can act within a distributed system on behalf of its user. In particular such processes must be able to move freely from node to node in a distributed system and to continue processing asynchronously even if its user is (temporarily) not connected with the system any more. To allow such a functionality, each computing node of the distributed system must provide a suitable infrastructure. Such a platform for mobile agents consists basically of a virtual machine (engine) that must run on each participating node of the distributed system. The engine manages two types of objects: places and agents. A place is an object that provides an infrastructure for executing agents. An agent is a process that can occupy a place and that can communicate with other agents. Stationary agents can provide services and system resources to other agents. A mobile agent can actively move from one place to another (agent migration) and can access or provide services by communication with other agents. Applications of mobile agents can be found among others in the areas of network management, electronic commerce and mobile computing. The idea to send machine independent executable messages via a network can be traced back to the very beginning of the Internet, where the Decode-Encode-Language (DEL) was considered to run interactive programs on remote consoles of a networked system (RULIFSON 1969). Later the idea emerged independently in the area of radio network communication, where the SOFT- NET project used Forth-messages to transmit data as well as to reprogram the underlying network (ZANDER 1981). Another early approach was the Network Command Language by FAL- CONE (1987). In the nineties the term `messengers' was used by TSCHUDIN (1993) to denote active messages programmed in his Postscript-like language M0. The term `mobile agent' was coined in a white paper by General Magic Inc. (1994, republished by WHITE 1997a, 1997b). General Magic's Telescript language was specifically designed for mobile agent programming and already included most of the concepts of later mobile agent systems, but it was dropped when it became clear that it could not compete with Java as a commercial product. In the sequel several mobile agent systems have been developed in the research community. The research sys- 5 WP 2.4: Developed Concepts and Implementation 65 tems were based on such diverse programming languages like untyped scripting languages, e.g. Agent Tcl, an extension of Tcl by GRAY (1995, 1996), or strongly typed functional languages like MAP based on Scheme by PERRET and DUDA (1996), but the mainstream of the systems today are based on Java. Examples are Mole by STRAßER, BAUMANN and HOHL (1996) or Aglets by the IBM AGLETS WORKBENCH TEAM (1997). Other approaches use a language independent approach like TACOMA by JOHANSEN, VANRENESSE and SCHNEIDER (1994, 1995) or Ara by PEINE and STOLPMANN (1997). Recently, several companies and research groups have presented MASIF, a proposal for the standardisation of mobile agent systems (MILOJICIC et al. (1998)). This paper presents Version 3.0 of Mole, one of the first Java-based mobile agent systems. The paper proceeds with an overview of the Mole system in Section 2. Then basic concepts of the Mole system like agent migration in Section 3 and agent communication in Section 4 are presented. Section 5 continues with a description of the agent infrastructure provided by Mole and Section 6 introduces the graphical agent monitor of the system. Implementational issues are discussed in Section 7 and the installation procedure is sketched in Section 8. 5.2 Mole System Overview Our model of an agent-based system - as various other models - is mainly based on the concepts of agents and places. Places provide the environment for safely executing local as well as visiting agents. An agent system consists of a number of (abstract) places, being the home of various services. Agents are active entities, which may move from place to place to meet other agents and access the places' services. In our model (see Figure 5.1)., agents may be multi-threaded entities, whose state and code is transferred to the new place when agent migration takes place. Each agent is identified by a globally unique agent identifier. An agent's identifier is generated by the system at agent creation time. The creating place can be derived from this name. It is independent of the agent's current place, i.e. it does not change when the agent moves to a new place. In other words, the applied identifier scheme provides location transparency. A place is entirely located at a single node of the underlying network, but multiple places may be running on a given node. For example, a node may provide a number of places, each one assigned to a certain agent community, allowing access to a certain set of services or implementing a certain prizing policy. Places are divided into two types, depending on the connectivity of the underly- Mobile Agent Service Agent Place Data Base Agent Migration Desktop Server Legacy Software System Resources Application Local Communication Global Communication User Application Laptop Figure 5.1 Mole System Overview 5 WP 2.4: Developed Concepts and Implementation 66 ing system. If a system is connected to the network all the time (exept for network failures and system crashes), a place on this system is called connected. If a system is only temporarily connected to the network, e.g. a user's PDA (Personal Digital Assistant), the place is called associatedAgent Lifecycle and Agent Mobility 5.2.1 Lifecycle of an Agent After an agent is created, its initialization routine init() is processed (see Figure 5.2). The arguments given to the agent at creation time are passed to this routine. The programmer can set up the internal state and initialize the agent attributes. At this time the agent is still outside any place. Now the agent system injects the agent into the system as if it had just arrived after a migration. First the agent is made known to the place, but other agents are not yet allowed to communicate with it. The prepare()-method is called, allowing the agent to do its place-specific setup, e.g. identifying local services. Finally the agent is started by calling the start()-method. After that normal processing takes place, the agent can start its own threads, use local services, and can communicate locally or remotely. The agent stays on the place until it decides to either migrate or to terminate. If the agent wants to migrate it calls the method migrateTo() (described in more detail in the next section). The system suspends the agent's threads, serializes the agent and sends it to the target place. If the target place accepts the agent, the agent is injected into the system and started again via the methods prepare() and start(). Now the target place sends an acknowledgement back to the source place which removes the suspended agent from the system. If the target place does not accept the agent, an error message is sent back and the agent resumes its work on the original place. It receives an exception as the result of the failed migration, and can react e.g. by trying to migrate to another place. If the agent has reached the end of its life, it calls the method die(). The system now stops all threads of the agent, removes it from the place, and deletes the agent. Additionally the system supports periodic operation. This provides a simple mechanism for the programmer to implement recurring tasks, e.g. checking a database for changes if no trigger mechanism exists. If an agent implements the interface Periodical, then the system, in addition to calling the init()-method, executes the method heartbeatInit(). Now as soon as the start()-method has been called for the first time, the system begins executing a method called heartbeat() in regular intervals. This continues until the die()-method is called. 5.2.2 Agent Migration init() prepare() start() die() migrate() Figure 5.2 Lifecycle of a Mobile Agent 5 WP 2.4: Developed Concepts and Implementation 67 The concept of a mobile agent supports process mobility, i.e., program executions may migrate from node to node of a computer network. Obviously, for migrating agents not only code but also the state information of the agent has to be transferred to the destination. An agent's state is subdivided into data state and execution state. While the first includes the agent's global and instance variables, the latter comprises the local variables and the active threads. According to GHEZZI and VIGNA (1997) two types of agent migration can be distinguished: weak migration and strong migration. With strong migration, the underlying system captures the entire agent state (consisting of data and execution state) and transfers it together with the code to the next location. Once the agent is received at its new location, its state is automatically restored. From a programmer's perspective, this scheme is very attractive since capturing, transfer and restoration of the complete agent state is done transparently by the underlying system. On the other hand, providing this degree of transparency in heterogeneous environments at least requires a global model of agent state as well as a transfer syntax for this information. Moreover, an agent system must provide functions to externalize and internalize agent state. Since the complete agent state (including data and execution state) can be large - in particular for multi-threaded agents - strong migration might be a very time-consuming and expensive operation. These difficulties have led to the development of the so-called weak migration scheme, where only data state information is transferred. The size of the transferred state information can be limited even more by letting the programmer select the variables making up the agent state. As a consequence, the programmer is responsible for encoding the agent's relevant execution states in the program variables. Moreover, the programmer must specify a start method that decides on the basis of the encoded state information where to continue execution after migration. While this method may substantially reduce the amount of state to be communicated, it puts additional burden on the programmer and makes agent programs more complex. The difference between weak and strong migration is a change in semantics, but not in expressive power. One of the design goals of Mole is the ability to run on every Java Virtual Machine (VM). A normal Java VM does not support capturing the state of a thread, which would be a prerequisite for capturing the execution state. Thus our decision was to choose the changed semantics of weak migration and with it the ability to run Mole on unchanged Java interpreters. This includes that, while agents in Mole can be multithreaded, after a migration only one thread is started. If more threads are necessary the agent has to start them explicitly. Weak migration in Mole is implemented by using a part of the Remote Method Invocation package RMI, the object serialization provided as part of Java 1.1. After an agent thread calls the migrateTo()-method, all threads belonging to the agent are suspended (not stopped). No new messages and calls (RPC) to the agent are accepted. After all pending messages to the agent have been delivered, the agent is removed from the list of active agents. Now the agent is serialized using the object serialization. The object serialization computes the transitive closure of all objects belonging to the agent (ignoring transient objects and threads), and creates a systemindependent representation of the agent. This serialized version of the agent is sent to the target place that reinstantiates the agent. If any of the Java classes needed are not available locally, the target place requests these classes either from a code server (HOHL et al., 1997), or from the source place. Now the agent is reinstantiated. One new agent thread is started. First the prepare()-method is called to initialize the place-specific attributes. Then this thread begins its work at the start()-method. As soon as the thread assumes control of the agent, a success message is sent back to the source place. The source place now terminates all threads pertaining 5 WP 2.4: Developed Concepts and Implementation 68 to the agent and removes it from the system at any stage of the migration an error occurs, the migration is stopped and the agent threads at the source place are resumed. The control flow continues after the migrateTo() statement, where error handling can be implemented (an exception is thrown in the case of failure). 5.3 Agent Communication and the Session Concept As will be seen below, a session between agents can be established only if the agents can identify each other. In our model, there are basically two ways how agents can be identified, the unique agent identifiers - comparable to object identifiers - and the so-called badges. Agent identifiers are well-suited for identifying service agents, as long as there exists a directory system, that maps user-defined service names to service agent identifiers. Note, however, that the directory service is not part of our base system, i.e., we clearly separate the mechanism for identifying services from the one for finding services. As a consequence, different naming schemes and directory systems can be used on top of this system. The directory service provided by Mole is described in Section 5.2. 5.3.1 Badges In the case of mobile agents the concept of agent identifiers is not always sufficient. Assume for example, that an agent wants to meet some other agent participating in the same task at a given place. If only agent identifiers were available, both agents would have to know each others identifiers. Actually, for identification it would be sufficient to say ?At place XYZ I want to meet an agent participating in task ABC?. This type of identification is supported by the concept of badges. A badge is an application-generated identifier, such as ?task ABC?, which agents can ?pin on? and ?pin off?. This badge does not necessarily have to be unique, it simply represents a role of an agent at a given time. As long as the agent provides the functionality associated with this role, it wears the badge. An agent may have several badges pinned on at the same time. Badges may be copied and passed on from agent to agent, and hence multiple agents can wear the same badge. For example, all agents participating in a subtask may wear a badge for the subtask and another one for the overall task. The agent that carries the result of the subtask may have an additional badge saying ?CarryResult?. Using badges, an agent is identified by a (place identifier, badge predicate)-tuple, which identifies all agents fulfilling the badge predicate at the place identified by the place identifier. A badge predicate is a logical expression, such as (?task ABC? AND (?CarryResult? OR ?Coordinator?)). Obviously, this is a very flexible naming scheme, which allows to assign any number of application-specific names to agents. To change the name assignments two functions are provided, PinOnBadge(badge) and PinOff- Badge(badge). 5.3.2 Sessions A session defines a communication relationship between a pair of agents. Agents that want to communicate with each other, must establish a session before the actual communication can be started. After session setup, the agents can interact by remote method invocation or by message passing. When all information has been communicated, the session is terminated. Sessions have 5 WP 2.4: Developed Concepts and Implementation 69 the following characteristics: Sessions may be intra-place as well as inter-place communication relationships, i.e., two agents participating in a session are not required to reside at the same place. In order to preserve the autonomy of agents, each session peer must explicitly agree to participate in the session. Further, an agent may unilaterally terminate the sessions it is involved in at any point in time. Consequently, agents cannot be trapped in sessions. While an agent is involved in a session, it is not supposed to move to another place. However, if it decides to move anyway, the session is terminated implicitly. The main reason for this property of sessions is to simplify the underlying communication mechanism, e.g., to avoid the need for message forwarding. The question may arise, why sessions are needed at all. There are basically two reasons: Firstly, the concept of a session can be used to synchronize agents that want to meet for cooperation. Note that the first property stated above allows agents to meet even if they stay on different places. The concept of a session is introduced to allow agents to specify which other agents they are interested to meet at which places. Furthermore, it allows agents to wait until the desired cooperation partner arrives at the place and indicates its willingness to participate. Secondly, we want to support both ?stateless? and ?stateful? interactions. In contrast to the first, the latter maintain state information for a sequence of requests. Obviously, if they encapsulate ?stateful? servers, service agents have to be ?stateful? also. A prerequisite for building ?stateful? entities are explicit communication relationships, such as sessions. In order to establish sessions, two methods are provided, PassiveSetUp() and ActiveS- etUp() (see Figure 5.3). The first operation is non-blocking and is used by agents to express that they are willing to participate in a session. In contrast, ActiveSetUp() is used to issue a synchronous setup request, i.e., the caller is blocked until either the session is successfully established or a timeout occurs. If ActiveSetUp() succeeds, it returns the reference of the newly created session object to the caller. Input parameter PlaceId identifies the place, where the desired session peer is expected, and PeerQualifier qualifies the peer at the specified place. A PeerQualifier is either an agent identifier or a badge predicate. Note that at most one agent qualifies in the case of a single agent identifier, while several agents may qualify if a single badge predicate is specified. In that case a randomly picked agent is chosen. To avoid infinite blocking, the parameter TimeOut can be used to specify a timeout interval. The operation blocks until the session is established or a timeout occurs, whatever happens first. The parameters PeerQualifier and PlaceId of the operation PassiveSetUp() are optional. If neither of both parameters is specified, the caller expresses its willingness to establish a session with any agent residing at any place. By specifying PlaceId and / or Peer- Qualifier the calling agent may limit the group of potential peers. For example, a group may Figure 5.3 Session Methods void PassiveSetUp(PeerQualifier , PlaceId) SessionObject ActiveSetUp(PeerQualifier, PlaceId, Timeout) boolean SetUp(SessionObject) void SessionObject.Terminate() 5 WP 2.4: Developed Concepts and Implementation 70 be limited to all agents wearing the badge ?Stuttgart University? and / or that are located at the caller's place. As pointed out above, before a session is established both participants must agree explicitly. An agreement for session setup is achieved if both agents issue matching setup requests. Two setup requests, say RA and RB of agents A respectively B, match if - Place_Id in RA and RB identifies the current place of B and A, respectively, and - PeerQualifier in RA and RB qualifies B and A, respectively. If a setup request issued by an agent matches more than one setup request, one request is chosen randomly and a session is established with the corresponding agent. A combination of PassiveSetUp() and ActiveSetup() allows a client / server style of communication (see Figure 5.4). The agent playing the server role once issues PassiveSetUp() when it is ready to receive requests. When an agent playing the client role invokes ActiveSetup(), this causes the SetUp() method of the server side to be invoked implicitly. SetUp() implicitly establishes a session with the caller and assigns a thread for handling this session. Therefore, once the server agent has called PassiveSetup(), any number of sessions can be established in parallel, where session establishment is purely client driven. If both agents issue (matching) ActiveSetUp() requests this corresponds to a rendezvous. Both requesters are blocked until the session is established or timeout occurs (see Figure 5.4b). This type of session establishment is suited for agents that want to establish peer-to-peer communication relationships with other agents. Communication between agents is peer-to-peer if both have their own ?agenda? in terms of communication, i.e., both decide - depending on their individual goals - when they want to interact with whom in which way. As pointed out above, Remote Method Invocation (RMI), the object-oriented equivalent to RPC, seems to be the most appropriate communication paradigm for a client / server style of interaction, while message passing is required to support peer-to-peer communication patterns. The available communication mechanisms are realized by so-called Com objects. Currently, there are two types of Com objects, RMI objects and messaging objects. Com objects are associated with sessions. Each session may have an RMI object, a messaging object, or both. Each session object offers a method for creating Com objects associated with this session. With the RMI object the methods exported by the session peer can be invoked. It can be compared with a proxy object known from distributed object-oriented systems. With the messaging object, Figure 5.4 Different Types of Interactions PassiveSetUp ActiveSetUp SetUp() ActiveSetUp ActiveSetUp a) Client/Server-Interaction b) Peer to Peer Interaction 5 WP 2.4: Developed Concepts and Implementation 71 messages can be conveyed asynchronously between the participants of a session. Messages are sent by calling the send() method. For receiving messages the receive() and subscribe()-methods are provided. The receive()-method blocks until a message is received or timeout occurs, whatever happens first. If the subscribe()-method is invoked instead, the incoming messages are handed over by calling the message()-method of the recipient and passing the message as method parameter. The advantage of having the concept of Com objects is twofold. Firstly, only those communication mechanisms have to be initiated that are actually needed during a session. Secondly, other mechanisms, such as streams, can be added to the system transparently. The latter advantage enhances the extensibility of the system. At any time, a session can be terminated unilaterally by each of the both session participants, either explicitly or implicitly. A session is terminated explicitly by calling SessionObject.Terminate() (see Figure 5.3), and implicitly when a session participant moves to another place. When a session is terminated, this is indicated by calling the SessionTerminated()-method exported by agents. Moreover, all resources associated with the terminated session are released. We want to mention, that for easier programming, we still allow the programmer to use ?traditional? RMIs or messages without the need of a session overhead, giving them the opportunity to issue single communication acts. 5.4 Agent Infrastructure 5.4.1 Resource Manager Resource management is necessary for two purposes. One is accounting, the other is resource control. Accounting is a prerequisite for commercial applications with agents, and resource control is necessary to prevent e.g. denial-of-service attacks. In Mole the following resources are managed: - CPU time - local network communication - communication with remote networks - number of created children - total time at the local place The CPU time used is calculated by counting the time slices given to threads of an agent. Mole has a central scheduler, the MCP (Master Control Process), that schedules all threads in the Mole system. We decided to implement our own scheduler, when problems with Java 1.0 led to the conclusion that the Java scheduler of the Solaris implementation had problems with concurrent execution of more than four threads of the same priority. This results from the Java specification being imprecise in this respect. The network communication is an important cost factor. Thus it is important for both accounting and resource control. Because all agent communication has to use the mechanisms provided by the agent system, control is done here. When an agent arrives at one place the arrival time is logged. This way the total time at the local place can be computed without problems. One other resource is not managed, the memory consumption of 5 WP 2.4: Developed Concepts and Implementation 72 an agent. While in principle extremely important, it can not be implemented with acceptable costs without modifying the Java virtual machine. 5.4.2 Directory Service A directory service is an electronic database that contains information on entities. An example for a full-fledged directory service is X.500 (see e.g. CHADWICK (1994)). In our Mole system we simply provide a local directory service. It supplies information on agents providing a service denoted by a string. This local directory service exists on every place. An agent can register itself locally if it provides a service by submitting a string identifying the service to the directory service. Another agent wanting to use this service first asks the directory service. The directory service returns a list containing all agents providing the service. This list is either empty, or contains one or more agent identifiers. The agent now chooses one of the agent identifiers and contacts the agent. 5.4.3 Security Model In Mole, the Sandbox security model is enforced by implementing a simple concept. In Section 2 we presented our agent model, and with it mobile and service agents. Mobile agents are the normal user agents, programmed and employed by the user. They have absolutely no access to the underlying system. Service agents are agents with access to system resources, providing controlled, secure abstractions of these resources inside the agent system. Furthermore, service agents may offer access to legacy software, using the native code interface offered by Java. This does not cause any security problems, because the service agents are immobile and may be started only by the administrator of the place. User agents may only communicate with other agents and have no direct access to system resources. Additionally it can be decided on a per-place basis which types of agents to allow on a place. Only agents that are derived from a specific type can migrate to a place. This mechanism can be used to implement access restrictions. Take e.g. a place that allows only agents of a very specific type. These can only be created at one other, open place. Then every agent wanting to access a service on the first, closed place has to migrate to the open place and request a service. This service then creates one of the specific agents that migrate to the closed place. 5.5 Graphical Agent Monitor In Mole the graphical agent monitor Moleview is used for the examining places, agents, and messages sent between them (see Figure 5.5). The snapshot has been made while running part of the test suite at installation time.Every Moleview window contains the information for one of the places that are inspected. In this example the place location1.mole.de has been examined. We can see 7 agents on the place, their ids and their descriptions. In the lower left corner of the window the services provided by agents are printed (in the example none). In the lower right corner of the window the communication in which the agents on this place participate is logged. This might be local or remote communication. RPCs as well as messages are logged with their message id, sender and receiver. Additional information about the agents, the provided services, 5 WP 2.4: Developed Concepts and Implementation 73 the messages, and the RPCs can be acquired by using an inspector window. In this specific example a non-existing recipient is used for some of the tests for the communication subsystem. 5.6 Implementational Issues 5.6.1 Agent Identifiers and Name Resolution In Mole, an agent is seen as a unique entity. This view is supported by using a globally unique name for every agent. This name is immutable, i.e. it does not change when the agent migrates. The uniqueness can only be guaranteed if the system creates the names used. If the system creates the agent identifiers, then these identifiers should be devisable without global knowledge. Additionally it is of advantage to be able to derive the site where the agent has been created from the agent identifier. Why do we place such constraints on the agent identifier? First, to be able to identify an agent (this is needed for communication, termination etc.), its name must be locally unique. Second, to be able to do the same after an agent has migrated, the name has to be immutable. From this follows that the agent identifier has to be globally unique. This can only be guaranteed if the system itself provides a service to create agent identifiers conforming to these requirements. If global knowledge is needed to create this agent name, then either an expensive mechanism has to be implemented to obtain the global knowledge, or a single point of failure is introduced if e.g. an identifier server is included into the system (see e.g. the Amoeba Figure 5.5 Moleview - The graphical Agent Monitor 5 WP 2.4: Developed Concepts and Implementation 74 sequencer poposed by TANENBAUM (1995)). The ability to derive the site where the agent has been created is of advantage e.g for localization algorithms utilizing the home location registry approach. This approach is used in GSM (Global System for Mobile Communications, see e.g. (MOULY and PAUTET, 1992)), where the identifier of the user (his telephone number) leads to a designated place (the home location registry) that contains the information how this user can be reached. The agent identifier in Mole is created from information that can be obtained locally. Table 5.1 contains the components of the agent identifier. The internet protocol Version 6 address of the underlying system together with the port number of the engine allows to identify the engine on which the agent has been created. The uniqueness of the name is guaranteed by using a combination of a normal counter that is set to 0 at the start of the engine, and a so-called crash counter, that is incremented every time the engine is started. If more than 232 agents are started the crash counter is incremented also. Two more bytes are reserved for future use, giving a total of 24 bytes. 5.6.2 Thread Management Unit One of the disadvantages of the Java language is that in some respects it is underspecified. One of the underspecified areas is the thread management. For instance if two threads of the same priority run in a virtual machine, nothing is said about the kind of scheduling used. It might even happen that one of the threads is executed solely and only when it has terminated is the other thread performed. This led to many problems in the platform-independent development of Mole, and we decided to implement our own thread scheduler. One of the few guarantees that Java gives is that a thread with higher priority is executed in preference to threads with lower priority. Our scheduler was implemented using this property of Java as follows: One designated thread called MCP is started when the Mole engine is booted, and controls all Mole threads (including all threads running in agents) at runtime. This thread gets the highest possible priority (apart from system threads). All other threads are on the lowest possible priority. The MCP- thread has a list of these threads and allows them their time slices by of changing their priorities. In regular intervals (time slices) the MCP-thread wakes up, lowers the priority of the running Mole thread, takes the next thread from its list and changes its priority to its own. Now it sleeps Table 5.1 Components of the Agent Id # Bytes Meaning 4 Dynamic Counter, incremented for every new agent identifier 4 Crash counter, incremented every time the system is started. Also incremented if dynamic counter overflows. 12 IP Version 6 address of the system on which the engine runs 2 The port number of the engine 2 Reserved for future use (set to 0) 5 WP 2.4: Developed Concepts and Implementation 75 for the length of the time slice, after which again the next thread is scheduled. This way the Mole system guarantees that threads are executed ?concurrently? in time slices. A feature of the MCP is that the computing resources can be managed as well. 5.7 Using Java-Enabled Web Browsers to Run Mobile Agents One of the main problems of the mobile agent technology is that before users get interested in using this technology, it has to offer advantages over the existing technology. One of the scenarios where mobile agents would offer tremendous advantages is their usage as an integrated service platform that can unify the existing information and service provider on the internet on the one hand and the need for automatic access to these things on the other side. In order to get such a platform, the technology, even a single product had to be wide-spreaded since no one would use such a platform without a large set of offered services, and no one would offer a service without a large set of users the service can reach. This need for a critical mass is one of the problems to solve in order to have this technology accepted. Another problem of current agent systems is that, as they rarely have been developed as a commercial product, the usage of such an agent system is rather complicated, at least for the end user. In order to allow more users to run mobile agent based application, the threshold to install an agent system has to be lowered. For this purpose, an agent system based on Mole was developed that runs as a Java applet inside any Java-enabled browser. Although a Mole engine, the runtime environment of the Mole agent system, is a regular Java application, it cannot simply be executed by a browser. The reason for that lies in the area of security, namely the protection of computer from malicious applets. Since applets are currently not allowed in most browsers to do anything that can be used to attack the computer the applet is executed, typical restrictions for applets are: - applets cannot access, start or manipulate other programs - applets cannot open sockets except to the server they were loaded from - applets cannot access system resources like file systems Another problem is the different ?operating mode? of web browsers and servers such as regular Mole engines: while the first were started and stopped by users at their will, servers normally are started once and run until either the computer crashes or a new version of a server is installed, both often in a controlled, non ad-hoc manner. A model that uses browsers to run an engine has to cope with the aspect that the user stops the browser at any time, regardless of the state of the agents. In this sense, browsers have similar characteristics like mobile devices. The architecture of the modified engine has to reflect these problems, but it should also try to make these aspect transparent to the agents, so that mobile agents can run in both a browser and a server environment. To handle the restrictions of applets, a new component, called Relay was introduced (see Figure 5.6). The purpose of the Relay component is to act as a representative of the browser engine. Therefore, Relays are located on a regular Server Engine. At startup time, a Browser Engine opens a connection to a certain Relay. This connection, a socket, remains open until the Server Engine is stopped. Every time the Browser Engine wants to communicate with another engine or vice versea, the relay acts as a proxy object of the Browser Engine and relays the communicated data to or from the Browser Engine. 5 WP 2.4: Developed Concepts and Implementation 76 To do this in a transparent manner, Browser Engines use special location names of the form browserLocation_:, e.g. browserLocation12_3:stuttgart.mole.informatik.uni-stuttgart.de. These names are created dynamically by the Relay at Browser Engine startup time.Since the applet can open a connection to the computer from where the applet was loaded, a weeb server is also needed on the computer where the Realy resides. Fortunately, Mole engines already can be used as a web server for Mole classes since HTTP is used as the code transfer protocol between engines. The Browser Engine is configured by a list of parameters that are provided in the tag of the enclosing HTML page. The (not optimized) Browser Engine applet is as big as any regular Mole engine and needs some 40 seconds to load in a browser. Further optimization can reduce both the length and the loading time of the applet. To demonstrate the Mole Browser Engine technology, an existing game (Mister X) was configured using mainly the engine on the one browser an users uses to start the game. 5.8 Installation Mole is a research development designed to experiment with new concepts arising in the context of mobile agents. It is available for free on the World Wide Web. The Java source code, documentation and sample agents can be downloaded from: http://www.informatik.uni-stuttgart.de/ipvr/vs/projekte/mole.html. This Web site currently allows downloading of Mole 2.1.2. Mole 3.0 is currently in the beta test phase and will be updated for the Middleware'98 conference in September 1998. For a prerelease of the beta version of Mole 3.0 please please contact Mr. Baumann, at University of Stuttgart, email: Joachim.Baumann@informatik.uni-stuttgart.de. 5.8.1 System Requirements Mole is undemanding regarding hardware and software requirements, especially compared to Telescript (WHITE, 1997b). This is due to the design focusing on the use of an unmodified Java Virtual Machine and existing hardware. While this didn't allow e.g. strong migration or the control of the memory consumption of agents, it allows Mole to run on every hardware platform that runs a Java Virtual Machine version 1.1 or higher. We have tested the system on various Browser Engine Agent A Agent B Server Engine Agent C Server Engine Agent D Relay link Engine 1 Engine 2 Engine 3 Figure 5.6 Architecture of the modified engine 5 WP 2.4: Developed Concepts and Implementation 77 computer types and operating systems. Normal PC's with a Intel Pentium (we have not tested systems with Intel 486) or compatible CPU with main memory ranging from 16 to 128 MBytes under Windows 95, Windows NT V3.51 or V4.0, OS/2 or Linux runs the system as well as Sun Sparc with Solaris, IBM RS/6000 with AIX, or HP workstations with HPUX. 5.8.2 Configuration Files Two main configuration files exist for Mole, the global mole resource file globalmolerc, and the user-specific Mole resource file ?~/.molerc? (under Windows 95/NT ?C:\.molerc?). These contain global as well as user specific definitions for variables that can be read by system agents at runtime. This way a simple method of configuring arbitrarily complex system agents is provided. Furthermore, many of the settings of the Mole system itself are defined in the Mole resource files. One example is the debug level, another one is the location of the global Mole installation. The following three variables are predefined: - $(HOME) is the absolute path to the home directory of the user (under Windows 95/NT this is ?C:\?) - $(USER) is the login name of the user - $(CWD) is the working directory A very handy feature for all of the Mole configuration files is that resources that have been defined already can be referenced in a simple way. Let us assume we have defined a resource ?Hello = howdy? and another resource ?World = world?, then the following definition of the variable HiWorld would yield a contents of ?howdy world?: ?HiWorld = $(Hello) $(World)?. This is very advantageous for the definition of complex values. Moreover, a user can define user-specific values depending on global values set by the system administrator in the global resource file. 5.8.3 Starting a Sample Agent In Appendix A a very simple agent that wanders between two places is listed (the static part is left out). The names of the places can be found in the two variables home and target. For the dynamic instantiation of agents a constructor without parameters is needed. As has been discussed in Section 3.1, the method init() is called first in the lifecycle of the agent. In this case we simply initialize the variables containing the two places and return the boolean value true, indicating that the initialization was successful. The system now calls the prepare()-method that sets the boolean variable atHome depending on the place of the agent (if at home, it is true). Now the agent is ready to start at the new place and its method start() is called. Here the agent simply decides where to migrate next. If the migration was not successful it prints an error message and dies. This agent can be brought into the system by issuing the following command at the command line of the Mole system (we assume two places, ?place1? and ?place2?): new mole.apps.SimpleWanderer(Home place1, Target place2) at place1; The agent is started as described above and begins to wander forth and back between the two places. 5 WP 2.4: Developed Concepts and Implementation 78 5.9 Conclusions and Future Work We have presented a platform for Java-based mobile agents, a research development that implements many new concepts of mobile agent systems. Besides a basic infrastructure like weak migration of agents and local/global communication between agents and agent groups, the emphasis of the Mole system lies on providing a comfortable infrastructure for agents. In particular, Mole implements a resource manager for accounting and resource control, a directory service, a thread management unit and a global naming scheme for agents. The usage of Mole is supported by simple means for installation and configuration and by a graphical agent monitor. Earlier Versions of Mole have been used e.g as the prototypical infrastructure for an electronic document system by KONSTANTAS, MORIN and VITEK (1996), as a simulation environment for distributed network management by Siemens in the project Swarms and as an execution environment for Tandem server classes by STRAßER et al. (1997). In order to make mobile agents usable in a commercial setting, our current research investigates further extensions to the system infrastructure. In particular, methods for orphan detection and agent termination have been investigated by BAUMANN (1997), coordination of agent groups has been investigated by BAU- MANN and RADOUNIKLIS (1997), an approach for securing agents against malicious hosts has been investigated by HOHL (1997) and a protocol for preserving the exactly-once-property of mobile agents by ROTHERMEL and STRAßER (1997). A Java-Code of the Wanderer Agent package mole.apps; import mole.*; public class SimpleWanderer extends UserAgent implements MobileAgent { private LocationName home = null; private LocationName target = null; private boolean atHome = null; public SimpleWanderer() { } public boolean init(Hashtable parameters) { String des = (String)parameters.get(?Description?); if(des != null) super.init(des);// Set description of agent if given // in the real world all parameters would have to checked as above home = new LocationName((String)parameters.get(?Home?)); target = new LocationName((String)parameters.get(?Target?)); return true; } 5 WP 2.4: Developed Concepts and Implementation 79 public void prepare() { atHome = (getCurrentLocation().locationName()).equals(home); } public void start() { if (atHome == true) { migrateTo(target); Engine.out(?Wasn't able to migrate to ? + target); die(); } else { migrateTo(home); Engine.out(?Wasn't able to migrate to ? + home); die(); } } } References BAUMANN, J. F. HOHL, K. ROTHERMEL, M. SCHWEHM, M.STRAßER 1998: Mole 3.0: A Middleware for Java-Based Mobile Software Agents, in Proc Int. Conf. Middleware `98 (to be published) BAUMANN, J. 1997: A Protocol for Orphan Detection and Termination in Mobile Agent Systems. Bericht Nr. 1997/09 der Fakultät Informatik, University of Stuttgart, Germany. BAUMANN, J. & RADOUNIKLIS, N. 1997: Agent Groups in Mobile Agent Systems. In Proceedings of the DAIS'97, Chapman & Hall, London, UK. BRADSHAW, J. M. (Ed.) 1997: Software Agents. AAAI Press/MIT Press, Menlo Park, CA. CHADWICK, D. (1994): Understanding the X.500 Directory. Chapman & Hall, London, UK. COCKAYNE, W. R. & ZYDA, M. 1997: Mobile Agents, Manning Publ. Co., Greenwich. FALCONE, J. R. 1987: A Programmable Interface Language for Heterogeneous Distributed Systems. ACM Trans. Computer Systems, 5(4):330-351. GHEZZI, C. and VIGNA, G., Mobile Code Paradigms and Technologies: A Case Study. In: (RO- THERMEL and POPESCU-ZELETIN, 1997), pp. 39-49. GRAY, R. S.1995: Agent Tcl: Alpha Release 1.1. Technical Report, Department of Computer Science, Dartmouth College. GRAY, R. S. 1996: Agent Tcl: A flexible and secure mobile agent system. Proc. 4th Annual Tcl/ Tk Workshop, pp. 9-23. HOHL, F. 1997: An approach to solve the problem of malicious hosts. Bericht Nr. 1997/03 der Fakultät Informatik, University of Stuttgart, Germany. HOHL, F.; KLAR, P. & BAUMANN, J. 1997: Efficient Code Migration for Modular Mobile Agents. In: Proc. 3rd ECOOP Workshop on Mobile Object Systems, dpunkt-Verlag, Heidelberg, Germany. (to appear) 5 WP 2.4: Developed Concepts and Implementation 80 IBM Aglets Workbench Team 1997:Aglets Workbench. In:(Cockayne & Zyda, 1997), pp.165- 183. JOHANSEN, D.; VANRENESSE, R. & SCHNEIDER, F. B. 1994: Operating System support for mobile agents. In: Proc. 5th Workshop on Hot Topics in Operating Systems, IEEE Comp. Soc. Press, pp. 42-45. JOHANSEN, D.; VAN RENESSE, R. & SCHNEIDER, F. B.1995: An Introduction to the TACOMA Distributed System. Technical Report 95-23, Department of Computer Science, University of Tromso, Finland. KONSTANTAS, D.;MORIN , J. H. & VITEK, J. 1996: MEDIA: A Platform for the Commercialization of Electronic Documents. In: Tschiritzis, D. (ed.) 1996: Object Applications. University of Geneva, pp. 7-18. MILOJICIC, D.; BREUGST, M.; BUSSE, I.; CAMPBELL, J.; COVACI, S.; FRIEDMAN, B.; KOSAKA, K.; LANGE, D.; ONO, K.; OSHIMA, M.; THAM, C.; VIRDHAGRISWARAN, S. & WHITE, J. (1998): MASIF: The OMG Mobile Agent System Interoperability Facility, (submitted to MA`98). MOULY, M. & PAUTET, M. 1992: The GSM System for mobile Communication. Europe Media Publications S. A., ETSI, Palaiseau, France. MÜHLHÄUSER, M. (ed.) 1996: Special Issues in Object-Oriented Programming, dpunkt-Verlag, Heidelberg, Germany. PEINE, H. & STOLPMANN, T. 1997: The Architecture of the Ara Platform for Mobile Agents. In: (ROTHERMEL & POPESCU-ZELETIN, 1997) pp. 50-61. PERRET, S. & DUDA, A. 1996: Implementation of MAP: A system for mobile assistant programming. In: Proc. IEEE Int. Conf. of Parallel and Distributed Systems. ROTHERMEL, K. & POPESCU-ZELETIN, R. (eds.) 1997: First Int. Workshop on Mobile Agents, MA'97, Lecture Notes in Computer Science 1219, Springer-Verlag, Berlin. ROTHERMEL, K. & STRAßER, M. (1997): A Protocol for Preserving the Exactly-Once Property of Mobile Agents. Bericht Nr. 1997/18, Fakultät Informatik, University of Stuttgart, Germany. RULIFSON, J. 1969: DEL. In: Internet Engineering Task Force, Network Working Group, Request for Comments 5, ftp://ds.internic.net/rdc/rfc5.txt STRAßER, M.; BAUMANN, J. & HOHL, F. 1996: Mole: A Java Based Mobile Agent System. In: MÜHLHÄUSER, 1996), pp. 327 - 334. STRAßER, M.; BAUMANN, J.; HOHL, F.; RADOUNIKLIS, N.; ROTHERMEL, K. & SCHWEHM, M. 1997: ATOMAS: A Transaction-oriented Open Multi Agent-System. Annual Report. Bericht Nr. 1997/14 der Fakultät Informatik, University of Stuttgart., Germany TANENBAUM, A. 1995: Distributed Operating Systems. Prentice Hall, Englewood Cliffs, NJ, USA. TSCHUDIN, C. F. 1993: On the Structuring of Computer Communications. Ph.D. Thesis, University of Geneva, Suisse. WHITE, J. E. 1997a: Mobile Agents. In: (BRADSHAW, 1997) pp. 437-472. WHITE, J. E. 1997b: Telescript. In: (COCKAYNE & ZYDA, 1997) pp. 37-57. ZANDER, J. 1981: SOFTNET - Packet Radio in Sweden. In: ARRL Amateur Radio Computer Networking Conferences 1-4, The American Radio Relay League, Newington, CT, reprinted 1985, pp. 1.7 - 1.10. 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 81 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents Orphan detection in distributed systems is a well researched field for which many solutions exist. These solutions exploit well defined parent-child relationships given in distributed systems. But they are not applicable in mobile agent systems, since no similar natural relationship between agents exist. Thus new protocols have to be developed. In this paper one such protocol for controlling mobile mobile agents and for orphan detection is presented. The `shadow' approach presented in this paper uses the idea of a placeholder (shadow) which is assigned by the agent system to each new agent. This defines an artificial relationship between agents and shadow. The shadow records the location of all dependent agents. Removing the root shadow implies that all dependent agents are declared orphan and eventually be terminated. We introduce agent proxies that create a path from shadow to every agent. In an extension of the basic protocol we additionally allow the shadow to be mobile. The shadow approach can be used for termination of groups of agents even if the exact location of each single agent is not known. 6.1 Introduction A mobile agent is regarded as a piece of software roaming the network on behalf of a user, e.g. searching for information in different databases, buying a flight ticket and renting a car, or trying to find the cheapest flower shop. Mobile agents seem to be the solution to many of the problems in the area of distributed systems. But while the idea of mobile agents is quite appealing, and while many researchers are working in this area, some very important problems have not been solved. Most of the research concentrates on providing the basic system support for migration, communication, the security of the platform underlying the agent system and for the asynchronous operation of agents. Some solutions for these problems already exist and have been implemented in different agent systems (e.g. [12], [4], [8], [14], [7], [6]). But until now no protocols exist for orphan detection in mobile agent systems. Orphan detection in an agent system is very important both from the user's and from the system side, because a running agent uses resources which are valuable to both user and system. The user has to pay for resources (at least in principle), and the system has only a limited amount of them. So if the user does not need the results of a distributed computation in progress anymore, he wants to be able to terminate the computation to minimize the resulting cost. With an orphan detection mechanism the user simply declares the agents to be terminated as orphans. Orphan detection guarantees that the now useless agents can be determined by the system and ended, thus freeing the resources they have bound. In this paper we will present a new protocol, the shadow protocol, that allows both control of mobile agents and orphan detection. The paper is organized as follows: Section 6.2 presents our agent model. In Section 6.3 the shadow protocol is presented with different extensions and optimizations. Section 6.4 presents related work, and in Section 6.5 the conclusion and outlook is given. 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 82 6.2 The Agent Model In this section we will give you a short overview of our agent model, that has been described in more detail in [12], [1] and [4]. Our model of an agent-based system - as many other models - is mainly based on the concepts of agents and places. Places provide the environment for safely executing local as well as visiting agents. An agent system consists of a number of (abstract) places, being the home of various services. Agents are active entities, which may move from place to place to meet other agents and access the places' services. Each agent is identified by a globally unique agent identifier. An agent's identifier is generated by the system at agent creation time. The creating place can be derived from this name. It is independent of the agent's current place, i.e. it does not change when the agent moves to a new place. In other words, the applied identifier scheme provides location transparency. A place is entirely located on a single node of the underlying network, but multiple places may be situated on a given node. For example, a node may provide a number of places, each one assigned to a certain agent community, allowing access to a certain set of services or implementing a certain prizing policy. Places are divided into two types, depending on the connectivity of the underlying system. If a system is connected to the network all the time (barring network failures and system crashes), a place on this system is called connected. If a system is only part-time connected to the network, e.g. a user's PDA (Personal Digital Assistant), the place is called associated. 6.3 The Shadow Protocol In this section we discuss the basic Shadow Protocol with its agent proxies, the extension that allows the shadows to be mobile, and discuss possible optimizations. 6.3.1 The Idea In the shadow concept each application creates one or more shadows, a data structure on a connected place. The place where the shadow is created does not necessarily have to run on the same host on which the creating application runs. Each agent created by the application depends on such a shadow (Figure 6.2). The agent is dependent of the shadow instead of the application. As long as the shadow exists in the system, no contact of agents to the application itself or to the computer system on which the application runs is necessary. In regular intervals (called time to live or ttl) the system checks for each agent if the associated shadow still exists. If the shadow does no longer exist, the agent is declared to be an orphan and is removed. Hi migration place B place C application service agent mobile agent place A Figure 6.1 The Agent Model comm. Application creates Place Agent Shadow Figure 6.2 The Creation of a Shadow 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 83 If an agent creates a new agent, the system assigns the to this new agent the shadow of the creating agent, and the same remaining ttl until the next check (Figure 6.3). This assignment cannot be changed by the agents. Limiting the time span to the remaining ttl of the creating agent (and not to the original time interval) is necessary to prevent malicious agents from living infinitely. Otherwise the mechanism could be circumvented simply by creating a new agent with again the whole ttl just before the life span of the old agent ends. If a place on which a shadow resides cannot be reached, the system tries to contact the place several times. If still the place cannot be reached, the shadow is presumed no longer existent and its associated agents are killed. The disadvantage of this approach is that regardless of what an agent does, it has to connect to its shadow's place in regular intervals. The advantage on the other hand is that we have a worst-case time bound for the termination of agents through removing the shadows. This upper bound is exactly the sum of ttl of the agents and the timeout for contacting. Until now the protocol only allows passive termination. By removing a shadow all dependent agents are declared orphans, and after the ttl it is guaranteed that all agents have been removed by the orphan detection. By adding the path concept to this protocol, we also allow active termination, i.e. termination of an agent while its ttl is greater 0. Agent proxies are structures at each place that keep track of the movement of all agents dependent of a specific shadow, thus creating a path leading to the agent. By storing the place at which the agent got checked the last time we can find the beginning of a path for every agent. Even if the path gets lost, the agent will contact the shadow after the ttl. If an agent arrives at a place where not yet an agent proxy for this shadow exists, one is created (Figure 6.4). As soon as the agent migrates to another place, the destination (being part of the path leading to the agent) is stored in the proxy together with the ttl . When the end of the ttl is reached, the agent's shadow gets a request for extending the agent's life, and thus the new place of the agent is made known to the shadow (Figure 6.5:). The path entries stored in the different agent proxies along the agent's way is now superfluous and can be removed using the knowledge about the ttl stored in the proxy. An entry can also be removed if the agent migrates back to this place (this simply optimizes the now circular path by removing the loop). An agent proxy contains, for a specific place, all path segments of agents belonging to the same shadow. It exists exactly as long as there is a path entry in it. As soon as the agent proxy contains no more entries, it can be removed as well. This is especially helpful if the agents are actively terminated, i.e. the system actively sends messages to terminate the agents as fast as possible. In that case, all entries are removed from the agent proxy, allowing the system to delete the proxy as well. creates Place Agent Shadow Place depend Agent Figure 6.3 Creating a New Agent 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 84 6.3.2 The Protocol We will discuss the different parts of the protocol separately. The protocol is presented in an objectoriented pseudo-code notation. The place on which the agent resides, decrements in regular intervals the ttl of the agent. As soon as the ttl of the agent is 0, a message is sent back to the home place of the shadow, containing the id of agent and shadow. At the same time a timer is started with a timeout, and the agent enters the check phase (Figure 6.6). To allow greater flexibility each shadow (and thus the group of associated agents) can have a timeout of its own. This allows for a loophole by setting a very long timeout. But this can be corrected by introducing a per-place timeout. The timeout finally chosen is the minimum of agent timeout and place timeout. Figure 6.4 Proxie Paths Figure 6.5: Regular Update of Proxies Shadow Place Agent Shadow Place Agent Agent Proxy Agent Proxy Shadow Place Agent Agent Proxy Shadow Place Agent Regular Intervals: for each agent agent.timeToLive - -; if (agent.timeToLive == 0) sendCheck(agent.shadowHome, current- Place, agent.shadowId, agent.id); startTimer(min(place.TimeOut,agent.time- Out), agent.proxy, agent); onArrival(agent) agentproxy = proxyList.find(agent.shadowId); if(agentproxy == null) agentproxy = new Proxy(agent.id, agent.timeTo- Live, agent.shadowHome, current- Place); proxyList.add(agentproxy); else agentproxy.add(agent.agentId, agent.timeTo- Live); agent.proxy = agentproxy; agentList.add(agent); agent.start(); onLeaving(agent, target) if (agent.timeToLive > 0) agentList.remove(agent); agent.proxy.setTarget(agent.id, target)); startTimer( agent.timeToLive + agent.timeOut, agent.proxy, agent.id); SendAgent(target, agent); Figure 6.6 System Methods 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 85 The check message is received by the home place of the shadow. First a timer is stopped that has been started the last time the ttl has been sent back to the agent. This allows to detect agents that have been terminated (see below). The ttl is requested from the responsible shadow, and if greater 0 is sent back by the system to the requesting agent. As soon as the message is received, the timer for the timeout is stopped, and the agent's ttl is set (see Figure 6.7). This ends the agent's check phase and allows it to migrate again. When an agent arrives at a place, the list of agent proxies is searched for a proxy of that agent. If none exists, a new one is created, and the agent gets a reference on it. As soon as an agent wants to leave, its ttl is checked. This is done to prevent an agent who is in the check phase to migrate. If it is not in the check phase, the information in the agent proxy is updated to point to the target place. At the same time a timer is started that removes the path after the sum of remaining ttl and timeout (see Figure 6.6). The shadow can decide on a case-by-case basis if an agent's life time is to be extended, and by which interval. In Figure 6.8 we present an example policy, that for all of the agents returns the same ttl. This method checks first if an agent entry already exists for this agent (in case a newly created agent contacts the shadow), updates the information about the location of the agent, and returns the ttl. The shadow is also called if the system has detected (via the timeout), that an agent has been terminated. The simplest policy is to remove the related entry from the list. We now discuss the reaction to the different timeouts (see Figure 6.9). One possible reaction to the timeout of the check message has been sketched out above. Here we present a simple alternative; the agent is removed at once. The next timeout affects the paths. As soon as an agent migrates, the path segment pointing to its new location is created, and a timer started. As soon as this timer ends, we know that the path information in the shadow itself has been updated, and this part of the path can safely be removed. The last method is called if an agent has not tried to contact the shadow for the sum of ttl and timeout. In this case the agent has terminated. The shadow method (see Figure 6.9) is called to react to it. receiveCheck(from, shadowId, agentId) stopTimer(agentId); shadow = shadowList.find(shadowId); timeToLive = shadow.timeToLive(from, agentId); if (timeToLive > 0) startTimer(timeToLive + shadow.getTimeOut(agentId), shadow, agentId); sendAllowance(from, agentId, timeToLive); receiveAllowance(agentId, timeToLive) stopTimer(agentId); agent = agentList.findAgent(agentId); agent.timeToLive = timeToLive; proxyList.setTime(agentId, timeToLive); Figure 6.7 The Check Phase timeToLive(from, agentId, shadowId) [here an example policy is presented] shadowproxy = listOfProxies.find(shadowId); agententry = shadowproxy.get(agentId) if( agententry != null) agententry.target = from; else agententry = new AgentEntry(from, agentId, timeToLive); shadowproxy.add (agententry); return agententry.timeToLive; remove(agentId) [implement policy] agentproxy = listOfProxies.find(agentId); agentproxy.remove(agentId); Figure 6.8 Methods in the Shadow Object onTimer(proxy, agent) // check timeout [here an example policy is presented] agentList.remove(agent); agentproxy.remove(agentId); if(agentproxy.entries() == 0) proxyList.remove(agentproxy); onTimer(agentproxy, agentId)// path redundant [implement policy] agentproxy.remove(agentId); if(agentproxy.entries() == 0) proxyList.remove(agentproxy); onTimer(shadow, agentId) // ag. terminated shadow.remove(agentId); Figure 6.9 System: Reaction to Timeouts 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 86 Finding Agents If we want to actively terminate a specific agent, we have to find it first. This can be done with the help of the information stored in the agent proxies. If the agent is in the local list of active agents, it is already found. If not, the related agent proxy is searched. If it is not found, an error is returned. If it is discovered, a find request is sent to the target found in the proxy. At the target place the list of active agents is again examined. If the agent is found, a success message is sent back. If not, the related agent proxy is searched again. If no proxy exists, an error is sent back. Otherwise, the message is sent on. This is repeated until the agent is found or the path ends (see Figure 6.10). 6.3.3 Mobile Shadows In cases where many of the agents depending on a shadow move somewhere far away (i.e. communication costs are high), every one of the agents has to contact the shadow independently, resulting in unnecessarily high communication costs. If the migration behaviour is known in advance, the shadow can be placed in a way that reduces the communication cost. But in many cases the behaviour is not known in advance, or the group moves as a whole from area to area (e.g. from one organization to another). In these cases it would be much better if the shadow moved with the agents. Possible policies where to place the shadow could be: - at a place where the communication cost to all dependent agents would be lowest. - where one agent important for the computation is situated. If the place becomes unavailable (e.g. crashes), both shadow and agent would not be reachable, and the other dependent agents would be terminated. While in the first case the shadow would have to be persistent, in the second case it would have to be transient to implement the policy. To move a shadow two problems have to be dealt with. The first is that the agents depending on the shadow have somehow to be notified about the new location of the shadow. The second is that the application still has to be able to reach the shadow, e.g. in case it wants to terminate the agents. Both problems can be solved similar to the approach used with the agent proxies. When a shadow moves, a shadow proxy stays behind. Thus over time a shadow path is built. By contacting the copy at the home place in regular intervals this path can be cut short. As alternative to intervals at which to cut the path short, a maximum path length would be suitable. But using a maximum path length adds communication along the path, because as soon as the maximum path length has been reached the shadow proxies along the path have to be notified that they are no longer needed. A combination of these policies seems the most flexible. Now, when an agent requests a new ttl, the shadow might already have moved somewhere else. In this case, the request is sent to the new place of the shadow. If the shadow already has moved find(agentId) if (agentList.find(agentId) != null) return(this); agentproxy = shadowList.find(agentId); if(agentproxy != null) sendFind(agentproxy.target(agentId), this, agentId); else return(notFoundError); receiveFind(searcher, agentId) if (agentList.find(agentId) != null) sendFound(searcher, this, agentId); if((agtproxy = proxyList.find(agentId)) != null) sendFind(agtproxy.target(agentId), searcher, agentId); else sendError(searcher, notFoundError, agentId); receiveFound(from, agentId) return(from); receiveError(error, agentId) if (error == notFoundError) return(error); Figure 6.10 Finding Agents 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 87 again, the request is forwarded along the path of shadow proxies until the shadow itself is reached. The shadow sends a new grant back to the agent together with its new place. The next time the agent sends its request directly to the new place. The shadow proxies can be removed as soon as the path is no longer needed and no agent still has the reference to a shadow proxy. Thus the maximum of agent and shadow ttl is the maximum time the proxy has to be hold. One exception has to be made though. The first proxy, that stays at home, cannot be removed as long as the shadow is elsewhere. The Protocol We first examine the shadow part of the protocol. Moving the shadow to another place creates a path to the target and starts a timer. After the timeout of this timer the path has to be deleted. The path is created by leaving a shadow proxy behind. Removing the shadow is done by sending a message along the path (see Figure 6.11). Each shadow gets a ttl, after which it must contact its home place. This time is not necessarily the same as for the agents. In regular intervals this ttl is decremented. As soon as the shadow's ttl is 0, the shadow enters the check phase. A message containing the shadow id and its current place is sent to the home place and a timer is started (see Figure 6.12). The check message for the shadow contains the new place of the shadow. If the shadow proxy at home still exists, it is updated and the ttl is sent back. If the answer is not received until the timeout, the shadow is removed (more complex reactions with retries can be chosen instead). As soon as it is received, the timer is stopped and the ttl is set (see Figure 6.13). The shadow proxies creating the path between home place and shadow get a similar timeout after the sum of ttl of the shadow, of the agent (see below) and the communication timeout. At that point the path is redundant and can be removed (see below). This way the path created by the shadow is cut short in regular intervals. If the shadow comes back to its home place, the shadow proxy is replaced by the original. move(target) if (timeToLive != 0) sendShadow(target, this); if(currentPlace != null) // part of path pathTimeOut = timeToLive + timeOut; startTimer(pathTimeOut, shadow); currentPlace = target; terminateShadow() if (currentPlace != null) // shadow moved sendTerminate(currentPlace, id); delete(this); Figure 6.11 Additional Shadow Methods Regular Intervals: [agent related part stays the same] for each shadow if (shadow.homePlace != place.name()) shadow.timeToLive--; if (shadow.timeToLive == 0) sendCheck(shadow.homePlace, shadow.id); startTimer(shadow.timeOut, shadow); Figure 6.12 Extended System Methods: Regular Intervals onTimer(shadow) // this path seg. is redundant shadowList.remove(shadow); receiveAllowance(shadowId, timeToLive) shadow = shadowList.find(shadowId); stopTimer(shadow); shadow.timeToLive = timeToLive; receiveCheck(from, shadowId) shadow = shadowList.find(shadowId); if(shadow != null) shadow.currentPlace = place; sendAllowance(from, shadowId, shadow.timeToLive); Figure 6.13 Additional System Methods: Checking the shadow 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 88 In the basic protocol the agent check message is sent to the shadow's home place. Now it is sent to the place from which the last ttl message has been received. This is done by storing it in an additional attribute. If the shadow moves between two such messages, the check message is sent to a shadow proxy (somewhere on the path) instead of the original. The shadow proxy now forwards this agent check message along the path. The original, upon receiving the message, sends back the ttl and its own place. The path is superfluous as soon as the shadow's place is known at the home place and no agent still references a part of it (see Figure 6.14). Together with sending back the ttl to the agent the shadow starts a timer. If after this timeout the agent did not send a check message, the shadow knows that the agent has terminated. But since the timeout is detected at a place and not inside the shadow, the information might only reach a proxy and not the shadow itself. In this case the shadow has to be informed. Thus a message is sent along the path containing the information that the agent has terminated. Every proxy sends the information onward until it reaches the shadow. Now the agent entry is removed (see Figure 6.15). 6.3.4 Optimizing the Communication As soon as more than one agent belongs to a shadow, optimizations of the communication are possible. Three optimizations exist: - If two agents belonging to the same shadow come to the same place, the ttl of the one with the lower remaining time interval is set to the ttl of the other one. This works with an arbitrarily large number of agents on a place and happens conveniently at the arrival of a new agent. - If an agent's shadow has been checked, then this information also gets transferred to all other agents belonging to the same shadow on the same place as the agent. - The combination of shadow and agent proxies creates a spanning tree that follows the agents' movements with the shadow as the root. The tree can be optimized by simply using common paths for the parts of the paths that are the same for different agents. This effec- receiveCheck(from, shadowId, agentId) stopTimer(agentId); if(currentPlace != place.name()) sendCheck(currentPlace, from, shadowId, agentId); else shadow = shadowList.find(shadowId); timeToLive = shadow.timeToLive(from, agentId); if (timeToLive > 0) startTimer(timeToLive + shadow.getTimeOut(agentId), shadow, agentId); sendAllowance(from, place.name(), agentId, timeToLive); receiveAllowance(shadowPlace, agentId, timeToLive) stopTimer(agentId); agent = agentList.findAgent(agentId); agent.timeToLive = timeToLive; agent.shadowHome = shadowPlace; proxyList.setTime(agentId, timeToLive); Figure 6.14 Changed System Methods: Extending the agent's life onTimer(shadow, agentId) // agent terminated shadow.remove(agentId); if (shadow.currentPlace != place.name() ) sendRemoved( currentPlace, shadowId, agentId); receiveRemoved(shadowId, agentId) shadow = shadowList.find(shadowId); if(shadow != null) if(shadow.currentPlace != place.name()) sendRemoved( currentPlace, shadowId, agentId); else shadow = shadowList.find(shadowId); shadow.remove(agentId); Figure 6.15 Changed System Methods: Detecting terminated agents 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 89 tively reduces the number of messages that flow without changing the functionality. Furthermore, the agents on nodes along the tree can be updated simulataneously. The proxies allow to find an agent, e.g. to terminate it actively. But with all of the mentioned optimizations the path to a specific agent can be lost. This can happen if an agent gets additional ttl from another agent, and the path assuming the original ttl is removed. The optimizations make it impossible to terminate a specific agent. The interesting point though is that this doesn't matter for the termination of the whole group of agents. If the termination message is sent to all known proxies, then these proxies forward the termination message along all of the paths they are part of. Ultimately this termination message reaches all of the agents, even those no longer directly known to the shadow. The path segment for an agent exists exactly for the current ttl of the agent. So if it got additional time, then at that place the agent proxy holds the path from that place for that remaining time. Every time an agent gets additional time from another agent, there exists a valid path to that other agent. So, by first following the path to the other agent, and then the still valid path to our agent, every agent gets the termination message. This way, all of the mentioned optimizations can be used without compromising functionality for the group as a whole. 6.3.5 Fault Tolerance Our fault model contains two types of failures, node failures (fail-stop) and network partitions. It is important to note that from the viewpoint of a node these failures are not distinguishable. By introducing a path of proxies the fault sensitivity of the protocol is increased. If only one of the nodes containing a proxy is not reachable, either through node failure or network partitioning, the path is broken. Different mechanisms have to be used for the two different kinds of paths. While in the case of a broken agent proxy path only one agent is no longer reachable until its ttl is 0, in the case of a broken shadow proxy path the agents trying to extend their life are threatened. The mechanism employed for the agent proxy paths has already been presented in Section 6.3 , and is only discussed briefly. The mechanism used for shadow proxy paths has not yet been discussed in the protocol section and is examined in the following in detail. Agent Proxy Path By introducing the ttl, after which the agent has to contact the shadow's place, it is guaranteed that even if the path is broken, the new location of the agent can be identified after the ttl (as a worst-case bound), as long as either the network partition is short-term, or agent place and shadow place are in the same partition. If after the ttl (plus the timeout) the agent has not contacted the shadow, the shadow knows that the agent does not exist any longer (either because it has terminated or has been declared orphan and removed by the system). Shadow Proxy Path Two strategies are possible for dealing with a broken shadow proxy path. The first strategy does not change the characteristics of the protocol, but manages only short-term failures. It lets the 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 90 last shadow proxy of the still-existing path try to contact the next shadow proxy again. The problem though is that the new ttl has to be sent to the agent before the system decides to terminate it. The second strategy allows for longer failures but changes the worst-case bound for passive termination of the agents (the worst-case bound is 2ttl in this variant). If the last shadow proxy detects the break, it sends a new ttl back to the agent, but with the home place of the shadow as the new location. The new ttl is the minimum of the remaining shadow ttl and the agent ttl. If the shadow would have been removed, then the shadow proxy would know about it (and would have been removed as well). Thus the shadow still exists and it is correct to send the allowance. The home place of the shadow is sent instead of the location of the next shadow proxy in the path, to guarantee that the agent has a valid place to send the request for the next ttl. If the ttl of the agent is shorter than the remaining time of the shadow proxy path, then the next request will be sent along the same path (that hopefully is connected again). If the ttl of the path is shorter, then the agent will contact the home place of the shadow when the shadow itself has requested a new ttl. This means that the home place holds the new location of the shadow and forwards the request correctly. 6.4 Related Work In the area of mobile agent systems the current research concentrates on the basic system support. But now that many different agent systems existing support the functionality needed to realize applications, mechanisms providing the functionality presented in this paper are essential. Thus the problem areas of orphan detection and termination of agents are beginning to evoke the interest of the research community. But apart from the mechanisms developed at the University of Stuttgart (see [5] describing a group concept or [2] discussing an energy concept and a path concept) no publications present similar functionality for mobile agent systems. However, in the area of distributed systems many algorithms exist that solve similar problems. The area of distributed algorithms, and especially distributed termination detection (in [9] and in [13] a discussion of many algorithms can be found) and distributed garbage collection (one example is the work on Stub Scion Pair Chains [11]), has to be seen as related work. But two differences prevent the use of these algorithms for mobile agent systems. First of all, the fault model is different. The possibility of network partitions or node crashes does not exist in the fault model used for most distributed algorithms. Mobile agent systems explicitly include these faults in their fault model. Furthermore, the fault model supports the asynchrony of agents. The second difference is the autonomy of the ?objects? in question that very much influences the processing model. A process (or object) in the distributed system area is not normally seen as autonomous. Here a process is seen as a cooperating part of a larger application. For a mobile agent the autonomy is one of the important prerequisites. This autonomy leads to the problem that a malicious agent might try to remove itself from the control by the system. These differences make it impossible to use the existing distributed algorithms in the area of mobile agent systems. It might be possible to use one such algorithm as the basis for a new design tailored to the needs of mobile agent systems. But the changes in the fault model and in the processing model effect so many changes in the algorithm itself that a correct transformation would be problematic at best. Nevertheless we believe that in principle it is possible to transform these algorithms correctly into algorithms that take the peculiarities of mobile agent systems into ac- 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 91 count. The key to this is an automatic transformation that, used on e.g. an algorithm for distributed garbage collection, turns it into a orphan detection and / or termination algorithm for mobile agent systems. An analogon to such an algorithm exists for the automatic transformation of termination detection algorithms into distributed garbage collection algorithms [10]. 6.5 Conclusion and Future Work In this paper we presented the shadow protocol. The shadow protocol has still some disadvantages: it introduces additional communication into the system and resources (memory) are bound to store the different path information. But the advantages outweigh the disadvantages by far: the mechanism is robust against malicious or faulty agents, the path information is updated without additional communication costs (no outdated path information exists), and the time until all agents are terminated in the worst case can be determined exactly. The presented protocol has been implemented in our agent system Mole (for a description of Mole see [12], [1], and [4]). We will examine the area of fault tolerance in detail. The presented mechanism is robust against short time network partitioning and system faults, but does not cope well with lasting faults. We will investigate in which way the shadow concept can be made fault resilient by replication of the control structures. Comment: This paper does not contain the full protocol as an appendix due to space restrictions. For the complete description please refer to [3]. Acknowledgements: Parts of the protocol have been implemented by M. Zepf. The comments of F. Hohl, M. Schwehm and M. Straßer improved the quality of the paper. References 1. J. Baumann, F. Hohl, N. Radouniklis, K. Rothermel, M. Straßer. ?Communication Concepts for Mobile Agent Systems?, in Mobile Agents `97, LNCS 1219, Springer-Verlag, pp. 123 - 135, 1997. 2. J. Baumann. ?A Protocol for Orphan Detection and Termination in Mobile Agent Systems?, Tech. Report 1997/09, Fac. of Computer Science, U. of Stuttgart, 1997. 3. J. Baumann, K. Rothermel. ?The Shadow Approach: An Orphan Detection Protocol for Mobile Agents?, Tech. Report 1998/08, Fac. of Computer Science, U. of Stuttgart, 1998. 4. J. Baumann, F. Hohl, K. Rothermel, M. Straßer. ?Mole - Concepts of a Mobile Agent System?, in WWW Journal, Special Issue on Software Agents, to appear. 5. J. Baumann, N. Radouniklis. ?Agent Groups for Mobile Agent Systems?, in Distributed Applications and Interoperable Systems, H. König et al., Eds., Chapman & Hall, pp. 74 - 85, 1997. 6. J. Baumann, C. Tschudin, J. Vitek. ?Mobile Object Systems: Workshop Summary?, Workshop Proceedings for the 2nd Workshop on Mobile Object Systems, in Workshop Reader ECOOP '96, d-punkt.verlag, pp. 301 - 308, 1996. 7. General Magic, ?Odyssey Web Site?. URL: http://www.genmagic.com/agents/ 8. IBM. ?The Aglets Workbench?. URL: http://www.trl.ibm.co.jp/aglets/ 9. F. Mattern. ?Verteilte Algorithmen?, Springer-Verlag, 1989. 6 WP 2.5: An Orphan Detection Protocol for Mobile Agents 92 10. G. Tel, F. Mattern. ?The Derivation of Distributed Termination Detection Algorithms from Garbage Collection Schemes.?, ACM TOPLAS 15:1, pp. 1-35, 1993. 11. M. Shapiro, P. Dickman, D. Plainfoss?. ?SSP Chains: Robust, Distributed References supporting acyclic Garbage Collection?, Tech. Report No. 1799, INRIA, Rocquencourt, Frankreich, 1992. 12. M. Straßer, J. Baumann, F. Hohl. ?Mole - A Java Based Mobile Agent System?, in Workshop Reader ECOOP '96, d-punkt, pp. 327 - 334, 1996. 13. G. Tel. ?Distributed Algorithms?, Cambridge University Press, 1994. 14. J. E. White. ?Telescript Technology: The Foundation of the Electronic Marketplace?, General Magic, 1994.