Using SCTP Multihoming for Fault Tolerance & Load Balancing Armando L. Caro Jr., Janardhan R. Iyengar, Paul D. Amer, Gerard J. Heinz Protocol Engineering Lab Computer and Information Sciences University of Delaware {acaro, iyengar, amer, heinz}@cis.udel.edu Randall R. Stewart Cisco Systems Inc. rrs@cisco.com OVERVIEW Mission critical systems rely on redundancy at multiple levels to provide uninterrupted service during resource failures. Such systems when connected to IP networks often deliver network redundancy by multihoming their hosts. A host is multihomed if it can be addressed by multiple IP addresses. An endpoint's IP address can become inaccessible, possibly due to an interface failure, severe congestion, or due to BGP's slow route convergence around path outages. Redundancy at the network layer allows a host to be accessible even if one of its IP addresses becomes unreachable; packets can be rerouted to one of its alternate IP addresses. TCP does not support multihoming. Any time either endpoint's IP address becomes unreachable, TCP's connection will timeout and abort, thus forcing the upper layer to recover. The recovery delay can be unacceptable for mission critical applications such as IP telephony, IP storage, and military battlefield communications. To address TCP's shortcoming, the Stream Control Transmission Protocol (SCTP) has been designed with fault tolerance in mind. SCTP supports multihoming at the transport layer to allow SCTP associations to remain alive even when an endpoint's IP address becomes unreachable. ADAPTIVE FAILOVER MECHANISM SCTP has a built-in failure detection and recovery system, known as failover, which allows associations to dynamically send traffic to an alternate peer IP address when needed. SCTP's failover mechanism is static and does not adapt to application requirements or network conditions. Network dynamics however, vary greatly among different networks and hence, the heuristics used for determining failover should be adjusted accordingly. Applications also have different requirements that the failover mechanism should cater to. For example, SS7 signalling applications require that failovers take no longer than 800 ms. On the other hand, a file transfer may be more concerned with the total transfer time. Therefore, we argue that network fault tolerance should cope with dynamic network conditions and varying application needs. We have developed a two-level (alpha-beta) threshold mechanism for SCTP which provides added control over failover actions. We have formally specified and modeled our failover mechanism, and are currently investigating the relationships between failover thresholds and network parameters, such as round trip times, packet loss rates, etc. From these relationships, we will develop an adaptive failover mechanism for SCTP. END-TO-END LOAD BALANCING SCTP provides for application-initiated changeovers so that the sending application can change the sender's primary destination address, thus moving the outgoing traffic to a potentially different path. Although the motivations for providing a changeover mechanism are different, it is not difficult to envision application developers using this feature for load balancing at the application layer. With the multihoming feature in SCTP, we feel that end-to-end load balancing can be implemented at the transport layer. Being better informed about the end-to-end paths, the transport layer can perform fine-grain load balancing. We foresee issues during load balancing in areas such as congestion control and loss detection and recovery. These issues also suggest that the transport layer be involved. We are currently looking at issues due to changeover. We have uncovered a problem that results in cwnd overgrowth during changeover. Analysis shows that this problem may not be a corner case but may occur under various network and changeover conditions. We propose the Rhein algorithm and the Changeover Aware Congestion Control (CACC) Algorithms as two solutions to the problem. We are further investigating changeover problems and solutions during load balancing. In the future, we will investigate shared bottleneck detection techniques for congestion control and traffic scheduling issues during load balancing. For further information: http://www.cis.udel.edu/~acaro/research http://www.cis.udel.edu/~iyengar/publications http://pel.cis.udel.edu