Network Availability: the quest for the five nines

Any organization that relies on data networks for its core operations needs to ensure the continued availability of its network infrastructure, which includes LAN devices (switches, routers, firewalls, etc.), WAN links, Internet and cloud connections, and the support facilities (power, air conditioning, etc.). The network operator may achieve the desired level of LAN availability using several approaches. Availability of services obtained from service providers or carriers is often defined by an SLA (service level agreement) that states, among other things, the percentage of time during which the service is expected to be up and running (uptime). It is common for the uptime to range from 99.95% to 99.9999%, depending on the type of service. Availability of five nines, 99.999% uptime or a little over 5 minutes of downtime a year, is considered the norm for telecommunication carriers [1], but as networks are becoming vital to business continuity for many organizations, the need for five nines availability will proliferate.

fivenines

The percentage of uptime does not provide sufficient information about the availability of the network. For instance, the 99.99% availability translates to about 52 minutes of downtime/year. This outage can occur in one occasion, in periods of four minutes every month, or one minute a week. Therefore, the availability is better measured and controlled using the metrics MTBF (mean time between failures) and MTTR (mean time to repair). The two metrics are related to the percentage of availability by the equation shown below but they provide better estimate of how long the network is expected to be operational and how fast it can recover from an unexpected failure.

(1) $\begin{equation*} Availability=\frac{MTBF}{MTBF+MTTR} \end{equation*}$

Reliability

Reliability of a given system or a component refers to the probability (likelihood) that a network is operational at any given time. A network component that remains operational, on average, for 364 days/year is said to have reliability of 99.73%. The component can also be described to have a frequency of failure of 1 day/year. Components that have high reliability are expected to have long MTBF and low frequency of failure.

Networks consist of many interdependent systems and components such as internetworking devices (routers and switches), facilities (power, air-conditioning, rack space, etc), physical and data security, management and configuration controls, and others. If the relationship among these systems is viewed as a connected chain of functions, the network can be operational only if all these systems are functioning properly. Then, the reliability, R, of the network is equivalent to the multiplication of the reliability of all individual systems.

(2) $\begin{equation*} R_{Network}=R_{System_1} R_{System_2}\dots R_{System_N} \end{equation*}$

According to Equation (1), to end up with a network of 99.999% reliability, individual components and systems must have higher reliability. As the number of components the network increases, the individual reliability must go higher as well. Reliability of hardware components and services acquired from providers (e.g. WAN connections) are often beyond the network operator’s control. Instead, the effort is concentrated on eliminating single points of failure (SPOF) in the network to reduce the chance that failure of a single component takes down the entire network [2].

Redundancy

Redundancies in the network infrastructure eliminate SPOFs by adding components and other resources (e.g. memory, bandwidth or power) beyond those needed for the normal operation of the network. The goal is to make these resources available in the event of a loss of the main resources due to a failure. Complete duplication of components, known as 2N redundancy, is quite expensive considering the excess resources that remain unused. Alternatively, N+1 or N+M redundancy may provide more cost effective redundancy by relaxing the requirement to duplicate every component and providing one or few standby components instead.

Let’s consider a simple example to demonstrate the effect of redundancy. A router serves as an Internet gateway for an organization’s LAN. If the router fails unexpectedly, purchasing a replacement router and waiting for its arrival may extend the time to recover from the failure to days or even weeks. If the LAN availability is measured within a year (to comply with an SLA, for example), then it is estimated to be around 95% based on Equation (1). A vendor’s service contract may guarantee replacing the failed router within a predefined time (2 to 24 hours) for an annual fee. In this case, the repair time will include the delivery time plus the time to bring the router online and the availability approaches 99.95%. To reduce the repair time further to an hour or less and increase availability to 99.99%, a spare router can be kept in storage.

The example is an oversimplification because a single instance of failure is not sufficient to measure the availability of a network since the MTTR is the statistical average of repair times from multiple failures accruing over a long period of time. The example shows, however, that redundancy is an attractive approach to improving the network reliability. As the following equation shows, a highly available system of 99.9999% reliability can be constructed from two redundant components of 99.9% reliability or three redundant components of 99% reliability.

(3) $\begin{equation*} (1-R_{System})=(1-R_{Component_1} )(1-R_{Component_2} )\dots(1-R_{Component_N}) \end{equation*}$

Failover switching mechanisms

Redundancy requires a mechanism to detect the failure and initiate the failover process. Successful failover to standby components requires timely detection of the fault and successful transfer of functions to the standby components. It is also required that any activated component and other backup resources are capable of performing the same functions and carrying the same workload as the failed component.

Network management systems (NMS) can detect faults and alert the network operators who can replace the faulty components and restore configurations manually as in the previous example. However, as MTTR requirements approach few minutes, timely response by the human operator becomes impossible and automatic detection and switching mechanisms are needed.

This means for our router example that the additional router must be connected to the network, powered on, and ready to take over as soon as the main router fails. Protocols such as VRRP (Virtual Router Redundancy Protocol) or HSRP (Cisco’s counterpart) provide the necessary detection and activation mechanisms, but the network operator must ensure the synchronization of the configuration in both routers manually or using automated tools. Considering the various delays associated with operating VRRP [3] and the convergence of other protocols such as BGP after the fault [4], the network may return to full operation within few minutes and that is sufficient to push the availability to 99.999%.

It is easy to overlook the fact that failover switching mechanisms can also be subject to failure. The NMS may fail to detect the fault or send the alert message to the human operator. The VRRP may not function properly because of misconfiguration or because of another fault in the network. These failures may go unnoticed during normal network operation because they do not affect network performance, but they cause severe consequences when faults occur.

Stateful recovery

Redundancy may reduce the network recovery time but it does not recover the data in transit. If failover is to be unnoticeable by protocols and applications such as VoIP, then stateful recovery is required. In stateless failover all active connections going through our example router can timeout and sessions are dropped. The backup router needs to establish all routing adjacencies and rebuild the routing, NAT, and ARP tables. Applications also need to re-establish connections when the backup router takes over.

If stateful failover is supported, the active router must continuously pass information to the backup router such as device status, TCP connection states, NAT and ARP tables, etc. When the failover occurs, the same information will be available to the backup router to use immediately and the applications running on the network do not need to re-establish communication sessions. In addition to its advantage to certain applications (e.g. no dropped VoIP conversations), stateful recovery reduces the recovery time to seconds or a sub-second interval and improves the availability to the range of 99.9999%.

Complexity

Equation (3) justifies the use of redundancy as means to build highly available network infrastructure from less reliable, inexpensive components. However, Equation (2) suggests that the same result can be achieved by simplifying the architecture (fewer stages) and/or using highly available components in each stage. Estimating the reliability of the network by statistical means is not a trivial task because of issues of complexities that result from adding redundancies and inter-dependencies.

The multitude of protocols and level of redundancies may cause multiple protocols to react to failure and attempt recovery simultaneously. For instance, when a link fails, the network will initiate recovery by re-configuring the spanning tree and activating another link. A backup router may also attempt to take over the routing function when it fails to receive notifications from the main router as a result of the link-loss. Such conditions are avoided by introducing artificial delays in reacting to these event, at the expense of longer repair time.

Redundancy may provide false sense of reliability and scalability. A network of many redundant components can experience multiple failures before it suffers performance degradation or an all-out outage. Once an outage occurs the network operator is faced with the task of repairing multiple failures. Redundant components may suffer cascading failures if they are subjected to the same external events that caused the original failure such as capacity overload or an exploitation of a software bug. Also performance degradation may occur when one component within a group of load-balancing components fails if the remaining components do not have enough capacity to handle all the workload.

Conclusions

Network availability can be improved considerably by implementing multiple types of redundancies, but achieving the coveted five nines requires paying attention to issues beyond simple redundancies. Decisions about levels of complexity and scalability can be made during the design stage by choosing between few highly available components or more of less reliable components. Failover mechanisms can fail but redundancy in these systems is not always available. The role of network management systems and practices is significant not only in detecting faults and critical conditions, such as exceeding safe capacity levels, but also in ensuring fast recovery. Standard network protocols have inherent limitations with respect to reacting to faults and recovery. These limitations have to be understood and proprietary solutions can be sought, if available, to archive the desired availability.