Network Design Principles: Redundancy

Redundancy is a primary tool to create resilience in communications networks. It eliminates the single point of failure by relying on the probability that it is rare that two network components fail at the same time. Redundancy can be classified into [Lidwell]:

Diverse redundancy

Diverse redundancy uses multiple components of different types. Diverse redundancy is resistant to a single cause of failure, but it is complex to implement and maintain. For example, a network may have diverse redundancy by using one fibre link and one wireless link to connect between two buildings. This reduces the likelihood that a single problem will result in a failure in both links. Another example is to use two fibre links going over two diverse routes is a cut in route does not affect the other. The former example represents technological diversity while the latter represents spatial diversity.

Homogenous redundancy: uses multiple components of a single type (e.g., use of multiple firewalls). Homogenous redundancy is relatively simple to implement and maintain but it is susceptible to single causes of failure. In other words, the type of cause that results in failure in one component can result in failure of other components of the same type. For example, a Denial of Service (DoS) attack may cause both the main and backup firewall to fail.

Active redundancy: requires that redundant components are functioning at all times (e.g., two active firewalls). In networks, active redundancy can be further classified into two types:

Active/Active Redundancy: In this type, the devices functioning in a redundancy group are both active at the same time and may share the task load (i.e. perform load balancing). In the event of a failure in one member, the other member(s) of the redundancy group take over the function of the failed device. The remaining active member(s) must have enough capacity to handle all load to avoid any performance degradation. Active redundancy also allows for component failure, repair, and substitution with minimal disruption of network performance. However, there are usually technology and implementation constraints that make it difficult to always have active/active redundancy everywhere on the network.
Active/Standby (aka Hot Standby) redundancy: In this type, one or more devices are dedicated as backups. Both the backup and primary devices are running simultaneously but the backup is not performing any actual function other than monitoring the primary device. Once a fault is detected, the backup takes over automatically. This usually requires little or no human interaction and requires short MTTR.

Passive redundancy: activates redundant components only when an active component fails (e.g., using a spare switch in the event that an active switch fails). The backup device may be mounted, configured and powered off. When the primary devices fails, then the backup is turned on. This type of redundancy (sometimes called cold standby) usually involves human intervention and requires longer MTTR.

The four kinds of redundancy could be used in combination to achieve highly reliable systems following these recommendations:

Diverse redundancy is suitable for critical systems when the probable causes of failure cannot be anticipated.
Homogenous redundancy is suitable when the probable causes of failure can be anticipated.
Active redundancy is suitable for critical systems that must maintain stable performance in the event of component failure or extreme changes in system load.
Passive redundancy is suitable for non-critical components within networks, or networks in which performance interruptions are tolerable.

Keep in mind, though, that redundancy also increases complexity, so there is a point where adding redundancy does not improve availability.

Read more about redundancy in my previous post.

Read about other Network Design Principles.

References:

Lidwell, William, et al. Universal Principles of Design. 2nd ed., Rockport, 2010.