Key Concepts

1. Redundancy

Definition: Duplication of critical components.

Example: Multiple servers to take over if one fails.

2. Failover

Definition: Switching to a standby component upon failure.

Example: Backup server activation when the primary fails.

3. Load Balancing

Definition: Distributing workloads to prevent overload.

Example: Distributing traffic across servers using a load balancer.

4. Clustering

Definition: Multiple servers working together as a single system.

Example: Database cluster handling storage and processing.

5. Geographic Redundancy

Definition: Duplicate systems in different locations.

Example: Data centers in different regions for disaster recovery.

6. Monitoring and Alerting

Definition: Observing performance and alerting issues.

Example: Using tools like Nagios to track server health.

Principles

There are three principles of systems design in reliability engineering which can help achieve high availability.

Elimination of single points of failure. This means adding or building redundancy into the system so that failure of a component does not mean failure of the entire system.

Reliable crossover. In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover.

Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure – but the maintenance activity must.

Scheduled and unscheduled downtime

Scheduled downtime: Caused by maintenance tasks like software patches or configuration changes requiring reboots. It is usually management-initiated and unavoidable with the current system design.

Unscheduled downtime: Arises from unexpected physical events such as hardware/software failures, power outages, network issues, or security breaches.

If users are warned about scheduled downtimes, the distinction is helpful. However, for true high availability, any downtime, scheduled or unscheduled, is considered disruptive.