Failure handling with High Availability Controller (HAC)

High Availability Guide : Failure handling with High Availability Controller (HAC)

This section describes possible failure scenarios and typical recovery procedures for them if using the High Availability Controller (HAC). HAC handles various failure scenarios implicitly. However, different failure or initialization scenarios (administrative scenarios, for short) can be handled by a human administrator, or a watchdog type software program.

HAC is a watchdog type program that monitors Primary and Secondary servers, and gives commands to change those servers' states when necessary. For example, HAC can determine when the Primary or Secondary server itself has failed or when just the communication link between these servers is down.

The purpose of recovery is to bring the failed component back to operation. Occasionally, further failures happen during recovery. They usually lead to a situation where the system remains in a state of limited availability (only one server is up), awaiting human intervention. Typical recovery-time failures that are not automatically taken care of are:

▪The failed database is corrupted to a point that it is impossible to restart it.

▪There is not enough free disk space to perform a catchup.