Failure handling with High Availability Controller (HAC)

solidDB Help : High availability : solidDB HotStandby : High Availability Controller : Failure handling with High Availability Controller (HAC)

High Availability Controller (HAC) monitors the health and status of the HotStandby (HSB) servers. In failure situations, such as a database process failure or a computer hardware failure, HAC performs failovers and other necessary state transitions to maintain the best possible availability of the database service.

For all failures considered, it is assumed that they happen in a normal, fully operational state that is expressed by the PRIMARY ACTIVE and SECONDARY ACTIVE states of the two HSB servers. Generally, HAC takes care of single failures only; it is assumed that a failure does not occur before the system has recovered from a previous failure. There are, however, certain predefined multiple-failure scenarios that HAC can handle.

As far as single failures are concerned, HAC maintains an almost uninterruptible database service. If multiple failures occur, HAC attempts to avoid an erroneous system state (such as dual primary servers).

HAC can handle the following failures:

▪ Single failures:

– the primary (ACTIVE) database server process fails,

– the secondary (ACTIVE) database server process fails,

– the computer that hosts primary server fails,

– the computer that hosts secondary server fails,

– a server is unresponsive to external clients.

If an External Reference Entity (ERE) is used, HAC can also handle a HotStandby link failure, that is, a lost connection between the two HotStandby database processes. For more information about the ERE, see External Reference Entity (ERE)

▪ Double failures:

– while recovering from a previous failure, a synchronization error occurs between the primary and the secondary database,

– while recovering from a previous failure, a server process fails while servers are re-establishing the HSB link.

The purpose of recovery is to bring the failed component back to operation. Occasionally, further failures happen during recovery. They usually lead to a situation where the system remains in a state of limited availability (only one server is available), awaiting human intervention. The following typical recovery-time failures are not automatically resolved:

▪ the failed database is corrupted to a point that it is impossible to restart it,

▪ there is not enough free disk space to perform a catchup.

See

Primary server (or computer) fails

Secondary server (or node) fails

HotStandby link fails

Server is unresponsive to external clients

Summary of failure scenarios and HAC actions

Go up to

High Availability Controller