Failure handling in the grid

solidDB Help : solidDB Grid : Failure handling in the grid

Note Check the solidDB Release Notes for any limitations that are associated with using a grid in the current release.

The failures that might occur in a solidDB grid, include node failures, network failures, and replication failures.

▪ A node failure occurs when a solidDB process unexpectedly terminates or hangs.

▪ A network failure occurs when two or more nodes become disconnected from each other.

▪ A replication failure usually occurs as a result of node or network failures.

In a grid, the following mechanisms are used to attempt to keep the grid functioning, and able to accept and process transactions, after a node or network failure:

▪ Node controller: a separate program that monitors the health of node processes within the same host and performs corrective actions to keep processes alive. For more information, see Node controller.

▪ Grid Availability Manager (GAM): a built-in automatic failure handling mechanism that is active in the grid leader and monitors the availability and replication status of all grid nodes. The GAM performs corrective actions to maintain read and write access to the grid, and the replication factor of partitions. For more information, see Grid Availability Manager.

▪ Consensus algorithm: an automated mechanism that ensures that there is always a grid leader (and therefore an active GAM) available in the grid whenever the majority of nodes are available. For more information, see solidDB consensus algorithm.

See

Failure conditions and automatic corrective actions