solidDB Help : solidDB Grid : Failure handling in the grid : Failure conditions and automatic corrective actions
  
Failure conditions and automatic corrective actions
Note Check the solidDB Release Notes for any limitations that are associated with using a grid in the current release.
The following sections summarize typical node, network, and replication failure conditions, the automatic corrective actions, and the status of the grid before and after the corrective actions.
For detailed information about the corrective actions that each mechanism provides, see the following topics:
Node controller
Grid Availability Manager
solidDB consensus algorithm
Node failures
The following table summarizes node failure conditions and the automatic corrective actions that are taken.
The automatic corrective actions that are detailed in the following table assume that a node controller has been configured. However, any mechanism that monitors and restarts nodes can be used in place of the node controller.
 
Failure condition
Status after failure but before corrective actions
Automatic corrective actions
Status after successful corrective actions
Single node fails (not the leader)
If the primary replication unit for a replication group is on a failed node, clients cannot update data in the replication group but can query data from secondary replication units on other nodes.
Node controller attempts to restart the failed node.
If the node does not recover, GAM sets the status of the failed node to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
Clients can update and query all data as normal.
Leader node fails
Grid reconfiguration and parameter changes cannot be implemented. DDL statements cannot run.
If the primary replication unit for a replication group is on a failed node, clients cannot update data in the replication group but can query data from secondary replication units on other nodes.
Remaining grid nodes elect a new leader by using the consensus algorithm.
Node controller attempts to restart the failed node.
GAM starts on the new leader. If the failed node does not recover, GAM sets the status of the node to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
Clients can update and query all data as normal. DDL statements and grid configuration statements can run again.
Multiple nodes (fewer than half) fail
If the primary replication unit for a replication group is on a failed node, clients cannot update data in the replication group.
If all replication units in a replication group are on failed nodes, clients cannot query data in the replication group.
Node controller for each node attempts to restart the node.
For any node that does not recover, GAM sets the status of the node to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
Clients can update and query all data as normal with following exception:
If all replication units in a replication group were on failed nodes that did not recover, data in the replication group cannot be updated or queried.
Majority of nodes fail
Clients cannot update or query any data in the grid.
Because a majority of nodes are not active, GAM is unable to update the status of any failed nodes.
Node controller for each node attempts to restart the node.
When the majority of nodes in the grid are active, a new leader is elected by using the consensus algorithm. For any node that does not recover, GAM (on new leader) sets the status of the node to MEMBER_FAILED.
Clients can update and query all data as normal with following exceptions:
If all replication units in a replication group were on failed nodes that did not recover, data in the replication group cannot be updated or queried.
If the grid still does not have a majority of nodes in an active state, the grid leader remains but no synchronously replicated transactions (DDL statements, configuration changes, or other GAM actions) can be committed.
Network failures
The following table summarizes network failure conditions and the automatic corrective actions that are taken.
 
Failure condition
Status after failure but before corrective actions
Automatic corrective actions
Status after successful corrective actions
Network failure between two nodes (not including the leader)
Clients can update and query all data but replication groups that have replication units on both nodes are unable to synchronize the data.
GAM attempts to restart the replication between the nodes until either the connection is re-established or the node is removed from the grid.
Clients can update and query all data as normal.
Network failure between leader and another node
If the primary replication unit for a replication group is on a node that the leader cannot reach, clients cannot update data in the replication group but can query data from secondary replication units on other nodes.
GAM sets the status of the inaccessible node to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
Clients can update and query all data as normal.
Network failure splits grid (leader is still connected to majority of nodes)
If the primary replication unit for a replication group is not on a node that is connected to leader, clients cannot update data in replication group.
If no replication unit in the replication group is on a node connected to leader, clients cannot query data in replication group.
GAM sets the status of inaccessible nodes to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
Clients can update and query all data as normal with following exception:
If all the replication units in a replication group are on unconnected nodes, data in the replication group cannot be updated or queried.
Network failure splits grid (leader is connected to a minority of nodes)
The original leader remains in the group with the minority of nodes but no synchronously replicated transactions (DDLs, configuration changes, or other GAM actions) can be committed.
If any set of connected nodes can form a majority, they elect a new leader by using the consensus algorithm. GAM starts on the new leader. GAM treats unconnected nodes as failed, sets the status of nodes to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
Clients can update and query all data as normal with the following exception:
If all replication units in a replication group are on unconnected nodes, data in the replication group cannot be updated or queried.
Replication failures
The following table summarizes replication failure conditions and the automatic corrective actions that are taken.
When running in grid mode, each node creates multiple replication connections and subscriptions to all other nodes in the grid. These subscriptions might fail for different reasons, for example, due to node or network connectivity failure.
Note The connections and subscriptions are internal to the grid and should not be started or stopped by the database administrator.
 
Failure condition
Status after failure but before corrective actions
Automatic corrective actions
Status after successful corrective actions
Replication between one or more nodes fails.
Replication groups, where one of the nodes has the primary replication unit and another has the secondary replication unit, cannot maintain the configured replication factor. That is, updates made after the replication fails are stored in fewer nodes than expected. Replication of metadata is impacted if one of the nodes is the grid leader. Each node reports replication failures in solmsg.out.
If node or network failures can be resolved, grid nodes restart replication subscriptions as required.
GAM does not remove nodes to re-establish replication factor.
Clients can update and query all data as normal.
See
Actions of clients and nodes as a result of failures
Go up to
Failure handling in the grid