Failure condition
|
Status after failure but before corrective actions
|
Automatic corrective actions
|
Status after successful corrective actions
|
---|---|---|---|
Single node fails (not the leader)
|
If the primary replication unit for a replication group is on a failed node, clients cannot update data in the replication group but can query data from secondary replication units on other nodes.
|
Node controller attempts to restart the failed node.
If the node does not recover, GAM sets the status of the failed node to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
|
Clients can update and query all data as normal.
|
Leader node fails
|
Grid reconfiguration and parameter changes cannot be implemented. DDL statements cannot run.
If the primary replication unit for a replication group is on a failed node, clients cannot update data in the replication group but can query data from secondary replication units on other nodes.
|
Remaining grid nodes elect a new leader by using the consensus algorithm.
Node controller attempts to restart the failed node.
GAM starts on the new leader. If the failed node does not recover, GAM sets the status of the node to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
|
Clients can update and query all data as normal. DDL statements and grid configuration statements can run again.
|
Multiple nodes (fewer than half) fail
|
If the primary replication unit for a replication group is on a failed node, clients cannot update data in the replication group.
If all replication units in a replication group are on failed nodes, clients cannot query data in the replication group.
|
Node controller for each node attempts to restart the node.
For any node that does not recover, GAM sets the status of the node to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
|
Clients can update and query all data as normal with following exception:
If all replication units in a replication group were on failed nodes that did not recover, data in the replication group cannot be updated or queried.
|
Majority of nodes fail
|
Clients cannot update or query any data in the grid.
|
Because a majority of nodes are not active, GAM is unable to update the status of any failed nodes.
Node controller for each node attempts to restart the node.
When the majority of nodes in the grid are active, a new leader is elected by using the consensus algorithm. For any node that does not recover, GAM (on new leader) sets the status of the node to MEMBER_FAILED.
|
Clients can update and query all data as normal with following exceptions:
▪ If all replication units in a replication group were on failed nodes that did not recover, data in the replication group cannot be updated or queried.
▪ If the grid still does not have a majority of nodes in an active state, the grid leader remains but no synchronously replicated transactions (DDL statements, configuration changes, or other GAM actions) can be committed.
|
Failure condition
|
Status after failure but before corrective actions
|
Automatic corrective actions
|
Status after successful corrective actions
|
---|---|---|---|
Network failure between two nodes (not including the leader)
|
Clients can update and query all data but replication groups that have replication units on both nodes are unable to synchronize the data.
|
GAM attempts to restart the replication between the nodes until either the connection is re-established or the node is removed from the grid.
|
Clients can update and query all data as normal.
|
Network failure between leader and another node
|
If the primary replication unit for a replication group is on a node that the leader cannot reach, clients cannot update data in the replication group but can query data from secondary replication units on other nodes.
|
GAM sets the status of the inaccessible node to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
|
Clients can update and query all data as normal.
|
Network failure splits grid (leader is still connected to majority of nodes)
|
If the primary replication unit for a replication group is not on a node that is connected to leader, clients cannot update data in replication group.
If no replication unit in the replication group is on a node connected to leader, clients cannot query data in replication group.
|
GAM sets the status of inaccessible nodes to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
|
Clients can update and query all data as normal with following exception:
If all the replication units in a replication group are on unconnected nodes, data in the replication group cannot be updated or queried.
|
Network failure splits grid (leader is connected to a minority of nodes)
|
The original leader remains in the group with the minority of nodes but no synchronously replicated transactions (DDLs, configuration changes, or other GAM actions) can be committed.
|
If any set of connected nodes can form a majority, they elect a new leader by using the consensus algorithm. GAM starts on the new leader. GAM treats unconnected nodes as failed, sets the status of nodes to MEMBER_FAILED, selects a new primary replication unit for every replication group from remaining grid members, and reconfigures the grid.
|
Clients can update and query all data as normal with the following exception:
If all replication units in a replication group are on unconnected nodes, data in the replication group cannot be updated or queried.
|
Failure condition
|
Status after failure but before corrective actions
|
Automatic corrective actions
|
Status after successful corrective actions
|
---|---|---|---|
Replication between one or more nodes fails.
|
Replication groups, where one of the nodes has the primary replication unit and another has the secondary replication unit, cannot maintain the configured replication factor. That is, updates made after the replication fails are stored in fewer nodes than expected. Replication of metadata is impacted if one of the nodes is the grid leader. Each node reports replication failures in solmsg.out.
|
If node or network failures can be resolved, grid nodes restart replication subscriptions as required.
GAM does not remove nodes to re-establish replication factor.
|
Clients can update and query all data as normal.
|