solidDB Help : solidDB Grid : Failure handling in the grid : Grid Availability Manager
  
Grid Availability Manager
Note Check the solidDB Release Notes for any limitations that are associated with using a grid in the current release.
In the grid leader, the Grid Availability Manager (GAM) monitors the availability and replication status of all grid nodes.
By default, the GAM is started in every node but is active only in the grid leader; in followers, the GAM is in passive mode, see solidDB consensus algorithm.
If the node is the grid leader, the GAM takes actions in the following situations:
a new grid node is added or an unavailable node becomes available again,
a node becomes the grid leader,
a grid node becomes unavailable or is dropped,
the network, or part of the network fails.
If the node is a follower (not the grid leader), or a candidate (temporary transitional state between grid leader and follower), the GAM remains passive except when partition replication units are rearranged. In this case, the GAM in each grid node either loads new replication units or deletes excess replication units.
The following sections list the tasks that the GAM performs in specific situations.
New or reconnected node
When a new node is added to the grid or an existing node is reconnected, the GAM in the grid leader performs the following tasks:
allocates secondary replication units to balance the total number of replication units for each partition,
requests that the GAM in new node loads data to the new replication units,
when data load is complete, balances the distribution of primary replication units by switching some secondary replication units to primary replication units (and some primary replication units to secondary replication units),
removes any excess secondary replication units.
Node becomes grid leader
When a node assumes leadership after being a follower, the GAM in the grid leader performs the following tasks:
checks whether all ranges have a primary replication unit and, if not, switches secondary replication units to primary replication units as required,
allocates new secondary replication units and removes excess replication units in order to satisfy the replication factor,
checks the number of primary and secondary replication units on each node and balances the grid by moving replication units as required.
Node is removed or fails
When a node is removed from the grid or fails, the GAM in the grid leader performs the following tasks:
disconnects the node from the grid by switching the node membership state from MEMBER_ONLINE to either MEMBER_OFFLINE (if node is removed) or MEMBER_FAILED (if node has failed),
informs the rest of the nodes that the node is unavailable,
designates new primary replication units on other active nodes to replace any primary replication units that were on the unavailable node,
maintains the required number of replication units for each partition (determined by the replication factor) by creating new replication units on functional nodes,
if the node failed, the GAM keeps checking to see if the node has recovered, and automatically reconnects the node if it becomes responsive.
Network failure
In the case of a network failure, the GAM in the grid leader detects one of the following conditions:
Loss of connection between itself and one or more nodes.
If the GAM (and therefore the grid leader) can no longer connect to the majority of the grid nodes, the GAM demotes the grid leader to a follower. The grid node (as follower) then waits to be reconnected to the other grid nodes, when a new leader is elected, by using the consensus algorithm, see solidDB consensus algorithm.
If the GAM cannot connect to a minority of the nodes, the GAM treats those nodes as failed nodes.
Broken replication connections between nodes that are affected by the failure.
In this case, the GAM detects the network failure from the replication statuses that are reported by affected nodes. If the node is still active, the replication can be restarted by the node or, if the node controller detects the node as failed, the node controller attempts to restart the node, see Node controller.
Go up to
Failure handling in the grid