High Availability Guide : Watchdog sample : HotStandby configuration using Watchdog : How the Watchdog application works
  
How the Watchdog application works
The Watchdog sample application notifies you when the Primary server is down. In normal mode, the Watchdog checks the connection status of servers using the hotstandby status connect command in both Primary and Secondary servers.
The Watchdog performs this check between servers at regular intervals. The interval time is set with the PingInterval parameter in the Watchdog's solid.ini configuration file.
The Watchdog reaches the conclusion that there is a problem in the HotStandby system when it receives no response from the Primary or Secondary node or both nodes after a given number of polling attempts. The number of attempts is set in the NumRetry parameter in the Watchdog configuration file (the [Watchdog] section in the solid.ini).
The Watchdog also observes whether the Primary server and the Secondary server are connected to each other. If the Primary or Secondary server returns a successful connect status to the Watchdog, this means the Primary and Secondary are still connected. If it returns an error, on the other hand, then the Primary and Secondary are no longer connected.
If the AutoSwitch parameter in the Watchdog configuration file is set to YES, then the Watchdog is also responsible for automatically switching server states in the event of a Primary failure. For example, when the Primary server is down, the Watchdog switches the Secondary server to make it the new Primary and put it in PRIMARY ALONE state. If the AutoSwitch parameter is set to NO, the Watchdog does not change the server state itself, but instead writes a message to the Watchdog log to notify the user to switch server states.
To continue monitoring, the Watchdog switches to failure mode, which means it continuously keeps checking failed servers for a working connection.
Failure mode
When the Watchdog sample application knows that HotStandby Primary and Secondary servers are connected, the Watchdog stays in normal mode. If one of the servers fails, or if the communication link between these servers fails, the Watchdog will take some course of action. If the action fails to connect the servers, the Watchdog goes into failure mode.
After the Watchdog enters failure mode, the Watchdog waits for the system administrator to fix the problem with the Primary and Secondary servers. If, in the meantime, a second failure occurs, the Watchdog does not handle the failure. This limitation in the Watchdog is deliberate. There are situations where a series of failures and even seemingly appropriate responses can cause the error of having two Primary servers (either in PRIMARY ALONE or STANDALONE states). This is especially true if there are brief failures in the network, but no failures in the database servers themselves. An example that produces two Primary servers is provided in Coding the Watchdog for multiple failures.
During failure mode, the Watchdog polls both the Primary and Secondary servers. When it is able to connect to both servers, it sends the hotstandby state command to both servers to see whether it can communicate with them and to see which state each of them is in.
Once the Watchdog is able to communicate with both servers, it will decide what to do next based on the solid.ini parameter DualSecAutoSwitch. If DualSecAutoSwitch = Yes and both servers are secondary, then the Watchdog will automatically select one of the two secondaries to be a new primary and switch it to primary. If DualSecAutoSwitch = No then the system administrator must switch one server to be the primary. Note that DualSecAutoSwitch applies whether the Watchdog is in “normal” mode or “failure” mode.
Coding the Watchdog for multiple failures
There are two ways to handle multiple failures in the Watchdog. You can:
After each failure (and automatic response by the Watchdog), require manual (human) intervention to check the situation. Manual intervention may require actions, such as restarting a server, or fixing a network problem. This is the approach that the Watchdog uses because it reduces the risk of having two Primary servers.
Write a watchdog application that can handle multiple failures over time.
This method does run the risk of having two Primary servers, as shown in the following example.
Dual primaries
In this example, Server1 is initially the Primary and Server2 is initially the Secondary.
1 A network failure occurs and Server1 becomes inaccessible.
2 The Watchdog switches Server2 from SECONDARY to PRIMARY ALONE.
3 A second network failure occurs, and Server2 becomes inaccessible.
4 The first network failure is repaired, and Server1 becomes accessible again.
5 The Watchdog, seeing that Server1 is accessible and Server2 is not, switches Server1 to PRIMARY ALONE.
6 The second network failure is fixed and Server2 becomes accessible again.
7 At this point, both Server1 and Server2 are in the PRIMARY ALONE state.
See also
HotStandby configuration using Watchdog