Advanced Replication Guide : solidDB® data synchronization architecture : Sync Pull Notify : Introduction to Sync Pull Notify
  
Introduction to Sync Pull Notify
Before we discuss Sync Pull Notify in detail, we should contrast it with other methods of replicating data.
"Sync Push" vs. "Sync Pull" vs. "Sync Pull Notify"
When data is replicated, the servers can use
a "pull" approach, or
a "push" approach, or
a hybrid, such as "sync pull notify".
solidDB® advanced replication supports the "pull" and "pull notify" approaches.
The "pull" approach is that the replica requests the data. This corresponds to the "refresh" operation in solidDB® advanced replication. When the replica refreshes, it requests that the master server send all of the data in a particular publication (or the data that has changed since the last refresh operation).
The "push" approach is that the master sends the data to the replica at a time chosen by the master. (This is usually, but not necessarily, immediately after the data was updated on the master). When the master pushes the data, the replica must accept the data. solidDB® servers do not use a true "push" method.
solidDB® uses a hybrid approach called "sync pull notify". In this approach, the replica is notified that new data is available. The replica may then "pull" the new data by executing a REFRESH command. The REFRESH is optional; after being notified that there is new data, the replica can choose to refresh immediately, or to refresh after delaying, or simply to ignore the notification and not refresh at all. See the next section for details.
Implementing Sync Pull Notify
The solidDB® server implements Sync Pull Notify as a two-step process. First, the master server notifies the replica server; second, the replica server does a REFRESH.
Generally, the master server notifies the replica by calling a remote stored procedure. The replica must already have created this stored procedure. Normally, this procedure will itself contain the appropriate REFRESH command(s). (It is possible to use more indirect methods, but this approach is the shortest and simplest.)
What induces the master to call the remote stored procedure? One way to do this is to create a trigger. For example, suppose that you know that replica_node_01 would like to refresh data in the table named "employees" every time the table is changed on the master. You could create INSERT, UPDATE, and DELETE triggers on the master's copy of the employees table. Each trigger would call the remote stored procedure on replica_node_01 that requests a refresh of the data.
trigger
 |
 |
 v
stored procedure
 |
 |
 v
REFRESH statement
Although the process of notifying a replica is straightforward, there is no automated way to decide which replica(s) should be notified. The person who creates or updates the triggers on the master must know the name of the stored procedure to call on each replica that wants to be notified.
Similarly, the replica must have REFRESH statements that specify the appropriate publications. There is no automated way of having the master server figure out which replicas subscribe to which publications and then automatically creating triggers to call procedures on those replicas. (This wouldn't be sufficient anyway because the master server would also have to compose the appropriate stored procedure and force the replica to create that stored procedure.)
Whether you use a trigger or a START AFTER COMMIT or some other method to notify replicas that data in the master has been updated, it is important to remember that only the specified nodes are notified.
Note If you decide to use a trigger to initiate Sync Pull Notify or use some other mechanism to run synchronization very frequently, always make sure that you are aware of the possible performance consequences. Because synchronization messaging is based on a store & forward architecture that writes data to disk before sending it over the network, there is always some overhead related to synchronization messaging. The relative overhead of the messaging is the bigger, the smaller amount of data is to be synchronized. For example, a solidDB® server is capable of performing hundreds or thousands of update operations per second but it may be able to handle only some dozens of synchronization messages per second. If you write an update trigger into a frequently updated table to initiate synchronization for each occurred update, the update performance is limited by the synchronization messaging performance. Instead of achieving performance of 1000 updates per second, you may be able to run only 10 updates per second because every update causes a synchronization message that contains only one row of changed data. The remedy to this issue is to make the synchronization occur only, for example, once every few seconds. This way, each synchronization message contains more than one row in it and the overhead of the synchronization message is only a fraction of the worst case scenario.
DEFAULT keyword in remote stored procedures
To understand the DEFAULT keyword in remote stored procedures, you must understand both remote stored procedures and the START AFTER COMMIT command.
A START AFTER COMMIT command may contain the optional clause:
FOR EACH REPLICA [WHERE ... ];
For example
START AFTER COMMIT FOR EACH REPLICA
WHERE region = ’west’
UNIQUE CALL my_proc;
or, if you'd like the command to apply to all replicas:
START AFTER COMMIT FOR EACH REPLICA
UNIQUE CALL my_proc;
The server that is executing this statement effectively creates a list of all the replicas that match the WHERE clause and then calls the specified procedure (my_proc) for each replica on the list. While that stored procedure is running, the DEFAULT clause identifies the replica that is currently being processed. For example, suppose that we have three replicas that match the WHERE clause, i.e. three replicas for which region = 'west':
California
Oregon
Washington
The my_proc procedure is then called three times. During the first call DEFAULT is equal to "California"; during the second call DEFAULT is equal to "Oregon"; and during the third call the DEFAULT is equal to "Washington". If the stored procedure named my_proc wants to call a remote stored procedure on each of the three replicas, it can do so by using the syntax:
CALL remote_proc_name AT DEFAULT;
Each time that my_proc is called, DEFAULT will be set to the name of one of the three replicas, so the effect will be to call each of the following:
CALL remote_proc_name AT California;
CALL remote_proc_name AT Oregon;
CALL remote_proc_name AT Washington;
If the "remote_proc_name" contains a command to refresh from the master, then the effect will be that all three replicas are notified that there is updated data and they should refresh. Thus by using a combination of START AFTER COMMIT and remote procedure calls we have been able to notify every replica that needs to refresh the data.
One way to make such maintenance easier is to use the "properties" feature described earlier. For example, suppose that you have three replica servers named California, Oregon, and Washington, all of which refresh from a particular publication. Now suppose that you want to add a new replica named Arizona and it should also refresh from the same publication and thus should be notified under the same circumstances as the servers California, Oregon, and Washington. You can simply set the property for that new server to match the property that the other three servers have (and which is used in the WHERE clause of the START AFTER COMMIT). For example,
SET SYNC PROPERTY region = ’west’; -- on the replica
SET SYNC PROPERTY region = ’west’ FOR REPLICA arizona; -- on the master
When to use Sync Pull Notify
The Sync Pull Notify feature decreases the delay between updating information about the master and updating information on the replica(s). However, in some situations it may increase network traffic.
Sync Pull Notify may increase the load on your network. If your current synchronization approach is to synchronize each replica once per hour, and if you typically have multiple updates per hour, then you only use 1 set of network messages per replica per hour to synchronize. If you switch to using the Sync Pull Notify feature, however, then you will have as many sets of network messages per replica per hour as you have updates.
The converse is also true. If you currently refresh frequently but only have updates infrequently, then Sync Pull Notify will actually reduce your network traffic because you will only refresh when you actually need data; you will not need to frequently "poll" to see if any data has changed. If you work in a situation where updates are infrequent, but it's necessary for you to know about those updates immediately when they occur, then Sync Pull Notify is a very good solution.
Note that "pull" and "push" (Sync Pull Notify) approaches are not mutually exclusive. You may use a combination of these. As an example, you might choose to design your system so that every day at a specific time the master will notify the replicas that it is time to refresh. However, a repair person going out into the field and taking her replica database with her could also issue a "REFRESH" command just before leaving the office, thus making sure that she has the most up-to-date data.
When designing your system for Sync Pull Notify, you may find it helpful to know that there are only three ways that data in the master can be changed, and thus there are only three situations in which you might need to "push" updated data to the affected replicas:
1 Data may be changed directly on the master, that is, a client may insert, update, or delete a record in a table on the master.
2 The master may receive data from a replica.
3 If the master server is both a master and a replica (for example, it is "in the middle" in a hierarchy that has three or more levels), then the server may request a refresh from its master and get data from that.
See also
Sync Pull Notify