Advanced Replication Guide : Administering advanced replication systems : Monitoring solidDB® advanced replication : Managing synchronization errors
  
Managing synchronization errors
A synchronization messaging error occurs when a message delivery or receipt fails. This section describes the procedures to manage synchronization errors. Depending on where the synchronization error occurred, the way to recover from errors can vary.
Error in forwarding a message to the master
When sending a message from the replica to the master fails, the message remains in the replica database and it can be resent to the master. The value of the STATE column of row in SYS_SYNC_REPLICA_MSGINFO table is in this case 22 -R_SAVED. You can query names of those messages that have not been successfully sent to the master with the following SQL statement:
SELECT MSG_NAME
  FROM SYS_SYNC_REPLICA_MSGINFO
  WHERE STATE = 22;
Failed messages can be resent to the master database with the following command in the replica database:
MESSAGE message_name FORWARD;
The possible values of the STATE column of the SYS_SYNC_REPLICA_MSGINFO table are documented in the Appendix: System Tables in thesolidDB® SQL Guide.
Error in execution of a synchronization message in the master
A message execution can fail in the master database, if:
an SQL statement of a transaction fails
refreshing data from a publication fails
sending a reply message back to the replica fails
The method used to handle each of these reasons for a failed message execution is covered in this section.
Error Handling in solidDB® Intelligent Transaction
If an intelligent transaction fails because of a fatal error, then the execution of the message is stopped in the master database and the transaction is rolled back. The error code of the failed operation is returned to the replica as the error code of the synchronization messaging command that was supposed to return the reply message to the replica database.
The error code is returned to the replica as a return code from either of these statements:
MESSAGE message_name FORWARD TIMEOUT timeout_in_seconds
or
MESSAGE message_name GET REPLY TIMEOUT timeout_in_seconds
In the master database, the system table SYS_SYNC_MASTER_MSGINFO contains information about all messages that currently reside in the master database. If the execution of a message has failed because of an error, that message will have value 1 in the STATE column. The ERROR_CODE and ERROR_TEXT columns contain information about the error that caused the message execution to halt. You can query these hanging messages, their originating replica database as well as the statement that caused the message to halt from the master database, by querying the view named SYNC_FAILED_MESSAGES. This view exists on the master and shows the replica name, message name, statement string, error information, and other information.
The proper way to recover from a hanging message in the master is to fix the error in the master database. For example, the reason for the error could be a unique constraint violation. To fix this error, the existing data of the master database must be modified to allow the new row to be inserted. (It is not possible to modify the contents of the transaction that resides in the transaction queue.) Alternatively, there could be a programming error in a stored procedure that needs to be corrected by re-creating the stored procedure in the master database.
Once the error is corrected, restart the message in the master database with the following command:
MESSAGE message_name FROM REPLICA replica_name EXECUTE
After the message is successfully executed in the master database, the reply message can be requested to the replica database with the following command:
MESSAGE message_name GET REPLY TIMEOUT timeout_in_seconds
Alternatively, the entire halted message can be deleted from the master database with the following command:
MESSAGE message_name [FROM REPLICA replica_name] DELETE
or, just the current transaction in the message can be deleted from the master database with the following command:
MESSAGE message_name FROM REPLICA replica_name DELETE CURRENT TRANSACTION
However, using this alternative will cause loss of data and should be used only as a last resort when there is no other means for resolving the error.
Note that MESSAGE DELETE CURRENT TRANSACTION is a transactional operation and must be committed before message execution may continue. To restart the message (where it left off) after the deletion is committed, use the following statement:
MESSAGE msgname FROM REPLICA replicaname EXECUTE
In general, transactions should be written to avoid concurrency conflicts and deadlocks. Because conflicts can still occur when transactions update or delete rows, we recommend you specify the SYS_TRAN_MAXRETRY bulletin board parameter in the master database using the SET SYNC PARAMETER command. The SYS_TRAN_MAXRETRY parameter retries a transaction that has failed due to a concurrency conflict or a deadlock based on a user configurable maximum number of attempts. For details, see SYS_TRAN_MAXRETRY.
Error Handling During Refresh Operations
Unlike transactions, an error while executing a REFRESH request on the master does not cause the entire message to stop. Instead, the error is reported back to the replica database in the result set of the messaging command. The error code is returned in the ERRCODE column of the result set. Similarly, the error text can be found from the ERRSTR column of the result set.
The result set of the messaging commands should always be fetched and the ERRCODE checked. All non-zero values mean that an error has occurred during the message execution in the master database.
One possible source of error is that the version of the publication has changed in the master database. If the publication was dropped and then re-created, the subscription(s) to the old version of the publication must be dropped in the replica database prior to subscribing to the new version. Dropping a subscription is done with the following command:
DROP SUBSCRIPTION publication_name [{(parameter_list) | ALL}] [COMMITBLOCK number_of_rows] [OPTIMISTIC | PESSIMISTIC];
When a subscription is dropped, all data for that subscription is deleted from the replica database. Subscribing to the new version and then requesting a refresh always brings the full publication to the replica.
In many cases, you can avoid this problem by using the "OR REPLACE" option of the CREATE PUBLICATION command when you need to update a publication. If you update a publication by using CREATE OR REPLACE PUBLICATION, rather than by dropping and re-creating the publication, then you will not necessarily need to drop and re-create the subscriptions. If you do not need to re-create the subscriptions, then in some cases the master can send the replicas an incremental refresh rather than a full refresh. Incremental refreshes reduce network traffic.
Error in receiving a reply message to a replica
A transfer of a reply message from master to replica can fail because of a networking error. In this case, the message remains in the master database.
You can use the following SQL statement to list messages whose replies have not been successfully received by the replica:
SELECT MSG_NAME
FROM SYS_SYNC_REPLICA_MSGINFO
WHERE STATE = 23
You can request the message again from the master database using the following command:
MESSAGE message_name GET REPLY TIMEOUT timeout_in_seconds
The possible values of the STATE column of the SYS_SYNC_REPLICA_MSGINFO table are documented in the Appendix: System Tables of solidDB® SQL Guide.
Error in executing a reply message in a replica
An execution of a reply message in a replica database can fail because of a concurrency conflict. Records applicable to a transaction may be locked so that an operation can't be performed in the replica. In this case, the message remains in the replica database. For example, when other transactions are updating a table, a concurrent refresh operation may fail. In this case, the REFRESH command remains in the replica database and must be re-executed.
You can allow the message to be re-executed from the replica database using the following command:
MESSAGE message_name EXECUTE
For example:
MESSAGE MyMsg0002 EXECUTE;
Sometimes it may be necessary to execute the message using pessimistic locking. This way, you avoid concurrency conflict handling, especially during REFRESH operations. For more information, read Handling concurrency conflict in synchronized tables in replica.
Deleting a message for error recovery
You can also explicitly delete a message from a replica database to recover from an error. When you delete a message, you can specify that the entire contents or only the current transaction that is propagated to the master database in the message be permanently deleted. The command to delete the entire message is:
MESSAGE message_name [FROM REPLICA replica_name] DELETE
The command to delete the current transaction is:
MESSAGE message_name FROM REPLICA replica_name DELETE CURRENT TRANSACTION
Note that the above statement can be used only in the master database.
When deleting the message from the master database, be sure to specify the replica name in the clause FROM REPLICA replica_name.
For example:
Message MyMsg0001 FROM REPLICA bills_laptop DELETE;
See also
Monitoring solidDB® advanced replication