Best Practices for Replication

This page contains recommendations for achieving the best results when deploying TimesTen replication as part of a High Availability solution. This guide should be read in conjunction with the TimesTen Replication Guide. TimesTen supports both explicit Active/Standby pair replication and the older 'classic' replication. Active/Standby pair is the recommended replication mode for all new applications. While many of the best practices below apply to both modes this guide focuses on Active/Standby pair.

Setup and configuration

OS Username for Instance Administrator

The TimesTen instances which will participate in the same replicated environment must all have the same O/S username for the instance administrator user. The numeric userid does not have to be the same but the actual username must be identical. If this is not the case you will be unable to use the 'duplicate' function which is essential for deploying a replicated environment and for recovery after a failure.

LAN and WAN connectivity

The machines hosting the active and standby master databases must be connected via a LAN or a network with LAN characteristics (throughput, latency, reliability). Readonly subscribers, if used, can be on the same LAN as the masters, on a different LAN or remotely located and accessed via a WAN. If the network connecting the master databases is inadequate you will experience poor replication performance and possibly reliability issues.

System clock synchronization

The system clocks on the master nodes must be maintained in synchronization to an accuracy of < 250 milliseconds. Typically you will need to use NTP or similar for this. If the system clock synchronization is out by more than 250 ms then you will not be able to establish replication between the databases. In this case you will see 'clock skew' errors reported in one or both of the TimesTen error and support logs.

TCP port numbers for replication

Use fixed port numbers not dynamically assigned port numbers (see the STORE clause). Use of fixed port numbers will greatly simplify rolling upgrades between major releases. If you have firewalls in place then you may anyway have to use fixed port numbers.

Hostnames versus IP addresses in replication schemes

Whenever possible, use hostnames not IP addresses when defining store names in replication schemes (dbname ON hostname). It is okay to use IP addresses, if desired, in ROUTE clauses. If you use IP addresses then you will have to re-deploy your entire replicated environment any time the IP address of a participating host changes.

Name resolution

Ensure that hostname -> address resolution (DNS or /etc/hosts) is working properly and reliably. The local host's official name should not include any domain component. i.e. the O/S 'hostname' command should return 'myhost' not 'myhost.mydomain'. If hostnames and name resolution are not setup properly then you may be unable to establish replication and you may also have issues with duplicating.

Network bandwidth

Use GigaBit Ethernet as a minimum unless the replication workload is very low. 100 Mbit or lower Ethernet may easily become saturated under a heavy replication workload.

Prevent excessive transaction log file accumulation

Use the FAILTHRESHOLD option to guard against unbounded transaction log accumulation in the event of a prolonged network or machine outage. If you do not use FAILTHRESHOLD then a prolonged standby or network outage will cause a large amount of transaction log files to accumulate at the active which may then result in an 'out of disk space' condition which is likely to be highly problematic.

Async versus Sync replication

Choose the replication mode (asynchronous, RETURN RECEIPT, RETURN TWOSAFE etc.) appropriately to balance performance versus data safety. If you are considering using RETURN RECEIPT replication then you should be aware that in Active/Standby pair replication RETURN TWOSAFE offers both higher protection and higher performance and is usually therefore the recommended choice.

Use parallel replication in preference to ReceiverThreads

The ReceiverThreads attribute is an older mechanism for boosting the performance of replication. it has been superseded by the more effective and more controllable parallel replication mechanism. When using parallel replication, do not also use ReceiverThreads as this may lead to excessive CPU usage.

Synchronous replication options

When using synchronous replication, there are many configuration options that can be used to:

While there is no 'one size fits all' recipe, the following options, specified in the STORE clause for both master databases, are recommended as the best starting point.

STORE storename ON hostname
RETURN SERVICES OFF WHEN STOPPED
RETURN WAIT TIME 5
DISABLE RETURN ALL 2
DURABLE COMMIT ON
RESUME RETURN 250
LOCAL COMMIT ACTION COMMIT

Some of the numeric values above will need to be fine tuned for your specific setup and workload. The optimal values depend on your configuration and application requirements and so it is hard to make specific recommendations. Consult the TimesTen documentation for the meaning of these options and the associated values.

Implementation and Testing

Keep transactions small

In a replicated environment it is imperative to keep transactions small. No transaction should modify more than a few thousand rows. Executing very large transactions (10s or 100s of thousands of modifications in a single transaction) will consume a lot of system resources and will have a negative impact on replication operation. In extreme cases, due to the timeouts involved, a very large transaction may effectively prevent replication functioning and necessitate recovery via a duplicate operation.

Properly handle return service timeout warnings

If you choose to use RETURN RECEIPT or RETURN TWOSAFE then your application logic must have some awareness of replication. Specifically, under some circumstances a commit operation may return the TimesTen warning TT8170: Receipt or commit acknowledgement not returned in the specified timeout interval for transaction_ID. This warning means that the standby did not acknowledge the replicated transaction within the defined timeout period. This typically means that either the standby, or the network connecting the active and standby, is down. It may also occur when the replication agent is stopped at either the active or the standby. The application has to react correctly to this warning depending on how replication is configured [RETURN SERVICES ON|OFF WHEN STOPPED, LOCAL COMMIT ACTION COMMIT|NO ACTION].

Optimize replication throughput

Test a 'sustained maximum plus some headroom' workload to ensure the hardware is adequate and replication can keep up. As part of workload testing, tune the log buffer size (LogBufMB), logging parallelism (LogBufParallelism) and the replication parallelism (ReplicationParallelism) to optimize replication performance. The size of the log buffer is a key performance tunable in TimesTen and it is especially important when replication is being used (you want to eliminate any SYS.MONITOR.LOG_FS_READS).

Determine recovery strategy

Assess the ability of replication to 'catch up' from a backlog situation after a temporary outage when under typical and maximum workloads. This will affect your chosen recovery strategy; do you allow 'catchup' or do you choose to duplicate.

Monitor, monitor, monitor

Implement comprehensive monitoring of replication operation, throughput and backlog to get early warning of any problems (see later).

A/S Pair with Oracle Clusterware

Use latest supported Clusterware

Use the latest version of Oracle Clusterware (Grid Infrastructure) that is certified with the TimesTen version that you are using. Using an older version of Clusterware may result in less reliable operation.

Install TimesTen on local storage on each machine

Do not install TimesTen on shared storage; each node should have its own private installation. TimesTen does not support the concept of a 'shared installation'.

Avoid NFS storage

The use of NFS shared storage is not recommended for TimesTen database files (checkpoint and log files). These should be located on high performance local storage (best) or high performance non-NFS SAN storage (second best).

Use ttCWAdmin

Use the TimesTen Clusterware utility, ttCWAdmin, to perform all TimesTen/Clusterware operations. Do not use Clusterware tools or other TimesTen utilities to perform functions that should be performed via ttCWAdmin. Manipulating the state of TimesTen components by means other than ttCWAdmin may result in unreliable operation.

Monitor, monitor, monitor

Even when using Clusterware you should still have your own health monitoring in place especially for replication throughout/backlog.

A/S pair without Oracle Clusterware

Monitor, monitor, monitor

Ensure a rigorous and robust monitoring mechanism in place covering all aspects of TimesTen health, not just replication operation (see later).

Test, test, test

Ensure you have pre-scripted and thoroughly tested recovery procedures for every possible failure/recovery scenario that applies to your environment.

Handle all failure/recovery scenarios

If you plan to automate failover and/or recovery, be sure your automation mechanism can detect and deal with all possible failure scenarios and can detect/handle split brain situations, network partition etc.

Monitoring the health of TimesTen replication

For any production deployment, implementing a rigorous and robust monitoring framework is an absolute must. Operational problems may arise from inadequate or insufficiently robust monitoring regimes.

The topic of monitoring is a large one; for more detailed information on monitoring TimesTen, and replication in particular, please consult the TimesTen documentation here, as well as the Best Practices for Monitoring section in this Quick Start guide.