Integrated Replicat was first introduced in Oracle GoldenGate version 12.1.2, and at the time was the preferred method for applying replicated data into Oracle Database targets.
Integrated Replicat is well suited for OLTP use cases as it provides the ability to automatically calculate transactional data dependencies and auto tune parallelism based on the workload. However, the drawback of Integrated Replicat is its inability to split-up and apply large transactions in parallel. Therefore, as of Oracle GoldenGate version 19c, the best practice for data apply into Oracle databases; especially for mixed workloads consisting of OLTP and batch transactions, is to use Parallel Replicat (For Oracle GoldenGate 19c best practices refer to: https://www.oracle.com/a/tech/docs/maa-goldengate-hub.pdf).
Even though the preferred, and best practice, apply mechanism for Oracle GoldenGate 19c and above is Parallel Replicat, many organizations have yet to migrate from Integrated Replicat. In this article we shall present troubleshooting information for when Integrated Replicat presents performance issues.
The image below provides a simplified view of Integrated Replicat architecture.
In this view, Integrated Replicat consists of several distinct processes:
Below are several well-known and documented factors that adversely affect Oracle GoldenGate Apply performance. Refer to the Oracle GoldenGate documentation for additional information.
Other factors specific to Integrated Replicat that affect its performance
Source database transactions are comprised of one or more database operations. For Integrated Replicat, a large transaction is any transaction containing more than 15,100 source operations.
When Integrated Replicat is processing a transaction, the Logical Change Records (LCR) are queued in memory until the commit record is read from the Oracle GoldenGate Trail file. Once the commit record is processed, all of the queued LCRs are flushed from memory and applied to the database. If the transaction is large, memory pressure will be evident in the AWR and Integrated Healthcheck reports.
When a transaction exceeds 15,100 LCRs, and all transactions prior to the large transaction have been committed, a function called eager_size processing is enabled and one Integrated Replicat Applier begins applying the uncommitted LCRs to the database. This now becomes a barrier transaction that is applied serially.
When this happens all transactions after this large transaction are placed on hold until the commit record for the barrier transaction is processed. This is because the database is in an inconsistent state due to the apply of the large transaction, so parallel apply must be disabled.
EAGER_SIZE processing of large transactions severely impacts Integrated Replicat performance, and in many cases the Integrated Replicat will appear to be “hung” as it may not immediately respond to commands.
If this happens, we strongly advise that you do not kill the Integrated Replicat operating system pid. Killing the pid causes an immediate termination of the Integrated Replicat processes. Several bad things will now happen:
Now that we have identified some of the common things that can impact performance, let’s walk through some troubleshooting steps should you encounter a poorly performing Integrated Replicat.
Latency in a replication stream is the total time required to capture a source transaction and apply it to the target database. Latency is only a problem if this time exceeds your company’s published service level agreement (SLA) for data replication.
The Oracle GoldenGate Automatic Heartbeat is required for troubleshooting latency issues, and it is best practice to have this feature enabled. The Oracle GoldenGate Heartbeat places data in the target database that allows support personnel to quickly isolate the point in the replication stream causing the latency. With this information the time to troubleshoot and correct the latency issue is drastically reduced.
While the performance issue is present, the first things you should always do are:
While waiting on the above to complete, review the Replicat report file in the [deployment home]/var/lib/report directory, or the Replicat log file in the [deployment home]/var/log directory. Check for messages similar to the following:
For all of the above messages, except the large transaction, minor adjustments to Integrated Replicat and the database could be made that may improve performance.
An adjustment that could be made for the large transaction performance degradation would be to increase the Integrated Replicat EAGER_SIZE setting and increase the amount of database STREAMS_POOL_SIZE allocated for Integrated Replicat if there is sufficient, unallocated SGA in the database. However, this should be a short-term solution only as the proper, best practice resolution is to switch to Parallel Replicat.
If none of the above messages are seen in the report file or OGG logs, check the network latency from the OGG Hub to the database server. This may be done using ping, traceroute, or iperf3 operating system utilities.
When available, review the Healthcheck, AWR, and Streams Advisory output.
If in doubt about how to interpret the information, engage Oracle Support and they will be happy to assist.
Note: If you engage Oracle Support, or A-Team, they will require the following information: (1) the Integrated Healthcheck output, (2) the AWR report, (3) the Streams Advisory Report, (4) copies of the Integrated Replicat parameter and report files, (5) network latency information and (6) database error log data covering the period of the performance issue.
In most cases, the Healthcheck, AWR, and Streams Advisory reports will identify the root cause of the performance degradation.
If Healthcheck, AWR, and Streams Advisory Report do not uncover the cause of the performance degradation, additional database queries and operating system utilities may be required to identify the latency cause. In most cases where this is required A-Team has found that the issue is related to database resource limitations, or workload issues that negatively affect the Integrated Replicat Appliers; so we typically start at the database and work backwards towards the OGG Replicat (or right to left in the simplified architecture view we reviewed earlier).
For the following discussion there are three Oracle VirtualBox machines:
Integrated Replicat is configured as:
Query the database to get the sga_max_size and sga_target settings.
show parameter sga
This database has 8GB of SGA memory configured and allocated.
Query the database to determine how much streams pool is allocated and in use.
set wrap off
The database has 2.6GB of streams_pool configured and allocated.
Full Table Scans
Check for replicated tables incurring full table scans.
set wrap off
The example response output shows that the table TPC.MYNOPKEY incurs full table scans when applied to the target database.
Inbound Server Applier
To get the Applier state and activity.
set wrap off
The query response shows each Applier process state, its sid and serial#, the number of LCRs currently assigned for apply, the total number of LCRs applied to the database, and the current database transaction.
From the above information we can get each Applier’s OS pid by converting the database sid to an operating system pid.
set underline OFF
Query response example:
With the operating system pid, we can use the top utility at the database server to see operating system info for the Applier processes.
$ top –p10428,10434,10430,10432
In the example output, we can see that the Inbound Server Applier processes are neither CPU nor memory constrained.
Let’s take a look at this same information when a large transaction is being applied. We did a restart of the Integrated Replicat, and the only data being generated at the source is the OGG Heartbeat. Our Inbound Server Applier query returns:
The top utility on the database server shows:
After generating a large transaction of 15000 operations against one table in the source database, the Inbound Server Applier query shows that the data for all 15000 operations was applied by a single Inbound Server Applier.
The top utility on the database server shows an increase in memory consumption for the Applier OS process.
Let’s also take a look at what happens when a barrier transaction is encountered. For this example, we’ll run an OLTP operation simulator and then execute 25000 operations as a single transaction against a single source table while the OLTP simulator is running.
Starting the OLTP simulator, our Inbound Server Applier query shows the records being applied in parallel into the target database.
The top utility on the database server shows the Applier OS processes consuming both CPU and memory.
The Applier process’ Total Assigned and Total Messages Applied counters in the database query continue incrementing until the large transaction is encountered and the operation count within the transaction exceeds 15,100. When this happens, the Applier processes all go into the “IDLE” state and the barrier transaction records are applied to the database via a single Applier processes until the commit record is read from the Oracle GoldenGate Trail and the transaction completes in the database.
Once the commit record is processed and the transaction completes in the database, all Appliers resume processing OLTP operations. It’s important to note that the Applier Total Messages Applied counter will not increment until after the commit record in the Oracle GoldenGate Trail is applied.
You will also see an entry in the Integrated Replicat report file, and OGG logs stating a large transaction was processed.
Since the large transaction is applied as a barrier transaction, and the Applier Total Messages Applied counter does not increment, how can you tell it is really working on a large transaction. For this, we use the Oracle GoldenGate Logdump utility to get information from the Oracle GoldenGate Trail.
In order to use Logdump, we need to know what Trail the Integrated Replicat is processing. In the adminclient, use the info command:
In the above output, the Integrated Replicat is reading the Trail “my000000020” at the disk location “/u01/GoldenGate/Deployments/ORA193/Insecure/ogghubinsec/var/lib/data”.
The Logdump utilty is located in the $OGG_HOME/bin directory, and we start it from a Linux shell.
Issue the open command to access the Trail.
Issue the command detail data to enable viewing of information about each record in the Trail and then the count command to get the count of all records in the Trail per table (the output is too big to capture in a single screen snapshot below).
In the output above, the Trail contains records for the OGG Heartbeat and the table “pdbeast.tpc.dt_char”. As OGG Heartbeat is an OLTP operation, we can exclude those records from analysis. We can view the details of the records for the “pdbeast.tpc.dt_char” by issuing the filter include file command followed by the count command.
In the above output, the Trail contains 1 metadata record, 500 inserts, and 29999 updates for the table. There are 3 transactions in the Trail for the table, which averages to 10166 records per transaction. However, the single metadata record counts as 1 transaction, so we want to exclude it from the average; which gives us 2 transactions with an average of 15,249 operations per transaction. These two large transactions are considered barrier transactions and will be applied serially by a single Inbound Server Applier process.
Suppose you see the following when troubleshooting Integrated Replicat performance:
This means operations were encountered that Inbound Server Coordinator determined could only be applied serially; which is another form of a barrier transaction. All Appliers, except for the one performing the database apply, go into the “WAIT DEPENDENCY” state until the transaction commits. Workloads with dependencies should not be executed as large transactions as they cannot be efficiently applied to the target database.
Inbound Server Coordinator
To get the Inbound Server Coordinator state and activity.
set wrap off
To get the OS pid of the database process.
set underline off
With the OS pid, we can use the top utility to view system resource consumption.
Normally, the Inbound Server Coordinator will not be a factor when troubleshooting Integrated Replicat performance unless it consistently reports “FLOW CONTROL” as its state. Flow control means the Inbound Server Coordinator either (1) cannot keep pace with the number of messages being received from the Inbound Server Reader, or (2) it is being blocked from sending messages to the Inbound Coordinator Applier processes. If we see flow control being activated within the Coordinator process, we need to look at the Inbound Server Applier processes again.
Inbound Server Reader
Integrated Replicat SGA
Get the Integrated Replicat sga_allocated, and sga_used values. The name “SGA” in this context is referring to streams pool memory.
set wrap off
Query response example:
The Integrated Replicat configuration specifies it may use up to 1GB of SGA (streams pool). This shows the Inbound Server Reader process has been allocated 4Mb of streams pool memory and is using 4Mb.
As activity increases, these values will increase up to the limit specified by the Integrated Replicat max_sga_size setting.
Inbound Server Reader Process
We can see details about the Inbound Server Reader process by executing this database query:
set wrap off
When the Reader is processing incoming data, the “Bytes Received” counter will increment with each incoming data batch from the Replicat process (on the OGG Hub). “Total In Memory LCRS” reports the number of LCRs enqueued for the Inbound Server Coordinator; you will see this counter increment and decrement and LCRs are queued and removed.
To get the OS pid, run the database query:
set underline off
Then we can monitor the process resource consumption with top.
Inbound Server Reader performance may be impacted by (1) the amount of streams_pool allocated to Integrated Replicat in the database SGA, (2) the network between the Replicat and the Inbound Server Reader, including the Oracle Client on the OGG Hub, and (3) poorly functioning Inbound Server Coordinator or Inbound Server Applier processes.
To check network connectivity, we can use ping to get the average round trip time between the OGG Hub and the database server. As we pointed out previously, the network latency should not exceed 2 milliseconds.
A more accurate test of the network may be performed using the Linux iperf utility.
On our database server, we start iperf in server mode.
On the OGG Hub, we start iperf in client mode, specifying the hostname, or ipadress, of the target database server. A series of data packets are transferred between the two systems and iperf computes and reports the actual network bandwidth.
Earlier we mentioned that Inbound Server Reader performance may be affected by streams_pool. To illustrate this point, let’s change our Integrated Replicat to limit the amount of streams_pool SGA memory it can use, by altering the MAX_SGA_SIZE setting in the parameter file.
Now, the Inbound Server Reader will only be allocated, and allowed to use a maximum of 50 Mb of the available streams_pool. With the Integrated Running and our OLTP simulator running in the source database, our database query shows:
At the OGG Hub, we now have high latency reported:
Inbound Server Apply Network Receiver
To see details about the Inbound Server Apply Network Receiver, execute this database query:
set wrap off
In the output example above, our Integrated Replicat was idle, in that the only records being processed were generated by the Oracle GoldenGate Heartbeat. Let’s generate an OLTP workload and look at the output.
With our OLTP workload generator running, we see the number of total messages received incrementing while the Apply Network Receiver states changes from waiting to enqueuing states.
But what if the query response looks something like this:
The total messages received count is incrementing, but the Apply Network Receiver state says it is waiting on memory. The top utility shows there is no memory pressure in the server:
If we’re not memory constrained at the server level, we need to take a look at the SGA Streams Pool allocated and in use. We run the query presented previously to check the amount of SGA allocated and in use by the Inbound Server Reader and see the process is memory constrained.
In this case, the problem resolution is to alter the Integrated Replicat parameter file and increase the amount of SGA allocated via the setting DBOPTIONS INTEGRATEDPARAMS MAX_SGA_SIZE.
If the Inbound Server Apply Network Receiver shows a state of “SQL*met more data from client”, this is a symptom of network resource issues. This is typically a symptom of a slow network, check the network latency and bandwidth to ensure they meet the requirements for remote apply.
We’ve now covered all of the “integrated” components of Integrated Replicat; however, there is still one more piece to check, the Replicat process running at the OGG Hub.
The Linux shell command ps –C replicat will return the Replicat process pid.
If you have more than one Replicat running, you can use ps and grep to find the pid of the specific Replicat.
We can use top to view system resource consumption of the Replicat process.
Along with network latency, i/o latency can affect performance of the Replicat process. We can use the Linux dd command to check the i/o performance of the disk housing the Oracle GoldenGate Trails.
First, use dd to perform a write test and create a 512 Mb file on disk where the Trails are located.
The write test took 3.5 seconds to complete. Then use dd to read the file we created.
The read test completed in .06 seconds (Note: We did not flush cache before doing the read, as it impacts everything on the server. We could have specified iflag=direct in the dd command to bypass the buffer cache and read direct from disk).
Trace Replicat Activity
If Integrated Replicat appears to be “hung”, you can use the strace utility to see the activity of the Replicat process on the OGG Hub.
As shown previously, get the Replicat pid, then run the strace utility specifying the Replicat pid.
$ strace -rp [pid]
Ideally you want to see Replicat reading and writing, anything else may require further investigation by an Oracle GoldenGate subject matter expert.
In this article we discussed factors that could adversely affect Integrated Replicat performance and presented basic troubleshooting information that should help you identify the problem root cause. We also pointed out that as of Oracle GoldenGate 19c it is best practice to use Parallel Replicat for data apply.