X

Best Practices from Oracle Development's A‑Team

Oracle GoldenGate Integrated Replicat Troubleshooting

 

Disclaimer: The information presented in this article is for educational purposes only, applies to Oracle GoldenGate versions 19c and above, and does not constitute any contractual agreement; either implied or explicit. Any scripts or database query examples are presented where-is, as-is, and are unsupported by Oracle Support and Development. Always refer to the official Oracle documentation when planning changes to your Oracle GoldenGate installations.

Integrated Replicat was first introduced in Oracle GoldenGate version 12.1.2, and at the time was the preferred method for applying replicated data into Oracle Database targets.

Integrated Replicat is well suited for OLTP use cases as it provides the ability to automatically calculate transactional data dependencies and auto tune parallelism based on the workload. However, the drawback of Integrated Replicat is its inability to split-up and apply large transactions in parallel. Therefore, as of Oracle GoldenGate version 19c, the best practice for data apply into Oracle databases; especially for mixed workloads consisting of OLTP and batch transactions, is to use Parallel Replicat (For Oracle GoldenGate 19c best practices refer to: https://www.oracle.com/a/tech/docs/maa-goldengate-hub.pdf).

Even though the preferred, and best practice, apply mechanism for Oracle GoldenGate 19c and above is Parallel Replicat, many organizations have yet to migrate from Integrated Replicat. In this article we shall present troubleshooting information for when Integrated Replicat presents performance issues.

NOTE

Parallel Replicat features two modes of operation; Parallel Non-integrated and Parallel Integrated.

Parallel Integrated Replicat should be considered for use cases requiring Procedural Replication and Auto CDR. All other use cases should use Parallel Non-integrated Replicat.

For more information on Parallel Replicat features, refer to the Oracle GoldenGate documentation https://docs.oracle.com/en/middleware/goldengate/core/index.html).

Integrated Replicat and Parallel Integrated Replicat are not supported for data apply to Oracle Autonomous Data Warehouse https://docs.oracle.com/en/middleware/goldengate/core/19.1/oracle-db/using-ogg-autonomous-databases.html#GUID-808668AB-68B0-4D4A-96C4-A18FC06D1299).

 

Integrated Replicat Architecture

 

The image below provides a simplified view of Integrated Replicat architecture.

In this view, Integrated Replicat consists of several distinct processes:

 
The Replicat process:
 
Reads the Oracle GoldenGate trail.
 
 
Performs data filtering, mapping, and conversion.
 
 
Constructs logical change records (LCR) that represent source database
DML transactions (in committed order).
 
DDL is applied directly, or in “direct mode”, by the Inbound Server Apply
Network Receiver.
For direct mode apply, the operations are queued by the Apply Network Receiver until the Inbound Server Appliers complete the current transactions in their queue. Then the Apply Network Receiver applies the transaction to the database in direct apply mode through Oracle Call Interface (OCI). While the direct apply is in progress, Inbound Server Appliers cannot perform any data apply until the direct apply operation commits.
Direct mode apply degrades the performance of
Integrated Replicat.
 
Attaches to a background process in the target database known as the database inbound server by means of a lightweight streaming interface.
 
Transmits the LCRs to the inbound server, which applies the data to the
target database.
 
The Inbound Server is a set of processes that run as part of the Oracle Database.
 
Apply Network Receiver
 
Receives the LCRs from the Replicat process.
 
Determines if the LCRs must be applied in direct mode, and if not sends them to the Reader.
 
Reader
 
Receives the LCRs from the Apply Network Receiver process.
 
Computes the dependencies among the transactions in the workload based on the constraints defined at the target database.
 
Coordinator
 
Coordinates multiple transactions and maintains order among the apply servers.
 
Applier
 
Performs the changes for the transactions received from Coordinator in parallel while preserving the original transaction atomicity.

 
 
The default configuration for Integrated Replicat is to have 4 Appliers always available for data apply, with the ability to start a total of 50 on-demand to handle heavy workloads.
 
This is controlled via two options to the DBOPTIONS INTEGRATEDPARAMS parameter options
 
PARALLELISM: Sets a minimum number of apply servers that can be used under normal conditions. The default setting is 4.
 
 
MAX_PARALLELISM: Limits the number of apply servers that can be used when the workload is heavy. This number is reduced again when the workload subsides. The default is 50.
 
As the workload increases, Appliers are spawned within the database to try and improve performance based upon a workload mean throughput calculation that occurs every 25 seconds.
 
If the workload decreases, these additional Appliers will be terminated as they become inactive.
 
 
This automatic tuning of the number of Appliers is enabled only when PARALLELISM is greater than 1 and MAX_PARALLELISM is greater than PARALLELISM.

 

Factors That Affect Performance

 

Below are several well-known and documented factors that adversely affect Oracle GoldenGate Apply performance. Refer to the Oracle GoldenGate documentation for additional information.

 
Tables without a primary key (PK) or not nullable unique index (UI).
 
All table columns are captured from the source and used for applying update and delete operations into the target. The database performs a full table scan for these operations, which negatively impacts performance.
 
Tables with nullable, non-validated, or invalid keys or unique indexes.
 
A key can become invalid as the result of an object reorganization or a number of other actions. By default, Oracle GoldenGate will not utilize invalid keys when applying update or delete operations, and instead uses all table columns for these operations.
 
 
There is an Oracle GoldenGate configuration option to override this functionality; however, by doing so you accept the risk that the target data may not be maintained accurately through replication.
 
Oracle GoldenGate will not use keys or indexes consisting of nullable columns as they are not viable candidates for uniquely identifying a data row. By default, Oracle GoldenGate will not utilize these keys/indexes when applying update or delete operations, and instead uses all table columns for these operations.
 
There is an Oracle GoldenGate configuration option to override this functionality; however, allowing Oracle GoldenGate to use a nullable key can cause data corruption, as Oracle treats each row with a NULL value as a key column and as a separate unique value.
 
 
This configuration option is not supported when using Integrated Replicat.
 
Tables with deduplicated LOB columns
 
Update operations for tables with deduplicated LOB columns are serialized due to the dependency calculation requirements of this data type.
 
Oracle Net services
 
For remote apply, the default Oracle Database 19c SDU setting is too small. It is recommended that the Oracle Database 19c Listener and Oracle Database Client 19c SDU settings be set to 2097152 bytes.
 
 
NOTE: The SDU setting of the Oracle Client used by OGG and the Oracle Database Listener must be the same as connections will automatically be made using the lower of the two settings.
 
Network latency
 
The Oracle GoldenGate hub must be in close network proximity to the target database with the expectation that network latency should not exceed 2-3ms
 
BATCHSQL
 
Enabling BATCHSQL is a best practice; however, if there are data reliability issues with the target database; i.e., you have REPERROR enabled, there will be a performance degradation anytime a data apply batch is retried.
 
 
For Integrated Replicat, the performance gain achieved by enabling BATCHSQL may be minimal given that the batch apply cannot be submitted to the database in parallel.
 
Missing patches
 
It’s important to be up to date with both database and Oracle GoldenGate patches as many issues affecting integrated performance have already been identified and addressed
 
 
Review MOS Note, Oracle Support Document 2193391.1 (Latest GoldenGate/Database (OGG/RDBMS) Patch recommendations) at https://mosemp.us.oracle.com/epmos/faces/DocumentDisplay?id=2193391.1 to ensure your systems are current.

Other factors specific to Integrated Replicat that affect its performance

 
Operations that can only be applied in direct mode:
 
DDL operations.
 
Sequence operations.
 
UDT operations.
 
If the extract uses USENATIVEOBJSUPPORT to capture the UDT, then Integrated Replicat will apply it with the inbound server, otherwise it will be handled by Replicat in direct mode. USENATIVEOBJSUPPORT is the default for Integrated Extract; however, there are several limitations. Refer to the Oracle GoldenGate documentation if your use case requires replication of UDT operations.
 
Specifying the SQLEXEC parameter within a MAP parameter.
 
EVENTACTIONS processing.
 
Operations with data dependencies.
 
These operations can only be applied serially.
 
Database resources
 
STREAMS_POOL_SIZE allocation and SGA shared memory.
 
By default, the Integrated Replicat inbound server is configured to use infinite shared memory.
 
If there are multiple Oracle GoldenGate integrated components (Extracts and/or Replicats) configured, then this default setting should be reduced; typically to 1 GB.
 
The DBOPTIONS INTEGRATEDPARAMS MAX_SGA_SIZE parameter controls the amount of shared memory used by the inbound server.
 
If the inbound server MAX_SGA_SIZE setting is altered from the default, then the database STREAMS_POOL_SIZE must be configured. The formula for determining how much STREAMS_POOL to allocate for Integrated Replicat is:
(MAX_SGA_SIZE setting * # of Integrated Replicats) + 25% head room
 
Insufficient STREAMS_POOL_SIZE or available SGA memory will result in poor Integrated Replicat performance.
 
CPU
 
Integrated Replicat functions as database processes. There must be sufficient CPU resource available.
 
Large source transactions.
 
Often referred to as “long running transactions”, these are transactions consisting of more than 15,100 operations.
 
Because Integrated Replicat does not have functionality allowing it to process large transactions in parallel it is important to reduce the number of operations per transaction to an amount where Integrated Replicat performance meets your operational requirements.
 
In general, for source database batch workloads, it is recommended to perform a commit every 15,000 operations.

 

Large Transactions

 

Source database transactions are comprised of one or more database operations. For Integrated Replicat, a large transaction is any transaction containing more than 15,100 source operations.

When Integrated Replicat is processing a transaction, the Logical Change Records (LCR) are queued in memory until the commit record is read from the Oracle GoldenGate Trail file. Once the commit record is processed, all of the queued LCRs are flushed from memory and applied to the database. If the transaction is large, memory pressure will be evident in the AWR and Integrated Healthcheck reports.

When a transaction exceeds 15,100 LCRs, and all transactions prior to the large transaction have been committed, a function called eager_size processing is enabled and one Integrated Replicat Applier begins applying the uncommitted LCRs to the database. This now becomes a barrier transaction that is applied serially.

When this happens all transactions after this large transaction are placed on hold until the commit record for the barrier transaction is processed. This is because the database is in an inconsistent state due to the apply of the large transaction, so parallel apply must be disabled.

EAGER_SIZE processing of large transactions severely impacts Integrated Replicat performance, and in many cases the Integrated Replicat will appear to be “hung” as it may not immediately respond to commands.

If this happens, we strongly advise that you do not kill the Integrated Replicat operating system pid. Killing the pid causes an immediate termination of the Integrated Replicat processes. Several bad things will now happen:

  1. The database must rollback all of the uncommitted operations Integrated Replicat applied.
  2. The Integrated Replicat did not checkpoint, so it is in an inconsistent state and must perform recovery calculations upon restart. This may take a significant amount of time depending upon the size of the transaction being processed at the time the pid was killed.
  3. After recovery, Integrated Replicat will once again queue and then begin applying the large transaction. Once 15,100 LCRs are read into memory; eager_size processing begins again and the large transaction becomes a barrier transaction.

 

Identifying Performance Issues

 

Now that we have identified some of the common things that can impact performance, let’s walk through some troubleshooting steps should you encounter a poorly performing Integrated Replicat.

 

Identifying Latency

 

Latency in a replication stream is the total time required to capture a source transaction and apply it to the target database. Latency is only a problem if this time exceeds your company’s published service level agreement (SLA) for data replication.

The Oracle GoldenGate Automatic Heartbeat is required for troubleshooting latency issues, and it is best practice to have this feature enabled. The Oracle GoldenGate Heartbeat places data in the target database that allows support personnel to quickly isolate the point in the replication stream causing the latency. With this information the time to troubleshoot and correct the latency issue is drastically reduced.

Basic Troubleshooting

While the performance issue is present, the first things you should always do are:

 
Run the Oracle GoldenGate Integrated Healthcheck utility.
 
Healthcheck is packaged with the product and is located in the [OGG HOME]/ lib/sql/healthcheck directory.
 
Run a 15 minute AWR snapshot report.
 
This ensures we have an activity report for a small window during which the incident is occurring.
 
Use the UTL_RPADV package (Oracle Database 19c and above) to create a streams advisory report

While waiting on the above to complete, review the Replicat report file in the [deployment home]/var/lib/report directory, or the Replicat log file in the [deployment home]/var/log directory. Check for messages similar to the following:

 
INFO    OGG-25222  Oracle GoldenGate Delivery for Oracle, MYREP.prm:  Large transaction completed. Total records: 4,412,764.
 
There are large transactions in the workload.
 
INFO    OGG-25221  Oracle GoldenGate Delivery for Oracle, MYREP.prm:  Processing transaction (XID: 9.18.93647, Seqno: 8, RBA: 367556950) larger than eager size (15,100).
 
There are large transactions in the workload and eager_size processing was invoked.
 
INFO    OGG-00482  DDL found, operation
 
There are DDL operations in the workload. DDL operations can only be applied in direct mode.
 
WARNING OGG-06439  No unique key is defined for table MYNOPKY. All viable columns will be used to represent the key, but may not guarantee uniqueness.
 
There are tables without a PK or UI in the workload. Operations for these tables can only be applied serially, which impacts Integrated Replicat performance.

For all of the above messages, except the large transaction, minor adjustments to Integrated Replicat and the database could be made that may improve performance.

An adjustment that could be made for the large transaction performance degradation would be to increase the Integrated Replicat EAGER_SIZE setting and increase the amount of database STREAMS_POOL_SIZE allocated for Integrated Replicat if there is sufficient, unallocated SGA in the database. However, this should be a short-term solution only as the proper, best practice resolution is to switch to Parallel Replicat.

If none of the above messages are seen in the report file or OGG logs, check the network latency from the OGG Hub to the database server. This may be done using ping, traceroute, or iperf3 operating system utilities.

When available, review the Healthcheck, AWR, and Streams Advisory output.

  1. Check the AWR for obvious signs of database contention.
  2. If the AWR does not show any issues within the database, check the Oracle GoldenGate Hub for resource contention.
  3. Use the top utility to review process resource consumption and load average.
  4. If using block storage, use the iostat utility to check i/o performance.
  5. Check for network saturation between the OGG Hub and the database server.

If in doubt about how to interpret the information, engage Oracle Support and they will be happy to assist.

Note: If you engage Oracle Support, or A-Team, they will require the following information: (1) the Integrated Healthcheck output, (2) the AWR report, (3) the Streams Advisory Report, (4) copies of the Integrated Replicat parameter and report files, (5) network latency information and (6) database error log data covering the period of the performance issue.

In most cases, the Healthcheck, AWR, and Streams Advisory reports will identify the root cause of the performance degradation.

 

Advanced Troubleshooting

 

If Healthcheck, AWR, and Streams Advisory Report do not uncover the cause of the performance degradation, additional database queries and operating system utilities may be required to identify the latency cause. In most cases where this is required A-Team has found that the issue is related to database resource limitations, or workload issues that negatively affect the Integrated Replicat Appliers; so we typically start at the database and work backwards towards the OGG Replicat (or right to left in the simplified architecture view we reviewed earlier).

For the following discussion there are three Oracle VirtualBox machines:

 
One contains the source Oracle Database 19c
 
2 CPUs, 8GB RAM, Oracle Enterprise Linux 7
 
One contains the target Oracle Database 19c
 
4 CPUs, 16GB RAM, Oracle Enterprise Linux 7
 
One contains the Oracle GoldenGate 19c Microservices for Oracle deployment
 
2 CPUs, 8GB RAM, Oracle Enterprise Linux 7

Integrated Replicat is configured as:

replicat myrep
useridalias pdb1_oeldbora2 domain OracleGoldenGate
dboptions integratedparams (max_sga_size 1024, max_parallelism 4)
dboptions enable_instantiation_filtering
ddl include mapped
ddloptions report
reportcount every 60 seconds, rate
map pdbeast.tpc.*, target pdb1.tpc.*;

 

Available Memory

Query the database to get the sga_max_size and sga_target settings.

show parameter sga

Response:

This database has 8GB of SGA memory configured and allocated.

Streams Pool

Query the database to determine how much streams pool is allocated and in use.

set wrap off
set trimout on
set trimspool on
set underline on
set linesize 100

column COMPONENT format a16 heading 'SGA|COMPONENT'
column CURRENT_SIZE heading 'CURRENT SIZE|(bytes)'
column MAX_SIZE heading 'MAX SIZE|(bytes)'

select component, current_size, max_size
  from v$sga_dynamic_components
  where component = 'streams pool';

Response:

The database has 2.6GB of streams_pool configured and allocated.

Full Table Scans

Check for replicated tables incurring full table scans.

set wrap off
set trimout on
set trimspool on
set underline on
set linesize 100

TTITLE LEFT 'Replicat Apply – Full Table Scans' SKIP 2
column OBJECT_OWNER format a20 heading 'SCHEMA'
column OBJECT_NAME format a20 heading 'TABLE|NAME'

select distinct object_owner, object_name
from dba_hist_sql_plan
where
  object_name is not null
  and operation='table access'
  and object_type='table'
  and object_owner<>'sys'
  and timestamp>sysdate-5
  and sql_id in ( select distinct sql_id
    from dba_hist_active_sess_history
    where action like 'OGG%'
    and sql_plan_operation='TABLE ACCESS'
    and sql_plan_options='FULL'
    and sample_time>sysdate-5)
group by object_owner, object_name order by 1,2;

The example response output shows that the table TPC.MYNOPKEY incurs full table scans when applied to the target database.

Inbound Server Applier

To get the Applier state and activity.

set wrap off
set trimout ON
set trimspool on
set underline on
set linesize 4000

column APPLY_NAME format a14 heading 'APPLY|NAME'
column PROCESS_NAME format a6 heading 'PROCESS|STATE'
column STATE format a20
column CURRENT_TXN format a10
column MESSAGE_SEQUENCE heading 'MESSAGE|SEQUENCE'
column TOTAL_MESSAGES_APPLIED heading 'TOTAL|MESSAGES|APPLIED'
column TOTAL_ASSIGNED heading 'TOTAL|ASSIGNED'
column PROCESS_NAME heading 'PROCESS|NAME'
column CURRENT_TXN heading 'CURRENT|TXN'

select a.apply_name,
      substr(s.program,instr(s.program,'(')+1,4) process_name,
      a.state,
      a.sid,
      a.serial#,
      a.total_assigned,
      a.total_messages_applied,
      a.xidusn||'.'||a.xidslt||'.'||a.xidsqn current_txn
      from gv$gg_apply_server a, gv$session s
      where a.sid = s.sid and
            a.serial# = s.serial# and
            a.inst_id = s.inst_id and
            a.con_id = s.con_id
           order by a.apply_name, a.server_id;

The query response shows each Applier process state, its sid and serial#, the number of LCRs currently assigned for apply, the total number of LCRs applied to the database, and the current database transaction.

From the above information we can get each Applier’s OS pid by converting the database sid to an operating system pid.

set underline OFF

column sid format 999999 heading 'DATABASE_SID'
column spid heading 'OS_PID'

select a.sid, b.spid
from v$session a, v$process b
where a.paddr = b.addr
and a.sid in (select sid from gv$gg_apply_server)
order by a.sid;

Query response example:

With the operating system pid, we can use the top utility at the database server to see operating system info for the Applier processes.

$ top –p10428,10434,10430,10432

In the example output, we can see that the Inbound Server Applier processes are neither CPU nor memory constrained.

Let’s take a look at this same information when a large transaction is being applied. We did a restart of the Integrated Replicat, and the only data being generated at the source is the OGG Heartbeat. Our Inbound Server Applier query returns:

The top utility on the database server shows:

After generating a large transaction of 15000 operations against one table in the source database, the Inbound Server Applier query shows that the data for all 15000 operations was applied by a single Inbound Server Applier.

The top utility on the database server shows an increase in memory consumption for the Applier OS process.

Let’s also take a look at what happens when a barrier transaction is encountered. For this example, we’ll run an OLTP operation simulator and then execute 25000 operations as a single transaction against a single source table while the OLTP simulator is running.

Starting the OLTP simulator, our Inbound Server Applier query shows the records being applied in parallel into the target database.

The top utility on the database server shows the Applier OS processes consuming both CPU and memory.

The Applier process’ Total Assigned and Total Messages Applied counters in the database query continue incrementing until the large transaction is encountered and the operation count within the transaction exceeds 15,100. When this happens, the Applier processes all go into the “IDLE” state and the barrier transaction records are applied to the database via a single Applier processes until the commit record is read from the Oracle GoldenGate Trail and the transaction completes in the database.

Once the commit record is processed and the transaction completes in the database, all Appliers resume processing OLTP operations. It’s important to note that the Applier Total Messages Applied counter will not increment until after the commit record in the Oracle GoldenGate Trail is applied.

You will also see an entry in the Integrated Replicat report file, and OGG logs stating a large transaction was processed.

Since the large transaction is applied as a barrier transaction, and the Applier Total Messages Applied counter does not increment, how can you tell it is really working on a large transaction. For this, we use the Oracle GoldenGate Logdump utility to get information from the Oracle GoldenGate Trail.

In order to use Logdump, we need to know what Trail the Integrated Replicat is processing. In the adminclient, use the info command:

In the above output, the Integrated Replicat is reading the Trail “my000000020” at the disk location “/u01/GoldenGate/Deployments/ORA193/Insecure/ogghubinsec/var/lib/data”.

The Logdump utilty is located in the $OGG_HOME/bin directory, and we start it from a Linux shell.

Issue the open command to access the Trail.

Issue the command detail data to enable viewing of information about each record in the Trail and then the count command to get the count of all records in the Trail per table (the output is too big to capture in a single screen snapshot below).

In the output above, the Trail contains records for the OGG Heartbeat and the table “pdbeast.tpc.dt_char”. As OGG Heartbeat is an OLTP operation, we can exclude those records from analysis. We can view the details of the records for the “pdbeast.tpc.dt_char” by issuing the filter include file command followed by the count command.

In the above output, the Trail contains 1 metadata record, 500 inserts, and 29999 updates for the table. There are 3 transactions in the Trail for the table, which averages to 10166 records per transaction. However, the single metadata record counts as 1 transaction, so we want to exclude it from the average; which gives us 2 transactions with an average of 15,249 operations per transaction. These two large transactions are considered barrier transactions and will be applied serially by a single Inbound Server Applier process.

Suppose you see the following when troubleshooting Integrated Replicat performance:

This means operations were encountered that Inbound Server Coordinator determined could only be applied serially; which is another form of a barrier transaction. All Appliers, except for the one performing the database apply, go into the “WAIT DEPENDENCY” state until the transaction commits. Workloads with dependencies should not be executed as large transactions as they cannot be efficiently applied to the target database.

Inbound Server Coordinator

To get the Inbound Server Coordinator state and activity.

set wrap off
set trimout on
set trimspool on
set underline on
set linesize 132

column apply_name format a14
column state format a20
column sid format 999999
column serial# format 99999999

select apply_name,sid,serial#,state,total_applied,total_wait_deps
from gv$gg_apply_coordinator;

Response example:

To get the OS pid of the database process.

set underline off

column sid format 999999 heading 'DATABASE_SID'
column spid heading 'OS_PID'

select a.sid, b.spid
from v$session a, v$process b
where a.paddr = b.addr
and a.sid in (select sid from gv$gg_apply_coordinator)
order by a.sid;

Response example:

With the OS pid, we can use the top utility to view system resource consumption.

Normally, the Inbound Server Coordinator will not be a factor when troubleshooting Integrated Replicat performance unless it consistently reports “FLOW CONTROL” as its state. Flow control means the Inbound Server Coordinator either (1) cannot keep pace with the number of messages being received from the Inbound Server Reader, or (2) it is being blocked from sending messages to the Inbound Coordinator Applier processes. If we see flow control being activated within the Coordinator process, we need to look at the Inbound Server Applier processes again.

Inbound Server Reader

Integrated Replicat SGA

Get the Integrated Replicat sga_allocated, and sga_used values. The name “SGA” in this context is referring to streams pool memory.

set wrap off
set trimout on
set trimspool on
set underline on
set linesize 100

column APPLY_NAME format a14 heading 'APPLY|NAME'
column SGA_ALLOCATED heading 'SGA|ALLOCATED'
column SGA_USED heading 'SGA|USED'

select apply_name, sga_allocated, sga_used from gv$gg_apply_reader;

Query response example:

The Integrated Replicat configuration specifies it may use up to 1GB of SGA (streams pool). This shows the Inbound Server Reader process has been allocated 4Mb of streams pool memory and is using 4Mb.

As activity increases, these values will increase up to the limit specified by the Integrated Replicat max_sga_size setting.

Inbound Server Reader Process

We can see details about the Inbound Server Reader process by executing this database query:

set wrap off
set trimout on
set trimspool on
set underline off
set linesize 400

column apply_name format a12 heading 'APPLY|NAME'
column state format a15
column sid format 999999
column serial# format 99999999
column sga_allocated heading 'SGA|ALLOCATED'
column sga_used heading 'SGA|USED'
column bytes_received format 999999999 heading 'BYTES|RECEIVED'
column total_in_memory_lcrs format 99999999 heading 'TOTAL|IN|MEMORY|LCRS'

select apply_name, state, sid, serial#, sga_allocated, sga_used, bytes_received, total_in_memory_lcrs
from gv$gg_apply_reader;

Output example:

When the Reader is processing incoming data, the “Bytes Received” counter will increment with each incoming data batch from the Replicat process (on the OGG Hub). “Total In Memory LCRS” reports the number of LCRs enqueued for the Inbound Server Coordinator; you will see this counter increment and decrement and LCRs are queued and removed.

To get the OS pid, run the database query:

set underline off

column sid format 999999 heading 'DATABASE_SID'
column spid heading 'OS_PID'

select a.sid, b.spid
from v$session a, v$process b
where a.paddr = b.addr<
and a.sid in (select sid from gv$gg_apply_reader)
order by a.sid;

Output example:

Then we can monitor the process resource consumption with top.

Inbound Server Reader performance may be impacted by (1) the amount of streams_pool allocated to Integrated Replicat in the database SGA, (2) the network between the Replicat and the Inbound Server Reader, including the Oracle Client on the OGG Hub, and (3) poorly functioning Inbound Server Coordinator or Inbound Server Applier processes.

To check network connectivity, we can use ping to get the average round trip time between the OGG Hub and the database server. As we pointed out previously, the network latency should not exceed 2 milliseconds.

A more accurate test of the network may be performed using the Linux iperf utility.

On our database server, we start iperf in server mode.

On the OGG Hub, we start iperf in client mode, specifying the hostname, or ipadress, of the target database server. A series of data packets are transferred between the two systems and iperf computes and reports the actual network bandwidth.

Earlier we mentioned that Inbound Server Reader performance may be affected by streams_pool. To illustrate this point, let’s change our Integrated Replicat to limit the amount of streams_pool SGA memory it can use, by altering the MAX_SGA_SIZE setting in the parameter file.

replicat myrep
useridalias pdb1_oeldbora2 domain OracleGoldenGate
dboptions integratedparams (MAX_SGA_SIZE 50, MAX_PARALLELISM 4)
DBOPTIONS ENABLE_INSTANTIATION_FILTERING
ddl include mapped
ddloptions report
reportcount every 60 seconds, rate
MAP pdbeast.tpc.*, TARGET pdb1.tpc.*;

Now, the Inbound Server Reader will only be allocated, and allowed to use a maximum of 50 Mb of the available streams_pool. With the Integrated Running and our OLTP simulator running in the source database, our database query shows:

At the OGG Hub, we now have high latency reported:

Inbound Server Apply Network Receiver

To see details about the Inbound Server Apply Network Receiver, execute this database query:

set wrap off
set trimout on
set trimspool on
set underline off
set linesize 120

column apply_name format a10 heading 'APPLY|NAME'
column os_process_id format 999999 heading 'OS|PID'
column total_messages_received format 999999999 heading ‘TOTAL|MESSAGES|RECEIVED’
column state format a35
 
select apply_name, os_process_id, total_messages_received, state
from gv$gg_apply_receiver;

Output example:

In the output example above, our Integrated Replicat was idle, in that the only records being processed were generated by the Oracle GoldenGate Heartbeat. Let’s generate an OLTP workload and look at the output.

With our OLTP workload generator running, we see the number of total messages received incrementing while the Apply Network Receiver states changes from waiting to enqueuing states.

But what if the query response looks something like this:

The total messages received count is incrementing, but the Apply Network Receiver state says it is waiting on memory. The top utility shows there is no memory pressure in the server:

If we’re not memory constrained at the server level, we need to take a look at the SGA Streams Pool allocated and in use. We run the query presented previously to check the amount of SGA allocated and in use by the Inbound Server Reader and see the process is memory constrained.

In this case, the problem resolution is to alter the Integrated Replicat parameter file and increase the amount of SGA allocated via the setting DBOPTIONS INTEGRATEDPARAMS MAX_SGA_SIZE.

If the Inbound Server Apply Network Receiver shows a state of “SQL*met more data from client”, this is a symptom of network resource issues. This is typically a symptom of a slow network, check the network latency and bandwidth to ensure they meet the requirements for remote apply.

Replicat Process

We’ve now covered all of the “integrated” components of Integrated Replicat; however, there is still one more piece to check, the Replicat process running at the OGG Hub.

The Linux shell command ps –C replicat will return the Replicat process pid.

If you have more than one Replicat running, you can use ps and grep to find the pid of the specific Replicat.

We can use top to view system resource consumption of the Replicat process.

Along with network latency, i/o latency can affect performance of the Replicat process. We can use the Linux dd command to check the i/o performance of the disk housing the Oracle GoldenGate Trails.

First, use dd to perform a write test and create a 512 Mb file on disk where the Trails are located.

The write test took 3.5 seconds to complete. Then use dd to read the file we created.

The read test completed in .06 seconds (Note: We did not flush cache before doing the read, as it impacts everything on the server. We could have specified iflag=direct in the dd command to bypass the buffer cache and read direct from disk).

Trace Replicat Activity

If Integrated Replicat appears to be “hung”, you can use the strace utility to see the activity of the Replicat process on the OGG Hub.

As shown previously, get the Replicat pid, then run the strace utility specifying the Replicat pid.

$ strace -rp [pid]

Sample output:

Ideally you want to see Replicat reading and writing, anything else may require further investigation by an Oracle GoldenGate subject matter expert.

Conclusion

In this article we discussed factors that could adversely affect Integrated Replicat performance and presented basic troubleshooting information that should help you identify the problem root cause. We also pointed out that as of Oracle GoldenGate 19c it is best practice to use Parallel Replicat for data apply.

Join the discussion

Comments ( 1 )
  • Victor Zou Friday, July 30, 2021
    Very informative and helpful! Read it thru A-Z and followed all the steps to trouble-shoot my seemingly hung IR with big transactions of millions of records/transaction, many deadlocks (error 60), duplicate records (error 1) and 250 hours of trailfiles accumulated for initial load. The issues are mainly related to a few huge IOT tables that account for over 90% of the total db of 5TB. Thanks Loren!
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha