X

Best Practices from Oracle Development's A‑Team

Oracle GoldenGate: Monitoring "Things" That Affect Replication Stability

Introduction

There have been a lot of escalations to A-Team where Oracle GoldenGate outages could have been avoided if System and Database Administrators had been properly monitoring their environments. In this article we're going to discuss various areas that should be monitored regularly to avoid degradations in performance and unscheduled outages.

The concepts, scripts, and information presented in this article are for educational purposes only. They are not supported by Oracle Development or Support, and come with no guarantee or warrant for functionality in any environment other than the test system used to prepare this article. Before applying any changes presented in this article to your environment, you should thoroughly test to assess functionality and performance implications.

Main Article

The following discussion assumes the reader has a good understanding of system and database administration, Oracle GoldenGate Classic release 12.3 architecture, and the bash scripting language.

Monitoring Functionality Within Oracle GoldenGate

Oracle GoldenGate may be configured to monitor its environment and report error and performance degradation conditions.

Log Errors to SYSLOG

In Oracle GoldenGate 12.3, event logging to the system logs (syslog) is turned off by default. Logging functionality is controlled via the settings within the file $OGG_HOME/diretc/ogg-ggserr.xml (NOTE: In Oracle GoldenGate Microservices Architecure, this file is located in the $ORACLE_HOME/lib/utl/logging directory). To enable error logging to the syslog, edit the file and uncomment the section ggserr.Syslog as show below:

  <!--    !- The ggserr.Syslog adapter will write all messages to the local    !- system log.    !- -->   <appender name="ggserr.Syslog"   class="SyslogAppender">     <layout class="PatternLayout">       <param name="Pattern" value="%-19.19d{%Y-%m-%d %H:%M:%S}  %m%n"/>     </layout>   </appender>   <!-- ! -->

Then add the following before the line containing the text </configuration> to activate logging of warning and error messages to SYSLOG:

<logger name="ggserr.log" additivity="false">
<appender-ref  name="ggserr.Syslog"/>
<level value="warn"/>
</logger>

Stop and start the Oracle GoldenGate Manager to activate error logging to the syslog:

[oracle@ora12nomt log]$ sudo tail -f messages
Oct 23 11:58:52 ora12nomt extract[20452]: 2018-10-23 11:58:52  ERROR   OGG-01044  Oracle GoldenGate Capture for Oracle, etpc.prm:  The trail './dirdat/ett' is not assigned to extract 'ETPC'. Assign the trail to the extract with the command "ADD EXTTRAIL/RMTTRAIL ./dirdat/ett, EXTRACT ETPC".
Oct 23 11:58:52 ora12nomt extract[20452]: 2018-10-23 11:58:52  ERROR   OGG-01668  Oracle GoldenGate Capture for Oracle, etpc.prm:  PROCESS ABENDING.

Report Lag

If you monitor for lag, you must use the Oracle GoldenGate Integrated Heartbeat. There is an A-Team blog on this here.

Along with Integrated Heartbeat, the Oracle GoldenGate Manager may be configured to monitor and report lag conditions via three optional parameters LAGREPORT, LAGINFO, and LAGCRITICAL. Before setting these options, it is imperative that a realistic enterprise policy for replication lag be created that specifically states (1) acceptable replication lag and (2) the point at which replication lag becomes an error condition.

Notice we used the term "realistic" to describe the replication lag policy. This is the key as everyone wants their data replicated as quickly as possible; however, is it "realistic" to set a policy that lag exceeding 2 seconds for a replication stream between Tokyo and Manhattan constitutes a critical error? Not really.

Below is an example of enabling lag reporting within the Oracle GoldenGate Manager:

--
-- Listener on this port
port 19000
--
-- Cleanup consumed OGG Trails
purgeoldextracts ./dirdat/*, usecheckpoints
--
-- Allow connections to the OGG SERVER from the denoted
-- source system only
accessrule, prog server, ipaddr 192.168.120.44, allow, pri 1
--
-- Check for lag every 30 minutes
LAGREPORTMINUTES 30
--
-- The acceptable lag point
LAGINFOMINUTES 30
--
-- The point at which lag becomes an error condition
LAGCRITICALMINUTES 60

With the above settings, an error message will be written to SYSLOG for any Oracle GoldenGate process with 60, or more, minutes of backlog (lag):

Oct 23 12:53:19 ora12nomt mgr[24356]: 2018-10-23 12:53:19  WARNING OGG-00947  Oracle GoldenGate Manager for Oracle, mgr.prm:  Lag for EXTRACT ETPC is 00:62:09 (checkpoint updated 00:00:03 ago).

Check for Down Groups

Set the Oracle GoldenGate Manager to check and report on Groups in the ABEND state that have not been restarted by setting the parameter DOWNREPORT:

--
-- Every 60 minutes, check for Groups that are not in the RUNNING state
DOWNREPORTMINUTES 60

Every 60 minutes Manager will check the Oracle GoldenGate instance and log an error to SYSLOG for any in the ABEND state (Groups in the STOPPED state will not be reported to SYSLOG):

Oct 23 13:18:33 ora12nomt mgr[25485]: 2018-10-23 13:18:33  WARNING OGG-00946  Oracle GoldenGate Manager for Oracle, mgr.prm:  EXTRACT P12KVM abended.

Server Resources

Server resources can adversely affect Oracle GoldenGate; so we should be monitoring disk space, free memory, and network latency between the source and target Oracle GoldenGate servers. The following bash script, which I named OGGEnvCheck.sh, will write a customized error message to SYSLOG whenever any disk exceeds 75% capacity, free memory is less than 1GB, and average round trip time for ping exceeds 10ms. (Note: ping will not work between servers behind a firewall.)

#!/bin/bash ## Configurable by user # Target OGG hosts, ipaddress or DNS name to be used by ping command. readonly TARGET_HOSTS=("ora12kvm" "ora11kvm" "c3p0") ## DO NOT CHANGE THESE VALUES readonly SCRIPT_NAME=$(basename $0) readonly MAX_DISK_USED="75"   ## % disk used reported from df -h command readonly MAX_RTT="10"         ## average rtt time from ping readonly MIN_FREE_MEM="1000"  ## available memory from free -m command function err() {   logger -p user.error -t $SCRIPT_NAME "$@" } function check_disk_space {    df -h | tail -n+2 | while read fs size used avail usedpct mountp ; do       upct=${usedpct::-1}       if [[ "$upct" -gt "$MAX_DISK_USED" ]] && [[ "$mountp" != "/boot" ]]; then         err "OGG-99999: Mount point \"$mountp\" is $usedpct full."       fi    done } function check_free_mem {    free -m | tail -n-2 | while read mem total used mfree shared cache avail ; do       mem=${mem::-1}       if [[ "$mem" =~ "Mem" ]]; then         if [[ "$avail" -lt "$MIN_FREE_MEM" ]]; then           err "OGG-99999: Low available memory: total mem: $total Mb, available memory: $avail Mb."         fi       fi       if [[ "$mem" =~ "Swap" ]]; then         if [[ "$mfree" -lt "$MIN_FREE_MEM" ]]; then           err "OGG-99999: Low available swap: total swap: $total Mb, used: $used Mb, free: $mfree Mb."         fi       fi    done } function ping_targets {    for i in "${TARGET_HOSTS[@]}"; do       ping $i -c 10 | tail -n-2 | while read line ; do         if [[ "$line" =~ "100% packet loss" ]]; then           err "OGG-99999: OGG target destination, $i, is unreachable."         fi         if [[ "$line" =~ "rtt min/avg/max/mdev" ]]; then           # strip eveyrthing before and including the "=" character           l=${line#*=}           # strip the last 3 characters from the line           l=${l::-3}           # split the line into an array of values           IFS='/'           read -ra RTT <<< "$l"           # array value "1" is the average rtt.           # cannot use floating point numbers in bash math, so strip           # everything after the decimal point.           v=${RTT[1]}           v="${v%%.*}"           if [[ "$v" -gt "$MAX_RTT" ]]; then             err "OGG-99999: Average network latency to OGG target destination, $i, is ${RTT[1]} ms."           fi         fi       done    done } ## MAIN check_disk_space check_free_mem ping_targets

An example of the output written by this script to SYSLOG is below:

Oct 23 13:52:42 ora12nomt OGGEnvCheck.sh: OGG-99999: Mount point "/" is 78% full.
Oct 23 13:53:19 ora12nomt OGGEnvCheck.sh: OGG-99999: OGG target destination, c3p0, is unreachable.

Database Resources

For the Oracle Database, we should regularly check archive log space consumption as fully consumed archive space will not only have a negative impact upon the replication stream; it will bring the database to a grinding halt. The Oracle Database query below, which I named OGGEnvCheck.sql, will log an error when consumed archive log space exceeds 75% of capacity.

-- Create a location for the output (this directory must exist) CREATE OR REPLACE DIRECTORY MYERRLOC AS '/tmp'; DECLARE   v_ArchiveUsed     NUMBER(3);   v_ReadSCN1        NUMBER(10);   v_ReadSCN2        NUMBER(10);   v_GapSeq          VARCHAR(15);   v_OutputString    VARCHAR(150);   v_OutLoc          VARCHAR2(100) := 'MYERRLOC';   fHandle  UTL_FILE.FILE_TYPE; BEGIN -- Open the "log file" fHandle := UTL_FILE.FOPEN(v_OutLoc, 'OGGEnvCheckOut', 'w'); -- Check archivelog space. Report error if used space is more than 75%.. SELECT percent_space_used into v_ArchiveUsed FROM V$RECOVERY_AREA_USAGE where file_type = 'ARCHIVED LOG'; IF v_ArchiveUsed > 75 THEN   v_OutputString := 'OGG-99999: Archive Log space used is ';   v_OutputString := CONCAT (v_OutputString, v_ArchiveUsed);   v_OutputString := CONCAT (v_OutputString, '%.');   UTL_FILE.PUT(fHandle, v_OutputString); END IF; -- Close the "log file" and exit UTL_FILE.FCLOSE(fHandle); END; /

The file created by this script, /tmp/OGGEnvCheckOut, will be processed by a bash script we'll discuss in a bit.

If Downstream Capture is used in the replication stream, we also need to check the database to ensure Redo Transport is working properly and for gaps in the archive logs at the downstream database. The queries below, from the file OGGDBTransportCheck.sql, query the gv$logmnr_session Database View to ensure Integrated Extract has Redo data to process and obtain Archive Log information to determine if gaps exist (the Downstream Mining Extract will hang if gaps exist in the archive logs).

-- Check Redo Transport and archive log gaps -- -- Redo transport -- Spool the output to files we'll parse later SPOOL /tmp/readscn_a.out REPLACE SELECT session_name||' '||read_scn FROM gv$logmnr_session ORDER BY session_name ASC; SPOOL OFF -- Sleep 5 seconds host sleep 5 -- SPOOL /tmp/readscn_b.out REPLACE SELECT session_name||' '||read_scn FROM gv$logmnr_session ORDER BY session_name ASC; SPOOL OFF -- -- Check for archive log gap SPOOL /tmp/gapchk.out REPLACE SELECT consumer_name||' '||thread#||' '||name PREVIOUS_ARCHIVELOG  FROM (SELECT consumer_name,thread#,name,sequence# ids,         LEAD (sequence#, 1, NULL)         OVER (ORDER BY consumer_name,thread#,sequence# ASC) lead_no         FROM DBA_REGISTERED_ARCHIVED_LOG where purgeable='NO'); SPOOL OFF -- -- END of Downstream Capture checks

When this script is run, two files will be created in the /tmp directory that contain the read_scn value of all Mining Integrated Extracts:

[oracle@r2d2 tmp]$ cat readscn_a.out
SESSION_NAME||''||READ_SCN
-------------------------------------------------------------------------
OGG$CAP_ETPC 2769449
OGG$CAP_ETST 2769449

[oracle@r2d2 tmp]$ cat readscn_b.out
SESSION_NAME||''||READ_SCN
-------------------------------------------------------------------------
OGG$CAP_ETPC 2769459
OGG$CAP_ETST 2769459

This script also creates a file in the /tmp directory that contains information about the Archive Logs required for each Mining Integrated Extract.

[oracle@r2d2 tmp]$ cat gapchk.out
PREVIOUS_ARCHIVELOG
--------------------------------------------------------------------------------
OGG$CAP_ETST 1 /u01/oracle/app/c3p0_standby/1_71_911991262.dbf
OGG$CAP_ETST 1 /u01/oracle/app/c3p0_standby/1_72_911991262.dbf
OGG$CAP_ETST 1 /u01/oracle/app/c3p0_standby/1_73_911991262.dbf
OGG$CAP_ETST 1 /u01/oracle/app/c3p0_standby/1_74_911991262.dbf
OGG$CAP_ETPC 1 /u01/oracle/app/c3p0_standby/1_71_911991262.dbf
OGG$CAP_ETPC 1 /u01/oracle/app/c3p0_standby/1_72_911991262.dbf
OGG$CAP_ETPC 1 /u01/oracle/app/c3p0_standby/1_73_911991262.dbf
OGG$CAP_ETPC 1 /u01/oracle/app/c3p0_standby/1_74_911991262.dbf

We then use a bash script, OGGEnvReportDB.sh, to process the data retrieved from the Oracle Database and log any error conditions to SYSLOG:

#!/bin/bash ## Report any issues found during the database checks readonly ERRFILE="/tmp/OGGEnvCheckOut" readonly ARCFILE="/tmp/gapchk.out" readonly SCNFILEA="/tmp/readscn_a.out" readonly SCNFILEB="/tmp/readscn_b.out" readonly SCRIPT_NAME=$(basename $0) function err() {   logger -p user.error -t $SCRIPT_NAME "$@" } function readErrFile() {   ## Check the error file exists   if [[ -f $ERRFILE ]]; then       ## Read the file       while read LINE; do         if [[ "$LINE" =~ "OGG" ]]; then           err $LINE         fi       done < $ERRFILE       ## Clear out the file       echo " " > $ERRFILE   fi } function chkReadSCN() {   mynamea=''   mynameb=''   myscna=''   myscnb=''   ## Check the output files exist   if [[ -f $SCNFILEA ]] && [[ -f $SCNFILEB ]]; then       ## Read the file data into arrays       while read LINEA; do         if [[ "$LINEA" =~ "OGG" ]]; then           l=${LINEA% *}           mynamea+=("$l")           l=${LINEA#* }           myscna+=("$l")         fi       done < $SCNFILEA       while read LINEB; do         if [[ "$LINEB" =~ "OGG" ]]; then           l=${LINEB% *}           mynameb+=("$l")           l=${LINEB#* }           myscnb+=("$l")         fi       done < $SCNFILEB       #Process the array data and check if the read_scn is incrementing       for ((i=1; i< ${#mynamea[@]};i++)); do         namea=${mynamea[i]}         scna=${myscna[i]}         nameb=${mynameb[i]}         scnb=${myscnb[i]}         if [[ "$namea" = "$nameb" ]]; then           if [[ "$scna" -ge "$scnb" ]]; then             err "OGG-99999: Extract $namea is waiting for Redo. Check the upstream Redo Transport."           fi         fi       done       ## Clear out the files       echo " " > $SCNFILEA       echo " " > $SCNFILEB   fi } function chkArcGap() {   lastproc=''   lastthread=''   lastarch='0'   curproc=''   curthread=''   curarch=''   LINEARRAY=''   ## Check the archive gap results file exists   if [[ -f $ARCFILE ]]; then       ## Read the file       while read LINE; do         if [[ "$LINE" =~ "OGG" ]]; then          ## Split the line into its fields          curproc=$(echo $LINE | cut -d' ' -f1)          curthread=$(echo $LINE | cut -d' ' -f2)          curarch=$(echo $LINE | cut -d' ' -f3)          # Strip the last underscore from the archive          curarch=${curarch%_*}          # Strip the leading underscores from the archive          # This gives the archive sequence umber          curarch=${curarch#*_}   curarch=${curarch#*_}          # If the same reader and thread, check for an archive gap          if [[ "$lastproc" = "$curproc" ]] && [[ "$lastthread" = "$curthread" ]]; then            if [[ "$lastarch" -gt "0" ]]; then             archdiff=$(echo $(( curarch - lastarch )))             if [[ "$archdiff" -gt "1" ]]; then              err "OGG-99999: There is an archive log gap. Please check redo transport."             fi            fi            lastarch=$(echo $curarch)          else           lastproc=$(echo $curproc)           lastthread=$(echo $curthread)           lastarch=$(echo $curarch)          fi         fi       done < $ARCFILE       ## Clear out the file       echo " " > $ARCFILE   fi } ## MAIN chkReadSCN chkArcGap readErrFile

When this script is executed, error conditions will be written to SYSLOG.

Oct 24 09:20:03 r2d2 OGGEnvReportDB.sh: OGG-99999: There is an archive log gap. Please check redo transport.

Automating Script Execution

To automate the execution of these checks, I chose to use the Oracle Database DBMS_SCHEDULER.CREATE_JOB procedure to run the checks hourly. The pl/sql script SetupOGGEnvCheck.sql is used to create the jobs:

-- Must be run as SYSDBA -- sqlplus / as sysdba -- Set the necessary privileges GRANT CREATE JOB TO ggadmin; GRANT CREATE EXTERNAL JOB TO ggadmin; GRANT CREATE CREDENTIAL TO ggadmin; -- Create a credential so script is run as the -- correct OS user on the database server. CONN ggadmin/Oracle1 BEGIN   DBMS_CREDENTIAL.create_credential(     credential_name => 'oracle_ogg',     username        => 'oracle',     password        => 'Oracle1'   ); END; / -- Delete jobs if they already exist DECLARE   l_job_name VARCHAR2(30);   l_job_exists number; BEGIN   l_job_name := 'OGG_ENV_CHECK';   select count(*) into l_job_exists      from user_scheduler_jobs     where job_name = l_job_name           ;    if l_job_exists = 1 then       DBMS_SCHEDULER.drop_job(job_name => l_job_name);    end if;   l_job_name := 'OGG_DB_CHECK';   select count(*) into l_job_exists      from user_scheduler_jobs     where job_name = l_job_name           ;    if l_job_exists = 1 then       DBMS_SCHEDULER.drop_job(job_name => l_job_name);    end if;   l_job_name := 'OGG_DB_TRANSPORT_CHECK';   select count(*) into l_job_exists      from user_scheduler_jobs     where job_name = l_job_name           ;    if l_job_exists = 1 then       DBMS_SCHEDULER.drop_job(job_name => l_job_name);    end if;    l_job_name := 'OGG_DB_CHECK_REPORT';   select count(*) into l_job_exists      from user_scheduler_jobs     where job_name = l_job_name           ;    if l_job_exists = 1 then       DBMS_SCHEDULER.drop_job(job_name => l_job_name);    end if; END; / -- Create a job with in-line scripts. DECLARE   l_job_name VARCHAR2(30);   l_script   VARCHAR2(32767); BEGIN -- Check disk space and memory   l_job_name := 'ogg_env_check';   l_script := '/u01/oracle/OGGEnvCheck/OGGEnvCheck.sh';   DBMS_SCHEDULER.create_job(     job_name        => l_job_name,     job_type        => 'EXTERNAL_SCRIPT',     job_action      => l_script,     credential_name => 'oracle_ogg',     start_date      => systimestamp + interval '10' second,     repeat_interval => 'FREQ=HOURLY;INTERVAL=1',     enabled         => TRUE   ); END; / -- -- Check archive log space DECLARE   l_job_name VARCHAR2(30);   l_script   VARCHAR2(32767); BEGIN -- Check database features   l_job_name := 'ogg_db_check';   l_script := '/u01/oracle/OGGEnvCheck/OGGEnvCheck.sql';   DBMS_SCHEDULER.create_job(     job_name        => l_job_name,     job_type        => 'SQL_SCRIPT',     job_action      => l_script,     credential_name => 'oracle_ogg',     start_date      => systimestamp + interval '10' second,     repeat_interval => 'FREQ=HOURLY;INTERVAL=1',     enabled         => TRUE   ); END; / -- -- Check Redo Transport (Downstream Capture) -- -- Uncomment the following if using downstream capture --DECLARE --  l_job_name VARCHAR2(30); --  l_script   VARCHAR2(32767); --BEGIN -- Check database features --  l_job_name := 'ogg_db_check'; --  l_script := '/u01/oracle/OGGEnvCheck/OGGDBTransportCheck.sql'; --  DBMS_SCHEDULER.create_job( --    job_name        => l_job_name, --    job_type        => 'SQL_SCRIPT', --    job_action      => l_script, --    credential_name => 'oracle_ogg', --    start_date      => systimestamp + interval '30' second, --    repeat_interval => 'FREQ=HOURLY;INTERVAL=1', --    enabled         => TRUE --  ); --END; --/ DECLARE   l_job_name VARCHAR2(30);   l_script   VARCHAR2(32767); BEGIN -- Report any errors from database check   l_job_name := 'ogg_db_check_report';   l_script := '/u01/oracle/OGGEnvCheck/OGGEnvReportDB.sh';   DBMS_SCHEDULER.create_job(     job_name        => l_job_name,     job_type        => 'EXTERNAL_SCRIPT',     job_action      => l_script,     credential_name => 'oracle_ogg',     start_date      => systimestamp + interval '300' second,     repeat_interval => 'FREQ=HOURLY;INTERVAL=1',     enabled         => TRUE   ); END; /

With the hourly execution of the scripts and the GoldenGate Manager settings, we now have a good representation of our server and replication stream health. If the enterprise has processes in place that scrape SYSLOG, this information may be added for automated notifications.

Summary

In this article we presented information on how to setup Oracle GoldenGate release 12.3 and above to record warning and error conditions to SYSLOG, presented Oracle GoldenGate Manager settings to log excessive replication lag errors, presented Oracle GoldenGate Manager settings to report on Groups in the ABEND state, and provided scripting examples to log server and database resource issues that affect replication stability.

For more information on what other articles are available for Oracle GoldenGate please view our index page.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha

Recent Content