Oracle GoldenGate: Monitoring “Things” That Affect Replication Stability

Introduction

There have been a lot of escalations to A-Team where Oracle GoldenGate outages could have been avoided if System and Database Administrators had been properly monitoring their environments. In this article we’re going to discuss various areas that should be monitored regularly to avoid degradations in performance and unscheduled outages.

The concepts, scripts, and information presented in this article are for educational purposes only. They are not supported by Oracle Development or Support, and come with no guarantee or warrant for functionality in any environment other than the test system used to prepare this article. Before applying any changes presented in this article to your environment, you should thoroughly test to assess functionality and performance implications.

Main Article

The following discussion assumes the reader has a good understanding of system and database administration, Oracle GoldenGate Classic release 12.3 architecture, and the bash scripting language.

Monitoring Functionality Within Oracle GoldenGate

Oracle GoldenGate may be configured to monitor its environment and report error and performance degradation conditions.

Log Errors to SYSLOG

In Oracle GoldenGate 12.3, event logging to the system logs (syslog) is turned off by default. Logging functionality is controlled via the settings within the file $OGG_HOME/diretc/ogg-ggserr.xml (NOTE: In Oracle GoldenGate Microservices Architecure, this file is located in the $ORACLE_HOME/lib/utl/logging directory). To enable error logging to the syslog, edit the file and uncomment the section ggserr.Syslog as show below:

  <!--
   !- The ggserr.Syslog adapter will write all messages to the local
   !- system log.
   !- -->
  <appender name="ggserr.Syslog"   class="SyslogAppender">
    <layout class="PatternLayout">
      <param name="Pattern" value="%-19.19d{%Y-%m-%d %H:%M:%S}  %m%n"/>
    </layout>
  </appender>
  <!-- ! -->

Then add the following before the line containing the text </configuration> to activate logging of warning and error messages to SYSLOG:

<logger name=”ggserr.log” additivity=”false”>
<appender-ref  name=”ggserr.Syslog”/>
<level value=”warn”/>
</logger>

Stop and start the Oracle GoldenGate Manager to activate error logging to the syslog:

[oracle@ora12nomt log]$ sudo tail -f messages
Oct 23 11:58:52 ora12nomt extract[20452]: 2018-10-23 11:58:52  ERROR   OGG-01044  Oracle GoldenGate Capture for Oracle, etpc.prm:  The trail ‘./dirdat/ett’ is not assigned to extract ‘ETPC’. Assign the trail to the extract with the command “ADD EXTTRAIL/RMTTRAIL ./dirdat/ett, EXTRACT ETPC”.
Oct 23 11:58:52 ora12nomt extract[20452]: 2018-10-23 11:58:52  ERROR   OGG-01668  Oracle GoldenGate Capture for Oracle, etpc.prm:  PROCESS ABENDING.

Report Lag

If you monitor for lag, you must use the Oracle GoldenGate Integrated Heartbeat. There is an A-Team blog on this here.

Along with Integrated Heartbeat, the Oracle GoldenGate Manager may be configured to monitor and report lag conditions via three optional parameters LAGREPORT, LAGINFO, and LAGCRITICAL. Before setting these options, it is imperative that a realistic enterprise policy for replication lag be created that specifically states (1) acceptable replication lag and (2) the point at which replication lag becomes an error condition.

Notice we used the term “realistic” to describe the replication lag policy. This is the key as everyone wants their data replicated as quickly as possible; however, is it “realistic” to set a policy that lag exceeding 2 seconds for a replication stream between Tokyo and Manhattan constitutes a critical error? Not really.

Below is an example of enabling lag reporting within the Oracle GoldenGate Manager:


— Listener on this port
port 19000

— Cleanup consumed OGG Trails
purgeoldextracts ./dirdat/*, usecheckpoints

— Allow connections to the OGG SERVER from the denoted
— source system only
accessrule, prog server, ipaddr 192.168.120.44, allow, pri 1

— Check for lag every 30 minutes
LAGREPORTMINUTES 30

— The acceptable lag point
LAGINFOMINUTES 30

— The point at which lag becomes an error condition
LAGCRITICALMINUTES 60

With the above settings, an error message will be written to SYSLOG for any Oracle GoldenGate process with 60, or more, minutes of backlog (lag):

Oct 23 12:53:19 ora12nomt mgr[24356]: 2018-10-23 12:53:19  WARNING OGG-00947  Oracle GoldenGate Manager for Oracle, mgr.prm:  Lag for EXTRACT ETPC is 00:62:09 (checkpoint updated 00:00:03 ago).

Check for Down Groups

Set the Oracle GoldenGate Manager to check and report on Groups in the ABEND state that have not been restarted by setting the parameter DOWNREPORT:


— Every 60 minutes, check for Groups that are not in the RUNNING state
DOWNREPORTMINUTES 60

Every 60 minutes Manager will check the Oracle GoldenGate instance and log an error to SYSLOG for any in the ABEND state (Groups in the STOPPED state will not be reported to SYSLOG):

Oct 23 13:18:33 ora12nomt mgr[25485]: 2018-10-23 13:18:33  WARNING OGG-00946  Oracle GoldenGate Manager for Oracle, mgr.prm:  EXTRACT P12KVM abended.

Server Resources

Server resources can adversely affect Oracle GoldenGate; so we should be monitoring disk space, free memory, and network latency between the source and target Oracle GoldenGate servers. The following bash script, which I named OGGEnvCheck.sh, will write a customized error message to SYSLOG whenever any disk exceeds 75% capacity, free memory is less than 1GB, and average round trip time for ping exceeds 10ms. (Note: ping will not work between servers behind a firewall.)

#!/bin/bash

## Configurable by user
# Target OGG hosts, ipaddress or DNS name to be used by ping command.
readonly TARGET_HOSTS=("ora12kvm" "ora11kvm" "c3p0")

## DO NOT CHANGE THESE VALUES
readonly SCRIPT_NAME=$(basename $0)
readonly MAX_DISK_USED="75"  	## % disk used reported from df -h command
readonly MAX_RTT="10"        	## average rtt time from ping
readonly MIN_FREE_MEM="1000" 	## available memory from free -m command

function err() {
  logger -p user.error -t $SCRIPT_NAME "$@"
}


function check_disk_space {
   df -h | tail -n+2 | while read fs size used avail usedpct mountp ; do
      upct=${usedpct::-1}
      if  "$upct" -gt "$MAX_DISK_USED"  &&  "$mountp" != "/boot" ; then
        err "OGG-99999: Mount point \"$mountp\" is $usedpct full." 
      fi
   done
}

function check_free_mem {
   free -m | tail -n-2 | while read mem total used mfree shared cache avail ; do
      mem=${mem::-1}
      if  "$mem" =~ "Mem" ; then
        if  "$avail" -lt "$MIN_FREE_MEM" ; then
          err "OGG-99999: Low available memory: total mem: $total Mb, available memory: $avail Mb."
        fi
      fi
      if  "$mem" =~ "Swap" ; then
        if  "$mfree" -lt "$MIN_FREE_MEM" ; then
          err "OGG-99999: Low available swap: total swap: $total Mb, used: $used Mb, free: $mfree Mb." 
        fi
      fi
   done
}

function ping_targets {
   for i in "${TARGET_HOSTS[@]}"; do
      ping $i -c 10 | tail -n-2 | while read line ; do
        if  "$line" =~ "100% packet loss" ; then
          err "OGG-99999: OGG target destination, $i, is unreachable."
        fi
        if  "$line" =~ "rtt min/avg/max/mdev" ; then
          # strip eveyrthing before and including the "=" character
          l=${line#*=}
          # strip the last 3 characters from the line
          l=${l::-3}
          # split the line into an array of values
          IFS='/'
          read -ra RTT <<< "$l"
          # array value "1" is the average rtt.
          # cannot use floating point numbers in bash math, so strip
          # everything after the decimal point.
          v=${RTT[1]}
          v="${v%%.*}" 
          if  "$v" -gt "$MAX_RTT" ; then
            err "OGG-99999: Average network latency to OGG target destination, $i, is ${RTT[1]} ms."
          fi
        fi 
      done
   done
}

## MAIN
check_disk_space
check_free_mem
ping_targets

An example of the output written by this script to SYSLOG is below:

Oct 23 13:52:42 ora12nomt OGGEnvCheck.sh: OGG-99999: Mount point “/” is 78% full.
Oct 23 13:53:19 ora12nomt OGGEnvCheck.sh: OGG-99999: OGG target destination, c3p0, is unreachable.

Database Resources

For the Oracle Database, we should regularly check archive log space consumption as fully consumed archive space will not only have a negative impact upon the replication stream; it will bring the database to a grinding halt. The Oracle Database query below, which I named OGGEnvCheck.sql, will log an error when consumed archive log space exceeds 75% of capacity.

-- Create a location for the output (this directory must exist)
CREATE OR REPLACE DIRECTORY MYERRLOC AS '/tmp';
DECLARE
  v_ArchiveUsed     NUMBER(3);
  v_ReadSCN1        NUMBER(10);
  v_ReadSCN2        NUMBER(10);
  v_GapSeq          VARCHAR(15);
  v_OutputString    VARCHAR(150);
  v_OutLoc          VARCHAR2(100) := 'MYERRLOC';

  fHandle  UTL_FILE.FILE_TYPE;

BEGIN

-- Open the "log file"
fHandle := UTL_FILE.FOPEN(v_OutLoc, 'OGGEnvCheckOut', 'w');

-- Check archivelog space. Report error if used space is more than 75%.. 
SELECT percent_space_used into v_ArchiveUsed FROM V$RECOVERY_AREA_USAGE where file_type = 'ARCHIVED LOG';

IF v_ArchiveUsed > 75 THEN
  v_OutputString := 'OGG-99999: Archive Log space used is ';
  v_OutputString := CONCAT (v_OutputString, v_ArchiveUsed);
  v_OutputString := CONCAT (v_OutputString, '%.');
  UTL_FILE.PUT(fHandle, v_OutputString);
END IF;

-- Close the "log file" and exit
UTL_FILE.FCLOSE(fHandle);
END;
/

The file created by this script, /tmp/OGGEnvCheckOut, will be processed by a bash script we’ll discuss in a bit.

If Downstream Capture is used in the replication stream, we also need to check the database to ensure Redo Transport is working properly and for gaps in the archive logs at the downstream database. The queries below, from the file OGGDBTransportCheck.sql, query the gv$logmnr_session Database View to ensure Integrated Extract has Redo data to process and obtain Archive Log information to determine if gaps exist (the Downstream Mining Extract will hang if gaps exist in the archive logs).

-- Check Redo Transport and archive log gaps
--
-- Redo transport
-- Spool the output to files we'll parse later
SPOOL /tmp/readscn_a.out REPLACE
SELECT session_name||' '||read_scn FROM gv$logmnr_session ORDER BY session_name ASC;
SPOOL OFF

-- Sleep 5 seconds
host sleep 5
--
SPOOL /tmp/readscn_b.out REPLACE
SELECT session_name||' '||read_scn FROM gv$logmnr_session ORDER BY session_name ASC;
SPOOL OFF
--
-- Check for archive log gap
SPOOL /tmp/gapchk.out REPLACE
SELECT consumer_name||' '||thread#||' '||name PREVIOUS_ARCHIVELOG
 FROM (SELECT consumer_name,thread#,name,sequence# ids,
        LEAD (sequence#, 1, NULL)
        OVER (ORDER BY consumer_name,thread#,sequence# ASC) lead_no
        FROM DBA_REGISTERED_ARCHIVED_LOG where purgeable='NO');
SPOOL OFF
--
-- END of Downstream Capture checks

When this script is run, two files will be created in the /tmp directory that contain the read_scn value of all Mining Integrated Extracts:

[oracle@r2d2 tmp]$ cat readscn_a.out
SESSION_NAME||”||READ_SCN
————————————————————————-
OGG$CAP_ETPC 2769449
OGG$CAP_ETST 2769449

[oracle@r2d2 tmp]$ cat readscn_b.out
SESSION_NAME||”||READ_SCN
————————————————————————-
OGG$CAP_ETPC 2769459
OGG$CAP_ETST 2769459

This script also creates a file in the /tmp directory that contains information about the Archive Logs required for each Mining Integrated Extract.

[oracle@r2d2 tmp]$ cat gapchk.out
PREVIOUS_ARCHIVELOG
——————————————————————————–
OGG$CAP_ETST 1 /u01/oracle/app/c3p0_standby/1_71_911991262.dbf
OGG$CAP_ETST 1 /u01/oracle/app/c3p0_standby/1_72_911991262.dbf
OGG$CAP_ETST 1 /u01/oracle/app/c3p0_standby/1_73_911991262.dbf
OGG$CAP_ETST 1 /u01/oracle/app/c3p0_standby/1_74_911991262.dbf
OGG$CAP_ETPC 1 /u01/oracle/app/c3p0_standby/1_71_911991262.dbf
OGG$CAP_ETPC 1 /u01/oracle/app/c3p0_standby/1_72_911991262.dbf
OGG$CAP_ETPC 1 /u01/oracle/app/c3p0_standby/1_73_911991262.dbf
OGG$CAP_ETPC 1 /u01/oracle/app/c3p0_standby/1_74_911991262.dbf

We then use a bash script, OGGEnvReportDB.sh, to process the data retrieved from the Oracle Database and log any error conditions to SYSLOG:

#!/bin/bash

## Report any issues found during the database checks

readonly ERRFILE="/tmp/OGGEnvCheckOut"
readonly ARCFILE="/tmp/gapchk.out"
readonly SCNFILEA="/tmp/readscn_a.out"
readonly SCNFILEB="/tmp/readscn_b.out"
readonly SCRIPT_NAME=$(basename $0)

function err() {
  logger -p user.error -t $SCRIPT_NAME "$@"
}

function readErrFile() {
  ## Check the error file exists
  if  -f $ERRFILE ; then
      ## Read the file
      while read LINE; do
        if  "$LINE" =~ "OGG" ; then
          err $LINE
        fi
      done < $ERRFILE
    
      ## Clear out the file
      echo " " > $ERRFILE
  fi
}

function chkReadSCN() {
  mynamea=''
  mynameb=''
  myscna=''
  myscnb=''
  ## Check the output files exist
  if  -f $SCNFILEA  &&  -f $SCNFILEB ; then
      ## Read the file data into arrays
      while read LINEA; do
        if  "$LINEA" =~ "OGG" ; then
          l=${LINEA% *}
          mynamea+=("$l")
          l=${LINEA#* }
          myscna+=("$l")
        fi
      done < $SCNFILEA

      while read LINEB; do
        if  "$LINEB" =~ "OGG" ; then
          l=${LINEB% *}
          mynameb+=("$l")
          l=${LINEB#* }
          myscnb+=("$l")
        fi
      done < $SCNFILEB

      #Process the array data and check if the read_scn is incrementing
      for ((i=1; i< ${#mynamea[@]};i++)); do
        namea=${mynamea[i]}
        scna=${myscna[i]}
        nameb=${mynameb[i]}
        scnb=${myscnb[i]}
        if  "$namea" = "$nameb" ; then
          if  "$scna" -ge "$scnb" ; then
            err "OGG-99999: Extract $namea is waiting for Redo. Check the upstream Redo Transport." 
          fi
        fi
      done
    
      ## Clear out the files
      echo " " > $SCNFILEA
      echo " " > $SCNFILEB
  fi
}

function chkArcGap() {
  lastproc=''
  lastthread=''
  lastarch='0'
  curproc=''
  curthread=''
  curarch=''
  LINEARRAY=''
  ## Check the archive gap results file exists
  if  -f $ARCFILE ; then
      ## Read the file
      while read LINE; do
        if  "$LINE" =~ "OGG" ; then
         ## Split the line into its fields
         curproc=$(echo $LINE | cut -d' ' -f1)
         curthread=$(echo $LINE | cut -d' ' -f2)
         curarch=$(echo $LINE | cut -d' ' -f3)

         # Strip the last underscore from the archive
         curarch=${curarch%_*}
         # Strip the leading underscores from the archive
         # This gives the archive sequence umber
         curarch=${curarch#*_}
	 curarch=${curarch#*_}

         # If the same reader and thread, check for an archive gap
         if  "$lastproc" = "$curproc"  &&  "$lastthread" = "$curthread" ; then
           if  "$lastarch" -gt "0" ; then
            archdiff=$(echo $(( curarch - lastarch )))
            if  "$archdiff" -gt "1" ; then
             err "OGG-99999: There is an archive log gap. Please check redo transport."
            fi
           fi
           lastarch=$(echo $curarch)
         else
          lastproc=$(echo $curproc)
          lastthread=$(echo $curthread)
          lastarch=$(echo $curarch)
         fi
        fi
      done < $ARCFILE

      ## Clear out the file
      echo " " > $ARCFILE
  fi
}

## MAIN
chkReadSCN
chkArcGap
readErrFile

When this script is executed, error conditions will be written to SYSLOG.

Oct 24 09:20:03 r2d2 OGGEnvReportDB.sh: OGG-99999: There is an archive log gap. Please check redo transport.

Automating Script Execution

To automate the execution of these checks, I chose to use the Oracle Database DBMS_SCHEDULER.CREATE_JOB procedure to run the checks hourly. The pl/sql script SetupOGGEnvCheck.sql is used to create the jobs:

-- Must be run as SYSDBA
-- sqlplus / as sysdba

-- Set the necessary privileges

GRANT CREATE JOB TO ggadmin;
GRANT CREATE EXTERNAL JOB TO ggadmin;
GRANT CREATE CREDENTIAL TO ggadmin;

-- Create a credential so script is run as the
-- correct OS user on the database server.

CONN ggadmin/Oracle1

BEGIN
  DBMS_CREDENTIAL.create_credential(
    credential_name => 'oracle_ogg',
    username        => 'oracle',
    password        => 'Oracle1'
  );
END;
/

-- Delete jobs if they already exist
DECLARE
  l_job_name VARCHAR2(30);
  l_job_exists number;
BEGIN

  l_job_name := 'OGG_ENV_CHECK';

  select count(*) into l_job_exists
     from user_scheduler_jobs
    where job_name = l_job_name
          ;

   if l_job_exists = 1 then
      DBMS_SCHEDULER.drop_job(job_name => l_job_name);
   end if;

  l_job_name := 'OGG_DB_CHECK';

  select count(*) into l_job_exists
     from user_scheduler_jobs
    where job_name = l_job_name
          ;

   if l_job_exists = 1 then
      DBMS_SCHEDULER.drop_job(job_name => l_job_name);
   end if;

  l_job_name := 'OGG_DB_TRANSPORT_CHECK';

  select count(*) into l_job_exists
     from user_scheduler_jobs
    where job_name = l_job_name
          ;

   if l_job_exists = 1 then
      DBMS_SCHEDULER.drop_job(job_name => l_job_name);
   end if;

   l_job_name := 'OGG_DB_CHECK_REPORT';

  select count(*) into l_job_exists
     from user_scheduler_jobs
    where job_name = l_job_name
          ;

   if l_job_exists = 1 then
      DBMS_SCHEDULER.drop_job(job_name => l_job_name);
   end if;

END;
/


-- Create a job with in-line scripts.

DECLARE
  l_job_name VARCHAR2(30);
  l_script   VARCHAR2(32767);

BEGIN

-- Check disk space and memory
  l_job_name := 'ogg_env_check';
  l_script := '/u01/oracle/OGGEnvCheck/OGGEnvCheck.sh';

  DBMS_SCHEDULER.create_job(
    job_name        => l_job_name,
    job_type        => 'EXTERNAL_SCRIPT',
    job_action      => l_script,
    credential_name => 'oracle_ogg',
    start_date      => systimestamp + interval '10' second,
    repeat_interval => 'FREQ=HOURLY;INTERVAL=1',
    enabled         => TRUE
  );

END;
/

--
-- Check archive log space
DECLARE
  l_job_name VARCHAR2(30);
  l_script   VARCHAR2(32767);

BEGIN

-- Check database features
  l_job_name := 'ogg_db_check';
  l_script := '/u01/oracle/OGGEnvCheck/OGGEnvCheck.sql';


  DBMS_SCHEDULER.create_job(
    job_name        => l_job_name,
    job_type        => 'SQL_SCRIPT',
    job_action      => l_script,
    credential_name => 'oracle_ogg',
    start_date      => systimestamp + interval '10' second,
    repeat_interval => 'FREQ=HOURLY;INTERVAL=1',
    enabled         => TRUE
  );
END;
/

-- 
-- Check Redo Transport (Downstream Capture)
--
-- Uncomment the following if using downstream capture
--DECLARE
--  l_job_name VARCHAR2(30);
--  l_script   VARCHAR2(32767);

--BEGIN

-- Check database features
--  l_job_name := 'ogg_db_check';
--  l_script := '/u01/oracle/OGGEnvCheck/OGGDBTransportCheck.sql';


--  DBMS_SCHEDULER.create_job(
--    job_name        => l_job_name,
--    job_type        => 'SQL_SCRIPT',
--    job_action      => l_script,
--    credential_name => 'oracle_ogg',
--    start_date      => systimestamp + interval '30' second,
--    repeat_interval => 'FREQ=HOURLY;INTERVAL=1',
--    enabled         => TRUE
--  );
--END;
--/

DECLARE
  l_job_name VARCHAR2(30);
  l_script   VARCHAR2(32767);

BEGIN

-- Report any errors from database check
  l_job_name := 'ogg_db_check_report';
  l_script := '/u01/oracle/OGGEnvCheck/OGGEnvReportDB.sh';

  DBMS_SCHEDULER.create_job(
    job_name        => l_job_name,
    job_type        => 'EXTERNAL_SCRIPT',
    job_action      => l_script,
    credential_name => 'oracle_ogg',
    start_date      => systimestamp + interval '300' second,
    repeat_interval => 'FREQ=HOURLY;INTERVAL=1',
    enabled         => TRUE
  );

END;
/

With the hourly execution of the scripts and the GoldenGate Manager settings, we now have a good representation of our server and replication stream health. If the enterprise has processes in place that scrape SYSLOG, this information may be added for automated notifications.

Summary

In this article we presented information on how to setup Oracle GoldenGate release 12.3 and above to record warning and error conditions to SYSLOG, presented Oracle GoldenGate Manager settings to log excessive replication lag errors, presented Oracle GoldenGate Manager settings to report on Groups in the ABEND state, and provided scripting examples to log server and database resource issues that affect replication stability.

For more information on what other articles are available for Oracle GoldenGate please view our index page.

Add Your Comment