Oracle GoldenGate Big Data: Apply to Apache HDFS

Introduction

Oracle GoldenGate for Big Data provides runtime, pluggable functionality into the Oracle GoldenGate Java Delivery framework in order to stream transactional data into big data systems, raising the quality and timeliness of business insights. In this article we shall setup the Oracle GoldenGate Apply mechanism (Replicat) to stream data captured from an Oracle 12c Multi-tenant Database into Apache HDFS.

This document covers functionality present in Oracle GoldenGate version 12.3; which may not be available in earlier product releases.

The concepts, scripts, and information presented in this article are for educational purposes only. They are not supported by Oracle Development or Support, and come with no guarantee or warrant for functionality in any environment other than the test system used to prepare this article. Before applying any changes presented in this article to your environment, you should thoroughly test to assess functionality and performance implications.

Main Article

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. (Source: Apache Hadoop HDFS Architecture documentation)

For this article, we shall be using an Apache Hadoop version 3.01 single node cluster running on a Linux Virtual Machine as the Oracle GoldenGate target instance. Oracle GoldenGate for Big Data version 12.3.1.1 and Java version 8 complete the requisite software components. Environment variables set in the shell include:

# User specific environment and startup programs
#
export JAVA_HOME=/usr/java/jdk1.8.0_92
export JRE_HOME=/usr/java/jdk1.8.0_92/jre
export HADOOP_HOME=/u01/hadoop
export OGG_HOME=/u01/goldengate/ogg12301BD
export LD_LIBRARY_PATH=$JAVA_HOME/lib:$JRE_HOME/lib/amd64/server
#
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

GoldenGate Source Configuration

Installing and configuring Oracle GoldenGate data capture from an Oracle Database is covered in numerous other articles and documents; for this article, we are assuming that you already have a running Oracle GoldenGate Extract and Extract Data Pump in the source instance. Here are the configuration settings for my source Oracle GoldenGate instance:

 Integrated Extract Extract Data Pump 
extract epdborcl
useridalias ggalias
exttrail ./dirdat/ea
reportcount every 10 minutes, rate
reportrollover at 00:01
report at 00:01
statoptions resetreportstats
warnlongtrans 15m, checkinterval 15m
table pdborcl.tpc.*;
extract p_hadp
rmthost 192.168.120.248, mgrport 15000
rmttrail ./dirdat/hd
reportcount every 10 minutes, rate
reportrollover at 00:01
report at 00:01
statoptions resetreportstats
table pdborcl.tpc.orders;
table pdborcl.tpc.orders_products;
table pdborcl.tpc.orders_status_history;

 

Tables in the PDBORCL.TPC schema being captured are shown below; however, we are only replicating the three orders tables to HDFS as denoted in the Extract Data Pump.

SQL> select table_name from user_tables order by table_name;

TABLE_NAME
—————————–
CATEGORIES
CATEGORIES_DESCRIPTION
CUSTOMERS
CUSTOMERS_INFO
NEXT_CUST
NEXT_ORDER
ORDERS
ORDERS_PRODUCTS
ORDERS_STATUS_HISTORY
ORDERS_TOTAL
PRODUCTS
PRODUCTS_DESCRIPTION
PRODUCTS_TO_CATEGORIES

 

Target GoldenGate Configuration

In the target Oracle GoldenGate instance, we setup a Replicat with the following configuration:

 

 Replicat
replicat rhadp

— Set the target “database” as Java and defines the Java Adapter
— property file.

targetdb libfile libggjava.so set property=dirprm/rhadp.properties

— Group the specified number of source transactions into
— a single target transaction. (1000 is the default).

grouptransops 1000

— Report thenumber of operations processed per second
— every 10 minutes, and report total operations per
— table daily.

reportcount every 10 minutes, rate
reportrollover at 00:01
report at 00:01
statoptions resetreportstatsmap pdborcl.tpc.*, target pdborcl.tpc.*;

 

The properties file configuration is:

## rhadp.properties for HDFS/Hive

gg.handlerlist=hdfs

gg.handler.hdfs.type=hdfs
gg.handler.hdfs.rootFilePath=/user
gg.handler.hdfs.format=delimitedtext
gg.handler.hdfs.format.pkUpdateHandling=update
gg.handler.hdfs.mode=tx

goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE

gg.log=log4j
gg.log.level=WARN,ERROR

gg.report.time=30sec

gg.classpath=/u01/hadoop/etc/hadoop:/u01/hadoop/share/hadoop/common/lib/*:/u01/hadoop/share/hadoop/common/*:/u01/hadoop/share/hadoop/hdfs/lib/*:/u01/hadoop/share/hadoop/hdfs/*

javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

Where most new users run into problems is properly setting the property file gg.classpath entry correctly:

gg.classpath=/u01/hadoop/etc/hadoop:
/u01/hadoop/share/hadoop/common/lib/*:
/u01/hadoop/share/hadoop/common/*:
/u01/hadoop/share/hadoop/hdfs/lib/*:
/u01/hadoop/share/hadoop/hdfs/*

The gg.classpath entry must be contained on a single text line (it is shown on multiple lines above for clarity). In order of precedent, classpath items for the HDFS Handler are (1) the path to the Hadoop core-site.xml file, (2) the location of the Hadoop common library files, (3)  the location of the hadoop-common.jar file (4) the location of the HDFS common library files, and (5) the location of the hadoop-hdfs.jar file.

This properties files utilizes many of the defaults associated with the HDFS Handler, so the Oracle GoldenGate Replicat will create delimited text files under the HDFS /user directory, with a maximum file size of 1 gigabyte and the following naming format:

[Oracle@hadp ~]$ hdfs dfs -ls /user/*
-rw-r–r–   1 Oracle supergroup      19486 2018-03-27 10:10 /user/pdborcl.tpc.orders/PDBORCL.TPC.ORDERS_RHADP_2018-03-27_10-10-45.705.txt
-rw-r–r–   1 Oracle supergroup  134334893 2018-03-27 10:10 /user/pdborcl.tpc.orders_products/PDBORCL.TPC.ORDERS_PRODUCTS_RHADP_2018-03-27_10-10-47.054.txt
-rw-r–r–   1 Oracle supergroup       7480 2018-03-27 10:10 /user/pdborcl.tpc.orders_status_history/PDBORCL.TPC.ORDERS_STATUS_HISTORY_RHADP_2018-03-27_10-10-47.072.txt

There are properties file settings that control the size and frequency of target files rolls in the HDFS file system which may be defined if 1 gigabyte files that roll only when they reach the maximum file size, or Replicat is restarted do not meet your usage requirements. These include:

 Setting Purpose
gg.handler.[name].maxFileSize Selects the maximum file size of created HDFS files. May stipulate
k, m, or g to signify kilobytes, megabytes, or gigabytes respectively.
gg.handler.[name].fileRollInterval Sets a timer that starts when an HDFS file is created. If the
file is still open when the interval elapses then the file will
be closed. A new file will not be immediately opened. New HDFS
files are created on a just in time basis.
gg.handler.[name].inactivityRollInterval Sets a timer that starts from the latest write to an HDFS file.
New writes to an HDFS file restart the counter. If the file is
still open when the counter elapses the HDFS file will be closed.
A new file will not be immediately opened. New HDFS files are
created on a just in time basis.

Let’s run a test with the properties file set for a maximum HDFS file size of 100 MB, a file roll interval of 10 minutes, and an inactivity roll interval of 30 minutes. We add the following to the properties file, restart the Replicat, and generate source data:

gg.handler.hdfs.maxFileSize=100m
gg.handler.hdfs.fileRollInterval=10m
gg.handler.hdfs.inactivityRollInterval=30m

After a 10 minute test, the HDFS file system contains these files created by Replicat:

[Oracle@hadp ~]$ hdfs dfs -ls /user/*
-rw-r–r–   1 Oracle supergroup   32339315 2018-03-27 12:55 /user/pdborcl.tpc.orders/PDBORCL.TPC.ORDERS_RHADP_2018-03-27_12-45-01.876.txt
-rw-r–r–   1 Oracle supergroup      15953 2018-03-27 12:55 /user/pdborcl.tpc.orders/PDBORCL.TPC.ORDERS_RHADP_2018-03-27_12-55-03.864.txt
-rw-r–r–   1 Oracle supergroup  104857591 2018-03-27 12:45 /user/pdborcl.tpc.orders_products/PDBORCL.TPC.ORDERS_PRODUCTS_RHADP_2018-03-27_12-45-03.934.txt
-rw-r–r–   1 Oracle supergroup  104857441 2018-03-27 12:50 /user/pdborcl.tpc.orders_products/PDBORCL.TPC.ORDERS_PRODUCTS_RHADP_2018-03-27_12-45-31.356.txt
-rw-r–r–   1 Oracle supergroup  104857521 2018-03-27 12:55 /user/pdborcl.tpc.orders_products/PDBORCL.TPC.ORDERS_PRODUCTS_RHADP_2018-03-27_12-50-40.829.txt
-rw-r–r–   1 Oracle supergroup      25467 2018-03-27 12:55 /user/pdborcl.tpc.orders_products/PDBORCL.TPC.ORDERS_PRODUCTS_RHADP_2018-03-27_12-55-35.848.txt
-rw-r–r–   1 Oracle supergroup   12930036 2018-03-27 12:55 /user/pdborcl.tpc.orders_status_history/PDBORCL.TPC.ORDERS_STATUS_HISTORY_RHADP_2018-03-27_12-45-05.923.txt
-rw-r–r–   1 Oracle supergroup       6956 2018-03-27 12:55 /user/pdborcl.tpc.orders_status_history/PDBORCL.TPC.ORDERS_STATUS_HISTORY_RHADP_2018-03-27_12-55-07.165.txt

Whenever 30 minutes of inactivity occurs on any target file, the HDFS file will be closed and a new one created when new data is applied by Replicat:

[Oracle@hadp ~]$ hdfs dfs -ls /user/*
-rw-r–r–   1 Oracle supergroup   32339315 2018-03-27 12:55 /user/pdborcl.tpc.orders/PDBORCL.TPC.ORDERS_RHADP_2018-03-27_12-45-01.876.txt
-rw-r–r–   1 Oracle supergroup    5399930 2018-03-27 13:05 /user/pdborcl.tpc.orders/PDBORCL.TPC.ORDERS_RHADP_2018-03-27_12-55-03.864.txt
-rw-r–r–   1 Oracle supergroup      19661 2018-03-27 13:33 /user/pdborcl.tpc.orders/PDBORCL.TPC.ORDERS_RHADP_2018-03-27_13-33-47.433.txt
-rw-r–r–   1 Oracle supergroup  104857591 2018-03-27 12:45 /user/pdborcl.tpc.orders_products/PDBORCL.TPC.ORDERS_PRODUCTS_RHADP_2018-03-27_12-45-03.934.txt
-rw-r–r–   1 Oracle supergroup  104857441 2018-03-27 12:50 /user/pdborcl.tpc.orders_products/PDBORCL.TPC.ORDERS_PRODUCTS_RHADP_2018-03-27_12-45-31.356.txt
-rw-r–r–   1 Oracle supergroup  104857521 2018-03-27 12:55 /user/pdborcl.tpc.orders_products/PDBORCL.TPC.ORDERS_PRODUCTS_RHADP_2018-03-27_12-50-40.829.txt
-rw-r–r–   1 Oracle supergroup   39420206 2018-03-27 13:05 /user/pdborcl.tpc.orders_products/PDBORCL.TPC.ORDERS_PRODUCTS_RHADP_2018-03-27_12-55-35.848.txt
-rw-r–r–   1 Oracle supergroup     159777 2018-03-27 13:33 /user/pdborcl.tpc.orders_products/PDBORCL.TPC.ORDERS_PRODUCTS_RHADP_2018-03-27_13-33-47.466.txt
-rw-r–r–   1 Oracle supergroup   12930036 2018-03-27 12:55 /user/pdborcl.tpc.orders_status_history/PDBORCL.TPC.ORDERS_STATUS_HISTORY_RHADP_2018-03-27_12-45-05.923.txt
-rw-r–r–   1 Oracle supergroup    2099208 2018-03-27 13:05 /user/pdborcl.tpc.orders_status_history/PDBORCL.TPC.ORDERS_STATUS_HISTORY_RHADP_2018-03-27_12-55-07.165.txt
-rw-r–r–   1 Oracle supergroup       7708 2018-03-27 13:33 /user/pdborcl.tpc.orders_status_history/PDBORCL.TPC.ORDERS_STATUS_HISTORY_RHADP_2018-03-27_13-33-47.473.txt

 

Summary

In this article we setup Oracle GoldenGate for Big Data to apply transactional data captured from the Oracle RDBMS into Apache Hadoop Distributed File Syste.

For more information on what other articles are available for Oracle GoldenGate please view our index page.

 

Add Your Comment