Oracle GoldenGate: Apply to Apache Flume File Roll Sink

Introduction

This is part 1 of a two part article demonstrating how to replicate data in near-real time from an on-premises database to Oracle Storage Cloud Service.

In this article we shall demonstrate Oracle GoldenGate functionality to capture transactional data in near real-time from an on-premises Oracle Database and apply the records as delimited text data to an Apache Flume File Roll Sink. The Flume Sink may be on-premises or located in the Cloud; however, we will be using an on-premises server for this demonstration. A subsequent article will demonstrate Oracle Storage Cloud Service functionality to store the Flume data files in the Cloud for later use and analysis by Oracle Big Data tools.

We used the Oracle Big Data Lite Virtual Machine as the test bed for this article. The VM image is available for download on the Oracle Technology Network website.

Main Article

Simply put, Apache Flume is a data ingestion mechanism for collecting, aggregating, and transporting large amounts of streaming data from various sources to a centralized data store. The basic architecture of a Flume Agent is depicted below.

Flume ArchitectureData Generator

A data generator is any data feed; such as Twitter, Facebook, or in our case the Oracle GoldenGate Big Data Adapter, that creates data to be collected by the Flume Agent.

Flume Agent

The Flume Agent is a JVM daemon process that receives the events from data generator clients or other agents and forwards it to a destination sink or agent. The Flume Agent is comprised of:

Source

The source component receives data from the data generators and transfers it to one or more channels in the form of Flume events.

Channel

A channel is a transient store which receives Flume events from the source and buffers them till they are consumed by sinks.

Sink

A sink consumes Flume events from the channels and delivers it to the destination. The destination of the sink may be another agent or an external repository; such as, HDFS, HBase, or in our case text files in a Linux file system.

Oracle GoldenGate

Now that we have an understanding of Apache Flume, we can begin setting up Oracle GoldenGate for data capture and delivery. The first step is to install Oracle GoldenGate for my source database, Oracle 12c in this case, alter the database settings to enable data capture, and create a GoldenGate Change Data Capture Extract that will retrieve near real-time transactional data from Oracle Redo.

The Oracle GoldenGate architecture we’ll be configuring is depicted below.

OGG Architecture

Oracle GoldenGate Source

The Oracle database was altered to support Oracle GoldenGate data capture, and the Container Database user c##ggadmin was created as part of my setup procedures. These requirements are covered in detail in the various Oracle GoldenGate Installation Guides, so I am not covering the steps in detail. To setup the database:

1. Start pl/sql as sysdba via the command:
+++ a) sqlplus / as sysdba
2. Make sure you are in the root container:
+++ a) SELECT SYS_CONTEXT (‘USERENV’, ‘CON_NAME’) FROM DUAL;
3. Execute the following commands:
+++ a) create user c##ggadmin identified by Oracle1 container=all;
+++ b) grant connect, resource, dba to c##ggadmin;
+++ c) exec dbms_goldengate_auth.grant_admin_privilege(‘C##GGADMIN’,container=>’all’);
+++ d) grant dba to c##ggadmin container=all;
+++ e) alter database force logging;
+++ f) alter database add supplemental log data;
+++ g) shutdown immediate;
+++ h) alter database archivelog;
+++ i) alter database open;
+++ j) alter system set enable_goldengate_replication=true scope=both;

By default Oracle Database only logs the changed column information for update operations. Since we are applying data to Flume for further downstream analysis, I want to force Oracle to log all table data when an update occurs. This is done via the GGSCI add schematrandata command.

In my Big Data Lite virtual machine, GoldenGate for Oracle Database is install at /u01/ogg. I go to that directory, start the GGSCI command interpreter, establish a database connection, and execute the command add schematrandata allcols.

ggsci add schematrandata

Execute the command edit param emov to configure the CDC Extract. The configuration file should look like the one below (for more information on the listed parameter settings, refer to the Oracle GoldenGate Reference Guide at Oracle Technology Network).

Extract EMOV Configuration

Execute the ggsci status all command. If the EMOV Extract does not exist, create it and register it with the database as shown below (you will need to start the GoldenGate Manager first).

Create Extract EMOV

Next create the ./dirdat/tm Extract Trail. This series of disk files will contain data retrieved from Oracle Redo by the EMOV CDC Extract.

Add Exttrail

To create the Extract Data Pump, execute the GGSCI command edit param pflume. The configuration file should look like the one below.

pflume settings

The rmthost option defines the DNS name or IP address of a target server running Oracle GoldenGate and the port where the Oracle GoldenGate Manager is listening for incoming connections. In this case, I am defining a loopback for the Oracle GoldenGate Big Data Adapter instance installed on Big Data Lite machine, so I could have excluded the Extract Data Pump. However, Extract Data Pumps are required whenever data is to be transmitted over a network, so it is a good practice to always configure them in your test systems.

Save and close the file, then we can register the Extract Data Pump and its associated Trail file in our Oracle GoldenGate instance.

Register PFLUME

 

Now we can configure the Oracle GoldenGate target instance.

Oracle GoldenGate Target

In my Big Data Lite virtual machine, GoldenGate Generic with the Oracle GoldenGate Big Data Adapter is install at /u01/ogg-bd. I go to that directory, start the GGSCI command interpreter, and start the Oracle GoldenGate Manager.

OGG Target

Execute the command edit param rflume to configure the Flume Apply Replicat. The configuration file should look like the one below.

RFLUME Config

These parameter settings specify the runtime options for this Replicat process; however, you may be unfamiliar with these two:

TARGETDB LIBFILE libggjava.so SET property=dirprm/flume_frs.properties

This parameter serves as a trigger to initialize the Java module. The SET clause specifies the location and name of a Java properties file for the Java module. The Java properties file location may be specified as either an absolute path, or path relative to the Replicat executable location. In the configuration above, I used a relative path for the Oracle GoldenGate instance installed at /u01/ogg-bd.

GROUPTRANSOPS 100

This parameter controls the number of SQL operations that are contained in a Replicat transaction. The default setting is 1000, and is the best practice setting for a production environment. However, my test server is very small and I do not what to overwhelm my flume channel; so I elected to reduce the number of operations the Replicat will apply in a transaction.

Save and close the file. Register the Replicat with the Oracle GoldenGate instance being sure to specify that it will read from the Trail file ./dirdat/pf.

Register RFLUME

To create the flume_frs.properties file; exit GGSCI, change to the dirprm directory, and edit the file with vi or your favorite text editor. The file contents should look similar to the following:

 

flume_frs.properties

 

The flume_frs.properties file defines the Java Virtual Machine settings for the Oracle Big Data Flume Adapter:

gg.handlerlist=flumehandler

Defines the name of the handler configuration properties in this file.

gg.handler.flumehandler.type=flume

Defines the type of this handler.

gg.handler.flumehandler.RpcClientPropertiesFile=custom-flume-rpc.properties

Specifies the name of the Flume Agent configuration file. This file defines the Flume connection information for Replicat to perform remote procedure calls.

gg.handler.flumehandler.format=delimitedtext

Sets the format of the output. Supported formatters are: avro_row, avro_op, delimitedtext, xml, and json.

gg.handler.flumehandler.format.fieldDelimiter=|

The default delimiter is an unprintable character, I changed it to be a vertical bar via this setting.

gg.handler.flumehandler.mode=tx

Sets the operating mode of the Java Adapter. In tx mode, output is written in transactional groups defined by the Replicat GROUPTRANSOPS setting.

gg.handler.flumehandler.EventMapsTo=tx

Defines whether each flume event would represent an operation or a transaction, based upon the setting of gg.handler.flumehandler.mode.

gg.handler.flumehandler.PropagateSchema=true

Defines whether, or not, the Flume handler publishes schema events.

gg.handler.flumehandler.includeTokens=false

When set to true, includes token data from the source trail files in the output. When set to false excludes the token data from the source trail files in the output.

gg.classpath=dirprm/:/usr/lib/flume-ng/lib/*:

Specifies user-defines Java classes and packages used by the Java Virtual Machine to connect to Flume and run. The classpath setting must include (1) the directory location containing the Flume Agent configuration file and (2) a list of the Flume client jars required for the Big Data Adapter to work with Flume. In this example, the Flume client jars are installed at /usr/lib/flume-bg/lib and we use a wildcard, *, to load all of the jars.

The Flume client library versions must match the version of Flume to which the Flume Handler is connecting.

javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

Sets the JVM runtime memory allocation and the location of the Oracle GoldenGate Java Adapter dependencies (ggjava.jar) file.

Save and close the file.

To create the Flume Agent configuration file, custom-flume-rpc.properties, edit the file with vi or your favorite text editor. The file contents should look similar to the following:

custome-flume-rpc.properties

This file contains the settings Oracle GoldenGate will use to connect to the Flume Agent. In this example, the Oracle GoldenGate Big Data Adapter will attempt an Avro connection to a Flume Agent running on the local machine and listening for connections on port 41414. Data will be sent to Flume in batches of 100 events. Connection and requests to the Flume Agent will time-out and fail if there is no response within 2000 ms.

Flume Agent

As previously shown, the Oracle GoldenGate Big Data Adapter is configured to connect to a Flume Agent on the local machine, listening on port 41414. To configure the Flume Agent, I created the file /home/oracle/flume/flume_frs.conf; with the following settings:

Flume Agent Config

In this file, we create a single Flume Agent named oggfrs. The oggfrs Agent will consist of an Avro RPC source that listens on port 41414 of our local host, a channel that buffers event data in memory, and a sink that writes events to files in a directory of the local file system. The sink will close the existing file and create a new one every 120 seconds.

You will notice that I set the sink directory to include the Oracle source schema and table name. This is a personal preference as the File Roll Sink does not provide a mechanism for naming the output files. Currently, the File Roll Sink names the files based upon the current timestamp when the file is created. The Flume source code sets the file name as follows:

public class PathManager {

  private long seriesTimestamp;
  private File baseDirectory;
  private AtomicInteger fileIndex;

  private File currentFile;

  public PathManager() {
    seriesTimestamp = System.currentTimeMillis();
    fileIndex = new AtomicInteger();
  }

  public File nextFile() {
    currentFile = new File(baseDirectory, seriesTimestamp + "-"
        + fileIndex.incrementAndGet());

    return currentFile;
  }

  public File getCurrentFile() {
    if (currentFile == null) {
      return nextFile();
    }

    return currentFile;
  }

We could write a Java module to override the Flume PathManager class; however, that is beyond the scope of this article.

For more information on setting up Apache Flume Agents, refer to the Apache documentation.

Save and close the file.

We are now ready to start the Flume Agent. In my /home/oracle/flume directory, I created a shell script to start the agent, with the following commands:

Flume Agent shell

After creating the output directory /u01/ogg-db/flumeOut/movidemo/movie, execute the script to start the agent. Upon a successful agent startup, you will see output similar to the following on the terminal screen:

Agent Start

You will also see the files created by the File Roll Sink in the output directory.

File Roll Files

The File Roll Sink will create new files at the interval specified in the Flume Agent configuration file, even when there is no Oracle GoldenGate activity.

Start Oracle GoldenGate and Test

To start and test our configuration, go to the target Oracle GoldenGate environment, /u01/ogg-bd, and make sure the Oracle GoldenGate Manager is in the RUNNING state.

Target OGG Manager

Go to the source Oracle GoldenGate environment, /u01/ogg, start GGSCI and start the emov and pflume Extract Groups. Executing the status all command should show each is in the RUNNING state.

Source OGG Status

On the Oracle GoldenGate target, start the rflume Replicat. Executing the command status rflume should show it in the RUNNING state.

OGG RFLUME Start

 

The Flume Agent will show the Replicat connect to the Flume Source.

Agent Connect

Now let’s generate some data in the Oracle Database and verify it flows through to the Flume File Roll Sink. In SQL Developer, I connect to the Oracle Pluggable Database, select the moviedemo schema, and execute the following:

INSERT INTO "MOVIEDEMO"."MOVIE" (MOVIE_ID, TITLE, YEAR, BUDGET, GROSS, PLOT_SUMMARY) VALUES ('1173971','Jack Reacher: Never Go Back', '2016','96000000','0','Jack Reacher must uncover the truth behind a major government conspiracy in order to clear his name. On the run as a fugitive from the law, Reacher uncovers a potential secret from his past that could change his life forever.');
INSERT INTO "MOVIEDEMO"."MOVIE" (MOVIE_ID, TITLE, YEAR, BUDGET, GROSS, PLOT_SUMMARY) VALUES ('1173972','Boo! A Madea Holloween', '2016','0','0','Madea winds up in the middle of mayhem when she spends a haunted Halloween fending off killers, paranormal poltergeists, ghosts, ghouls and zombies while keeping a watchful eye on a group of misbehaving teens.');
INSERT INTO "MOVIEDEMO"."MOVIE" (MOVIE_ID, TITLE, YEAR, BUDGET, GROSS, PLOT_SUMMARY) VALUES ('1173973','Like Stars on Earth', '2007','0','1204660','An eight-year-old boy is thought to be lazy and a trouble-maker, until the new art teacher has the patience and compassion to discover the real problem behind his struggles in school.');
commit;

Executing the GGSCI command view report rflume on the Oracle GoldenGate target, shows that data was captured and sent to the Oracle GoldenGate Replicat.

RFLUME Report

We will also see a file containing data in out Flume File Roll Sink output directory:

FRS DirectoryThis is a delimited text file, so we can view the contents of the file using cat.

flume frs output

 

In the next part of this series we shall take the delimited text files that contain data, move them to Oracle Storage Cloud Service, and analyze the contents with some of the Oracle Big Data analysis tools.

Summary

In this article we demonstrated the functionality of Oracle GoldenGate to capture database transactions and apply this data to an Apache Flume Agent configured with a File Roll Sink.

Continue to the next part of this article: Uploading a file to Oracle storage cloud service using REST API

Add Your Comment