Configuring Oracle Data Integrator for Oracle Big Data Cloud: Topology Configuration

Introduction

 

This article discusses how to configure the Oracle Data Integrator (ODI) for Oracle Big Data Cloud (BDC), specifically – how to configure the ODI Topology.  ODI offers out of the box integration with Big Data technologies such as Apache Hadoop, Apache Spark, Apache Hive, Apache Pig, and Apache Kafka, among others.  ODI supports both distributions of Hadoop:  Hortonworks Data Platform (HDP), and Cloudera Enterprise Data Hub (CDH).  Additionally, ODI can also be used on other distributions of Hadoop such as Amazon Elastic MapReduce (EMR).

For additional information on how to use ODI with BDC, go to “Using Oracle Data Integrator with Oracle Big Data Cloud.”  A pre-recorded live demonstration that supports this discussion can be found at the following Oracle Data Integration webcast: “Mastering Oracle Data Integrator with Big Data Cloud.”

 

Configuring ODI Topology for Big Data Cloud

 

To use the Big Data technologies found on BDC, ODI users must configure the following ODI data servers in the ODI Topology:  Hadoop, Hive, HDFS, Spark, and Pig.  Also, users must configure the ODI agents in the ODI Topology as well.  For the ODI Standalone Configuration, only the ODI Standalone agent must be configured.  For the ODI High-Availability Configuration, both the ODI Standalone agent and the ODI J2EE agent must be configured. Also, before configuring the ODI Topology for BDC, users must configure the BDC environment and install software additional tools such as Sqoop.

 

For information on how to install ODI Standalone on BDC, go to “Configuring Oracle Data Integrator for Big Data Cloud: Standard Configuration.”

For information on how to install ODI High-Availability on BDC, go to “Configuring Oracle Data Integrator for Big Data Cloud: High-Availability Configuration.”

For information on how to configure the BDC environment for ODI, go to “Configuring Oracle Data Integrator for Big Data Cloud: Environment Configuration.”

 

The following sections of this article show examples of how to configure both the ODI data servers and the ODI agents for BDC.  It is recommended to launch the ODI Studio on BDC, and perform this configuration from the BDC cluster.

 

Hadoop Data Server

 

In ODI, the Hadoop data server is a pre-requisite for other Big Data data servers such as Hive, Spark, Pig, and HDFS; thus, users must configure this data sever first.  Figure 1, below, shows an example of how to configure the Hadoop data server for BDC:

 

Figure 1 – ODI Topology for Big Data Cloud – Hadoop Data Server

Figure 1 – ODI Topology for Big Data Cloud – Hadoop Data Server

 

The chosen ODI HDFS Root, on Figure 1 above, is /user/oracle/odi, but users can select or use another directory.  By default, BDC already has directory /user/oracle.  Notice that /user/oracle/odi is a HDFS directory, and not a Linux or OS directory.  This HDFS directory must exist before initializing the Hadoop data server.

The Additional Classpath, on Figure 1, above, has been added manually.  In this example, the Hadoop version for the BDC instance is 2.4.2.0-258.  Thus, when adding the additional class paths, verify the Hadoop version, and adjust the class path names.  Alternatively, when defining the additional class paths, users can use the Hadoop generic-version directory – current – to define the class paths.  Make ensure the class paths exist under this directory name (current).

 

TIP - ODI for BDC - Topology - Figure 1

 

Configure the physical schema for this physical data server by using the default options. Then, proceed to create the logical data server.  For additional information on how to configure a Hadoop data server in the ODI Topology, go to “Creating and Initializing the Hadoop Data Server.”

Hive Data Server

 

Figure 2, below, shows an example of how to configure the Hive data sever for BDC.  Users must configure this data server if they plan to execute Hive workloads.  Notice the name of the Hadoop data server – this is the Hadoop data server configured on the previous section of this article.

 

Figure 2 – ODI Topology for Big Data Cloud – Hive Data Server

Figure 2 – ODI Topology for Big Data Cloud – Hive Data Server

 

A Hive user and password are required when configuring the Hive data server.  The Hive Metastore URI can be found by accessing the console of the BDC cluster.  The BDC console uses Apache Ambari to display this metadata.  Go to the Ambari Administration interface and get the Hive Metastore URI.  For more information of how to access the BDC console, go to “Accessing Big Data Cloud Service Using Ambari.”

 

Configure the physical schema for this physical data server by using the default options. Then, proceed to create the logical data server.  For additional information on how to configure a Hive data server in the ODI Topology, go to “Hive Data Server Definition.”

 

Spark Data Server

 

Figure 3, below, shows an example of how to configure the Spark data sever for BDC.  Users must configure this data server if they plan to execute Spark workloads.  Notice the name of the Hadoop data server – this is the Hadoop data server configured on section “Hadoop Data Server” of this article.  When configuring the Spark data server, use the default Master Cluster of yarn-client.

 

Figure 3 – ODI Topology for Big Data Cloud – Spark Data Server

Figure 3 – ODI Topology for Big Data Cloud – Spark Data Server

 

On BDC, the spark-submit program, used by ODI to launch the Spark workloads, is located under /usr folder.  Thus, the properties of the Spark data server must be adjusted in order to specify the correct location of the spark-submit program.  Figure 4, below, shows an example of how to modify the properties of the Spark data server to specify the correct location of the spark-submit program.

 

Figure 4 – ODI Topology for Big Data Cloud – Spark Home Directory

Figure 4 – ODI Topology for Big Data Cloud – Spark Home Directory

 

Configure the physical schema for this physical data server by using the default options. Then, proceed to create the logical data server.  For additional information on how to configure a Spark data server in the ODI Topology, go to “Setting Up Spark Data Server.”

 

Pig Data Server

 

Figure 5, below, shows an example of how to configure the Pig data sever for BDC.  Users must configure this data server if they plan to execute Pig workloads.  Notice the name of the Hadoop data server – this is the Hadoop data server configured on section “Hadoop Data Server” of this article.

 

Figure 5 – ODI Topology for Big Data Cloud – Pig Data Server

Figure 5 – ODI Topology for Big Data Cloud – Pig Data Server

 

The Additional Classpath, on Figure 5 above, has been added manually.  In this example, the Hadoop version for the BDC instance is 2.4.2.0-258.  Thus, when adding the additional class paths, verify the Hadoop version, and adjust the class path names.  Alternatively, when defining the additional class paths, users can use the Hadoop generic-version directory – current – to define the class paths.  Make ensure the class paths exist under this directory name (current).

 

TIP - ODI for BDC - Topology - Figure 5

Configure the physical schema for this physical data server by using the default options. Then, proceed to create the logical data server.  For additional information on how to configure a Pig data server in the ODI Topology, go to “Pig Data Server Definition.”

 

HDFS Data Server

 

Figure 6, below, shows an example of how to configure the HDFS data sever for BDC.  Users must configure this data server if they plan to manage HDFS files.  Notice the name of the Hadoop data server – this is the Hadoop data server configured on section “Hadoop Data Server” of this article.  For this data server, no additional class paths are required.

 

Figure 6 – ODI Topology for Big Data Cloud – HDFS Data Server

Figure 6 – ODI Topology for Big Data Cloud – HDFS Data Server

 

Configure the physical schema for the HDFS data server, and use the same HDFS directory name for both the “Directory (Schema)” and the “Directory (Work Schema)”.  For additional information on how to configure a HDFS data server in the ODI Topology, go to “HDFS Data Server Definition.”  For a list of supported HDFS file formats, go to “Working with Complex Datatypes and HDFS File Formats.”

 

 

ODI Standalone Agent

 

Figure 7, below, shows an example of how to configure the ODI Standalone agent for BDC.  Users must configure this agent in order to execute Big Data workloads on BDC.  The Name and Port Number of the ODI agent must match the name and the port number of the agent used during the installation and configuration of the ODI Standalone agent on the BDC cluster.  The Host is hostname or IP address of the node where the ODI Standalone agent has been installed.

 

Figure 7 – ODI Topology for Big Data Cloud – ODI Standalone Agent

Figure 7 – ODI Topology for Big Data Cloud – ODI Standalone Agent

 

For additional information on how to install the ODI Standalone Agent, go to “Configuring the Domain for an ODI Standalone Agent.

 

 

ODI J2EE Agent

 

Figure 8, below, shows an example of how to configure the ODI J2EE agent for BDC.  Users must configure this agent if they plan to use the ODI High-Availability Configuration for BDC, otherwise, they should skip this section.  The Name of the ODI agent must match the name of the agent used during the installation and configuration of the ODI J2EE Agent on the Java Cloud Service.

 

Figure 8 – ODI Topology for Big Data Cloud – ODI J2EE Agent

Figure 8 – ODI Topology for Big Data Cloud – ODI J2EE Agent

 

Since the ODI High-Availability Configuration uses the load balancer on JCS to execute the ODI workloads, then the hostname and the port number of the J2EE agent is the host name and port number of the load balancer.  On Figure 8, above, the Port number is the port number of the load balancer HTTP listener.  By default, the port number of the load balancer HTTP listener is 8080.  Figure 9, below, shows the default port number of the load balancer HTTP listener:

 

Figure 9 – ODI Topology for Big Data Cloud – Load Balancer Listener

Figure 9 – ODI Topology for Big Data Cloud – Load Balancer Listener

 

It is recommended to enable SSL/TLS and use the HTTPS listener instead.  For information on how to configure SSL/TLS between Oracle Traffic Director and clients, go to “Configuring SSL/TLS Between Oracle Traffic Director and Clients.“

 

For Big Data workloads, the J2EE agent must be used as an orchestrator – since ODI requires to execute the Big Data workloads in the cluster of the Big Data engine.  Thus, in order to use the J2EE agent as an orchestrator, the J2EE agent must be linked with the Standalone Agent as shown on Figure 10, below:

 

Figure 10 – ODI Topology for Big Data Cloud – ODI J2EE Agent Load Balancing

Figure 10 – ODI Topology for Big Data Cloud – ODI J2EE Agent Load Balancing

 

In this example, on Figure 10, above, the J2EE agent, OracleDIAgent, is linked with the Standalone agent, BigDataODIAgent1.  Thus, when a Big Data workload is sent to the load balancer, the load balancer sends the request to the J2EE agent.  The J2EE agent in turn takes the request and sends it to the Standalone agent on BDC.

 

For additional information on how to configure ODI High-Availability for BDC, go to “Configuring Oracle Data Integrator for Big Data Cloud: High-Availability Configuration.”

 

Conclusion

 

ODI offers out of the box integration with Big Data technologies such as Apache Hadoop, Apache Spark, Apache Hive, and Apache Pig, among others.  ODI supports both distributions of Hadoop:  Hortonworks Data Platform (HDP), and Cloudera Enterprise Data Hub (CDH).  Additionally, ODI can also be used on other distributions of Hadoop such as Amazon Elastic MapReduce (EMR).  This article discussed how to configure the ODI Topology for BDC.

For more Oracle Data Integrator best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit Oracle A-team Chronicles for Oracle Data Integrator (ODI).”

ODI Related Articles

Using Oracle Data Integrator with Oracle Big Data Cloud

Configuring Oracle Data Integrator for Big Data Cloud: Standard Configuration

Configuring Oracle Data Integrator for Big Data Cloud: High-Availability Configuration

Configuring Oracle Data Integrator for Big Data Cloud: Environment Configuration

Webcast: “Mastering Oracle Data Integrator with Big Data Cloud Service – Compute Edition.”

Add Your Comment