Configuring Oracle Data Integrator (ODI) for Amazon Elastic MapReduce (EMR)

Introduction

This article demonstrates how to configure Oracle Data Integrator (ODI) for the Amazon Elastic MapReduce (EMR) cloud service.  Amazon EMR is a big data cloud service, available on the Amazon Web Services (AWS) cloud computing services.

ODI is well documented to run on both the Cloudera and Hortonworks distributions of Hadoop.  ODI can also run on the distributions of Hadoop found on the Amazon EMR cloud service.  This is the third article of four publications that shows how to install, configure, and use ODI on the Amazon EMR cloud service:

 

For a demonstration of how to leverage ODI on Amazon EMR, go to “Webcast: Leveraging Oracle Data Integrator (ODI) with Amazon Elastic MapReduce (EMR).”  Additionally, an ODI 12.2.1 repository with examples of how to leverage ODI with Amazon EMR can be found at “ODI Repository Sample for Amazon Elastic MapReduce (EMR).”

 

Configuring Oracle Data Integrator (ODI) for Amazon Elastic MapReduce (EMR)

 

Prior configuring ODI for the Amazon EMR cloud service, users must install ODI on the Amazon EMR cloud service.  To install ODI on the Amazon EMR cloud service, go to “Installing Oracle Data Integrator (ODI) on the Amazon Elastic MapReduce (EMR).

Once users have installed ODI on the Amazon EMR cloud service, the ODI Topology must be configured.  This section illustrates how to configure ODI Topology with three of the big data technologies found on Amazon EMR:  Hadoop, Hive, and Spark.  Additional technologies such as Pig and Oozie can be configured as well.

The IP address of the master node of the EMR cluster is needed in order to configure ODI with the following two technologies:  Hadoop and Hive.  Login into the master node of the EMR cluster, and identify the IP address of the master node as shown on Figure 1 below.

 

Figure 1 – Identifying the IP Address of the EMR Master Node

Figure 1 – Identifying the IP Address of the EMR Master Node

Make a note of the IP address of the master node.  Proceed to configure the following three technologies in this order:  Hadoop, Hive, and Spark.

To see a complete list of big data technologies supported by ODI, go to “Fusion Middleware Integrating Big Data with Oracle Data Integrator.”  For information on how to configure ODI technologies, go to “Setting Up the ODI Topology.” 

Hadoop Configuration

Using the Physical Architecture of the ODI Topology, select the Hadoop technology, and create a new data server as shown on Figure 2 below.  Enter a name for the new Hadoop data server.  Under the Connection section, enter the name of the hadoop user.  The hadoop user does not have a password; thus, the password value is not required for this configuration.

Specify the HDFS Name Node URI, and the Resource Manager – use the IP address of the EMR master node for these two metadata elements.  For the ODI HDFS Root, specify the home directory of the hadoop user, and a directory name where ODI can initialize the Hadoop technology.  This directory name will be created when the user chooses to initialize the data server.

Save your new Hadoop data server, and select the Initialize option to start the initialization process.  Once the Initialization process is complete, proceed to test the new data server.

 

Figure 2 – Configuring the ODI Hadoop Technology

Figure 2 – Configuring the ODI Hadoop Technology

Create a physical schema for the new Hadoop technology, and use the default values.  Then, go to the Logical Architecture of the ODI Topology, and create a new Hadoop logical schema.  Configure the ODI Context with the new Hadoop logical and physical schemas.

 

Hive Configuration

Using the Physical Architecture of the ODI Topology, select the Hive technology, and create a new data server as shown on Figure 3 below.  Enter a name for the new Hive data server.  Specify the User and Password of the Hue account.  For details on how to create the Hue account, go to “Preparing Amazon Elastic MapReduce (EMR) for Oracle Data Integrator (ODI).”

For the Metastore URI, specify the IP address of the EMR master node.  Under Hadoop Configuration, select the Hadoop data server – this is the Hadoop data server created on the previous section of this article.

 

Figure 3 – Configuring the Hive Physical Data Server

Figure 3 – Configuring the Hive Physical Data Server

Go to the JDBC tab of the new Hive data server, and select the JDBC driver and the JDBC URL as shown on Figure 4 below.  For the JDBC URL, specify the IP address of the EMR master node.  Save your new Hive data server, and proceed to test the connection.

 

Figure 4 – Specifying the JDBC Parameters for the Hive Physical Data Server

Figure 4 – Specifying the JDBC Parameters for the Hive Physical Data Server

 

Create a physical schema for the new Hive technology, and select the default schemas for Hive as shown on Figure 5 below.

 

Figure 5 – Configuring the Hive Physical Schema

Figure 5 – Configuring the Hive Physical Schema

Go to the Logical Architecture of the ODI Topology, and create a new Hive logical schema.  Configure the ODI Context with the new Hive logical and physical schemas.

Spark Configuration

Using the Physical Architecture of the ODI Topology, select the Spark technology, and create a new data server as shown on Figure 6 below.  Enter the name of the new Spark data server.  For the Master Cluster (Data Server), enter yarn-client.  Save your new Spark data server.

 

Figure 6 – Configuring the Spark Physical Data Server

Figure 6 – Configuring the Spark Physical Data Server

Create a physical schema for the new Spark technology, and use the default values.  Then, go to the Logical Architecture of the ODI Topology, and create a new Spark logical schema.  Configure the ODI Context with the new Spark logical and physical schemas.

  

Conclusion

ODI is well documented to run on both the Cloudera and Hortonworks distributions of Hadoop.  ODI can also run on the distributions of Hadoop found on the Amazon EMR cloud service.  This article demonstrates how to configure ODI for the Amazon Elastic MapReduce (EMR) cloud service.

For more Oracle Data Integrator best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit Oracle A-team Chronicles for Oracle Data Integrator (ODI).”

ODI Related Cloud Articles

Preparing Amazon Elastic MapReduce (EMR) for Oracle Data Integrator (ODI)

Installing Oracle Data Integrator (ODI) on Amazon Elastic MapReduce (EMR)

Using Oracle Data Integrator (ODI) with Amazon Elastic MapReduce (EMR)

Webcast: Leveraging Oracle Data Integrator (ODI) with Amazon Elastic MapReduce (EMR)

ODI Repository Sample for Amazon Elastic MapReduce (EMR)

Integrating Oracle Data Integrator (ODI) On-Premise with Oracle Cloud Services

Add Your Comment