Preparing Amazon Elastic MapReduce (EMR) for Oracle Data Integrator (ODI)

June 9, 2016 | 13 minute read
Text Size 100%:

 

Introduction

This article demonstrates how to prepare the Amazon Elastic MapReduce (EMR) cloud service for the Oracle Data Integrator (ODI) installation.  Amazon EMR is a big data cloud service, available on the Amazon Web Services (AWS) cloud computing services.

ODI is well documented to run on both the Cloudera and Hortonworks distributions of Hadoop.  ODI can also run on the distributions of Hadoop found on the Amazon EMR cloud service.  This is the first article of four publications that shows how to install, configure, and use ODI on the Amazon EMR cloud service:

 

For a demonstration of how to leverage ODI on Amazon EMR, go to “Webcast: Leveraging Oracle Data Integrator (ODI) with Amazon Elastic MapReduce (EMR).”  Additionally, an ODI 12.2.1 repository with examples of how to leverage ODI with Amazon EMR can be found at “ODI Repository Sample for Amazon Elastic MapReduce (EMR).”

 

Preparing Amazon Elastic MapReduce (EMR) for Oracle Data Integrator (ODI)

 

In order to leverage ODI with the Amazon EMR service, three AWS services are required:  Amazon RDS, Amazon EMR, and Amazon S3.  Amazon RDS is the database service, which includes database engines such as Oracle, MySQL, and PostgreSQL.  Amazon S3 is the cloud storage, which allows users to access data stored in the AWS cloud from anywhere on the web.  Amazon EMR is a cluster of Amazon EC2 compute resources, based on the Hadoop distributed processing architecture, MapReduce.

Figure 1 below, illustrates the required AWS cloud services in order to host ODI on Amazon EMR:

 

Figure 1 - ODI on Amazon Elastic MapReduce

Figure 1 - ODI on Amazon Elastic MapReduce

The Amazon RDS database instance, on Figure 1 above, is required in order to host the ODI repository.   The ODI repository can be hosted on one of the following two Amazon RDS database engines: Oracle or MySQL.    Using the Oracle database engine, this article illustrates how to install an ODI repository on an Amazon RDS instance.

By default, the Amazon EMR cluster is provisioned with one master node, and two slave nodes.  This article discusses how to create an Amazon EMR cluster with additional storage in order to host the ODI binaries.  Two ODI main components are installed and configured on the master node of the Amazon EMR cluster:  the ODI agent, and the ODI studio.  The ODI standalone agent is the recommended agent-type, since it does not require an application server and it can run as a standalone Java application in the Amazon EMR cluster.  The ODI studio can also be installed on the master node of the Amazon EMR cluster.

For big data integration tasks, the ODI standalone agent must reside on the master node of the EMR cluster, so ODI can invoke the big data tools – such as Spark, Pig, Sqoop, and Hive – found on the EMR cluster.  For non-big data integration tasks, users can install additional ODI agents on other Amazon EC2 instances as shown on Figure 1 above.  ODI Studio can also be installed on other Amazon EC2 instances.

This article discusses how to install and configure X Window (X11) software on the master node of the EMR cluster.  X11 software allows applications such as the ODI installer to run on a server; thus, the ODI agent and the ODI studio can be installed and configured on the master node of the Amazon EMR cluster.  Additionally, X11 software allows the forwarding of application screens; thus, ODI users can launch ODI binaries on the master node of the Amazon EMR cluster, but the ODI screens can run on client computers as shown on Figure 1 above.

The Amazon S3 storage instance, on Figure 1 above, can be used to store source-data files that need to be ingested into the Amazon EMR cluster.  The Amazon S3 storage instance can also be used as a landing area to store data that has been transformed on the Amazon EMR cluster.  Data stored on an Amazon S3 storage instance can be copied or moved into other cloud data services or on-premise data warehouses.  This article shows how to create a hadoop user that has access to both file systems:  the Hadoop file system and the Amazon S3 file system; thus, ODI users can create integration tasks in the EMR cluster to transform data on both file systems.

 

Creating the Amazon AWS Account

In order to configure ODI with Amazon EMR, an Amazon AWS account must be created.  Proceed to create the Amazon AWS account.  Once the Amazon AWS account is created, go to the Amazon AWS Management Console, and identify the following four Amazon AWS services:  Amazon EC2, Amazon RDS, Amazon EMR, and Amazon S3.  The following sections of this article show how to configure these four Amazon AWS services to successfully install and leverage ODI on the Amazon EMR cloud service.

 

Creating the Amazon Key Pair

Amazon EMR is a cluster of Amazon EC2 compute resources.  Amazon EC2 uses public–key cryptography to login and access the Amazon EMR instances.  On Amazon EMR, users can access the EMR cluster by login into the master node of the cluster.  The public-key cryptographic uses a public key to encrypt the password information when a user attempts to login into a master node.  Then, the master node uses a private key to decrypt the password information and determines if a user can be logged into the cluster.    In Amazon AWS, these public and private keys are known as Amazon Key Pairs.  In order to host ODI on an Amazon EMR instance, an ODI agent must be configured on the master node of the EMR cluster.  Thus, an Amazon key pair is required in order to access the master node of an EMR cluster.

Return to the Amazon AWS Management Console, and locate the AWS services.  Select EC2 to access the EC2 dashboard.  On the EC2 dashboard, locate the Network & Security section.  This section contains various network and security options, including the Key Pairs option.  Select the Key Pairs option, and create a new EC2 key pair.  Figure 2 below illustrates the location of the Key Pairs option in the EC2 dashboard:

 

Figure 2 - Creating an ODI Key Pair in Amazon AWS

Figure 2 - Creating an ODI Key Pair in Amazon AWS

When creating the new EC2 key pair, a file with extension .pem is also created.  Save the .pem file on disk.  This file will be used in a later section of this article in order to configure the SSH connection, and access the master node of the EMR cluster.

 

Creating the Amazon EMR Cluster

Once the EC2 key pair has been created and a .pem file has been saved on disk, proceed to create an instance of the Amazon EMR cloud service.  Using the Amazon AWS Management Console, locate the AWS services and select EMR.

In the Elastic MapReduce service window, select the Create Cluster option, and proceed to create a new EMR cluster.  Select the Advanced Options to customize the configuration of the new EMR cluster.  Select the software vendor and the release version as shown on Figure 3 below.  In this example, the selected vendor is Amazon, and the release version is emr-4.5.0.  In this release, the following application tools have been selected:  Hadoop, Sqoop, Spark, Hive, HCatalog, Oozie, Hue, and Pig.  Select additional application tools if your environment requires them.

 

Figure 3 - Creating the Amazon EMR Cluster

Figure 3 - Creating the Amazon EMR Cluster

Once the vendor and the release version have been selected, proceed to configure the Hardware options.  Ensure that the selected hardware options meet the requirements of your desired environment.

It is recommended to configure additional storage for the ODI installation and ODI binaries.  In the Hardware Configuration options, locate the Master Node instance, select the Add EBS volumes option, and add a new storage volume as shown on Figure 4 below:

Figure 4 - Adding EBS Storage in the Amazon EMR Cluster

Figure 4 - Adding EBS Storage in the Amazon EMR Cluster

Figure 4 above shows the default size of 100 GiB for the new EBS volume.  This size is sufficient for the ODI installation and the additional binaries that will be installed on the master node of the Amazon EMR cluster.  However, additional storage or EBS volumes may be required if the user chooses to store data files or install additional software.

Select the General Cluster Settings options and enter the name of the EMR cluster.  Select the Security options and enter the EC2 key pair that was created in a previous section of this article.  Figure 5 below shows the selection of the EC2 key pair:

 

Figure 5 - Specifying the ODI EC2 Key Pair

Figure 5 - Specifying the ODI EC2 Key Pair

Note 1 - Specifying the ODI EC2 Key Pair

 

Once the EC2 key pair has been selected, proceed to create the EMR cluster.  The EMR cluster will be ready for use when its status is Waiting as shown on Figure 6 below.  Once the EMR cluster is ready for use, locate the Connections section and proceed to enable the web connections.   Follow the Amazon instructions on how to enable the Amazon EMR web applications such as Hue, Spark, and the Resource Manager.

 

Figure 6 - Enabling Web Connections

Figure 6 - Enabling Web Connections

Once the Web connections have been enabled, select the Hue application, and create a Hue account as shown on Figure 7 below:

Figure 7 – Creating the Hue Account

Figure 7 – Creating the Hue Account

The new Hue account, as shown on Figure 7 above, will be used by ODI to access Hive, and other Hadoop resources.  This Hue user will be configured in the ODI Topology.

 

Note 2 - Creating the Hue Account

 

Once the Hue user has been created, login to Hue and browse the file system of the EMR cluster.  On Amazon EMR, users can access both the Amazon Simple Storage Service (S3) and the Amazon Hadoop file system as shown on Figure 8 below.  Furthermore, Hive tables can be defined on both Amazon S3 and Amazon Hadoop file system; thus, ODI can be used to transform data from and into Hive tables that have been defined on both file systems.  For additional information on how to integrate Hive with Amazon S3, go to “Additional Features of Hive in Amazon EMR.”

 

Figure 8 – Browsing Amazon S3 and Amazon Hadoop

Figure 8 – Browsing Amazon S3 and Amazon Hadoop

 

Configuring SSH to Access the Amazon EMR Cluster

In order to access the master node of the EMR cluster, the user must perform two configurations:  create an inbound SSH rule, and configure a SSH client tool to connect to the master node.  The inbound SSH rule must be defined in the security group of the master node.  Return to the Amazon AWS Management Console, and locate the AWS services.  Select EC2 to access the EC2 dashboard.  On the EC2 dashboard, locate the Network & Security section, and select the Security Groups option.  A list of security groups should be displayed on screen.  Identify and select the security group of the master node, and proceed to edit its inbound rules as shown on Figure 9 below:

 

Figure 9 – Editing the EMR Master Security Group

Figure 9 – Editing the EMR Master Security Group

In the Edit Inbound Rules page, add a new inbound rule of type SSH as shown on Figure 10 below.  Specify the IP address of the machine that will be authorized to connect to the Amazon EMR cluster.  If multiple computers must access the Amazon EMR cluster, specify a range of IP addresses.  Save your changes.

 

Figure 10 – Adding an Inbound Rule for SSH Connection

Figure 10 – Adding an Inbound Rule for SSH Connection

Return to the Amazon AWS Management Console, and select EMR.  In the cluster list, locate and select the new EMR cluster.  Locate the Master Public DNS section of the new cluster, and select the SSH option as shown on Figure 11 below.  Follow the Amazon instructions on how to install PuTTY on a client computer and use the EC2 key pair .pem file to connect and access the master node of the EMR cluster.

 

Figure 11 – Configuring the SSH Connection

Figure 11 – Configuring the SSH Connection

Using PuTTY on a client computer, connect to the master node of the EMR cluster, as shown on Figure 12 below.

 

Figure 12 – Accessing the Amazon EMR Master Cluster

Figure 12 – Accessing the Amazon EMR Master Cluster

Note 3 - Accessing the Amazon EMR Master Cluster

 

Adding the Hue User to the Hadoop Group

Before running any Hive queries, add the Hue user to the hadoop group in Linux.  Using the hadoop user of the EMR master node, login into the master node and execute the following Linux command:

sudo adduser -g hadoop oracle

 

Creating the Amazon RDS Instance

When configuring ODI with Amazon EMR, the ODI repository should be installed on a database instance of Amazon RDS.  The ODI repository installation supports various database engines.  On Amazon RDS, ODI can be installed on one of two Amazon RDS services:  MySQL, and Oracle.

Return to the Amazon AWS Management Console, and locate the AWS services.  Select RDS to create a new Amazon RDS service instance.  Choose between Oracle and MySQL database engines.  Select the database options that meet your requirements, and proceed to create the database instance.  Figure 13 below shows the Oracle database engines available on Amazon RDS.

 

Figure 13 – Configuring a Database Instance in Amazon RDS

Figure 13 – Configuring a Database Instance in Amazon RDS

Once the database instance has been created, identify the Endpoint of the new database as shown on Figure 14 below.  Make a note of the Endpoint URL and the port number.  This information is required in order to create an ODI repository on this database instance.

 

Figure 14 – Database Instance Endpoint URL and Port Number

Figure 14 – Database Instance Endpoint URL and Port Number

The new database instance will be accessed by the ODI standalone agent as well.  Thus, a RDS inbound rule must be configured to allow the Agent to access the RDS database instance.

Return to the Amazon AWS Management Console, and locate the AWS services.  Select EC2 to access the EC2 dashboard.  In the EC2 dashboard, locate the Network & Security section, and select the Security Groups option.  A list of security groups should be displayed on screen.  Identify and select the security group of the RDS instance, as shown on Figure 15 below.  Proceed to edit the inbound rules of the RDS instance security group.

Figure 15 - Adding a RDS Inbound Rule for the RDS Database Instance

Figure 15 - Adding a RDS Inbound Rule for the RDS Database Instance

Create a new inbound rule of type RDS that allows the master node of the EMR cluster to make inbound calls to the RDS database instance.  On Figure 15 above, the Source IP Address has been set to 172.31.5.102/32 – this is a rage of IP addresses, which include the IP address of the EMR master node.  Save your new inbound rule.

For additional information on how to access an Oracle database instance on the Amazon RDS service, go to “Connecting to a DB Instance Running the Oracle Database Engine.”

To install ODI on the master node of the Amazon EMR cluster, go to “Installing Oracle Data Integrator (ODI) on Amazon Elastic MapReduce (EMR).”

 

Conclusion

 

ODI is well documented to run on both the Cloudera and Hortonworks distributions of Hadoop.  ODI can also run on the distributions of Hadoop found on the Amazon EMR cloud service.  This article demonstrates how to prepare the Amazon Elastic MapReduce (EMR) cloud service for the Oracle Data Integrator (ODI) installation.

For more Oracle Data Integrator best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit Oracle A-team Chronicles for Oracle Data Integrator (ODI).”

 

ODI Related Cloud Articles

Installing Oracle Data Integrator (ODI) on Amazon Elastic MapReduce (EMR)

Configuring Oracle Data Integrator (ODI) for Amazon Elastic MapReduce (EMR)

Using Oracle Data Integrator (ODI) with Amazon Elastic MapReduce (EMR)

Webcast: Leveraging Oracle Data Integrator (ODI) with Amazon Elastic MapReduce (EMR)

ODI Repository Sample for Amazon Elastic MapReduce (EMR)

Integrating Oracle Data Integrator (ODI) On-Premise with Oracle Cloud Services

 

Benjamin Perez-Goytia


Previous Post

Using Oracle Data Integrator (ODI) to Bulk Load Data into HCM-Cloud

Christophe Dupupet | 8 min read

Next Post


Installing Oracle Data Integrator (ODI) on Amazon Elastic MapReduce (EMR)

Benjamin Perez-Goytia | 9 min read