This article discusses how to use Oracle Data Integrator (ODI) with the Amazon Elastic MapReduce (EMR) cloud service. Amazon EMR is a big data cloud service and a member of the Amazon Web Services (AWS) cloud computing services, offered by Amazon.com.
In the big data space, ODI is certified on both Cloudera and Hortonworks distributions of Hadoop. In addition to Cloudera and Hortonworks certifications, ODI usability can be extended to other distributions of Hadoop, including the Amazon distributions of Hadoop found in the Amazon EMR cloud service. Furthermore, ODI can be used with other big data technologies such as Apache Spark, which it is also found in the Amazon EMR cloud service. Users can use ODI to design big data tasks with Spark and execute them against the Spark cluster found in the Amazon EMR cloud service.
This is the fourth article of four publications that shows how to install, configure, and use ODI on the Amazon EMR cloud service:
For a demonstration of how to leverage ODI on Amazon EMR, go to “Webcast: Leveraging Oracle Data Integrator (ODI) with Amazon Elastic MapReduce (EMR).” Additionally, an ODI 12.2.1 repository with examples of how to leverage ODI with Amazon EMR can be found at “ODI Repository Sample for Amazon Elastic MapReduce (EMR).”
In order to use ODI with the Amazon EMR cloud service, three AWS cloud services are required: Amazon RDS, Amazon S3, and Amazon EMR. Amazon RDS is the database service, which includes database technologies such as Oracle, MySQL, and PostgreSQL. Amazon S3 is the cloud storage, which allows users to access data stored in the AWS cloud from anywhere on the web. Amazon EMR is a cluster of Amazon EC2 compute resources, based on the Hadoop distributed processing architecture, MapReduce.
Figure 1 below, illustrates the required AWS cloud services in order to host ODI on Amazon EMR:
Figure 1: Hosting ODI on Amazon Elastic MapReduce
The Amazon RDS database instance, on Figure 1 above, is required in order to host the ODI repository. The ODI repository can be hosted on one of the following two Amazon RDS database engines: Oracle or MySQL. Additional Amazon RDS database instances may be required if the customer wishes to store and maintain datamarts on the Amazon RDS cloud service.
By default, the Amazon EMR cluster is provisioned with one master node, and two slave nodes. An ODI agent can be installed in the master node of the Amazon EMR cluster, as shown on Figure 1 above; thus, the agent can access and invoke big data tools such as hdfs, spark-submit, hive, and sqoop, among others. The ODI standalone agent is the recommended agent-type, since it does not require an application server and it can run as a standalone Java application in the master node of the Amazon EMR cluster. Multiple ODI standalone agents can be installed on the master node in order to load-balance the data workloads.
ODI Studio can also be installed on the master node of the Amazon EMR cluster, as shown on Figure 1 above. Running ODI Studio on the master node provides easy access to configuring and testing the Hadoop technologies in the ODI Topology. ODI Studio runs and performs better when it is installed and used in the Amazon EMR cluster. Alternatively, if the user has additional Amazon EC2 instances, ODI Studio can also be installed on other EC2 instances. ODI users can run ODI studio on an Amazon EC2 instance, and re-direct the display of ODI Studio to an on-premise computer for fast performance.
Amazon EMR clusters are configured with both Amazon Open Java Development Kit (Amazon JDK), and Standard Edition Java Development Kit (Oracle JDK). In order to run both the ODI standalone agent and the ODI Studio in the master node of the Amazon EMR cluster, an Oracle-certified JDK is required. In Amazon EMR cluster, multiple versions of Java can be installed in the master node; thus, users can run both Amazon distributions of big data applications and ODI components with Amazon JDK and Oracle JDK, respectively. Additional expertise may be required in order to correctly install multiple versions of Java in the master node of the Amazon EMR cluster. For a complete list of Oracle-certified JDKs, go to “Oracle Fusion Middleware 12c (12.2.1) Certification Matrix.”
The Amazon S3 storage instance, on Figure 1 above, is required in order to store source data files that need to be ingested into the Amazon EMR cluster. The Amazon S3 storage instance can also be used as a landing area to store data that has been transformed on the Amazon EMR cluster. Data stored on an Amazon S3 storage instance can be copied or moved into other cloud data services or on-premise data warehouses.
Figure 2 below, illustrates an Amazon EMR cluster with an ODI standalone agent installed in the master node of the cluster. This ODI standalone agent can natively invoke big data applications – such as Spark and Hive – installed on the Amazon EMR cluster. For instance, users can design ODI mappings that can use Spark or Hive as the transformation engine in the Amazon EMR cluster. ODI mappings can take advantage of the Spark SQL and Hive QL compatibility by defining Hive as a source or as a target datastore, but using Spark as the transformation engine. These ODI mappings can successfully run in the Spark cluster, the Amazon EMR cluster.
Figure 2: Using ODI with Spark and Hive in the Amazon EMR Cluster
Amazon EMR provides additional features to integrate Hive with the Amazon S3 storage service. The content of the Hive tables – files – can directly reside in Amazon S3 buckets (folders). Using Hive, users can read and write, respectively, from and into files located in Amazon S3 buckets. Hive SerDes support multiple file formats such as CSV, XML, and JSON, among others. Thus, Amazon S3 buckets can contain Hive files in a variety of formats. ODI users can take advantage of this additional Hive features of Amazon EMR – users can design ODI mappings that can read, transform, and populate Hive datastores located in Amazon S3. This ODI integration is illustrated on Figure 3 below.
Figure 3: Using ODI with Hive in the Amazon EMR Cluster
Users can install additional big data tools such as Sqoop in the master node of the Amazon EMR cluster. ODI can take advantage of these additional tools – users can design ODI mappings with Sqoop to perform data upload operations between Amazon EMR and Amazon RDS. This ODI integration is illustrated on Figure 4 below – the ODI standalone agent can invoke Sqoop to move data between instances of Amazon RDS and the Amazon EMR cluster.
Figure 4: Using ODI with Sqoop in the Amazon EMR Cluster
ODI provides great benefits when extending its usability with Amazon distributions of Hadoop in the Amazon EMR cloud service. Users do not need to manually write Hive, Spark or Sqoop scripts to execute tasks in the Amazon EMR cluster. Instead, the ODI framework can be used to design big data tasks using ODI mappings, ODI components, and ODI knowledge modules.
ODI can also be used for other big data technologies such as Spark, which it is also found in the Amazon EMR cloud service. Users can use ODI to design big data tasks with Spark and execute them against the Spark cluster found in the Amazon EMR cloud service.
ODI can be hosted on the Amazon EMR cloud service. The ODI standalone agent is a light-weight Java application that can be hosted in the master node of the Amazon EMR cluster. ODI is an engine-less data integration tool that can take full advantage of the Amazon EMR cloud service.
For more Oracle Data Integrator best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-team Chronicles for Oracle Data Integrator (ODI).”