Using Oracle Data Integrator with Big Data Cloud

Introduction

 

This article presents an overview of how to use Oracle Data Integrator (ODI) with Oracle Big Data Cloud (BDC). ODI offers out of the box integration with Big Data technologies such as Apache Hadoop, Apache Spark, Apache Hive, and Apache Pig, among others. ODI supports both distributions of Hadoop: Hortonworks Data Platform (HDP), and Cloudera Enterprise Data Hub (CDH). Additionally, ODI can also be used on other distributions of Hadoop such as Amazon Elastic MapReduce (EMR). This article discusses the benefits of using ODI with BDC. The article also discusses two configuration options when installing and configuring ODI for BDC.

A pre-recorded live demonstration that supports this discussion can be found at the following Oracle Data Integration webcast: “Mastering Oracle Data Integrator with Big Data Cloud.”

 

Overview of Oracle Big Data Cloud

 

Oracle Big Data Cloud (BDC), also known as Oracle Big Data Cloud Service – Compute Edition, is the Oracle elastic Big Data cloud service built on Oracle’s own distribution of Hadoop, which resembles the Hortonworks Hadoop Distribution. BDC includes Apache Hadoop (including MapReduce), Apache Spark, Apache ZooKeeper, Apache Hive, Apache Zeppelin, and Apache Pig. BDC also offers a complete set of RESTful APIs that users can use to manage BDC instances, including APIs to manage Spark jobs. Oracle offers Apache Kafka as a separate cloud service called Event Hub Cloud Service (EHCS).

BDC allows users to create Apache Spark and Hadoop clusters, and these clusters can grow or shrink based on user’s workload needs, elastically. BDC uses the Oracle Storage Classic as its data lake with an in-memory caching layer called Alluxio for fast data access – this allows customers to scale up or scale down compute and storage, independently. For additional information on how Alluxio works with BDC, go to “About the Big Data Cloud File System (BDFS).”

BDC is suitable for real-time data analysis, interactive data analysis, and batch data workloads. For instance, BDC offers an optimized Spark and Spark Streaming platform for real-time data analysis. Also, BDC is tightly integrated with Event Hub Cloud Service to ensure extreme performance between Kafka and BDC. For interactive data analysis, BDC offers Apache Zeppelin notebooks, which allow users to create collaborative documents with Scala, Hive, Spark SQL, Python, and R, among others. For batch data workloads, BDC offers REST APIs and command-line utilities such as cURL to launch batch jobs and transform data in the BDC cluster.

The data integration platform that ODI offers to Big Data users can also be extended to BDC. Users can use ODI with BDC to design and execute complex Hadoop and Spark transformations. Also, users can use ODI to design data integration tasks that combine datasets from on-premises data servers with cloud data services. These combined datasets can be transformed on BDC with ODI. Figure 1, below, illustrates an example of some of the on-premises data servers and cloud data services that can be integrated with ODI, including BDC:

 

Figure 1 – Using Oracle Data Integrator with Big Data Cloud

Figure 1 – Using Oracle Data Integrator with Big Data Cloud

 

Using ODI with BDC provides the following benefits:

 

  • Users do not need to write Spark, Hive, or Pig programs in order to analyze or transform data on BDC. Instead, users can use ODI to design their workloads, and ODI will natively generate the necessary code and use BDC to perform the necessary data transformations.
  • Users can design a single workload with ODI and perform the transformations of the workload in one or more of the transformation engines available in BDC. For instance, users can design a single ODI mapping and perform the transformations of the mapping either in Spark, in Hive, or in Pig. ODI uses a Declarative Design method to separate the logical design of a mapping from its physical implementation. Thus, users can design multiple physical implementations of the same mapping, and each physical implementation can use a different transformation engine transform data in BDC.
  • Users can design workloads with ODI to invoke the REST APIs in BDC and perform data upload operations, control Spark jobs, move data between Storage Cloud and BDC, or even scale in and scale out BDC clusters.
  • Users can design workloads with ODI to combine datasets from multiple data sources such as SQL databases, Kafka data feeds, SaaS applications, data lakes, and many others. ODI can integrate these datasets and transform them in BDC.
  • ODI decreases the time it takes to implement Big Data workloads in BDC by offering out of the box knowledge modules or code templates, which increases developer productivity, streamlines the development process and improves performance.

Although it is not part of this article, it should be noted that Oracle also offers another Big Data cloud service called Oracle Big Data Cloud Service (BDCS). This Big Data cloud service is based on Cloudera Enterprise Data Hub (CDH), which includes Apache Hadoop, Apache Spark, and Apache Kafka. BDCS is a pre-configured cloud service that includes Oracle software such as the Oracle Big Data Connectors, Oracle R, Oracle Big Data Spatial and Graph, and Oracle Data Integrator (ODI) Enterprise Edition. This Big Data cloud service is also illustrated on Figure 1, above.

Configuring Oracle Data Integrator for Big Data Cloud

 

To use ODI with BDC, users can install and configure ODI in one of two ways: ODI Standalone or ODI with High-Availability. The ODI Standalone configuration requires the installation and configuration of the ODI Standalone agent in an instance of BDC. The ODI High-Availability configuration is an extension of the ODI Standalone configuration, but it uses the ODI J2EE agent as an orchestrator for Big Data workloads. The ODI High-Availability configuration also takes advantage of the failover and load-balancing capabilities that are available in an environment with high-availability. The following two sections discuss the recommended components for each of these two ODI configurations.

ODI Standalone Configuration for Big Data Cloud

Under this configuration, the ODI Standalone agent is installed and configured on an instance of BDC. The ODI Standalone agent is configured as a standalone lightweight Java application on BDC, and it is hosted in one of the nodes of the BDC cluster. ODI requires an ODI Standalone agent in the Big Data cluster in order to execute Big Data workloads. The ODI Standalone agent uses an ODI repository installed on an instance of the Database Cloud Service (DBCS).

This configuration does not include high-availability features, but users can implement their own monitoring tools to restart the ODI Standalone agent in case of failures. Also, additional ODI Standalone agents can be configured to distribute the executions of the Big Data workloads. Figure 2, below, illustrates the ODI Standalone Configuration for Big Data Cloud:

 

Figure 2 – Configuring ODI Standalone for Big Data Cloud

Figure 2 – Configuring ODI Standalone for Big Data Cloud

 

Tip 1 - Using Oracle Data Integrator with Oracle Big Data Cloud

 

As suggested by Figure 2, above, three cloud services are required for this configuration:

 

  • Database Cloud Service (DBCS) – This cloud service is required in order to host the ODI repository. Alternatively, users can install and host an ODI-certified SQL database on an instance of Oracle Compute Cloud Service. To see a list of SQL databases that are certified with ODI , go to “Oracle Data Integrator Certification Matrix.”
  • Big Data Cloud (BDC) – This cloud service is required, and it is used as the transformation engine to run the Big Data workloads. At least, one ODI standalone agent must be installed in one of the nodes of the BDC cluster, but two ODI standalone agents are recommended for failover and additional load-balancing capabilities. BDC also requires Storage Cloud Service to host its data lake, and Event Hub Cloud Service for those users with streaming-data requirements.
  • Compute Cloud Service – This cloud service is recommended to host the ODI Studio, so ODI developers have a dedicated compute resource to design their Big Data workloads – mappings, packages, procedures, and other ODI objects. It is not recommended to perform ODI development in BDC or DBCS.


ODI High-Availability Configuration for Big Data Cloud

The ODI High-Availability configuration is an extension of the ODI Standalone configuration, as previously noted in this article. Under this configuration, both ODI agents, the ODI Standalone and the ODI J2EE agent, are installed and configured in order to achieve high-availability. The ODI Standalone agent is installed and configured on an instance of BDC. The ODI J2EE agent is installed and configured on an instance of the Oracle Java Cloud Service (JCS). JCS offers Oracle Traffic Director for load-balancing, Oracle WebLogic for high-availability, and Oracle Coherence for application-scalability.

The ODI High-Availability configuration allows users to submit both Big Data and non-Big Data ODI workloads directly to the load balancer on JCS. The load balancer distributes the ODI workloads among the J2EE agents on JCS. Since Big Data workloads must be executed on a Big Data cluster, the J2EE agent sends the Big Data workloads to the Standalone agent, located on BDC, for execution. Under this configuration, the J2EE agent can be used to execute other types of workloads such as loading SQL data into Oracle Analytics Cloud (OAC), invoking RESTful web services to consume data from SaaS applications, or loading very large files from on-premises data servers into Oracle Storage Cloud Service.

The ODI High-Availability configuration for BDC requires that you bring your own license (BYOL) of Oracle Data Integrator for Big Data. Thus, users must download the ODI installer from the Oracle Middleware Data Integrator Download Site. The ODI installer found on Java Cloud Service, Oracle Data Integrator Cloud Service (ODICS), cannot be used for this configuration. Figure 3, below, illustrates the ODI High-Availability configuration for Big Data Cloud:

 

Figure 3 – Configuring ODI High-Availability for Big Data Cloud

Figure 3 – Configuring ODI High-Availability for Big Data Cloud

 

As illustrated on Figure 3, above, four cloud services are required for this configuration:

 

  • Java Cloud Service (JCS) – This cloud service is required in order to host the application server (WebLogic), the load balancer (Traffic Director), and the ODI J2EE agents. JCS requires an instance of Database Cloud Service (DBCS) to store the JCS metadata; thus, users must subscribe to DBCS before provisioning an instance of JCS.
  • Database Cloud Service (DBCS) – This cloud service is required in order to host the ODI repository. Since JCS requires an instance of DBCS, users can install and configure the ODI repository on the same instance of DBCS where JCS metadata is stored.
  • Big Data Cloud (BDC) – This cloud service is required, and it is used as the transformation engine to run the ODI Big Data workloads. At least, one ODI standalone agent must be installed in a BDC cluster, but two ODI standalone agents are recommended for failover and additional load-balancing capabilities.
  • Compute Cloud Service – This cloud service is recommended to host the ODI Studio, so ODI developers have a dedicated compute resource to design their Big Data workloads – mappings, packages, procedures, and other ODI objects. It is not recommended to perform ODI development in BDC or DBCS.

For additional information on how to configure both ODI Standalone and ODI High-Availability with BDC, go to:

Configuring Oracle Data Integrator for Big Data Cloud: Standard Configuration

Configuring Oracle Data Integrator for Big Data Cloud: High-Availability Configuration

Configuring Oracle Data Integrator for Big Data Cloud: Environment Configuration

Configuring Oracle Data Integrator for Big Data Cloud: Topology Configuration

Conclusion

 

BDC is suitable for real-time data analysis, interactive data analysis, and for large volume data processing. For large volume data processing, users can design workloads with ODI and execute them on BDC. ODI decreases the time it takes to implement Big Data projects on BDC by offering code templates, which increases developer productivity, streamlines the development process, and improves performance. This article discussed the benefits of using ODI with BDC, and it presented two configuration options when installing and configuring ODI for BDC.

For more Oracle Data Integrator best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit Oracle A-team Chronicles for Oracle Data Integrator (ODI).”


ODI Related Articles

 

Webcast: “Mastering Oracle Data Integrator with Big Data Cloud Service – Compute Edition.”

Integrating Oracle Data Integrator (ODI) On-Premises with Cloud Services

Using Oracle Data Integrator (ODI) with Amazon Elastic MapReduce (EMR)

Big Data Cloud (BDC)

Big Data Cloud REST API

Big Data Cloud Service (BDCS)

Event Hub Cloud Service (EHCS)

Oracle Storage Classic

Database Cloud Service (DBCS)

Compute Cloud Service

Java Cloud Service (JCS)

 

 

Add Your Comment