This article presents an overview of how to use Oracle Data Integrator (ODI) with Oracle Big Data Cloud (BDC). ODI offers out of the box integration with Big Data technologies such as Apache Hadoop, Apache Spark, Apache Hive, and Apache Pig, among others. ODI supports both distributions of Hadoop: Hortonworks Data Platform (HDP), and Cloudera Enterprise Data Hub (CDH). Additionally, ODI can also be used on other distributions of Hadoop such as Amazon Elastic MapReduce (EMR). This article discusses the benefits of using ODI with BDC. The article also discusses two configuration options when installing and configuring ODI for BDC.
A pre-recorded live demonstration that supports this discussion can be found at the following Oracle Data Integration webcast: “Mastering Oracle Data Integrator with Big Data Cloud.”
Overview of Oracle Big Data Cloud
Oracle Big Data Cloud (BDC), also known as Oracle Big Data Cloud Service – Compute Edition, is the Oracle elastic Big Data cloud service built on Oracle’s own distribution of Hadoop, which resembles the Hortonworks Hadoop Distribution. BDC includes Apache Hadoop (including MapReduce), Apache Spark, Apache ZooKeeper, Apache Hive, Apache Zeppelin, and Apache Pig. BDC also offers a complete set of RESTful APIs that users can use to manage BDC instances, including APIs to manage Spark jobs. Oracle offers Apache Kafka as a separate cloud service called Event Hub Cloud Service (EHCS).
BDC allows users to create Apache Spark and Hadoop clusters, and these clusters can grow or shrink based on user’s workload needs, elastically. BDC uses the Oracle Storage Classic as its data lake with an in-memory caching layer called Alluxio for fast data access – this allows customers to scale up or scale down compute and storage, independently. For additional information on how Alluxio works with BDC, go to “About the Big Data Cloud File System (BDFS).”
BDC is suitable for real-time data analysis, interactive data analysis, and batch data workloads. For instance, BDC offers an optimized Spark and Spark Streaming platform for real-time data analysis. Also, BDC is tightly integrated with Event Hub Cloud Service to ensure extreme performance between Kafka and BDC. For interactive data analysis, BDC offers Apache Zeppelin notebooks, which allow users to create collaborative documents with Scala, Hive, Spark SQL, Python, and R, among others. For batch data workloads, BDC offers REST APIs and command-line utilities such as cURL to launch batch jobs and transform data in the BDC cluster.
The data integration platform that ODI offers to Big Data users can also be extended to BDC. Users can use ODI with BDC to design and execute complex Hadoop and Spark transformations. Also, users can use ODI to design data integration tasks that combine datasets from on-premises data servers with cloud data services. These combined datasets can be transformed on BDC with ODI. Figure 1, below, illustrates an example of some of the on-premises data servers and cloud data services that can be integrated with ODI, including BDC:
Figure 1 – Using Oracle Data Integrator with Big Data Cloud
Using ODI with BDC provides the following benefits:
Although it is not part of this article, it should be noted that Oracle also offers another Big Data cloud service called Oracle Big Data Cloud Service (BDCS). This Big Data cloud service is based on Cloudera Enterprise Data Hub (CDH), which includes Apache Hadoop, Apache Spark, and Apache Kafka. BDCS is a pre-configured cloud service that includes Oracle software such as the Oracle Big Data Connectors, Oracle R, Oracle Big Data Spatial and Graph, and Oracle Data Integrator (ODI) Enterprise Edition. This Big Data cloud service is also illustrated on Figure 1, above.
To use ODI with BDC, users can install and configure ODI in one of two ways: ODI Standalone or ODI with High-Availability. The ODI Standalone configuration requires the installation and configuration of the ODI Standalone agent in an instance of BDC. The ODI High-Availability configuration is an extension of the ODI Standalone configuration, but it uses the ODI J2EE agent as an orchestrator for Big Data workloads. The ODI High-Availability configuration also takes advantage of the failover and load-balancing capabilities that are available in an environment with high-availability. The following two sections discuss the recommended components for each of these two ODI configurations.
Under this configuration, the ODI Standalone agent is installed and configured on an instance of BDC. The ODI Standalone agent is configured as a standalone lightweight Java application on BDC, and it is hosted in one of the nodes of the BDC cluster. ODI requires an ODI Standalone agent in the Big Data cluster in order to execute Big Data workloads. The ODI Standalone agent uses an ODI repository installed on an instance of the Database Cloud Service (DBCS).
This configuration does not include high-availability features, but users can implement their own monitoring tools to restart the ODI Standalone agent in case of failures. Also, additional ODI Standalone agents can be configured to distribute the executions of the Big Data workloads. Figure 2, below, illustrates the ODI Standalone Configuration for Big Data Cloud:
Figure 2 – Configuring ODI Standalone for Big Data Cloud
As suggested by Figure 2, above, three cloud services are required for this configuration:
The ODI High-Availability configuration is an extension of the ODI Standalone configuration, as previously noted in this article. Under this configuration, both ODI agents, the ODI Standalone and the ODI J2EE agent, are installed and configured in order to achieve high-availability. The ODI Standalone agent is installed and configured on an instance of BDC. The ODI J2EE agent is installed and configured on an instance of the Oracle Java Cloud Service (JCS). JCS offers Oracle Traffic Director for load-balancing, Oracle WebLogic for high-availability, and Oracle Coherence for application-scalability.
The ODI High-Availability configuration allows users to submit both Big Data and non-Big Data ODI workloads directly to the load balancer on JCS. The load balancer distributes the ODI workloads among the J2EE agents on JCS. Since Big Data workloads must be executed on a Big Data cluster, the J2EE agent sends the Big Data workloads to the Standalone agent, located on BDC, for execution. Under this configuration, the J2EE agent can be used to execute other types of workloads such as loading SQL data into Oracle Analytics Cloud (OAC), invoking RESTful web services to consume data from SaaS applications, or loading very large files from on-premises data servers into Oracle Storage Cloud Service.
The ODI High-Availability configuration for BDC requires that you bring your own license (BYOL) of Oracle Data Integrator for Big Data. Thus, users must download the ODI installer from the Oracle Middleware Data Integrator Download Site. The ODI installer found on Java Cloud Service, Oracle Data Integrator Cloud Service (ODICS), cannot be used for this configuration. Figure 3, below, illustrates the ODI High-Availability configuration for Big Data Cloud:
Figure 3 – Configuring ODI High-Availability for Big Data Cloud
As illustrated on Figure 3, above, four cloud services are required for this configuration:
For additional information on how to configure both ODI Standalone and ODI High-Availability with BDC, go to:
Configuring Oracle Data Integrator for Big Data Cloud: Standard Configuration
Configuring Oracle Data Integrator for Big Data Cloud: High-Availability Configuration
Configuring Oracle Data Integrator for Big Data Cloud: Environment Configuration
Configuring Oracle Data Integrator for Big Data Cloud: Topology Configuration
BDC is suitable for real-time data analysis, interactive data analysis, and for large volume data processing. For large volume data processing, users can design workloads with ODI and execute them on BDC. ODI decreases the time it takes to implement Big Data projects on BDC by offering code templates, which increases developer productivity, streamlines the development process, and improves performance. This article discussed the benefits of using ODI with BDC, and it presented two configuration options when installing and configuring ODI for BDC.
For more Oracle Data Integrator best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-team Chronicles for Oracle Data Integrator (ODI).”
Webcast: “Mastering Oracle Data Integrator with Big Data Cloud Service - Compute Edition.”
Integrating Oracle Data Integrator (ODI) On-Premises with Cloud Services
Using Oracle Data Integrator (ODI) with Amazon Elastic MapReduce (EMR)
Event Hub Cloud Service (EHCS)