X

Best Practices from Oracle Development's A‑Team

Publish an Oracle Cloud Infrastructure Data Integration task to Oracle Cloud Infrastructure Data Flow

Jerome Francoisse
Consulting Solution Architect

A recent release of Oracle Cloud Infrastructure (OCI) Data Integration introduces the ability to publish a task to OCI Data Flow for execution. In this blog, we discuss the benefits of publishing a task to OCI Data Flow and how to do it.

OCI Data Integration and OCI Data Flow

OCI Data Integration and OCI Data Flow services both run Spark jobs in a serverless fashion. They’re both OCI native and integrate with Oracle Identity and Access Management (IAM).

OCI Data Integration provides a nice graphical interface to design data flows in a WYSIWYG canvas. This no-code approach enables business users, ETL developers, and data engineers with a deep understanding of the data to develop their own data integration pipelines without requiring any technical knowledge. When running a task, OCI Data Integration automatically chooses OCI Compute shapes for the Spark driver and the executors to run the job. The main method of your application runs on the driver, which splits it into tasks and sends them to several executors for fast parallel processing.

OCI Data Flow allows you to run our own Spark scripts, also in a serverless manner. It supports Java, Scala, and Python. So, you can choose whichever language you’re more familiar with to use the Spark framework. When running an application on OCI Data Flow, you can select the Oracle Cloud Infrastructure Compute shapes used for the driver, the executor, and the number of executors. 

OCI Data Integration and OCI Data Flow complement each other. We can achieve most of our data integration jobs in OCI Data Integration to extract data from different sources, merge, transform, and store them in our favorite location, such as Oracle Object Storage or Autonomous Data Warehouse. You can use OCI Data Flow for some advanced use cases not covered by OCI Data Integration, like running machine learning models using MLlib. 

While OCI Data Flow gives more flexibility in the development and the execution of a job, it requires more technical knowledge to write the code and find the right shape to use through benchmarking.

So, it makes sense to integrate both services and use the capacity to publish and execute run OCI Data Integration tasks in OCI Data Flow. It allows you to select a Compute shape if you need and centralize all your Spark runs in the same place.

Implementation

Prerequisites

Publishing an OCI Data Integration task to OCI Data Flow lands the task code generated by OCI Data Integration into an Oracle Object Storage bucket. It then creates an OCI Data Flow application pointing to that file.

Therefore, you need to ensure that you have the right policies for your Data Integration Workspace to manage OCI Data Flow applications so that it can create one. You also need to grant privileges to your user to read and write from Oracle Object Storage to read OCI Data Flow applications and manage OCI Data Flow runs. For some policy examples, see Data Integration Policies.

You also need an Oracle Object Storage bucket ready to receive the task code. When publishing a task to OCI Data Flow, OCI Data Integration lands a jar file containing our Spark task in this bucket. It then creates an OCI Data Flow application pointing to that jar file.

Publishing a task to OCI Data Flow

Both Integration tasks and Data Loader tasks can publish to OCI Data Flow. In this example, you first created a simple data flow joining two CSV files stored on Oracle Object Storage to analyze traffic over roads. You filter the traffic source based on the date, cast strings to numbers for the vehicle counts, and finally sum the car counts and load the total into a new CSV file stored on Oracle Object Storage.

Figure 1: An example data flow

Figure 1: An example data flow

You then configure an integration task for that data flow.

Figure 2: Integration Task configured with the example data flow

Figure 2: Integration Task configured with the example data flow

Then, you can navigate to the folder containing the integration task and use the task’s actions menu (three dots) to select Publish to OCI Data Flow.

Figure 3: Menu item to publish a task to OCI Data Flow

Figure 3: Menu item to publish a task to OCI Data Flow

A new dialog pane opens, and you can add the details about your application. You can give it a name and description, choose the OCI Compute shapes for the driver and the executors, and select the number of executors. Finally, you can choose where the jar file containing the OCI Data Integration task is uploaded. This jar file is referenced in the OCI Data Flow application. 

Figure 4: Configuring the OCI Data Flow application: compartment, application name, Spark driver and executor shapes, and so on.

Figure 4: Configuring the OCI Data Flow application: compartment, application name, Spark driver and executor shapes, and so on.

Figure 5: Configuring the OCI Data Flow application: jar file storage

Figure 5: Configuring the OCI Data Flow application: jar file storage

After clicking Publish, we see a notification with a link. Clicking that link opens a new pane displaying the Publishing Status. You can also access that pane by clicking the task’s actions menu (three dots) and selecting View OCI Data Flow Publish History.

Figure 6: Publish notification and History menu item

Figure 6: Publish notification and History menu item

Figure 7: OCI Data Flow Publish historyFigure 7: OCI Data Flow Publish history

If the status is successful, you can navigate to the OCI Data Flow Applications page to see the newly created application.

Figure 8: Navigating to OCI Data Flow

Figure 8: Navigating to OCI Data Flow

From this page, you can run the OCI Data Flow application by using the application’s actions menu (three dots) and selecting run. This action opens a pane where you can change the default selection for the driver, executor shapes, and number of executors. After validating, you’re redirected to the OCI Data Flow Runs page, where you can monitor the execution.

Figure 9: Review and run the OCI Data Flow application

Figure 9: Review and run the OCI Data Flow application

Figure 10: Run the OCI Data Flow application

Figure 10: Run the OCI Data Flow application

Clicking the run name provides more details and a link to the Spark UI page, also available from the run’s actions menu (three dots).

Figure 11: Monitoring the execution with Spark UIFigure 11: Monitoring the execution with Spark UI

Now, hop to Oracle Object Storage page to check that the target files are correctly created in the bucket. And there you are, you have successfully run your OCI Data Integration task in OCI Data Flow!

Figure 12: Target CSV files created the OCI Data Flow run in an Oracle Object Storage bucketFigure 12: Target CSV files created the OCI Data Flow run in an Oracle Object Storage bucket

Limitations

  • Currently publishing an OCI Data Integration task to OCI Data Flow only supports Oracle Object Storage source and target.
  • Now, OCI Data Integration task’s parameters can’t be used in OCI Data Flow.

The product team is aware of these limitations, and I hope these features will be available soon!

Automation

Both OCI Data Integration and OCI Data Flow expose all their actions through an API and various software development kits (SDKs). Therefore, you can automate the publishing of a task to OCI Data Flow, monitor the publishing status, and execute it right away.

Using the REST API, the task is published using the CreateExternalPublication method. The GetExternalPublication method allows you to monitor the status. You can keep checking the status until it changes from PUBLISHING to SUCCESSFUL or FAILED. You can then use CreateRun to run the application in OCI Data Flow and GetRun and GetRunLog to monitor its execution.

With the Java SDK, the task publishing is done using the CreateExternalPublication() method from the OCI Data Integration package. You can use getExternalPublication().getExternalPublication().getStatus() to get the status and wait for it to change from Publishing to Successful or Failed. You can then use the createRun() method from the OCI Data Flow package to create a new run. 



Conclusion

Today, we have seen that we can easily publish an OCI Data Integration task to OCI Data Flow. This capability helps you gather all your Spark jobs together and have further control of your driver and executor shapes. You can also automate the process of publishing and running your task in OCI Data Flow right after it’s ready.

Stay tuned for more on OCI Data Integration, OCI Data Flow and data integration in general.

For more how-tos and interesting reads, check out Oracle Cloud Infrastructure Data Integration blogs.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha

Recent Content