A recent release of Oracle Cloud Infrastructure (OCI) Data Integration introduces the ability to publish a task to OCI Data Flow for execution. In this blog, we discuss the benefits of publishing a task to OCI Data Flow and how to do it.
OCI Data Integration and OCI Data Flow services both run Spark jobs in a serverless fashion. They’re both OCI native and integrate with Oracle Identity and Access Management (IAM).
OCI Data Integration provides a nice graphical interface to design data flows in a WYSIWYG canvas. This no-code approach enables business users, ETL developers, and data engineers with a deep understanding of the data to develop their own data integration pipelines without requiring any technical knowledge. When running a task, OCI Data Integration automatically chooses OCI Compute shapes for the Spark driver and the executors to run the job. The main method of your application runs on the driver, which splits it into tasks and sends them to several executors for fast parallel processing.
OCI Data Flow allows you to run our own Spark scripts, also in a serverless manner. It supports Java, Scala, and Python. So, you can choose whichever language you’re more familiar with to use the Spark framework. When running an application on OCI Data Flow, you can select the Oracle Cloud Infrastructure Compute shapes used for the driver, the executor, and the number of executors.
OCI Data Integration and OCI Data Flow complement each other. We can achieve most of our data integration jobs in OCI Data Integration to extract data from different sources, merge, transform, and store them in our favorite location, such as Oracle Object Storage or Autonomous Data Warehouse. You can use OCI Data Flow for some advanced use cases not covered by OCI Data Integration, like running machine learning models using MLlib.
While OCI Data Flow gives more flexibility in the development and the execution of a job, it requires more technical knowledge to write the code and find the right shape to use through benchmarking.
So, it makes sense to integrate both services and use the capacity to publish and execute run OCI Data Integration tasks in OCI Data Flow. It allows you to select a Compute shape if you need and centralize all your Spark runs in the same place.
Publishing an OCI Data Integration task to OCI Data Flow lands the task code generated by OCI Data Integration into an Oracle Object Storage bucket. It then creates an OCI Data Flow application pointing to that file.
Therefore, you need to ensure that you have the right policies for your Data Integration Workspace to manage OCI Data Flow applications so that it can create one. You also need to grant privileges to your user to read and write from Oracle Object Storage to read OCI Data Flow applications and manage OCI Data Flow runs. For some policy examples, see Data Integration Policies.
You also need an Oracle Object Storage bucket ready to receive the task code. When publishing a task to OCI Data Flow, OCI Data Integration lands a jar file containing our Spark task in this bucket. It then creates an OCI Data Flow application pointing to that jar file.
Both Integration tasks and Data Loader tasks can publish to OCI Data Flow. In this example, you first created a simple data flow joining two CSV files stored on Oracle Object Storage to analyze traffic over roads. You filter the traffic source based on the date, cast strings to numbers for the vehicle counts, and finally sum the car counts and load the total into a new CSV file stored on Oracle Object Storage.
Figure 1: An example data flow
You then configure an integration task for that data flow.
Figure 2: Integration Task configured with the example data flow
Then, you can navigate to the folder containing the integration task and use the task’s actions menu (three dots) to select Publish to OCI Data Flow.
Figure 3: Menu item to publish a task to OCI Data Flow
A new dialog pane opens, and you can add the details about your application. You can give it a name and description, choose the OCI Compute shapes for the driver and the executors, and select the number of executors. Finally, you can choose where the jar file containing the OCI Data Integration task is uploaded. This jar file is referenced in the OCI Data Flow application.
Figure 4: Configuring the OCI Data Flow application: compartment, application name, Spark driver and executor shapes, and so on.
Figure 5: Configuring the OCI Data Flow application: jar file storage
After clicking Publish, we see a notification with a link. Clicking that link opens a new pane displaying the Publishing Status. You can also access that pane by clicking the task’s actions menu (three dots) and selecting View OCI Data Flow Publish History.
Figure 6: Publish notification and History menu item
Figure 7: OCI Data Flow Publish history
If the status is successful, you can navigate to the OCI Data Flow Applications page to see the newly created application.
Figure 8: Navigating to OCI Data Flow
From this page, you can run the OCI Data Flow application by using the application’s actions menu (three dots) and selecting run. This action opens a pane where you can change the default selection for the driver, executor shapes, and number of executors. After validating, you’re redirected to the OCI Data Flow Runs page, where you can monitor the execution.
Figure 9: Review and run the OCI Data Flow application
Figure 10: Run the OCI Data Flow application
Clicking the run name provides more details and a link to the Spark UI page, also available from the run’s actions menu (three dots).
Figure 11: Monitoring the execution with Spark UI
Now, hop to Oracle Object Storage page to check that the target files are correctly created in the bucket. And there you are, you have successfully run your OCI Data Integration task in OCI Data Flow!
Figure 12: Target CSV files created the OCI Data Flow run in an Oracle Object Storage bucket
The product team is aware of these limitations, and I hope these features will be available soon!
Both OCI Data Integration and OCI Data Flow expose all their actions through an API and various software development kits (SDKs). Therefore, you can automate the publishing of a task to OCI Data Flow, monitor the publishing status, and execute it right away.
Using the REST API, the task is published using the CreateExternalPublication method. The GetExternalPublication method allows you to monitor the status. You can keep checking the status until it changes from PUBLISHING to SUCCESSFUL or FAILED. You can then use CreateRun to run the application in OCI Data Flow and GetRun and GetRunLog to monitor its execution.
With the Java SDK, the task publishing is done using the CreateExternalPublication() method from the OCI Data Integration package. You can use getExternalPublication().getExternalPublication().getStatus() to get the status and wait for it to change from Publishing to Successful or Failed. You can then use the createRun() method from the OCI Data Flow package to create a new run.
Today, we have seen that we can easily publish an OCI Data Integration task to OCI Data Flow. This capability helps you gather all your Spark jobs together and have further control of your driver and executor shapes. You can also automate the process of publishing and running your task in OCI Data Flow right after it’s ready.
Stay tuned for more on OCI Data Integration, OCI Data Flow and data integration in general.
For more how-tos and interesting reads, check out Oracle Cloud Infrastructure Data Integration blogs.