Running Spark Workloads In Oracle Cloud

October 31, 2022 | 5 minute read
Jeffrey Thomas
Big Data Architect, A-Team
Text Size 100%:

As customers are moving more and more data into the cloud from various sources it has become clear that the need for tooling to provide a mechanism for processing data at very large scale is a common requirement. 

Historically one of the main tools we have used in software architectures to perform this type of work at high scale has been Hadoop.  There are many tools within the Hadoop ecosystem that are up to the task, but for today’s conversation we will focus on running Spark workloads. 

Growing from opensource roots, Spark has become the industry standard for almost any data processing related operations at scale, streaming, etc.  Its focus is around leveraging in memory processing distributed across a cluster of compute notes.  In memory processing allows Spark to run much faster when the right about of memory is allocated to a job, and is why Spark workloads are a great fit for autoscaled environments.

Oracle OCI cloud has embraced this technology, leveraging Spark under the covers in a number of native services as well as offering serverless elastic spark as its own service, called Data Flow.  Customers can also run spark workloads in the OCI Big Data Service (BDS), a fully managed Hadoop environment.

In customer discussions, questions keep coming up over the difference between running Spark workloads in Data Flow vs. BDS.  Today we will focus on a handful of the most important categories and discuss some of the features supported by each service that compliment them.

At the end of the day, running spark workloads in either one of these services is extremely cost effective, easy to manage/maintain, and fully integrated with native OCI which makes Oracle Cloud a great option for customers looking to move large workloads from other clouds, on premise, or starting new projects with data in Object Storage.

 

 

 

Data Flow

Big Data Service (BDS)

Scalability / Performance

  • Parameterized Autoscaling
  • Scales linearly up to node cap per customer
  • Spin up time for a job under 5 minutes with new parameters for quick retries if the cluster is not big enough
  • Autoscale either vertically or horizontally based on CPU and RAM saturation (customizable policies)
  • Always on so no risk of capacity issues for allocated nodes
  • Spin up ~40 mins to build new clusters

Security

  • Takes advantage of OCI’s fine-grained data access controls.
  • Single user environment, easy user tracing (no need for Kerberos because of this feature)
  • OOTB integration with Oracle Identity Service and built in AD integration
  • Takes advantage of OCI’s fine-grained data access controls.
  • OOTB support for Kerberos and Apache Ranger for further data access controls
  • Multi-user environment, best practices must be followed to trace user interactions (E.G. shouldn’t use root user)
  • OOTB integration with Oracle Identity Service and built in AD integration

Cost

  • Pay for underlying IaaS only, no service fee
  • Only pay when jobs are running
  • Significantly cheaper than BDS in general
  • SKU associated with the product on top of IaaS used
  • Pay for underlying IaaS 24/7 for all active nodes in the cluster
  • Autoscaling can help mitigate costs as well as a new pause/resume feature that is coming in the next few months

 

Ease of Migration

  • Once Spark code is refactored to supported version it can be run directly by the service.
  • REST or Java API integration with enterprise scheduler required

 

             

  • Once Spark code is refactored to supported version it can be run directly by the service.
  • Integration with enterprise scheduler would have to be implemented.  Oozie, REST, Java APIs etc all options.

Maintenance / Administration

  • No installs, patching or upgrades.
  • Built-in monitoring/alarms.
  • Customer not responsible for nodes in the clusters
  • No installs, patching or upgrades.
  • Customer responsible for health of the cluster and own any custom changes
  • Better monitoring tools in the OCI console compared to traditional on premise Hadoop or legacy Oracle cloud services
  • Same access to Ambari for cluster Admin

 

Flexibility

  • Only supports Spark workloads
  • Can import existing python or Java/Scala libraries at run time
  • Supports running jobs at scale created in OCI Data Integration, a graphical based no-code ETL tool
  • Supports a variety of Hadoop components out of the box like Spark, Hive, HDFS, Hbase, Kafka, Flink and many more.
  • Bootstrap scripts enable easier configuration and automation with BDS. Users can run a bootstrap script on all the cluster nodes, after a cluster is created, when the shape of a cluster changes, or when they add or remove nodes from a cluster.
  • Customer has full access to customize cluster in any way.  This allows for installation of 3rd party tools, other Hadoop components, etc.

 

Jeffrey Thomas

Big Data Architect, A-Team


Previous Post

Custom Certificates for Fusion SOAP APIs

Greg Mally | 7 min read

Next Post


Migrate to Helidon 3 MicroProfile

Dolf Dijkstra | 11 min read