OCI Big Data Service High Availability Clusters

July 25, 2023 | 5 minute read
Jeffrey Thomas
Big Data Architect, A-Team
Text Size 100%:


HA basic configuration
This is a standard configuration for a BDS Highly Available cluster.  These smaller VM sides would be good for a development or QA test environment.



The two most challenging aspects of managing a Hadoop cluster in the past have always been security and High Availability ( HA ).  There are many built-in features within HDFS itself to mitigate hardware failures and data loss, but configuring the cluster in a way that the processes themselves have a way to fail over has never been easy.  To add on to that, Hadoop itself has no native solutions around securing this distributed computing environment forcing them to lean on technologies like Kerberos to do so.  The OCI Big Data Service ( BDS ) provides an automated solution to both of these complex problems right out of the box just by clicking a simple check box at cluster creation time.

High Availability (HA) Node Configuration

Leveraging the built-in functionality of BDS, the minimum size of a HA cluster of 7 nodes.  This configuration (as seen in the image above) consists of 2 name nodes that are configured to fail over to one another if anything should happen, 2 utility nodes that are running other HA and Security functions like the Kerberos KDC, and 3 worker nodes (the minimum for any BDS cluster).  VM Shapes and sizes are chosen at cluster creation time, but can be changed at any point in the lifetime of the cluster.  Worker nodes can be scaled up and down both horizontally and vertically with autoscaling policies as well to provide either more CPU/Ram or storage if the cluster is using local HDFS.  

Kerberos is a computer network and security protocol that authenticates requests between trusted hosts across an untrusted network, like the internet or in our case, a Hadoop cluster. It uses secret-key cryptography and a trusted third-party service to authenticate client-server applications as well as users.

Kerberos users and services depend directly on the Key Distribution Center (KDC) , which provides two main functions: authentication and ticket-granting. KDC "tickets" authenticate all parties, allowing nodes to verify their identity securely. 

While the services provided by Kerberos are very valuable, administering a kerberized hadoop cluster can be one of the most challenging tasks in the Big Data admin space.  The BDS HA configuration automatically lays down Kerberos on all of the nodes, allows for easy user creation via Identity Management, and provides patching and upgrades for both the kerberized nodes and the KDC running on the utility nodes with no extra effort from the customer.  This is a huge benefit to running the BDS HA configuration.

Putting this into Practice

Now that we are familiar with what the BDS HA cluster configuration brings to the table, how do we use it?  First of all, in this age of cyber security, audit etc, having a kerberized cluster is a minimum requirement for any production Hadoop cluster.  The same really goes for having multiple name nodes when it comes to production workloads as nobody wants to see a cluster have to be rebooted in the middle of a long-running job (some of these can last for days or weeks) because the name node had a log folder fill up.  So a best practice that we see very often is to use a configuration like this in a production environment.

However, when it comes to QA, dev and other environments the conversation can be a little more tricky.  Customers I have talked to that are trying to cut down on costs have often brought up the fact that this configuration requires an extra 4 nodes over the minimum cluster configuration to run.  This extra cost can add up pretty quickly. 

In my experience, it is a no brainer to match QA to production so that is not as big of an issue.  However many customers will run multiple small dev environments for different projects, branching code, etc and these environments can become costly.  

To setup development environments we have 3 choices:

  1. Forget about HA/Kerberos and build the minimum 3 node cluster.  This option is the cheapest but will bring along some issues.  Any application integration that is being developed to run in the QA/Prod kerberized clusters will not work without kerberos, so none of that will be testable until QA.  The development environments will also be less secure and open to potential audit concerns depending on the customer's business vertical.
  2. Build a minimum 3 node cluster and install and manage Kerberos yourself.  This option does bring the added benefit of still being able to build out smaller clusters while at the same time having Kerberos available for development and testing purposes.  This issue will be the time required for initial setup, but most importantly the time required for patching and upgrading down the road.  The QA and Production environments will be patched by the BDS service in an automated way and when those patch cycles happen these dev environments will have to be manually kept in sync.  
  3. Just use the BDS HA configuration for development.  This to me, and to all of the customers I have worked with on this so far, has been the best option.  Keeping the Dev, QA, and Prod environments identical from a basic configuration standpoint (the number of worker nodes will vary greatly depending on workloads) is very important.  This also is the best option to streamline administration and take advantage of as much automation from the BDS service as possible.

Assuming we go with option 3, there are still ways to mitigate cost in these development environments.  If workloads are small, the minimum side of the name nodes and utility nodes can be as low as the E4 Flex Shape with 4 OCPU's and 32 gigs of RAM.  This is a very cost efficient shape that can always be increased during times of heavier load.  Another very important aspect of the BDS service that plays a role here is the ability to stop a cluster.  In times of little use, weekends and holidays clusters can be stopped completely.  Stopping a cluster is like pausing it. You won’t be billed for compute resources, but you will be billed for storage.  On the storage side, if the configuration is not using HDFS, but using Object Storage for data, the storage on the cluster is extremely small.  These clusters can be stopped and started in as little as 5 to 15 minutes depending on the size of the cluster.

Closing Thoughts

Hadoop clusters bring so many valuable technologies for processing data at large scale, real-time data processing and data movement, fast queries of large data sets and so much more.  The ability to secure and run these clusters in a reliable way without the overhead of trying to solve these problems is a huge benefit that OCI's Big Data Service brings to our customers.


Jeffrey Thomas

Big Data Architect, A-Team

Previous Post

IAM Domain Upgrade

Kiran Thakkar | 5 min read

Next Post

Quickly and easily apply budgets to manage your OCI spending

Christopher Johnson | 7 min read