GlusterFS – Automated Deployment for High Availability and Performance on OCI

January 16, 2020 | 9 minute read
Text Size 100%:

GlusterFS is a well known, scalable, high performance shared filesystem. It is easy to install either on premises or in the cloud. Guides for this can be found in many places, including here and here. However, after you built your second or third Gluster environment, you start thinking of automation, especially when you are in the cloud. Automation not only provides a quick way of creating a new environment when you need it. It also delivers reproducible and thus reliable environments, something very valuable when configurations become more complex.

On OCI, the tool of choice for these kinds of automation is Terraform. This blog entry provides example code and discusses several challenges that had to be overcome by adding scripts to the Terraform code.

Target Architecture

GlusterFS provides high availability through replication of its underlying filesystems, or bricks. The most robust architecture is a three-way replica. This provides full resiliency for both read and write access in case any one of the three replicas should fail. The ideal mapping of this concept to OCI is the regional layout of the multi availability domain regions within OCI, where a single region provides three independent availability domains. Although located in separate datacenters and thus fully independent, they are still close enough to provide reasonably low network latency between them to allow replication in an active-active-active configuration. In such a scenario, we would put one GlusterFS server (with its attached block volumes) in each AD and then create a three-way replicated and distributed Gluster volume.

The OCI block volume service provides high performance block storage to compute instances via iSCSI. The default configuration will use the primary network interface of each compute instance for this iSCSI traffic. In order to provide maximum bandwidth to the Gluster clients, the second NIC of the various “bare metal” shapes available in OCI can be used for the Gluster traffic. Separated in this way, we can make full use of the bandwidth of two physical NICs on the Gluster servers. This will provide the full 25GBit/s of one NIC to the Gluster clients, while allowing a maximum of 32 block volumes of 32TB each on the storage backend. This concept is described in more detail here.

Combining both concepts leads to an architecture as shown in the following diagram:

Gluster Architecture

This architecture can scale in two ways:

  • To scale capacity, we can add additional block volumes to each of the Gluster servers. This should always be done symmetrically, adding the same capacity and number of bricks to all servers.
  • To scale performance, we can add additional servers. Again, this should be done symmetrically, so we should increase the server count by a multiple of three.
  • Of course, by adding the combination of servers and storage at the same time, both capacity and performance can be increased.

Performance Estimates

Given enough Gluster clients, the limiting factor for this architecture is network bandwidth. In the above configuration, each Gluster server is equipped with two NICs of 25GBit/sec each, one for block volume access and one for Gluster traffic. This will give approx. 3 GByte/sec throughput from each server. When all clients read from their filesystems, this can be aggregated, as clients will read from the nearest server. So we can estimate to deliver up to 9 GByte/sec throughput for a reading workload. When writing, we need to remember that this is a three-way replicated setup, so all data will be written to all three servers. Hence we have to divide our total network bandwidth by the level of replication, giving us an estimated 3 GByte/sec write throughput. If we had 6 servers instead of 3, both of these numbers could potentially double. I will go into the details of the achievable performance in a second post.

Building the Automation

To build this configuration, we need several things:

  • Two regional subnets:
    • One for the primary interface, which will also connect to block storage. This will be the “primary” subnet for the Gluster servers.
      (The automation assumes you already have this subnet in place.)
    • One for the storage traffic. This will be used for Gluster traffic only.
  • To ease the configuration of security policies, it is best to configure a Network Security Group with the necessary security rules (ports) to allow Gluster traffic. The second vNIC of each Gluster server and the storage vNIC of the Gluster clients will then be added to this NSG.
  • A bastion host on a public subnet. This will be needed to access the Gluster servers to configure the storage services. (This assumes that the whole Gluster environment will be deployed in private subnets and the automation will run outside of OCI.)

Most of this is rather straight forward in Terraform. You create your subnets and servers. You attach a second vNIC to each server. Then you create a set of block volumes for each server and attach those as well. However, there are certain things that would be nice to automate that are beyond the capabilities of the standard Terraform provider. Those are things like:

  • Install Gluster packages.
  • Partition the block volumes, create filesystems and mount them.
  • Create a Gluster server pool.
  • Create a Gluster volume and start it.

In the rest of this blog, we will go over sample code that achieves the above.

Beyond the general Terraform (which I will not go into), there are three steps needed. First, the server instances need to be configured. This can be achieved using a normal first boot script, spawned through cloud-init at the end of instance deployment.

Note: For this to work, the instance needs to have access to Oracle's yum repositories. This means the (private) subnet should have a NAT gateway and a route configured to permit this.
#!/bin/bash

logfile=/var/tmp/firstboot.log
exec 2>&1 > $logfile

# make mountpoint for bricks
mkdir -p /bricks

# install gluster packages
yum install -y install oracle-gluster-release-el7
yum-config-manager --enable ol7_gluster41 ol7_addons ol7_latest ol7_optional_latest ol7_UEKR5
yum install -y glusterfs-server

# enable and start gluster services
systemctl enable glusterd >> $logfile
systemctl start glusterd >> $logfile

# put this in the background so the main script can terminate and continue with the deployment
( while !( systemctl restart glusterd )
do
   # give the infrastructure another 10 seconds to come up
   echo waiting for gluster to be ready >> $logfile
   sleep 10
done ) &

# open firewall for gluster
echo Adding GlusterFS to hostbased firewall >> $logfile
/bin/firewall-offline-cmd --add-service=glusterfs
systemctl restart firewalld

# Configure second vNIC
scriptsource="https://raw.githubusercontent.com/oracle/terraform-examples/master/examples/oci/connect_vcns_using_multiple_vnics/scripts/secondary_vnic_all_configure.sh"
vnicscript=/root/secondary_vnic_all_configure.sh
curl -s $scriptsource > $vnicscript
chmod +x $vnicscript
cat > /etc/systemd/system/secondnic.service << EOF
[Unit]
Description=Script to configure a secondary vNIC

[Service]
Type=oneshot
ExecStart=$vnicscript -c
ExecStop=$vnicscript -d
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

EOF

systemctl enable secondnic.service  >> $logfile
systemctl start secondnic.service   >> $logfile

# put this in the background so the main script can terminate and continue with the deployment
( while !( systemctl restart secondnic.service )
do
   # give the infrastructure another 10 seconds to provide the metadata for the second vnic
   echo waiting for second NIC to come online >> $logfile
   sleep 10
done ) &

echo "This is a GlusterFS Server" >> /etc/motd
echo "firstboot done" >> $logfile

This script does the following:

  • Redirect stderr to a log file.
  • Create the main mountpoint for the brick filesystems.
  • Install GlusterFS packages and enable and start the server service.
    • During testing, this sometimes didn’t work on the first attempt, so the script spawns a little loop to repeat this until successful.
  • Configure the host based firewall for Gluster.
  • Configure the second vNIC.
    • For this, the sample configuration script is downloaded from the Terraform examples on github.
    • A service “secondnic.service” is created, enabled and started.
      Again, this is repeated in a loop until successful.
  • A success message is posted to /etc/motd and to the log file.

The installation of packages and the downloading of the script requires that the VCN and primary subnet are configured with a route to the public internet through a NAT gateway. (The automation assumes the gluster servers will be deployed on a private subnet and does not attempt to configure a public IP address.)

At this point, the servers themselves are ready. However, they still need the brick volumes to be attached before we can create a Gluster volume.

Attaching the bricks is done with Terraform. For each block volume, the Terraform remote-exec provisioner is used to perform the iSCSI attach commands on the respective server, to partition the disk, create a filesystem and mount it. For this to work from outside of the VCN, we need to tunnel a ssh connection through the bastion. This connection is described in the relevant Terraform code in the volume module:

connection {
    type                = "ssh"
    bastion_host        = data.oci_core_instance.bastion.public_ip
    bastion_user        = var.bastion_key["user"]
    bastion_private_key = file(var.bastion_key["private"])
    host                = data.oci_core_instance.host.private_ip
    user                = var.bastion_key["user"]
    private_key         = file(var.bastion_key["private"])
    agent               = false
  }

(This code would also allow to use two separate users – one for accessing the bastion host, and one for actually accessing the Gluster servers. This can be useful if the privileged opc user on the bastion is configured with MFA for better security. In this case, using a different user with less privileges just for the ssh tunnel would be needed.)

The actual code to deal with the newly attached block device is here:

  provisioner "remote-exec" {
    inline = [
      "sudo iscsiadm -m node -o new -T ${self.iqn} -p ${self.ipv4}:${self.port}",
      "sudo iscsiadm -m node -o update -T ${self.iqn} -n node.startup -v automatic",
      "sudo iscsiadm -m node -T ${self.iqn} -p ${self.ipv4}:${self.port} -l",
      "sleep 1",
      "export DEVICE_ID='/dev/disk/by-path/'$(ls /dev/disk/by-path|grep ${self.iqn}|grep ${self.ipv4}|grep -v part)",
      "set -x",
      "export HAS_PARTITION=$(sudo partprobe -d -s $${DEVICE_ID} | wc -l)",
      "if [ $HAS_PARTITION -eq 0 ] ; then",
      "  sudo parted -s  $${DEVICE_ID} mklabel gpt ",
      "  sudo parted -s -a optimal $${DEVICE_ID} mkpart primary xfs 0% 100%",
      "  sleep 3",
      "  sudo mkfs.xfs -i size=512 -L ${var.name}${count.index} $${DEVICE_ID}-part1",
      "fi",
      "sudo mkdir -p /bricks/${var.name}${count.index}",
      "sudo sh -c 'printf \"LABEL=\"%s\"\t/bricks/%s\txfs\tdefaults,noatime,_netdev\t0 3\n\" ${var.name}${count.index} ${var.name}${count.index} >> /etc/fstab' ",
      "sudo mount /bricks/${var.name}${count.index}",
      "sudo mkdir -p /bricks/${var.name}${count.index}/data",
    ]
  }

Thanks to Stephen Cross for the general idea for this!

Finally, once all Gluster servers have been provisioned and all volumes mounted, the Gluster server pool and the volume has to be created. This can only be done once all three servers are complete and all volumes are attached and mounted. This is a dependency Terraform does not recognize. It has to be called out explicitly. Since the volumes depend on the servers already, all we need to wait for is the last volume. Volumes are grouped by server using the “volumes” module. Therefore, the dependency is on the three modules for the bricks:

depends_on = [module.server1, module.server2, module.server3]

The Terraform resource for generic script operations is the “null_resource”. It again uses a remote-exec provisioner to execute a script on the first Gluster server.

  provisioner "file" {
    source      = "mkvolume.sh"
    destination = "/tmp/mkvolume.sh"
  }

  provisioner "remote-exec" {
    inline = [
      "echo ${data.oci_core_vnic.gserv1.private_ip_address} >  /tmp/hostlist ",
      "echo ${data.oci_core_vnic.gserv2.private_ip_address}  >> /tmp/hostlist ",
      "echo ${data.oci_core_vnic.gserv3.private_ip_address}  >> /tmp/hostlist ",
      "chmod +x /tmp/mkvolume.sh",
      "/tmp/mkvolume.sh",
    ]
  }

The script itself is here:

#!/bin/bash
v=$(
for b in `ls /bricks` 
do
   for h in `cat /tmp/hostlist`
   do
      echo -n "${h}:/bricks/${b}/data "
   done
done
)

for h in `cat /tmp/hostlist`
do
   sudo /usr/sbin/gluster peer probe $h
done

sudo /usr/sbin/gluster volume create data replica 3 $v
sudo /usr/sbin/gluster volume start data
sudo /usr/sbin/gluster volume status

It builds a volume definition from the list of Gluster hostnames and the number of data bricks it finds. It then adds all three servers to the Gluster server pool and finally creates and starts the volume.

If all goes well, this Terraform example should build your Gluster server environment in less than 10 minutes, mostly depending on how long it takes to deploy the instances and volumes. A successful run will end with the following lines:

null_resource.gluster (remote-exec): peer probe: success. Probe on localhost not needed
null_resource.gluster (remote-exec): peer probe: success.
null_resource.gluster (remote-exec): peer probe: success.
null_resource.gluster (remote-exec): volume create: data: success: please start the volume to access data
null_resource.gluster (remote-exec): volume start: data: success
null_resource.gluster (remote-exec): Status of volume: data
null_resource.gluster (remote-exec): Gluster process                             TCP Port  RDMA Port  Online  Pid
null_resource.gluster (remote-exec): ------------------------------------------------------------------------------
null_resource.gluster (remote-exec): Brick 10.45.2.2:/bricks/data0/data     49152     0          Y       13306
null_resource.gluster (remote-exec): Brick 10.45.2.3:/bricks/data0/data     49152     0          Y       13183
null_resource.gluster (remote-exec): Brick 10.45.2.4:/bricks/data0/data     49152     0          Y       13158
null_resource.gluster (remote-exec): Brick 10.45.2.2:/bricks/data1/data     49153     0          Y       13328
null_resource.gluster (remote-exec): Brick 10.45.2.3:/bricks/data1/data     49153     0          Y       13205
null_resource.gluster (remote-exec): Brick 10.45.2.4:/bricks/data1/data     49153     0          Y       13180
null_resource.gluster (remote-exec): Brick 10.45.2.2:/bricks/data2/data     49154     0          Y       13350
null_resource.gluster (remote-exec): Brick 10.45.2.3:/bricks/data2/data     49154     0          Y       13227
null_resource.gluster (remote-exec): Brick 10.45.2.4:/bricks/data2/data     49154     0          Y       13202
null_resource.gluster (remote-exec): Self-heal Daemon on localhost          N/A       N/A        Y       13373
null_resource.gluster (remote-exec): Self-heal Daemon on 10.45.2.3          N/A       N/A        Y       13225
null_resource.gluster (remote-exec): Self-heal Daemon on 10.45.2.4          N/A       N/A        Y       13250
null_resource.gluster (remote-exec): Task Status of Volume data
null_resource.gluster (remote-exec): ------------------------------------------------------------------------------
null_resource.gluster (remote-exec): There are no active volume tasks
null_resource.gluster: Creation complete after 5s [id=9189342820819064035]
Apply complete! Resources: 29 added, 0 changed, 0 destroyed.

The above should get you started in provisioning your own Gluster environment. While the code itself should be easily portable to your own environment, it is still an example implementation only. So I strongly suggest you have a close look at what it does before running "terraform apply".

I will publish the full terraform code once it has passed the necessary technical and legal review.  Please check back here from time to time.

Stefan Hinker

After around 20 years working on SPARC and Solaris, I am now a member of A-Team, focusing on infrastructure on Oracle Cloud.


Previous Post

Oracle Database backup using Archive Storage in OCI

Tomasz Buchwald | 7 min read

Next Post


OCI Block Volume Performance with GlusterFS

Stefan Hinker | 5 min read