GlusterFS is a well known, scalable, high performance shared filesystem. It is easy to install either on premises or in the cloud. Guides for this can be found in many places, including here and here. However, after you built your second or third Gluster environment, you start thinking of automation, especially when you are in the cloud. Automation not only provides a quick way of creating a new environment when you need it. It also delivers reproducible and thus reliable environments, something very valuable when configurations become more complex.
On OCI, the tool of choice for these kinds of automation is Terraform. This blog entry provides example code and discusses several challenges that had to be overcome by adding scripts to the Terraform code.
GlusterFS provides high availability through replication of its underlying filesystems, or bricks. The most robust architecture is a three-way replica. This provides full resiliency for both read and write access in case any one of the three replicas should fail. The ideal mapping of this concept to OCI is the regional layout of the multi availability domain regions within OCI, where a single region provides three independent availability domains. Although located in separate datacenters and thus fully independent, they are still close enough to provide reasonably low network latency between them to allow replication in an active-active-active configuration. In such a scenario, we would put one GlusterFS server (with its attached block volumes) in each AD and then create a three-way replicated and distributed Gluster volume.
The OCI block volume service provides high performance block storage to compute instances via iSCSI. The default configuration will use the primary network interface of each compute instance for this iSCSI traffic. In order to provide maximum bandwidth to the Gluster clients, the second NIC of the various “bare metal” shapes available in OCI can be used for the Gluster traffic. Separated in this way, we can make full use of the bandwidth of two physical NICs on the Gluster servers. This will provide the full 25GBit/s of one NIC to the Gluster clients, while allowing a maximum of 32 block volumes of 32TB each on the storage backend. This concept is described in more detail here.
Combining both concepts leads to an architecture as shown in the following diagram:
This architecture can scale in two ways:
Given enough Gluster clients, the limiting factor for this architecture is network bandwidth. In the above configuration, each Gluster server is equipped with two NICs of 25GBit/sec each, one for block volume access and one for Gluster traffic. This will give approx. 3 GByte/sec throughput from each server. When all clients read from their filesystems, this can be aggregated, as clients will read from the nearest server. So we can estimate to deliver up to 9 GByte/sec throughput for a reading workload. When writing, we need to remember that this is a three-way replicated setup, so all data will be written to all three servers. Hence we have to divide our total network bandwidth by the level of replication, giving us an estimated 3 GByte/sec write throughput. If we had 6 servers instead of 3, both of these numbers could potentially double. I will go into the details of the achievable performance in a second post.
To build this configuration, we need several things:
Most of this is rather straight forward in Terraform. You create your subnets and servers. You attach a second vNIC to each server. Then you create a set of block volumes for each server and attach those as well. However, there are certain things that would be nice to automate that are beyond the capabilities of the standard Terraform provider. Those are things like:
In the rest of this blog, we will go over sample code that achieves the above.
Beyond the general Terraform (which I will not go into), there are three steps needed. First, the server instances need to be configured. This can be achieved using a normal first boot script, spawned through cloud-init at the end of instance deployment.
#!/bin/bash logfile=/var/tmp/firstboot.log exec 2>&1 > $logfile # make mountpoint for bricks mkdir -p /bricks # install gluster packages yum install -y install oracle-gluster-release-el7 yum-config-manager --enable ol7_gluster41 ol7_addons ol7_latest ol7_optional_latest ol7_UEKR5 yum install -y glusterfs-server # enable and start gluster services systemctl enable glusterd >> $logfile systemctl start glusterd >> $logfile # put this in the background so the main script can terminate and continue with the deployment ( while !( systemctl restart glusterd ) do # give the infrastructure another 10 seconds to come up echo waiting for gluster to be ready >> $logfile sleep 10 done ) & # open firewall for gluster echo Adding GlusterFS to hostbased firewall >> $logfile /bin/firewall-offline-cmd --add-service=glusterfs systemctl restart firewalld # Configure second vNIC scriptsource="https://raw.githubusercontent.com/oracle/terraform-examples/master/examples/oci/connect_vcns_using_multiple_vnics/scripts/secondary_vnic_all_configure.sh" vnicscript=/root/secondary_vnic_all_configure.sh curl -s $scriptsource > $vnicscript chmod +x $vnicscript cat > /etc/systemd/system/secondnic.service << EOF [Unit] Description=Script to configure a secondary vNIC [Service] Type=oneshot ExecStart=$vnicscript -c ExecStop=$vnicscript -d RemainAfterExit=yes [Install] WantedBy=multi-user.target EOF systemctl enable secondnic.service >> $logfile systemctl start secondnic.service >> $logfile # put this in the background so the main script can terminate and continue with the deployment ( while !( systemctl restart secondnic.service ) do # give the infrastructure another 10 seconds to provide the metadata for the second vnic echo waiting for second NIC to come online >> $logfile sleep 10 done ) & echo "This is a GlusterFS Server" >> /etc/motd echo "firstboot done" >> $logfile
This script does the following:
The installation of packages and the downloading of the script requires that the VCN and primary subnet are configured with a route to the public internet through a NAT gateway. (The automation assumes the gluster servers will be deployed on a private subnet and does not attempt to configure a public IP address.)
At this point, the servers themselves are ready. However, they still need the brick volumes to be attached before we can create a Gluster volume.
Attaching the bricks is done with Terraform. For each block volume, the Terraform remote-exec provisioner is used to perform the iSCSI attach commands on the respective server, to partition the disk, create a filesystem and mount it. For this to work from outside of the VCN, we need to tunnel a ssh connection through the bastion. This connection is described in the relevant Terraform code in the volume module:
connection { type = "ssh" bastion_host = data.oci_core_instance.bastion.public_ip bastion_user = var.bastion_key["user"] bastion_private_key = file(var.bastion_key["private"]) host = data.oci_core_instance.host.private_ip user = var.bastion_key["user"] private_key = file(var.bastion_key["private"]) agent = false }
(This code would also allow to use two separate users – one for accessing the bastion host, and one for actually accessing the Gluster servers. This can be useful if the privileged opc user on the bastion is configured with MFA for better security. In this case, using a different user with less privileges just for the ssh tunnel would be needed.)
The actual code to deal with the newly attached block device is here:
provisioner "remote-exec" { inline = [ "sudo iscsiadm -m node -o new -T ${self.iqn} -p ${self.ipv4}:${self.port}", "sudo iscsiadm -m node -o update -T ${self.iqn} -n node.startup -v automatic", "sudo iscsiadm -m node -T ${self.iqn} -p ${self.ipv4}:${self.port} -l", "sleep 1", "export DEVICE_ID='/dev/disk/by-path/'$(ls /dev/disk/by-path|grep ${self.iqn}|grep ${self.ipv4}|grep -v part)", "set -x", "export HAS_PARTITION=$(sudo partprobe -d -s $${DEVICE_ID} | wc -l)", "if [ $HAS_PARTITION -eq 0 ] ; then", " sudo parted -s $${DEVICE_ID} mklabel gpt ", " sudo parted -s -a optimal $${DEVICE_ID} mkpart primary xfs 0% 100%", " sleep 3", " sudo mkfs.xfs -i size=512 -L ${var.name}${count.index} $${DEVICE_ID}-part1", "fi", "sudo mkdir -p /bricks/${var.name}${count.index}", "sudo sh -c 'printf \"LABEL=\"%s\"\t/bricks/%s\txfs\tdefaults,noatime,_netdev\t0 3\n\" ${var.name}${count.index} ${var.name}${count.index} >> /etc/fstab' ", "sudo mount /bricks/${var.name}${count.index}", "sudo mkdir -p /bricks/${var.name}${count.index}/data", ] }
Thanks to Stephen Cross for the general idea for this!
Finally, once all Gluster servers have been provisioned and all volumes mounted, the Gluster server pool and the volume has to be created. This can only be done once all three servers are complete and all volumes are attached and mounted. This is a dependency Terraform does not recognize. It has to be called out explicitly. Since the volumes depend on the servers already, all we need to wait for is the last volume. Volumes are grouped by server using the “volumes” module. Therefore, the dependency is on the three modules for the bricks:
depends_on = [module.server1, module.server2, module.server3]
The Terraform resource for generic script operations is the “null_resource”. It again uses a remote-exec provisioner to execute a script on the first Gluster server.
provisioner "file" { source = "mkvolume.sh" destination = "/tmp/mkvolume.sh" } provisioner "remote-exec" { inline = [ "echo ${data.oci_core_vnic.gserv1.private_ip_address} > /tmp/hostlist ", "echo ${data.oci_core_vnic.gserv2.private_ip_address} >> /tmp/hostlist ", "echo ${data.oci_core_vnic.gserv3.private_ip_address} >> /tmp/hostlist ", "chmod +x /tmp/mkvolume.sh", "/tmp/mkvolume.sh", ] }
The script itself is here:
#!/bin/bash v=$( for b in `ls /bricks` do for h in `cat /tmp/hostlist` do echo -n "${h}:/bricks/${b}/data " done done ) for h in `cat /tmp/hostlist` do sudo /usr/sbin/gluster peer probe $h done sudo /usr/sbin/gluster volume create data replica 3 $v sudo /usr/sbin/gluster volume start data sudo /usr/sbin/gluster volume status
It builds a volume definition from the list of Gluster hostnames and the number of data bricks it finds. It then adds all three servers to the Gluster server pool and finally creates and starts the volume.
If all goes well, this Terraform example should build your Gluster server environment in less than 10 minutes, mostly depending on how long it takes to deploy the instances and volumes. A successful run will end with the following lines:
null_resource.gluster (remote-exec): peer probe: success. Probe on localhost not needed null_resource.gluster (remote-exec): peer probe: success. null_resource.gluster (remote-exec): peer probe: success. null_resource.gluster (remote-exec): volume create: data: success: please start the volume to access data null_resource.gluster (remote-exec): volume start: data: success null_resource.gluster (remote-exec): Status of volume: data null_resource.gluster (remote-exec): Gluster process TCP Port RDMA Port Online Pid null_resource.gluster (remote-exec): ------------------------------------------------------------------------------ null_resource.gluster (remote-exec): Brick 10.45.2.2:/bricks/data0/data 49152 0 Y 13306 null_resource.gluster (remote-exec): Brick 10.45.2.3:/bricks/data0/data 49152 0 Y 13183 null_resource.gluster (remote-exec): Brick 10.45.2.4:/bricks/data0/data 49152 0 Y 13158 null_resource.gluster (remote-exec): Brick 10.45.2.2:/bricks/data1/data 49153 0 Y 13328 null_resource.gluster (remote-exec): Brick 10.45.2.3:/bricks/data1/data 49153 0 Y 13205 null_resource.gluster (remote-exec): Brick 10.45.2.4:/bricks/data1/data 49153 0 Y 13180 null_resource.gluster (remote-exec): Brick 10.45.2.2:/bricks/data2/data 49154 0 Y 13350 null_resource.gluster (remote-exec): Brick 10.45.2.3:/bricks/data2/data 49154 0 Y 13227 null_resource.gluster (remote-exec): Brick 10.45.2.4:/bricks/data2/data 49154 0 Y 13202 null_resource.gluster (remote-exec): Self-heal Daemon on localhost N/A N/A Y 13373 null_resource.gluster (remote-exec): Self-heal Daemon on 10.45.2.3 N/A N/A Y 13225 null_resource.gluster (remote-exec): Self-heal Daemon on 10.45.2.4 N/A N/A Y 13250 null_resource.gluster (remote-exec): Task Status of Volume data null_resource.gluster (remote-exec): ------------------------------------------------------------------------------ null_resource.gluster (remote-exec): There are no active volume tasks null_resource.gluster: Creation complete after 5s [id=9189342820819064035] Apply complete! Resources: 29 added, 0 changed, 0 destroyed.
The above should get you started in provisioning your own Gluster environment. While the code itself should be easily portable to your own environment, it is still an example implementation only. So I strongly suggest you have a close look at what it does before running "terraform apply".
I will publish the full terraform code once it has passed the necessary technical and legal review. Please check back here from time to time.
After around 20 years working on SPARC and Solaris, I am now a member of A-Team, focusing on infrastructure on Oracle Cloud.