OCI Block Volume Performance with GlusterFS

January 16, 2020 | 5 minute read
Text Size 100%:

In my previous post, I described how to setup a highly available GusterFS server environment using terraform. I briefly mentioned the provisions I had taken to ensure high performance as well, but didn't go into details of performance validation. This is what I'll do here.

To recap, the expected maximum performance from the three Gluster servers was:

  • 3 GB/sec write throughput
  • 9 GB/sec read throughput

To measure this performance, we need a sufficient number of Gluster clients and a load generator that is capable of orchestrating a distributed filesystem load in a reliable and reproducible manner. I've been using Vdbench for this purpose for quite some time and it proved to be a reliable tool here once more. To have sufficient power, I used 9 VM.Standard2.24 systems, 3 in each AD. This gave me sufficient CPU and network bandwidth to be sure that the clients would not be the limiting factor in the performance testing. The general architecture is shown in the diagram from my first post.

Vdbench Configuration

Vdbench is a very flexible and powerfull load generator for file or disk IO. You can find the user guide on its download page. (It's usually appropriate to use the latest available version.) I will briefly discuss the configuration file I used for this setup:

hd=default,jvms=24
hd=one,system=10.43.9.5,user=opc,shell=ssh
hd=two,system=10.43.9.6,user=opc,shell=ssh
hd=three,system=10.43.9.7,user=opc,shell=ssh
hd=four,system=10.43.9.8,user=opc,shell=ssh
hd=five,system=10.43.9.9,user=opc,shell=ssh
hd=six,system=10.43.9.19,user=opc,shell=ssh
hd=seven,system=10.43.9.18,user=opc,shell=ssh
hd=eight,system=10.43.9.17,user=opc,shell=ssh
hd=nine,system=10.43.9.16,user=opc,shell=ssh

fsd=fsd1,anchor=/gluster/fast/work,depth=1,width=4,files=50000,size=20M,shared=yes

fwd=default,xfersize=4k,fileio=random,fileselect=random,threads=350,stopafter=5,rdpct=100
fwd=fwd1,fsd=fsd1,host=one
fwd=fwd2,fsd=fsd1,host=two
fwd=fwd3,fsd=fsd1,host=three
fwd=fwd4,fsd=fsd1,host=four
fwd=fwd5,fsd=fsd1,host=five
fwd=fwd6,fsd=fsd1,host=six
fwd=fwd7,fsd=fsd1,host=seven
fwd=fwd8,fsd=fsd1,host=eight
fwd=fwd9,fsd=fsd1,host=nine

*rd=rd1,fwd=fwd*,fwdrate=max,format=(clean,only),interval=1

rd=rd3,fwd=fwd1,fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd4,fwd=(fwd1-fwd2),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd5,fwd=(fwd1-fwd3),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd6,fwd=(fwd1-fwd4),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd7,fwd=(fwd1-fwd5),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd8,fwd=(fwd1-fwd6),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd9,fwd=(fwd1-fwd7),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd10,fwd=(fwd1-fwd8),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd11,fwd=(fwd1-fwd9),fwdrate=max,format=restart,elapsed=120,interval=1

A few remarks to explain this configuration:

  • The first section is the host definition (hd).
    • By default, all hosts will run 24 instances of the tool - one for each core.
    • Each of the hosts is given a name (one, two,...) and connection details.
  • Next comes the definition of the filesystem for the test. This will be populated with 4 directories, each containing 50k files of 20MB each.
  • In the "file work definition", I set the defaults for each run: The blocksize, IO type, number of threads and how much of the workload should be reading (rdpct=100). For a write-only workload, this parameter would be 0.
  • Next are the run definitions, where the individual test runs are defined.
    • The first one is commented out. It is only used to create the content of the filesystem. While this can be a performance test in itself, it takes significantly longer than the 120 seconds defined for the real runs.
    • The others (rd3 through rd11) will each run the workload defined in above, with a varying number of hosts participating.

The read:write ratio for a full run is set by changing the value for "rdpct" in the defaults for fdw. With that, a full run will scale the number of clients from one to nine without modifying any other parameter.

Testing Environment

These tests were run on this equipment:

  • Servers: 3x BM.Standard2.52, using both NICs for full network bandwidth as described in the previous post.
    Each server was configured with 8 block volumes of 4TB size each, formatted with xfs and tied together in a 3-way replicated distributed GlusterFS volume. The servers were located in three different availability domains in the eu-frankfurt-1 region of OCI.
  • Clients: 9x VM.Standard2.24, using one vNIC, mounting the GlusterFS filesystem using the native gluster client. The filesystem mount was configured with directIO to avoid client caching. Three clients each were located in each of the availability domains. One of the clients was also used as the vdbench controller.

Results

Tests were run with several different workload combinations to show the most common performance characteristics:

  • 100% read
  • 100% write
  • 50% read/write
  • small blocksize of 4kB
  • large blocksize of 4MB

The following graphs show the results:

Throughput with large blocksize

As we can see, write performance reaches approx. 3GB/sec, close to the theoretical maximum estimated earlier. By increasing the read percentage in the workload, the overall throughput increases, as read traffic doesn't need to be replicated three times. The best result (with eight or nine clients) maxes out at just over 8GB/sec. While this doesn't reach the theoretical limit of around 9GB/sec, it is still a respectable result.

IOPS for various workloads

Network bandwidth is not the limiting factor for the second test with a small blocksize of just 4kB.  In this case, we measure how many network packets can be transfered and filesystem operations can be serviced by this infrastructure. The impact of replication for the write workloads is still visible, but not as significant as with large transfers. The very linear increase per client flatens out slightly when going from eight to nine clients and reaches a maximum of around 70k IOPS for the read-only workload. Remember that this is testing on a distributed filesystem, so IOPS measured here can not be directly mapped (or compared) to raw disk IOPS possible with a raw block volume.

For reference, here are the full results in tabular form.
IO Size 4MB write     IO Size 4MB read/write     IO Size 4MB read  
No. of Clients MB/s FOPS   No. of Clients MB/s FOPS   No. of Clients MB/s FOPS
1 968 242   1 1595 398   1 1027 256
2 1908 477   2 2953 738   2 2125 531
3 2850 712   3 4085 1021   3 3169 792
4 2925 731   4 5011 1252   4 4181 1045
5 2923 730   5 5790 1447   5 5248 1312
6 2914 728   6 5873 1468   6 6212 1553
7 2864 716   7 5789 1447   7 7279 1819
8 2791 697   8 5710 1427   8 8113 2028
9 2769 692   9 5593 1398   9 8119 2029
                     
                     
IO Size 4k   write   IO Size 4k   read/write   IO Size 4k   read
No. of Clients MB/s FOPS   No. of Clients MB/s FOPS   No. of Clients MB/s FOPS
1 27 6958   1 28 7153   1 35 9036
2 53 13577   2 54 13909   2 65 16629
3 76 19609   3 82 21071   3 100 25658
4 105 26979   4 107 27590   4 134 34410
5 129 33197   5 133 34181   5 165 42264
6 152 39060   6 157 40273   6 199 51034
7 175 44862   7 182 46745   7 237 60702
8 198 50737   8 205 52699   8 266 68342
9 218 55821   9 229 58743   9 272 69646

Summary

With these tests, I could confirm the performance estimates that were based on physical capabilities of the used server equipment.  They show several things:

  • OCI's non-blocking network architecture lives up to the expectations set by the published SLAs.
  • GlusterFS is a low overhead distributed filesystem that delivers high performance, even when operated in a HA configuration.
  • Both throughput and IOPS scale with the number of clients.
  • Throughput is mainly limited by the available physical network bandwidth.
  • There were no obvious blockers to scalability, which suggests that performance would continue to scale with more servers.
    (I might update this blog if or when I get the opportunity to test with more servers.)

Finally a word about price-performance.  Based on the OCI cost estimator (prices as of 2019-12-09), the monthly cost for the tested server setup is:

  • 3x BM.Standard2.52: approx. $7,404
  • 3x 32TB Block Volume: approx. $2,508
    • 10 VPUs per GB for balanced volume perfomance: apprx. $1,672
  • Total: approx. $11,586

This means you pay approx. $11,500 per month for a full HA configuration with 32TB of net storage capable of 8GBit/sec read throughput. 

A similar configuration on AWS, consisting of 3 servers "m5.metal" and 3x 8x4TB volumes of EBS storage are estimated at around $21,642 by the AWS "Simple Monthly Calculator"...  As for the expected performance, this would be up to your own testing.

Stefan Hinker

After around 20 years working on SPARC and Solaris, I am now a member of A-Team, focusing on infrastructure on Oracle Cloud.


Previous Post

GlusterFS – Automated Deployment for High Availability and Performance on OCI

Stefan Hinker | 9 min read

Next Post


FastConnect Design

Javier Ramirez | 11 min read