OCI Block Volume Performance with GlusterFS

January 16, 2020 | 5 minute read

Stefan Hinker

Text Size 100%:

－＋

In my previous post, I described how to setup a highly available GusterFS server environment using terraform. I briefly mentioned the provisions I had taken to ensure high performance as well, but didn't go into details of performance validation. This is what I'll do here.

To recap, the expected maximum performance from the three Gluster servers was:

3 GB/sec write throughput
9 GB/sec read throughput

To measure this performance, we need a sufficient number of Gluster clients and a load generator that is capable of orchestrating a distributed filesystem load in a reliable and reproducible manner. I've been using Vdbench for this purpose for quite some time and it proved to be a reliable tool here once more. To have sufficient power, I used 9 VM.Standard2.24 systems, 3 in each AD. This gave me sufficient CPU and network bandwidth to be sure that the clients would not be the limiting factor in the performance testing. The general architecture is shown in the diagram from my first post.

Vdbench Configuration

Vdbench is a very flexible and powerfull load generator for file or disk IO. You can find the user guide on its download page. (It's usually appropriate to use the latest available version.) I will briefly discuss the configuration file I used for this setup:

hd=default,jvms=24
hd=one,system=10.43.9.5,user=opc,shell=ssh
hd=two,system=10.43.9.6,user=opc,shell=ssh
hd=three,system=10.43.9.7,user=opc,shell=ssh
hd=four,system=10.43.9.8,user=opc,shell=ssh
hd=five,system=10.43.9.9,user=opc,shell=ssh
hd=six,system=10.43.9.19,user=opc,shell=ssh
hd=seven,system=10.43.9.18,user=opc,shell=ssh
hd=eight,system=10.43.9.17,user=opc,shell=ssh
hd=nine,system=10.43.9.16,user=opc,shell=ssh

fsd=fsd1,anchor=/gluster/fast/work,depth=1,width=4,files=50000,size=20M,shared=yes

fwd=default,xfersize=4k,fileio=random,fileselect=random,threads=350,stopafter=5,rdpct=100
fwd=fwd1,fsd=fsd1,host=one
fwd=fwd2,fsd=fsd1,host=two
fwd=fwd3,fsd=fsd1,host=three
fwd=fwd4,fsd=fsd1,host=four
fwd=fwd5,fsd=fsd1,host=five
fwd=fwd6,fsd=fsd1,host=six
fwd=fwd7,fsd=fsd1,host=seven
fwd=fwd8,fsd=fsd1,host=eight
fwd=fwd9,fsd=fsd1,host=nine

*rd=rd1,fwd=fwd*,fwdrate=max,format=(clean,only),interval=1

rd=rd3,fwd=fwd1,fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd4,fwd=(fwd1-fwd2),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd5,fwd=(fwd1-fwd3),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd6,fwd=(fwd1-fwd4),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd7,fwd=(fwd1-fwd5),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd8,fwd=(fwd1-fwd6),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd9,fwd=(fwd1-fwd7),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd10,fwd=(fwd1-fwd8),fwdrate=max,format=restart,elapsed=120,interval=1
rd=rd11,fwd=(fwd1-fwd9),fwdrate=max,format=restart,elapsed=120,interval=1

A few remarks to explain this configuration:

The first section is the host definition (hd).
- By default, all hosts will run 24 instances of the tool - one for each core.
- Each of the hosts is given a name (one, two,...) and connection details.
Next comes the definition of the filesystem for the test. This will be populated with 4 directories, each containing 50k files of 20MB each.
In the "file work definition", I set the defaults for each run: The blocksize, IO type, number of threads and how much of the workload should be reading (rdpct=100). For a write-only workload, this parameter would be 0.
Next are the run definitions, where the individual test runs are defined.
- The first one is commented out. It is only used to create the content of the filesystem. While this can be a performance test in itself, it takes significantly longer than the 120 seconds defined for the real runs.
- The others (rd3 through rd11) will each run the workload defined in above, with a varying number of hosts participating.

The read:write ratio for a full run is set by changing the value for "rdpct" in the defaults for fdw. With that, a full run will scale the number of clients from one to nine without modifying any other parameter.

Testing Environment

These tests were run on this equipment:

Servers: 3x BM.Standard2.52, using both NICs for full network bandwidth as described in the previous post.
Each server was configured with 8 block volumes of 4TB size each, formatted with xfs and tied together in a 3-way replicated distributed GlusterFS volume. The servers were located in three different availability domains in the eu-frankfurt-1 region of OCI.
Clients: 9x VM.Standard2.24, using one vNIC, mounting the GlusterFS filesystem using the native gluster client. The filesystem mount was configured with directIO to avoid client caching. Three clients each were located in each of the availability domains. One of the clients was also used as the vdbench controller.

Results

Tests were run with several different workload combinations to show the most common performance characteristics:

100% read
100% write
50% read/write
small blocksize of 4kB
large blocksize of 4MB

The following graphs show the results:

Throughput with large blocksize

As we can see, write performance reaches approx. 3GB/sec, close to the theoretical maximum estimated earlier. By increasing the read percentage in the workload, the overall throughput increases, as read traffic doesn't need to be replicated three times. The best result (with eight or nine clients) maxes out at just over 8GB/sec. While this doesn't reach the theoretical limit of around 9GB/sec, it is still a respectable result.

IOPS for various workloads

Network bandwidth is not the limiting factor for the second test with a small blocksize of just 4kB. In this case, we measure how many network packets can be transfered and filesystem operations can be serviced by this infrastructure. The impact of replication for the write workloads is still visible, but not as significant as with large transfers. The very linear increase per client flatens out slightly when going from eight to nine clients and reaches a maximum of around 70k IOPS for the read-only workload. Remember that this is testing on a distributed filesystem, so IOPS measured here can not be directly mapped (or compared) to raw disk IOPS possible with a raw block volume.

For reference, here are the full results in tabular form.

IO Size 4MB	write		IO Size 4MB	read/write		IO Size 4MB	read
No. of Clients	MB/s	FOPS	No. of Clients	MB/s	FOPS	No. of Clients	MB/s	FOPS
1	968	242	1	1595	398	1	1027	256
2	1908	477	2	2953	738	2	2125	531
3	2850	712	3	4085	1021	3	3169	792
4	2925	731	4	5011	1252	4	4181	1045
5	2923	730	5	5790	1447	5	5248	1312
6	2914	728	6	5873	1468	6	6212	1553
7	2864	716	7	5789	1447	7	7279	1819
8	2791	697	8	5710	1427	8	8113	2028
9	2769	692	9	5593	1398	9	8119	2029


IO Size 4k		write	IO Size 4k		read/write	IO Size 4k		read
No. of Clients	MB/s	FOPS	No. of Clients	MB/s	FOPS	No. of Clients	MB/s	FOPS
1	27	6958	1	28	7153	1	35	9036
2	53	13577	2	54	13909	2	65	16629
3	76	19609	3	82	21071	3	100	25658
4	105	26979	4	107	27590	4	134	34410
5	129	33197	5	133	34181	5	165	42264
6	152	39060	6	157	40273	6	199	51034
7	175	44862	7	182	46745	7	237	60702
8	198	50737	8	205	52699	8	266	68342
9	218	55821	9	229	58743	9	272	69646

Summary

With these tests, I could confirm the performance estimates that were based on physical capabilities of the used server equipment. They show several things:

OCI's non-blocking network architecture lives up to the expectations set by the published SLAs.
GlusterFS is a low overhead distributed filesystem that delivers high performance, even when operated in a HA configuration.
Both throughput and IOPS scale with the number of clients.
Throughput is mainly limited by the available physical network bandwidth.
There were no obvious blockers to scalability, which suggests that performance would continue to scale with more servers.
(I might update this blog if or when I get the opportunity to test with more servers.)

Finally a word about price-performance. Based on the OCI cost estimator (prices as of 2019-12-09), the monthly cost for the tested server setup is:

3x BM.Standard2.52: approx. $7,404
3x 32TB Block Volume: approx. $2,508
- 10 VPUs per GB for balanced volume perfomance: apprx. $1,672
Total: approx. $11,586

This means you pay approx. $11,500 per month for a full HA configuration with 32TB of net storage capable of 8GBit/sec read throughput.

A similar configuration on AWS, consisting of 3 servers "m5.metal" and 3x 8x4TB volumes of EBS storage are estimated at around $21,642 by the AWS "Simple Monthly Calculator"... As for the expected performance, this would be up to your own testing.

Stefan Hinker

After around 20 years working on SPARC and Solaris, I am now a member of A-Team, focusing on infrastructure on Oracle Cloud.

OCI Block Volume Performance with GlusterFS

Vdbench Configuration

Testing Environment

Results

Summary

Stefan Hinker

GlusterFS – Automated Deployment for High Availability and Performance on OCI

FastConnect Design

Resources for

Why Oracle

Learn

What's New

Contact Us