Automated Backup of Instances with Multiple Block Volumes

July 7, 2020 | 5 minute read
Text Size 100%:

Sometimes all you want to do is create regular backups of your instances. Policy-Based Backups to the rescue. All you need to do is choose from one of the existing policies or define your own and then apply that to the instance or volume you want to protect. This works like a charm and opens the door to cross-region copies and very simple restore operations. However, what do you do if you have multiple data volumes attached to your instance, in addition to the boot volume (which, as a best practice, should only contain your OS)?  The first thought will be for Volume Groups, as they are designed to group multiple volumes into one administrative unit and allow for simultaneous backups of all the volumes in the group. However, you can not control these backups with a policy (yet). Fortunately, it is rather straight forward to bridge this gap with some scripting, for example using the Python SDK for OCI.

In this blog post, I will describe one possible approach for a solution, using a simple use case as an example.

The Use Case

Let's assume we have a system that provides shared files access for Windows clients by running Samba to serve out multiple filesystems. These filesystems are hosted on a set of block volumes attached to this server VM. We want to be able to do several things:

  • Create a crash consistent backup of all volumes attached to this VM, boot volume and data volumes, all at the same time.
  • Delete the oldest backup set, but always keep a certain number of backups.
  • View the status of all backups for this VM

The general idea is simple:

  • Create a Volume Group for this instance and add all the data volumes and the boot volume to this group.
  • Create a Volume Group Backup using this group.

To simplify the usage of such a script, the backup group should be linked to the instance it belongs to. We can do this by querying the instance for its boot volume and then looking up the backup group this boot volume belongs to. Since the API only allows 1:1 relationships between volumes and groups, this is easy.

Building Blocks

Taking the instance OCID as the only input, the first step is to do some basic discovery:

compute = oci.core.ComputeClient(ociconfig)
instance = compute.get_instance(instanceOCID)

def fetch_metadata(instance):
    
    comp     = instance.data.compartment_id
    ad       = instance.data.availability_domain
    
    bva = compute.list_boot_volume_attachments(ad,comp,instance_id=instance.data.id)
    bootvolume = blockvolume.get_boot_volume(bva.data[0].boot_volume_id)
    setid = bootvolume.data.volume_group_id 
    if setid != None :
        try:
            volumegroup=blockvolume.get_volume_group(setid)
        except oci.exceptions.ServiceError as e:
            print(f'Error looking up Volume Group: {e.message}\nExiting')
            sys.exit(1)            
        print(f'Found Volume Group {volumegroup.data.display_name}')
    else:
        # no backup set defined, we need to create the volume group
        # first, get the list of all attached block volumes and boot volumes
        volumeIDs = []
        blockvolumes = compute.list_volume_attachments(comp,availability_domain=ad,instance_id=instance.data.id).data
        for b in blockvolumes:
            volumeIDs.append(b.volume_id)
        bootvolume   = compute.list_boot_volume_attachments(ad,comp,instance_id=instance.data.id).data
        volumeIDs.append(bootvolume[0].boot_volume_id)
        # then configure and create the volume group
        backupgroup=oci.core.models.CreateVolumeGroupDetails(
            availability_domain=ad,
            compartment_id=comp,
            display_name=instance.data.display_name + "_backup_group",
           source_details=oci.core.models.VolumeGroupSourceFromVolumesDetails(volume_ids=volumeIDs))
        volumegroup=blockvolume.create_volume_group(backupgroup)
        print(f'Created Volume Group {volumegroup.data.display_name}')
        setid=volumegroup.data.id
    return(setid)

Here, if we find a valid volume group, we return it's OCID to the caller. Otherwise, we collect a list of all volumes attached to the instance and create a new block volume group, using the instance name as a prefix for the group name. We then return the setid. After returning from this function, we now know that a valid block volume group exists for the instance, with all volumes attached to it.

Next, let's create a first backup:

def do_backup(setid):
    
    now=datetime.now()
    datestring=now.strftime("%Y-%m-%d_%H:%M")
    backupjob=oci.core.models.CreateVolumeGroupBackupDetails(
        type='INCREMENTAL',
        display_name=datestring,
        volume_group_id=setid)
    try:
        backup=blockvolume.create_volume_group_backup(backupjob)
        print(f'Backup started: {datestring}')
    except oci.exceptions.ServiceError as e:
        print (f'Backup failed: {e.message}')
As you can see, there's not very much to do here. All we need is a name for the backup - using a sensibly formatted date string is a first approach. Then we just use the OCID of the volume group we got from fetch_metadata and ask the API to create a backup.

Before we go into deleting old backups, let's see if we can get some information about existing backups:

def show_backups(comp,setid):
    backups=blockvolume.list_volume_group_backups(comp,volume_group_id=setid,sort_by="TIMECREATED",sort_order="ASC")
    goodbackups=[]
    for b in backups.data:
        if (b.lifecycle_state != 'TERMINATED'):
            goodbackups.append(b)
    print(f'Available Backups:')
    count=len(goodbackups)
    for b in goodbackups:
        if (count > keep):
            print (f' - {b.display_name}')
        else:
            print (f' * {b.display_name}')
        count -= 1
We might also want a little function to ask for the latest available backup:
def latest_backup(comp,setid):
    backups=blockvolume.list_volume_group_backups(comp,volume_group_id=setid,sort_by="TIMECREATED",sort_order="ASC")
    latest=None
    for b in backups.data:
        if (b.lifecycle_state != 'TERMINATED'):
            latest=b
    return latest            

A First Run

Running this for the first time on our example fileserver:

    show_backups(instance.data.compartment_id,backupsetID)
    latest=latest_backup(instance.data.compartment_id,backupsetID)
    if (latest is None):
        print ('No backups available')
    else:
        print (f'Latest Backup: {latest.display_name}')
The output will be:
$ ./instance.backup.py -s -i ocid1.instance.oc1.eu-frankfurt-1.xyz
Created Volume Group smbserver_backup_group
Available Backups:
No backups available
Running the first backup:
./instance.backup.py -b -i ocid1.instance.oc1.eu-frankfurt-1.xyz
Found Volume Group smbserver_backup_group
Backup started: 2020-07-06_11:46
And checking the status again:
$ ./instance.backup.py -s -i ocid1.instance.oc1.eu-frankfurt-1.xyz
Found Volume Group smbserver_backup_group
Available Backups:
 * 2020-07-06_11:46
Latest Backup: 2020-07-06_11:46

Managing Backups

Finally, after having created a few backups, we will want a way to delete old backups we no longer need. For automated backups using policies, you can choose from various options for your retention policy. Here, until this feature catches up to support Block Volume Groups, we'll have to implement these ourselves. For the purpose of this example, a simple "erase all but the newest n backups" should be fancy enough. You're welcome to enhance this on your own ;-)

First, a word of warning:  Before deleting backups, make sure that the backups you keep are actually valid!  Especially if you run a script like this in short intervals, say once per hour, you will quickly have many backups, leading you to believe that it's safe to delete all but the last 10, for example. However, if you have some systematic error on your server - say it won't boot anymore - that state will also be reflected in your backups. Which means you may well end up with a backup of your server which won't boot, because the error was introduced to the server before the last available backup was made. So always check that a newer backup preserves good state before deleting the older ones!

If we want to keep N backups, we could use a function like this:

def do_deleteBackups(comp,setid,keep):
    backups=blockvolume.list_volume_group_backups(comp,volume_group_id=setid,sort_by="TIMECREATED",sort_order="ASC")
    goodbackups=[]
    for b in backups.data:
        if (b.lifecycle_state != 'TERMINATED'):
            goodbackups.append(b)
    count=len(goodbackups)
    for b in goodbackups:
        if (count > keep):
            print (f'Deleting backup: {b.display_name} ...',end='')
            result=blockvolume.delete_volume_group_backup(b.id)
            print ('OK')
        else:
            print (f'Keeping backup:  {b.display_name}')
        count -= 1

This will first collect all your "good" (non-terminated) backups, then go through that list and terminate all but the last "keep" ones.

Here's some sample output after we've made a few backups first:

$ ./instance.backup.py -s -i ocid1.instance.oc1.eu-frankfurt-1.xyz
Found Volume Group smbserver_backup_group
Available Backups:
 - 2020-07-06_11:46
 * 2020-07-06_12:17
 * 2020-07-06_12:19
 * 2020-07-06_12:22
 * 2020-07-06_12:28
Latest Backup: 2020-07-06_12:28

$ ./instance.backup.py -d -k 4 -i ocid1.instance.oc1.eu-frankfurt-1.xyz
Found Volume Group smbserver_backup_group
Deleting all but the latest 4 backups...
Deleting backup: 2020-07-06_11:46 ...OK
Keeping backup:  2020-07-06_12:17
Keeping backup:  2020-07-06_12:19
Keeping backup:  2020-07-06_12:22
Keeping backup:  2020-07-06_12:28

Restoring

Restoring one of your backups is very simple. In the Console, just go to the Volume Group Backup which you want to use for your restore and select "Create Volume Group". A new Volume Group will be created for you, with a restored copy of all the volumes from the backup attached. This usually only takes a few seconds. To finalize the recovery, just spin up a new instance from the restored boot volume and attach the data volumes. If critical, you can even "reuse" the IP addresses of the original VM, although for that, you'll need to terminate it first.

Stefan Hinker

After around 20 years working on SPARC and Solaris, I am now a member of A-Team, focusing on infrastructure on Oracle Cloud.


Previous Post

Measuring results in Classifiers with Precision and Recall

Rajesh Chawla | 2 min read

Next Post


Calling SOAP Services from Visual Builder

Tim Bennett | 4 min read