Completely cleaning up the contents of an OCI compartment

July 10, 2023 | 11 minute read
Christopher Johnson
Director, Cloud Engineering
Text Size 100%:

There are a bunch of ways to use Compartments in OCI to organize your work. For example I and most of my compatriots create a compartment per project and put everything for that project in there. When the project is over we nuke the entire thing from orbit. After all, it's the only way to be sure.

Ripley is my hero

But you can't delete a compartment until it's empty...

You must remove all resources from a compartment before you can delete it.

To delete a compartment, it must be empty of all resources. Before you initiate deleting a compartment, be sure that all its resources have been moved, deleted, or terminated, including any policies attached to the compartment.

Source: https://docs.oracle.com/en-us/iaas/Content/Identity/compartments/To_delete_a_compartment.htm

So it's not as easy as just hitting a button and moving on.

So we should delete the compartment when we're done. But instead we move the compartment to a compartment named _ToBeDeleted which is our magic coffee table, and overnight all of the compute instances and load balancer and subnets and VCNs and pizza boxes just go away like magic!

Well, not everything. But more on that in a second.

And not actually by magic. We just have Cron Job in a little single node OKE cluster we have for those sorts of things that runs Richard's Super Delete script.

That script hits on most of the big blocks but it does leave behind things he hasn't yet added. And so every once in a while, usually when we're in a meeting that could have been an email, someone goes through and pokey clicks around to delete whatever is left before deleting the compartment.

Richard's script is great. Full stop. I mean that truly. We've been using it for about as long as it has existed. And it does clean up the vast majority of the things we normally use. And while I could have added code to handle those services and objects he doesn't yet support (and probably should have just done that), I have had this idea for a different approach knocking around in the back of my head for a different way. And I really wanted to try it out.

My idea boiled down to: What if the cleanup logic existed separately from the objects that need to be found and removed? And what if that code to do the cleanup just needed to have a list of the services and object types to be declared, and then it did all the rest.

And so I started with that idea at its core and built exactly that. Or at least mostly that.
 

This is from my README (though this will probably be out of date by the time you read this post)...


extirpate : ex·tir·pate : to completely erase or eradicate

What it is and does

The OCI Extirpater is a command line tool that deletes everything within a compartment in every OCI region.

How it does it

extirpater uses the OCI SDK to:

  1. find every subscribed region
  2. find every compartment underneath the specified root
  3. find every object within that compartment
  4. delete the object

The code for each object type is actually quite small (see ociclients/template.py).


 

OCI is API first. So every service has an API to manage the related objects and there is an SDK (in a bunch of languages) to make interacting easier than writing raw REST requests. In Python that means for service XYZ there's a class oci.functions.XYZManagementClient that has functions / methods to create, list, update, and delete the objects. The list function usually takes the OCID of the compartment and the delete one takes the OCID of the object. There are all sorts of exceptions to the above for various reasons, but in general this is mostly how the APIs and SDK work.

So for objects that work like that I made it easy - I create a class, declare a human readable name for the service, provide the class in the SDK and then declare an array of objects related to that class. In that array each object has the friendly names (singular and plural) for human readable logging, and the names of the list and delete functions.

 

The simplest working example class is probably the one for OCI Functions, and it looks like this:

import oci
from ociextirpater.OCIClient import OCIClient

class functions( OCIClient ):
    service_name = "Functions"
    clientClass = oci.functions.FunctionsManagementClient

    objects = [
        {
            "name_singular"    : "Functions Application",
            "name_plural"      : "Functions Applications", 
            "function_list"    : "list_applications",
            "function_delete"  : "delete_application",
        }
    ]

Super simple, right?!

The cleanup work for Functions Applications is really straightforward in part because the service is actually really nice about things. If you delete a Functions Application with multiple Functions in it, the service goes ahead and deletes them all for you. No need to delete each one first.

All of the heavy lifting is up in OCIClient.py. It knows how to interrogate these classes, find the service name, the right class in the OCI SDK, and the the objects it's going to delete. And then it goes and creates all the instances and whatnot of the right Python classes from the SDK and does all the work of iterating and listing and deleting and etc etc etc.

In OCI all (or almost all) of the actions (e.g. create, delete, update) are asynchronous; you fire a request, get back a result, and then either wait a bit or you can poll to see if the request is complete. Generally CRUD operations finish up quickly enough but some may take a while - think things like provisioning a full Exadata Cloud Service database might take minutes to complete. But for delete I keep simple things simple and just fire off the requests to delete and then move on; I don't wait until the operation completes or bother checking whether it was successful. If I cared about making absolutely sure everything got taken care of and doing it in exactly the right order we'd be using Terraform! But for a script intended to run headlessly and just sort of do its best to clean things up what I described is probably mostly fine.

But that there are all sorts of exceptions to the above. For example unlike with Functions, for the OCI Logging service you have to delete all of the Logs in a Log Group before you can delete the Log Group itself. Careful ordering of the objects in my declarations got me pretty close. And I could just let the script clean what it could now and then let the run the next night get the rest. But if it only runs once a day that could mean that certain objects would wind up hanging around for multiple days. And that just wouldn't do.

So in those cases I did want to wait for a delete operation to complete before moving on to the next object. Thankfully the OCI SDK includes a handy set of wrapers for that too - the so called "composite" class from the SDK hides the complexity of requesting an operation and then waiting for a specific state to be reached. All hidden behind a nice simple interface. And so OCIClient looks to see if you declare that composite class and, if you do, it instantiate and initialize instances of those classes, and call the c_function_delete method insead.

See Streams for example:

class stream( OCIClient ):
    service_name = "Stream Pool"
    clientClass = oci.streaming.StreamAdminClient
    compositeClientClass = oci.streaming.StreamAdminClientCompositeOperations

    objects = [
        {
            "name_singular"      : "Stream",
            "name_plural"        : "Streams",

            "function_list"      : "list_streams",
            "formatter"          : lambda pool: "Stream pool with OCID {} / name '{}' is in state {}".format( pool.id,
                                                                                                              pool.name,
                                                                                                              pool.lifecycle_state ),

            "c_function_delete"  : "delete_stream_and_wait_for_state",
            "kwargs_delete"      : {
                                    "wait_for_states": ["DELETED"]
                                   }
        },

        {
            "name_singular"      : "Stream Pool",
            "name_plural"        : "Stream Pools",

            "function_list"      : "list_stream_pools",
            "formatter"          : lambda pool: "Stream pool with OCID {} / name '{}' is in state {}".format( pool.id,
                                                                                                              pool.name,
                                                                                                              pool.lifecycle_state ),
            "function_delete"    : "delete_stream_pool",
        },
    ]

There are some other interesting things I found in the SDK while I was working on this. For one, while nearly every object in OCI has an OCID, it's only usually true that they have a name and a lifecycle state (e.g. creating, updating, deleting, failed, running, etc). And the name is most often in a field called "display_name" and state in "lifecycle_state", but there are exceptions to every rule. So I added a property in the objects dict for a function to format a friendly "one liner". And you can, of course, provide a lambda function for that.

You can see it above (or here below):

            "formatter"         : lambda pool: "Stream pool with OCID {} / name '{}' is in state {}".format( pool.id, pool.name, pool.lifecycle_state ),

In this case a Stream Pool has "name" instead of display_name. So I return that instead.


Iterating over every region, service, and then object is actually somewhat slow - O(n^3). And you may be wondering why I didn't use the OCI Search service to get that down to O(n). In fact I can guarantee one of my coworkers is yelling that right now at me (hi Jake).

The obvious (and wrong) answer is that Search doesn't support every object, so I was going to need to write some of this code anyway. And I'm a huge fan of commitment devices because I know myself well enough to know that if I don't do the hard part now I probably never will. So I decided to write the harder code for services that don't support search now, and then later on I'd use Search instead. But once I got the necessary code down to just those 4 lines for each object it was super simple to add any other object that turned up on _ToBeDeleted. I just kep telling myself "just one more test case!"

I probably should have stopped a while back and implemented search, but it was just so, so satisfying to see the script tearing through that I think I lost my way.

And I guess it does mean that I now have what is probably a locally optimal solution to compare the actual optimal solution to when I get around to writing it.

 

I checked the whole thing into github so you can see all my super awesome incredily well written code the dumb mistakes I made along the way.

If you're feeling generous feel free to add other objects or fix any bugs you find and throw me a Pull Request.

And as always, I'd love to hear your feedback!

Christopher Johnson

Director, Cloud Engineering

Former child, Admiral of the bathtub navy, noted author and mixed medium artist (best book report, Ms Russel's 4th grade class, and macaroni & finger paint respectively), Time Person of the Year (2006), Olympic hopeful (and I keep hoping), Grammy Award winner (grandma always said I was the best), and dog owner.


Previous Post

Reference OCI Networking Architecture - Oracle Essbase

Andrei Stoian | 3 min read

Next Post


Accelerating Data Science Projects in OCI with Rapids

Arun Kuttty | 7 min read