Text Size 100%:

Introduction

Customers could use Oracle Cloud Infrastructure (OCI) Alarms to notify their IT staff about events happening in their environment that need attention. Some customers have a central service management system (also known as ticketing system) where they track and manage all such events via alerts and tickets. I recently helped a customer with a solution for managing OCI alarms in their ServiceNow system which is one of the popular service management systems. I think that other customers may also find the use-case of some value. Hence, in this blog post, I want to outline the design of the solution and provide configuration/implementation guidelines so that it is easy for others to build similar solutions. Here is a high level architecture diagram of the solution.

The simple example use-case that I am going to use for the blog is this: We have an alarm set-up for memory utilization going beyond a certain threshold on a server. This alarm notification should end-up in creating a ticket in ServiceNow. In the screen captures below you will find that the thresholds were set to a low value but that was done so that we could trigger the alarm and test the solution without worrying about running heavy workloads on the system. The threshold value is just a number and easily set to an appropriate value in the alarm definition.

Design Overview

Lets’ start with the major system components that come together and interact to make this solution possible.

  • OCI Monitoring service allows customers to create alarms for keeping an eye on resource behaviors that are interesting and significant for them. Setting these alarms allows administrators to focus their time and energy on higher value tasks rather than regularly monitoring the resources via metrics and charts. Once alarms are set, OCI will notify you of any events that you want to be informed about – as defined in alarm definitions. Managing alarms is described here in OCI documentation and some best practices related to alarms are here.
  • Alarms use OCI Notifications service to notify system administrators about these events. Notifications service uses topics and subscriptions to send out notifications. Messages are published to topics and then sent out to all the subscription channels defined for that topic. OCI Notifications service supports a rich set of subscriptions – Email, PagerDuty, Slack, HTTPS URLs and Oracle Functions. More information about topics and subscriptions is here.
  • Oracle Functions allows development, deployment and execution of applications that implement customers’ business logic without worrying about the infrastructure where it will execute. This enables very quick release cycles as you are only concerned with the development of your business logic and the infrastructure is already taken care of.

With that brief introduction to the principal components of the solution, we are now ready to talk about the design. We used Oracle Functions for this integration between OCI alarms and ServiceNow. The design of the solution is quite simple. OCI triggers alarms either when thresholds defined in alarm definitions are violated or in case of absence of metrics which indicates a resource that is down or unreachable. Here I am going to use alarms based on thresholds as an example. Alarm notifications are sent to a topic. A notification topic has various defined subscriptions. In this case, we used a Function subscription. The Function is invoked by OCI Notifications service. We need to build logic as part of the Function code to transform the OCI notification data to the form as expected by ServiceNow and then invoke ServiceNow APIs to create/manage tickets there.

Implementation Details

In the remaining part of the post I am going to describe function implementation in detail. Rather than repeating the product documentation here, I will provide references to the relevant parts and focus mostly on the design elements of the solution.

Documentation Links:

An alarm message belongs one of these four types: OK_TO_FIRING, FIRING_TO_OK, REPEAT and RESET. These message types are described here. Alarm message data depends on the message types. Message type data formats are described here. The documentation has an example of an alarm message. I have included some more examples below.

Following are some salient points about the alarm messages:

  • Each alarm message has a “dedupeKey” that can be used to distinguish between messages from different alarms
  • An alarm goes from OK to FIRING state when the metric threshold as defined in alarm definition is violated for any of the resources included by metric description. The corresponding message type is OK_TO_FIRING. This message type will include dimension data for all the resources for which the alarm is firing. Please note that the dimension data may be different depending on the resource type and metric available. If you are bringing your own custom metrics to OCI monitoring service, you could create alarms on those metrics as well. The following picture shows an example of this message type: I have formatted the message data for easy viewing and pointed out salient parts.

Please make a note of the “resourceDisplayName” (pointed by a red arrow). We will see this value mapped to “Node” field in ServiceNow Event. Also, alarm body is mapped to “Description” field in the event.

  • The alarm goes back to OK state when the metric is under the defined threshold for all the resources. The corresponding message type is FIRING_TO_OK and doesn’t have any dimension data. The following picture shows an example of this message type:

Please note that the “dedupekey” is identical for both these notifications

  • Repeat messages are like OK_TO_FIRING but have a message type of REPEAT. It is possible that Repeat messages may have dimension data for different resources than the original OK_TO_FIRING message because the state of the system might have changed. For example – Lets’ say that we have a cluster of three servers – “server A”, “server B” and “server C” for a web application and we have defined an alarm for monitoring memory utilization on these servers. Things are running smooth and the alarm is in OK state. Now server A’s memory utilization breaks the threshold limit. The alarm switches from OK to FIRING state and an OK_TO_FIRING notification is triggered which will include server A’s details in the dimension data. Things calm down on server A, however before the alarm goes to OK, server B’s memory utilization breaks the threshold limit. If configured, repeat notifications will trigger – however, instead of server A, the dimension data will have details of server B because the system state has changed now. Please note that there can be multiple resources in the dimension data.

Function Implementation

Having covered the basics of alarms and message types we are now ready to talk about the Function that could be used for the integration.

Notification data is identical for all subscription types. For alarm notifications, the data is a serialized JSON object like what is shown above.

If you are new to Oracle Functions, here is good quick-start tutorial to get you up to speed. The service documentation is here and more details on how they work are here

Oracle Functions is based on open source Fn project. You can use multiple programming languages to develop your function code. You have complete control of how you want to build the integration. You could also invoke OCI APIs from inside your function code for various-use-cases. For example, instead of storing sensitive information like passwords, tokens and other secrets in plaintext in either Function configuration or environment variables, you could manage secrets in OCI Vault service and read them using APIs which is obviously a much better and secure way. There is a nice blog post by my colleague Kiran Thakkar on the subject complete with code examples.

I am going to use Python as the language of choice for this example. However, the concepts remain the same in other languages as well. In Python, the entry point of your Function is the handler method which has the following signature:

def handler(ctx, data: io.BytesIO = None):

As mentioned above, in case of alarm notifications this input data is a serialized JSON object, which could be converted to JSON using something like:

funDataStr = data.read().decode('utf-8') funDataJSON = json.loads(funDataStr)

Once you have the JSON object you could use the usual JSON manipulation techniques to extract and analyze the contained information and take appropriate action.

Functions also accept configuration parameters that are passed in the “ctx” parameter. For example, we can pass ServiceNow URL, User Id and Password as function’s configuration parameters so that we don’t need to hard-wire any of these inside function code. As mentioned above, we should store the User Id and Password in a vault and pass the OCIDs of these secrets into the function configuration. The function can then use OCI APIs to get the secrets from the vault. You will need to grant the function privileges for reading secrets from a particular vault using OCI IAM Policies. Here is a screen capture of the function’s configuration:

This data is available inside the function in a dictionary like object:

ctxConfig = ctx.Config() snowURL = ctxConfig['SNOW_URL'] snowUsrIDSec = ctxConfig['SNOW_USER_ID_SEC'] snowUsrPwdSec = ctxConfig['SNOW_USER_PWD_SEC']

Please note that “snowUsrIDSec” and “snowUsrPwdSec” are OCIDs of the corresponding secrets stored in an OCI vault. OCI provides APIs to read secrets from vault provided proper authorizations are in place. The Python SDK APIs are here. Kiran’s blog also has samples in Python.

ServiceNow Integration

For integration with ServiceNow, the information from alarm notification data could be used to create a ServiceNow event. The attributes of the event could simply be derived by extracting and mapping (/transforming) the information contained in the alarm data. You have the full freedom to create and enrich the ServiceNow event as per your use-cases by using OCI APIs and the power that the programming environment provides. Here are a couple of examples of such mapping:

  • The severity levels for OCI alarms are – CRITICAL, WARNING, ERROR and INFO whereas ServiceNow expects numeric severity levels. Here is a sample mapping:
Sample OCI Alarm to ServiceNow Severity Mapping
OCI Severity Level ServiceNow Severity Level
CRITICAL 1
WARNING 2
ERROR 3
INFO 4

 

  • Similarly, “resourceDisplayName” from dimension data could be used for “Node” attribute of the ServiceNow event
  • The “Description” field in the event could be set to the alarm body

Once you have constructed the event, it could be sent via a HTTP POST call to the “https://<Instance Name>. service-now.com/api/now/table/em_event” end-point exposed by ServiceNow. Please note that you need to provide appropriate authentication information along with the API call.

Please keep in mind that the purpose of this post is to demonstrate the rich integration capabilities that OCI Notifications and Functions provide for managing and servicing your alarms in a service management system like ServiceNow rather than prescribing any API on the ServiceNow side or any particular mapping/transformation for creating the ServiceNow event.

For advanced integration scenarios, you could also keep track of (in a cache) what individual resources the alarm has fired for so that when the alarm switches back to OK, appropriate CLEAR events could be created in ServiceNow and information correlated.

Solution Configuration

Let’s look at how the whole integration is wired. Since pictures are worth thousand words, I will let them speak:

  • Alarm Definition: Alarm Definition has Notification Topic configured as destination (pointed by red arrow).

  • Notification Topic Configuration: A Notification Topic has one or more subscriptions. Please make note of the Function subscription (highlighted by a red oval)

I have already provided a screen capture of the Function definition above.

When the alarm fires, the notification is invoked which in turn triggers the Function and if the Function is set-up properly, it will in turn create an event in ServiceNow. Following are the screen captures from ServiceNow:

  • ServiceNow Event: Please note that the fields in the event are mapped from the alarm notification data.

  • ServiceNow Alert:

Conclusion

That completes our design and configuration/implementation of the use-case.

Hopefully this will enable you to manage your OCI alarms in a centralized way in your service management systems, if your use-cases ask for it.

Pulkit Sharma


Previous Post

Learn How to Be the Best at Using OCI Logging Service to explore Object Storage Logs

Anand Raghavan | 4 min read

Next Post


Privately Accessing Oracle Services Residing in Different Regions

Dayne Carley | 6 min read