OCI Native monitoring and alerting for Autonomous database and DB Systems

November 15, 2019 | 9 minute read
Rishi Mahajan
Consulting Solutions Architect
Text Size 100%:

 

Introduction

For mission critical applications running on cloud, it becomes important to have a robust mechanism to monitor your cloud infrastructure (including applications and databases) and get alarmed for any failures/issues.

This blog post talks specifically about Oracle Cloud Infrastructure Monitoring, Events and Notification service to monitor and capture service events for your autonomous database and database systems. An autonomous database is a fully managed database service while database systems include single/two node Virtual Machine/Bare Metal/Exadata database systems. The monitoring, events and notification service can also be integrated with third party services like Pagerduty and Slack for operations management. The post also provides insight into how these service can be integrated with Pagerduty and Slack for operations.

OCI Monitoring Service

OCI monitoring Service enables customers to perform active and passive monitoring of cloud resources using metric and alarm features. Services like compute, autonomous database etc. continuously emit metrics to Monitoring service for active monitoring. And alarms use these metrics for passive monitoring. The alarm has a trigger action and notification method. The alarm trigger gets triggered when metrics breach alarm threshold and at that time a notification is invoked.

OCI Events Service

Oracle Cloud Infrastructure Events inform you about state change of a cloud resource. An example of event can be -create/terminate autonomous database or db system, begin/end backup of autonomous database or db system, dataguard switchover/failover of a database system. This state change information can be sent as a notification to different channels like Email, third party integrations using Notification service.

Events also enable you to create automation based on the state change of resources. An example of event based automation can be to trigger data load into an autonomous database when event informs that a file has been uploaded to object storage. Such events are fed to Oracle Functions service to invoke an event based automation. OCI Events service can also send information to OCI streaming service.

OCI Notifications Service

Oracle Notifications service broadcasts messages. The notifications service can be integrated with OCI Monitoring service and Events service to send notifications to notification channels. The channel for notification can be an Email, HTTPS endpoint, Slack or Pagerduty. PagerDuty is a SaaS based incident response platform for IT departments. Slack is a cloud-based team collaboration software.

Configuration

With that, let’ see an end to end flow for monitoring databases using these services.

In our use case, we’ll use these services and third party integrations to configure monitoring for our cloud resources. We will create alarms to monitor autonomous database metrics and trigger notifications. We'll also use Events for autonomous database and database system to send notifications. The notification channels will be Email, Pagerduty and Slack. It is not necessary to use all these channels but for this post, i have configured all of them and same notifications will be sent to all these channels.

The first step is to configure Notifications.

1) Under OCI Notifications service, create a topic with name “oci_monitoring”.

2) Under “oci_monitoring” topic, create 3 subscriptions - Email, Pagerduty and Slack

A single topic can have multiple channels (or protocols as per OCI terminology) configured for sending notifications.

We may not need 3 subscriptions. It has been done here just to test notifications service with all channels – Email, Pagerduty and Slack.

  • Create first subscription with Protocol set to “Email”  and Email as your email where notification is required to be received.

  • Create second subscription with Protocol set to Pagerduty. This requires Pagerduty integration key.

For pagerduty integration to work, a service is required to be created with Integration type as "Oracle Cloud Infrastructure Monitoring" on pagerduty portal. The integration key generated for the service is then required to be added into Pagerduty subscription on OCI side. The service has been created with name "oci_infrastructure monitoring" and it can receive notifications from OCI monitoring alarms via notification service.

A second pagerduty service is required to be created for receiving notifications from OCI events. This service would be of Integration type "Custom Event Transformer". Similarly the integration key for this pagerduty service is also added into subscription on OCI side. For two pagerduty services, there would be 2 subscriptions under notifications topic on OCI side.

  • Create third subscription with Protocol set to Slack .This requires slack webhook.

For Slack integration, we need a slack app (create a new or use an existing one) in workspace.Then create an incoming hook and finally add webhook to workspace to choose the channel for notifications. The webhook URL is required to be added into Slack subscription on OCI side

The state of all the three subscriptions will be “Pending” now.

3) Acknowledge subscriptions in respective channels.

These subscriptions will trigger a notification to all configured protocols – Email, Pagerduty and Slack to confirm the subscriptions. Click on links to confirm subscription.

The subscription on pagerduty portal will looks something like this

After confirming the subscriptions from respective channels, the subscription will change state from Pending to Available

4) After configuring notifications, next step is to create rules to capture events under OCI Events service. 

I have created a rule "oci_events_monitoring" with these events and action trigger set to "oci_monitoring" topic

5) The next step is to create alarms under OCI Monitoring service. 

I have created an alarm  "oci_atp_cpu_critical" for autonomous database. The alarm will monitor CpuUtilization metric and trigger a critical notification when CpuUtilization > 80% for 5 continuous minutes.

Another alarm oci_atp_storage_util_warning alarm sends a warning alert when autonomous database storage utilization reaches 80%.

We can create more alarms for other metrics available for database services.

Testing

Our rule for events has autonomous database service creation begin and end event. So we'll create an autonomous database service and check if this triggers notifications to all the three channels.

[opc@ebsbasash11 atp]$ oci db autonomous-database create --db-name atpdb --compartment-id $COMPARTMENT_ID --cpu-core-count 1 --data-storage-size-in-tbs 1 --admin-password xxxxxxxx --db-workload OLTP --display-name atpdb --license-model LICENSE_INCLUDED

[opc@ebsbasash11 atp]$ oci db autonomous-database list --compartment-id $COMPARTMENT_ID --output table --query "data [*].{dbname:\"display-name\",state:\"lifecycle-state\"}"

+----------+------------+

| dbname   | state      |

+----------+------------+

| atpdb    | AVAILABLE  |

+----------+------------+

[opc@ebsbasash11 atp]$ 

This event generated 2 notifications (begin and end of service creation) on each channel

Similarly, a startup/shutdown, wallet download or  a backup fired for autonomous database and db systems will generate a begin and end event.

[opc@atp-devops atp]$ oci db autonomous-database stop --autonomous-database-id ocid1.autonomousdatabase.oc1.iad.xxxxxxxxxxxxxxxxxxxxxxx 

 

For database system events, i have added dataguard association event to existing rule. The rule will capture dataguard switchover event and notify when switchover is invoked for a dataguard association.

This is a begin notification generated for dataguard switchover.

 

Next, lets see monitoring alert generated by monitoring service.

To test oci_atp_cpu_warning alarm , i stressed CPUs on autonomous database to check if condition is captured and notification is generated. 

The Service Metrics under monitoring service shows the highest cpu consumption for autonomous database. 

Once the threshold period of 5 minutes is over, the alarm status shows status as "Firing" and will generate a notification.

Like events, the alarm also sent notifications to all the notification channels.

So we can use these services to configure metrics, alarms and event capturing for operations management.

Conclusion

With Oracle Cloud Infrastructure Monitoring, Events and Notifications services, we can do monitoring of cloud infrastructure resources. Integrations with third party operations management solutions can also be done easily for day to day operations.

For more information on Oracle Cloud Infrastructure Services, check documentation

To test these features and many more, check Oracle Cloud Free Tier

Rishi Mahajan

Consulting Solutions Architect


Previous Post

ISV Architecture Validated Design

Tal Altman | 5 min read

Next Post


AT&T Cloud Service Portal (Synaptic) configuration for OCI FastConnect

Andrei Stoian | 4 min read