The End-to-End monitoring use-case aims at delivering the service levels and key performance objectives of an organizations business transactions running on Oracle Cloud Infrastructure (OCI). As organizations move workloads to OCI, they need the ability to determine if service levels are consistent or improved from their on-premises environment. In addition, if those OCI workloads are integrated with other deployments (on-premises or in other clouds), they need visibility across the hybrid and multi-cloud estate. In all cases, when performance issues occur, they need visibility into their business transactions to understand where the root cause originates so they can quickly remediate the issue and ensure a positive end user experience.
This is the first of the series of blogs on this topic co-authored by Gustavo Saurez, Johannes Murmann, Pulkit Sharma, Richard Jacobs, Uday Sambhara.
Getting a unified view of how applications are performing and spotting potential issues before they impact customers is crucial to the business. Monitoring is not just limited to performance metrics or host error logs but can span a variety of data sources such as audit logs from OCI, application and stack component telemetry, user behavior and individual applications and/or usage data from OCI detailing which resources are used and how much of them are used.
For workloads on OCI, this spans all monitoring we can provide from the client’s browser all the way to the individual components of OCI. This offers a unified view of the performance and health of a solution at all tiers. In case a problem is discovered, purpose-build machine learning applied to the large amount of metrics and logs allows the user to quickly drill into the details and get a better understanding of root cause and remediation steps.
Besides monitoring, we also provide functionality that proactively helps customers identify and avoid potential issues. This is achieve by applying anomaly detection algorithms to metrics or log data, forecasting of resource usage and scheduling health checks that probe critical parts of the solution. Alerts can be triggered based on all kinds of conditions and thresholds and can be used to alert people or trigger automated remediation tasks.
Today’s applications and solutions can be made up of many components using different technologies, or residing in different public clouds, cloud-at-customer or on-premises environments.
OCI offers a mix of OOTB and Add-on Services to choose from depending on one’s need for customization to provide customers with the End-2-End monitoring they need to ensure their workloads are running optimally. At the time of writing this article some of the out of the box services available to clients are the following:
OCI Event Service emits events, which are structured messages that indicate a state change in OCI resources. Launching an Instance, Terminating an Instance, Create/Update/Delete of an Object are examples of events. Events can be routed by Notification Service to appropriate channels or feed into Functions for actionable items such as notifying a specific team on launch of an instance. List of OCI services that emit events can be found here.
Metrics from OCI Monitoring are available out of box in metrics explorer providing a comprehensive view of metrics in the OCI console. Monitoring service allows defining thresholds on resource metrics to generate Alarms. Alarms can further feed into Notification Service. OCI Metrics are also accessible for integration with third party tools that are cloud vendor agnostic like Grafana, an open source platform for monitoring and analytics. OCI provides a Grafana Plugin which enables OCI as data source to view metrics of OCI Resources or OCI resources used by an OKE cluster in a single Grafana dashboard. The below list shows the native OCI Services that emit metrics at the time of writing this blog post.
WAF (Web Application Firewall)
More information about the metrics service can be found here incl. an up-to-date list of OCI services that emit metrics.
Health Checks provide users with external monitoring capabilities to determine the availability and performance of any publicly facing services, including hosted websites, API endpoints, or externally facing load balancers.
Audit Log events of API operations on OCI resources are available in OCI Audit Console and can be exported by command line interface (CLI) or via REST API for consumption to other third-party tools like Splunk for a single pane of glass view.
While the out of box services are available at no cost, clients can choose add-on services to consume the data and provide analytics and further insights as discussed further in this document.
The add-on services are the Log Analytics service, the Infrastructure Monitoring service, the Application Performance Monitoring service (APM) and the IT Analytics Service.
We will examine the different type of metrics and logs that can be collected and ingested by these services and how APM can feed into the IT Analytics service. Log and Metric data can be provided in three ways: by native OCI resources emitting data automatically, by prepackaged agents that customers can deploy or by using industry-standard interfaces such as REST APIs.
In the next blog, we will look at a simple 2-Tier architecture for End-To-End use-case on monitoring, stay tuned..