In this Part-3 of the series, we will take a detailed look at the add-on services available in OCI for End-to-End Monitoring needs. In Part-1 and Part-2, we have gone through features of out-of-the-box OCI services for monitoring and simple 2-tier architecture and a complex EBS use-case. The available add-on services at the time of writing this blog are Log Analytics, Application Performance Monitoring, Infrastructure Monitoring and IT Analytics. This topic is co-authored by Gustavo Saurez, Johannes Murmann, Pulkit Shama, Richard Jacobs and Uday Sambhara.
Log Analytics Service
Log Analytics Service is for log events ingestion, analysis, field-enrichment, and indexing. Log data can be securely ingested, from multiple sources. Enabled with knowledge of log structure and content in the form of predefined entity types and out of the box log sources, logs collected from infrastructure and applications help to correlate activities, troubleshoot issues and offer actionable insights. Currently following are three methods for ingestion of Logs via Log Analytics
On-Demand Upload Client
Rest API (Requires a Service Request in order to be enabled)
A Cloud Agent (the same one used to monitor server-side metrics) can continuously collect log files of registered entities. The Log source needs to be associated with an Entity Type in order for the cloud agent to know what it should collect. Typically, the agent is installed on the same server that also hosts the logs.
For ad-hoc or on-demand scenarios the ODU Client can be used. The ODU Client is a command-line tool that can be used to ingest multiple files in one operation. REST APIs are another option Developers/DevOps teams can use to orchestrate log ingestion. On the data that is loaded and enriched, possible operations are to analyze and explore (search) on Logs, correlate and obtain key values and gain operational insight on the data.
Interactive data visualization is key, provides various options like compare and contrast data with one or more parameters by graphs like pie charts, bar charts, and histograms. This helps in summarizing the data and drill more into datasets inked where needed. For advanced analysis, log analytics service uses machine learning to identity patterns from different datasets and groups them into clusters. By clustering the patterns from different datasets, outliers or patterns that stand out is an outcome. On the contrary, if a use case/advance analysis demands picking specific log records from different datasets and group them for further analysis then ‘Link’ capability provides such functionality. Linking on clusters is a hybrid use case of letting log analytics service cluster on a pattern and use Link to group log records for additional patterns/analysis.
Oracle Cloud Infrastructure Monitoring Service provides proactive monitoring for entire IT infrastructure. Customers can perform status and health monitoring across tiers and be alerted about issues, troubleshoot and resolve them before they affect users. This section provides an overview of Oracle Infrastructure Monitoring service and its features.
Oracle Infrastructure Monitoring simplifies monitoring by offering a common set of metrics that allow you to compare performance across various vendor technologies. This service automatically generates alerts when managed entities are down and allows you to create alert rules that specify the metrics thresholds and notifications options.
Some common concepts and terminologies associated with Oracle Infrastructure Monitoring are:
Entity: A monitored resource, such as, a database, a host server, a compute resource, or an application server.
Metrics: A set of parameters and values measured and collected periodically for a particular system for tracking performance and availability.
Alerts: Information generated in response to an availability issue or when a metric crosses its thresholds defined. Conditions for generating alerts are defined in Alert Rules. Alerts sent to administrators by using various channels, such as, email and SMS are known as notifications.
Oracle Infrastructure Monitoring Service supports a large variety of entities out-of-the-box including common operating systems, virtual servers, cloud services and SaaS applications, relational databases, No-SQL databases, storage systems, network appliances, firewalls and physical switches, web and application servers among others. The full list of supported entities can be found here.
You can extend Oracle Infrastructure Monitoring capability by using open source metric collector agents like collectd and Telegraf (or packaged monitoring tools like Microsoft SCOM or VMWare VCenter) to collect additional types of metric data. These metric collectors can be configured to send metrics data to a cloud agent. You can also use custom metrics which allow you to create full-fledged metrics on any entity type that is monitored by a cloud agent. Custom metrics let you extend Oracle Infrastructure Monitoring Service capabilities to monitor conditions specific to your IT environment. This provides you with a comprehensive view of your environment.
Oracle Infrastructure Monitoring Service also enables you to monitor the state and resource utilization of processes running on a host. This feature is known as Process Monitoring. It is useful in scenarios where you may want to keep a proactive eye on the state of the critical processes that make up an application. You can also monitor the CPU and Memory Utilization of these processes as well as get alerts if they cross boundary threshold conditions.
Oracle Infrastructure Monitoring Service collects performance and availability metrics for all entities set up for monitoring. The alerts sub-system informs you of availability or performance problems. Alert Rules enable you to define how alerts are triggered (if they are not automatic) and how you get notified.
With Oracle Infrastructure Monitoring Service, Availability status is monitored automatically. If an entity is down, a Down alert of fatal severity is automatically generated. If it is a host or agent entity, a not heard from alert (also fatal severity) is generated. Customers could create alert rules to get these notifications. Once an entity is detected to be up, the alert will clear automatically.
Oracle Infrastructure Monitoring Service Enterprise Summary dashboard provides tier regions that indicate the current status and performance of all entities in that particular tier. The tiered status bar charts show the breakdown of status for each entity type monitored in your enterprise within that tier. For example: In the following diagram, Of all the Weblogic servers, 11 are up (running as expected), 9 are down, 2 have errors and 3 are in pending status. Similarly, of the total number of Tomcat servers, 3 are up (running as expected) and 1 is down.
Users can review the home page for each entity to look for alerts and key performance metrics. Performance charts also show outliers (points on the charts that look different and are isolated compared to the others). For example – in the following performance chart showing CPU vs memory utilization, there are some outlier points that correspond to high CPU and memory consumption. These will need to be investigated.
Users can also select metrics to be displayed for predefined ranges on performance charts, such as: Last 2 weeks, Last Day, or Last Hour.
Oracle Infrastructure Monitoring Service also provides REST APIs using which new entity types and entities could be defined and metrics can be uploaded from systems where Cloud Agents can’t be installed like Oracle Cloud Infrastructure (OCI) Load Balancers, DNS Zone Management etc. REST APIs are also the mechanism to collect third-party cloud provider metrics like those published through AWS CloudWatch or Azure Ops. The REST API is documented here. The REST API documentation also provides sample end-to-end use-cases. Customers could build solutions using these REST APIs to upload metrics into Oracle Infrastructure Monitoring Service. In addition to the use of Cloud Agents where available, there are multiple possible options to design and implement such integration solutions. For example, OCI Functions, OCI Streaming and Standalone solutions. Design outline of such a solution based on Oracle Functions as described in this blog.
Oracle Application Performance Monitoring is a cloud service that provides deep visibility into the performance of customers’ web applications. With Oracle Application Performance Monitoring, you can:
Rapidly isolate application performance issues
Drill down to related logs in the context of a problem and find its root cause
Gain end-to-end visibility into the performance of your application across all tiers
Monitor end-user experience
Monitoring of web applications is enabled by using an APM Java Agent which is a lightweight agent that runs in the Java Virtual Machine (JVM) of a web application and collects performance monitoring data for Java web applications running in your data center or in the cloud.
Using Oracle Application Performance Monitoring, you can monitor the performance of your application by following transactions across servers to identify the exact tier causing an application issue, see if the issue is specific to a geography and see application logs automatically in the context of the application performance. Determining if the problem lies with the application or maybe a regional networking issue becomes easier when we have data from globally distributed test agents as well as data from within the application.
Synthetic Monitoring helps in simulating a path in the application that a user would normally take, and ensure that the user can transition through the different web pages in the path smoothly. This helps is recognizing application performance issues before the end-user experiences it. Synthetic Monitoring allows the following types of tests to be performed:
HTTP Ping — Testing the connectivity to and performance of your application
Page Load — Testing the performance of a single URL, being loaded by a browser
Scripted Actions — Testing the performance of a complete workflow recorded using Selenium scripting.
Rest Web Service — Testing the performance of a complete workflow that uses the REST web service.
APM enables monitoring of application performance by monitoring end-user experience, server requests, application servers, and other entities. End-user experience monitoring is done by closely watching various aspects of the application including page performance, monitoring of AJAX requests and monitoring of application request performance.
IT Analytics consists of multiple areas that together will give customers a 360-degree insight into the resource utilization and Capacity trends for the workload they are running.
The purpose of the Analytics service is to provide Administrators, DevOps and IT Executives with information based on historical and long term data that will enable them to make critical decisions. IT Analytics will support them by providing information about usage trends, capacity planning, and anomaly detection.
Resource Analytics looks across fleets of Hosts, Databases, Exadata and Application Servers with a focus on Optimizing Utilization, Reducing Cost and forecasting future consumption to ensure workloads run uninterrupted and capacity planning happens efficiently.
Performance Analytics provides deep insight into the performance of Databases and Application Servers with the ability to drill into SQL execution performance and details on Application Page Load times among other things. Availability Analytics is all about ensuring continuous visibility of the health of the IT Landscape and providing attention to systems that may be at risk of breaching SLA before it happens. Historical data of incidents and events can be used to conduct analysis to can be used to reveal certain systems or versions causing more incidents than others.
It is important to view IT Analytics and its components within a bigger long-term perspective and understand that they act on large amounts of historical data to provide important insights but they are not meant to be real-time monitoring solutions. Real-time monitoring is handled by the Infrastructure monitoring Service or APM Service and especially for the database, there are multiple options like the Performance Hub as part of the OCI Console or Enterprise Manager among others.
An enterprise-scale setup consists of mix of native OCI services and customer-managed components and a need to detect, monitor, analyze and forecast at enterprise scale as discussed in End-to-End use-case scenario of this blog series, and those functionalities are provided by OCI Log Analytics Service, IT Analytics Service, Infrastructure Monitoring Service, and Application Performance Service. OCI gives an enterprise with capabilities to obtain fine-grained metrics and monitor resources to understand current health and performance of workloads running on OCI. One can choose any and all of the OCI monitoring services and get insight offered by the service(s) in making decisions to optimize or forecast resources utilization, respond to anomalies, generate various business or technical reporting on current workloads on OCI.