Performance issues are some of the most difficult and expensive issues to diagnose and fix. For JAVA applications there is a great tool called the Java Flight Recorder (JFR) that can be used to both proactively to find potential performance issues during testing before they become apparent through external metrics and reactively to troubleshoot issues that have appeared during performance testing or in production.
The Java Flight Recorder captures and dumps, with very little overhead (typically less than 2%), information about the Java runtime and the Java application running in the Java runtime to a file. The data is recorded at time stamped data points called events. Typical events can be threads waiting for locks, garbage collections, periodic CPU usage data, File/Socket Input/output, JDBC request processing, Servlet Invocation etc. By default every event that took more than 10 ms is recorded. When creating a JFR recording, you can create recording templates which are sets of events that should be recorded. However, the default template is good enough to capture most of the details required to monitor the health of the system.
JFR is by default enabled for JRockit.
You have to specify following Java argument to enable JFR for Hotspot.
Start JMC from $JAVA_HOME/bin with “./jmc” command. For Jrockit JVM, the command is, “./jrmc”. You can see a list of Java Virtual Machine (JVM) processes running under JVM Browser.
You can select a JVM process and start/stop or dump JFR if it was already started from JVM browser.
You can set up Java Mission Control to automatically start or dump a flight recording when a threshold is reached for some system resource or some condition is met for MBean. This is done from the JMX console. To start the JMX console, find your application in the JVM Browser, right-click it, and select “Start JMX Console”.
Select “Triggers” tab at the bottom of the screen to configure new trigger.
You can choose to create a trigger on any MBean in the application. There are several default triggers set up for common conditions such as high CPU usage, deadlocked threads, or too large of a live set. Select “Add” to choose any MBean in the application, including your own application-specific ones. When you select your trigger, you can also select the conditions that must be met.
Click the check box next to the trigger to enable it. Select condition that should be met for the trigger then, click the Action tab. From Action tab, select what to do when the condition is met.
As shown below, you can check average CPU usage and CPU usage when performance test was run. Performance test was run twice when this JFR was captured. As shown in the screen shot below, both the times CPU shot up to 100%. This of course is not a good sign and needs to be corrected. Okay, so how do you know the number of CPUs assigned to the machine? You can check that from “System” tab.
As you can see in the above screen shot, two CPUs were assigned to the machine. When I changed it to 4 CPUs, it did help and the CPU was no more 100% utilized, however it was still used more than 90% as shown in the screen shot below for the same performance test. This shows a use case of JFR. You can measure performance improvement of the system when you make some configuration change.
The heap size is one of the very important resources that impacts performance. Along with heap size, garbage collection (GC) pattern also says a lot about performance. How many times GC was run, how much time did it take to finish GC and how many of those GCs were PermGen and full GCs. The GC cycle also helps detect memory leaks if any exist. If memory used post GC operation is growing every time GC runs is the first indication of potential memory leaks. More information about memory tab and detecting memory leak can be found here.
Contention is one another easy to detect signs of performance issues. Too much contention can be a result of low CPU or low memory available or bad coding. Looking at resources usage and stack trace of the thread that caused contention can help you find out the root cause of contention. As shown in the below example, same test with 2 CPU had a much more contention than with 4 CPU.
Note: Ignore Muxer thread contention. They are special threads created by weblogic to read/write data from/to sockets.
System configuration as shown in one of the screen shot above shows details about CPU, memory and hardware. It also shows a list of environment variables and processes running on the system. Processes running on the system can help you check and make sure no unwanted process was running when the performance test was done.
One of the interesting parts of JFR is integration with weblogic. JFR can capture all the events produced by the Weblogic Diagnostics Framework (WLDF) throttled per ECID (Execution Context Identifier). The Weblogic tab pack is not installed by default. You can follow the steps to install the Weblogic tab pack. JFR can capture WLDF
From Weblogic tab, you can check the time taken to process every request identified by ECID or by HTTP requests or by servlets. You can also check the time taken by JDBC queries to diagnose list of JDBC queries that need tuning.
Every tab under Java Mission control has a range navigator at the top. It is time range. You can select a period of time when let’s say the test was run or when you observed failures during the performance test. This helps you do more focused troubleshooting and also get zoomed view of the event graph. Once you select a range, you can ‘synchronize selection’ so every tab provides you data within that time frame.
There is also an interesting feature that is operative set. When you figure out, let’s say a troubled thread, you can add that thread to operative set and view data about just that thread or operation in every tab. This helps you pinpoint the issue and recommend very specific tuning. Below is an example of using operative set.
First of all, from Weblogic tab, I get request instance based on ECID that took the maximum time to process (more than 2 seconds) and add that to operative set. Select ‘Add Selection’ option from the list of menu items as shown in the screen below.
Then from Events tab I click on ‘Show Only Operative Set’ and that will show only those events that happened during processing of the request that I selected as operative set. If you see that event, you would notice that it spent more than 1 second waiting for monitor so you know why it took more than 2 seconds to process the request.
More information about JFR can be found here.