Recently, I worked with a customer where we would notice severe performance degradation over time within JCS (Java Cloud Service).
The product used was Oracle Service Bus (OSB) where it would take around 2 seconds to process a request and send a response. Over time, we would see the same request/response take over 1 minute! If we restarted the JVM and tested again, the response time was back to 2 seconds but would again degrade over time.
In an effort to troubleshoot this, I started out by taking JFRs (Java Flight Recordings) to compare when things were somewhat fast versus when they were horrendously slow. We noticed that the XML transformations become very slow. What we did not know was whether it was only the transformations that became slow or whether everything became slow.
In an effort to determine whether it was only OSB transformations, I created a simple WebApp that would stress the CPU. When invoked, it calculates Fibonacci to the 40th digit.
After server start this would typically finish in about 5 seconds. After the server gets into the bad state, the browser would timeout and we would start to see STUCK threads on WebLogic where the stack would like like:
"[STUCK] ExecuteThread: '60' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00007f3980066800 nid=0x1abc runnable [0x00007f39c9ce5000] java.lang.Thread.State: RUNNABLE at jsp_servlet.__index.fib(__index.java:61) at jsp_servlet.__index.fib(__index.java:69) at jsp_servlet.__index.fib(__index.java:69) at jsp_servlet.__index.fib(__index.java:69) at jsp_servlet.__index.fib(__index.java:69) at jsp_servlet.__index.fib(__index.java:69) at jsp_servlet.__index.fib(__index.java:69) at jsp_servlet.__index.fib(__index.java:69) at jsp_servlet.__index.fib(__index.java:69) at jsp_servlet.__index.fib(__index.java:69) at jsp_servlet.__index.fib(__index.java:69) at jsp_servlet.__index.fib(__index.java:69) ...
This would consume a full CPU until it completed. Eventually, it did complete, but when things were really bad it would take over 1 hour (you read that correctly) to complete.
It almost seemed like the CPU was being throttled severely, especially since the thread dump only showed the expected recursion in my Fibonacci JSP code.
At this point, I wanted to see whether a native stack trace showed me anything different. Here is what a pstack showed for one of the threads stuck in the Fibonacci recursion:
Thread 22 (Thread 0x7f39cacf8700 (LWP 6828)): #0 0x00007f3a320d6473 in CodeHeap::largest_free_block() const () from /u01/jdk/jre/lib/amd64/server/libjvm.so #1 0x00007f3a31f678af in CodeCache::largest_free_block() () from /u01/jdk/jre/lib/amd64/server/libjvm.so #2 0x00007f3a31f928b5 in CompileBroker::compile_method(methodHandle, int, int, methodHandle, int, char const*, Thread*) () from /u01/jdk/jre/lib/amd64/server/libjvm.so #3 0x00007f3a31ddedbd in AdvancedThresholdPolicy::submit_compile(methodHandle, int, CompLevel, JavaThread*) () from /u01/jdk/jre/lib/amd64/server/libjvm.so #4 0x00007f3a32418180 in SimpleThresholdPolicy::event(methodHandle, methodHandle, int, int, CompLevel, nmethod*, JavaThread*) () from /u01/jdk/jre/lib/amd64/server/libjvm.so #5 0x00007f3a3213b637 in InterpreterRuntime::frequency_counter_overflow_inner(JavaThread*, unsigned char*) () from /u01/jdk/jre/lib/amd64/server/libjvm.so #6 0x00007f3a3213f5c6 in InterpreterRuntime::frequency_counter_overflow(JavaThread*, unsigned char*) () from /u01/jdk/jre/lib/amd64/server/libjvm.so #7 0x00007f3a260108fb in ?? () #8 0x0000000000000007 in ?? () #9 0x00007f3a260108c7 in ?? () #10 0x00007f39cacf55e0 in ?? ()
What we can see here is that the method is being compiled over and over. With help from a colleague, it was determined that this was a known bug with a specific JVM flag: -XX:+TieredCompilation.
This flag is set by JCS itself when provisioning the instances. You can see this in the Server Start properties for the managed server:
While the journey was long (I didn't cover many of the troubleshooting/diagnostics that were done in between), the fix was simple.
Remove the -XX:+TieredCompilation flag.
So, what were the results after making this change? With my Fibonacci program, it would now finish at 500 milliseconds! That is even a huge improvement over our baseline of 5 seconds after server start! And that result remained consistent over time too!
And what about with OSB? Instead of our baseline of 2 seconds that we obtained right after server start, we were now at 700 milliseconds! That also remained consistent over time! So, not only did removing this flag help with the severe performance degradation over time, it significantly improved on our fastest times!
This bug has since been fixed in 220.127.116.11 JVM but it will not be fixed in 1.7 JVM and thus the flag should never be used with version earlier than 1.8.x.
SPECIAL NOTE: In terms of JCS itself, this flag has now been removed for the upcoming release (15.2.3 version). After this release, any NEW JCS instances that are provisioned will NOT have this flag. For any existing JCS instances that you have, please remove this flag.
Thanks for reading!