The importance of clean log files

Introduction

This post will refer to Oracle ATG Commerce stack specific examples, but the theme applies to any product/configuration.

Reading and understand log files is an important skill. Having clean log files is critical!

 

Why are clean logs so important?

On many occasions we see questions along the lines of “why isn’t XYZ working correctly?”. You can browse places like product forums and public support communities, and find this same type of question asked repeatedly.

Sometimes the root cause of the problem is an obscure scenario, or product bug not easily discernible from debugging or logs. But, on many occasions, the answer is right there in the log files.

 

Many modern software solutions are complex, often consisting of the integration of multiple individual products into a single stack meant to provide a larger solution. As complexity increases, understanding the nuances of every integration point and code interaction starts to become an impossible task. A seemingly innocuous error in a log can have a major impact somewhere in that complex integration of half a dozen products and millions of lines of code.

How do you know if that error is really “an error that matters?”. I have been asked that very question on numerous occasion, and my response is always the same. ALL errors matter. Unless you understand every line of code, and know exactly what features a particular error impacts, you are doing nothing more than gambling by not addressing errors.

Whether you are running performance tests, regression tests, or are a developer working on a new feature – the errors matter.

 

To help illustrate the potential impact of a single error, the following are real examples I have personally encountered.

  • A performance test was being run and a single startup error was in the WebLogic logs. The test was comparing the performance of a new release of code against a previous version, and the new version was performing significantly worse. The single startup error was a message about a server not being reachable. The server was in fact an old server, not longer being used since hardware was upgraded. The tester assumed it didn’t matter since they still had a fully functional cluster running. Under the hood, the code was repeatedly trying to communicate with this non existent server, causing threads to be blocked until the socket timed out. Threads began backing up, causing response times across the test to skyrocket.
  • A developer working on a new feature, and could not get code to behave correctly even though they were following examples provided in the product manuals. The feature in question had to interact with the Profile Repository. In the startup logs for Commerce was an error about the Profile Repository XML being malformed. The end result was none of the data in this repository was accessible, causing the code to not work.
  • A group of developers debugging a problem could not figure out why code was not working. A single log entry during startup stated a required ATG library, protocol.jar, was missing. The code in question made use of the internal ATG messaging system. The lack of protocol.jar was causing XML files to not be parsed correctly, which in turn caused the messaging system to not start correctly.

 

If you want to be 100% certain an error message is not impacting your site – address the error.

Aside from causing things to not function correctly, errors can cause performance degradation and out of memory errors. As errors occur, they can leave objects behind on the heap, causing it to fill and ultimately crash from out of space issues. As the heap fills, garbage collection occurs more frequently, which often impacts site performance.

 

What can you do?

Step one is clean startup logs. Clean startup means not even a single error.

If your Commerce instances are starting up with any errors, address them immediately. Startup errors are often the easiest to fix, and prevent more complicated run time errors from occurring later.

If you have a cluster running, it is important to check all Commerce instances, not just one. An error in a single instance can impact the entire cluster.

 

Step two is to monitor your running environment. Sometimes errors only occur when the site is used a certain way. Monitor all log files and watch for errors. When they are found, attempt to address them as soon as possible.

 

How can I monitor all these logs?

The first step to monitor the logs is to know where all the logs are. This is not always an easy task.

Using the Commerce stack as an example, you have Weblogic logs for each instance, Endeca logs for each piece of the Endeca stack – plus logs for your deployed EAC appplication, logs for your database, possibly logs for a webserver for static content, and operating system logs.

 

You need to know where the logs exist for all the products you are using.

Check the product manuals first. Most products provide information on the location of the major log files used.

Contact support for your product(s) if you aren’t sure.

A trick I have used in the past on *nix system is to simply to go / and find . -name \*.log or \*.out.

Another option that will likely yield more results, but not always the results you want, is to search for all files modified in the last X amount of time – say in the last 30 minutes. This can help track down log files.

 

For a developer working in something like a local virtual machine with the entire stack running in that one VM, finding and monitoring logs is often an easier task.

For a large cluster deployment, manually monitoring log files can be a very tedious and time consuming task.

Many tools exist that will aid in monitoring log files, and automatically alert you if errors occur. A few examples of these types of tools are Splunk, logwatch, Graylog, and Nagios. There are many others out there, some free, some not.

Choosing a log monitoring tool that meets your needs should be an important part of any large Commerce, or other product deployment. These tools can be very flexible, and configured to alert you to entries in logs using things like pattern/regex matches.Some tools allow you to categorize errors, and alert different people or groups based on the specific error.

Preventing errors from ever getting to a production site is often the most efficient and cost effective answer, but bugs happen. Keeping an eye on the logs for all your applications will help ensure your end user has a better experience, and your hardware/applications are performing their best.

 

Add Your Comment