Dynamically Filter Apache Server Access Logs with Python

November 3, 2023 | 6 minute read
Text Size 100%:

What's Included in Server Access Logs?

Access logs are simply text files that store information about requests sent to your Apache or WebLogic server. They include values like requested resource, time of the request, response code, and responnse time among other values.

Access logs are valubale for gaining insight into what your HTTP server is doing. There are many tools out there that analyze access logs, but we'll focus on how you can do your own ad-hoc log filtering, which is a technique I've found useful in initial troubleshooting steps.

Log Formats

There are 3 access log types that you can use:

  • Common Log Format
  • Combined Log Format
  • Custom Log Format

The Common Log Format is the most common log format used by HTTP servers. It is typically defined as:

LogFormat "%h %l %u %t \"%r\" %>s %b" common

Here’s what this log would contain:

127.0.0.1 - - [3/Nov/2023:12:00:00 -0000] "GET /example HTTP/1.1" 200 3001
  • %h, the IP address of the remote host that made the request.
  • %l, remote log name provided by identd, only logged when information is provided by the logging directive.
  • %u, the user identifier via HTTP authentication.
  • %t, the date and time of the request with the time zone
  • \”%r\”, the first line of the request inside double quotes
  • %>s, the status code reported to the client.
  • %b, the size of the object sent to the client

The Combined Log Format is very similar to the Common Log Format but includes two additional headers – the referrer and the user agent. Its definition looks as follows:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined

An example of the log line:

127.0.0.1 - - [3/Nov/2023:12:00:00 -0000] "GET /example HTTP/1.1" 200 3001 "http://www.example.com/i

The Custom Log Format is very similar to the Common Log Format but includes two additional headers – the referrer and the user agent. Its definition looks as follows:

CustomLog '%R:%h %l %u %t "%r" %>s %b %D "%U"'

An example of the custom log line:

access_log.205188100:127.0.0.1 - - [03/Nov/2023:00:00:00 +0000] "GET /example HTTP/1.1" 200 1517 6170 "-"

This custom format has a couple additions to the Common Log Format:

  • %R, the handler generating the response
  • %D, the time taken to serve the request, in microseconds.

Parsing Access Logs with Python

In order to run your own ad-hoc analysis of server access logs, I typically turn to Python for its ease of use and rich library of public frameworks and modules. One framework that is particularly useful for log parsing is lars, which is a small framework for working with httpd logs.

The primary purpose of lars is to provide a set of conversions to ease loading such logs into relational databases, but numerous other possibilities can be realized with a little imagination. Which is to say: lars is not a web log analyzer. However, it is a toolkit that makes it quite easy to construct your own analyzer.

sdsd

A simple lars script opens a log source and uses the source and target wrappers provided by lars to convert the log entries into another format while potentially filtering the entries. Here's an example:

 

import io
from lars import apache, csv

with io.open('sample_log_file.log', 'r') as infile, io.open('output.csv', 'wb') as outfile:
    with apache.ApacheSource(infile) as source, csv.CSVTarget(outfile) as target:
        for row in source:
            target.write(row)

The above example opens the 'sample_log_file.log' as an Apache source, with the common log format (default). It then iterates through each row and writes each row to 'output.csv'.

Dynamic Filtering

In order to dynamically filter the log rows, we'll use the Python eval() function. The expression argument passed to eval() is parsed and evaluated as a Python expression. Note, both globals and locals are omitted and the expression is executed with the globals and locals in the environment where eval is called.

We can place our eval where the expression has access to each lars row object, which has attributes named after the columns of the source (with characters that cannot appear in Python identifiers replaced with underscores). These row attributes match the descriptions for the Common, Combined, or Custom Log Formats.

Here's an example of how we can filter the access log using the eval() function and return all 'POST requests.

 

import io
from lars import apache, datatypes, csv

# Custom Apache log format
CUSTOM_LOG_FORMAT = '%R:%h %l %u %t "%r" %>s %b %D "%U"'
expr = ''row.request.method==\"POST\"'

# Row format for output
NewRow = datatypes.row('host', 'path', 'time', 'size', 'time_taken_ms')
rows_found=0

with io.open(input_file, 'r') as f:
    with apache.ApacheSource(f, CUSTOM_LOG_FORMAT) as source:
        for row in source:              
            if eval(expr):
                rows_found+= 1
                # Create new output row
                new_row = NewRow(row.remote_host, row.request.url.path_str, row.time, row.size, row.time_taken_ms)
                print(new_row)

print(f'{rows_found} rows found for expression {expr}')

Handling Datetimes

That's interesting, but not compelling, right? Let's the ability to filter log entries based on datetime. To do so, we'll add the below function.

def is_time_between(begin_time, end_time, check_time=None):
    # If check time is not given, default to current UTC time
    check_time = check_time or datetime.utcnow().time()
    if begin_time < end_time:
        return check_time >= begin_time and check_time <= end_time
    else: # crosses midnight
        return check_time >= begin_time or check_time <= end_time

With the is_time_between function, we can filter log entries with code like this:

  • Filter by exact request time:
    • expr = 'row.time==datetime.strptime(\"09/15/2023 00:00:03\", \"%m/%d/%Y %H:%M:%S\")'```
  • Filter entries between two times:
    • expr = 'is_time_between(datetime.strptime(\"09/15/2023 00:00:01\", \"%m/%d/%Y %H:%M:%S\"), datetime.strptime(\"09/15/2023 00:00:03\", \"%m/%d/%Y %H:%M:%S\"), row.time)'```

Putting It All Together

To get everything working, create a function that wraps the above logic.

def filter_access_log(input_file, output_file, expr):
    
    # Row format for output
    NewRow = datatypes.row('host', 'path', 'time', 'size', 'time_taken_ms')
    rows_found=0
    
    with io.open(input_file, 'r') as f, io.open(output_file, 'wb') as o:
        with apache.ApacheSource(f, CUSTOM_LOG_FORMAT) as source, csv.CSVTarget(o, header=True) as target:
            for row in source:
                total_rows+=1
                
                if eval(expr):
                    rows_found+= 1
                    # Create new output row
                    new_row = NewRow(row.remote_host, row.request.url.path_str, row.time, row.size, row.time_taken_ms)
                    target.write(new_row)

    print(f'{rows_found} rows of {total_rows} found for expression {expr}')

 

Example Filter Expressions

Here's some example filter expressions you can use.

To find rows that match...

  • HTTP method POST: 'row.request.method==\"POST\"'
  • HTTP POST and 200 status: 'row.request.method==\"POST\" and row.status==200'
  • Exact request time: 'row.time==datetime.strptime(\"09/15/2023 00:00:03\", \"%m/%d/%Y %H:%M:%S\")'
  • Between two times: 'is_time_between(datetime.strptime(\"09/15/2023 00:00:01\", \"%m/%d/%Y %H:%M:%S\"), datetime.strptime(\"09/15/2023 00:00:03\", \"%m/%d/%Y %H:%M:%S\"), row.time)'
  • Response time (microseconds) greater than value: 'int(row.time_taken_ms) >= int(2793769)'
  • Response size (bytes) greater than value: 'int(row.size) >= int(1000)'
  • Request path contains value: '\"/example/path\" in row.request.url.path_str'

 

Summary

I use this script to run an initial pass at analyzing access logs - to do things like filter a set of requests between two times or get all POST requests with response times over a certain time. It's a great way to make a quick determination about what you should do with the logs.

Happy hunting

~sn

Shea Nolan


Previous Post

Oracle Fusion SaaS Applications High Frequency Data Extraction and Ingestion on OCI

Matthieu Lombard | 3 min read

Next Post


A routing scenario, is an asymmetric traffic path allowed in OCI?

Andrei Stoian | 5 min read