Visualizing the Coherence Datagram Test

 

10000-csv2

The graph above was generated from the output of the Coherence Datagram Test utility. The Coherence Datagram Test is a tool that sends and receives UDP packets between two ore more machines to evaluate the health and performance of the network between those machines. The above test was run for 100 secs on two server-class machines with a 1 Gb Ethernet connection to the same switch. I think it’s pretty clear from the graph that there is significant packet loss between the two machines. Here’s what the graph looks like on a healthy network:

10000-csv3

The first step is to actually run the Datagram Test to generate report data:

server1$ java -server -cp coherence.jar com.tangosol.net.DatagramTest -local 192.168.1.100 -log 192.168.1.100.log -txDurationMs 100000 -polite 192.168.1.101
server2$ java -server -cp coherence.jar com.tangosol.net.DatagramTest -local 192.168.1.101 -log 192.168.1.101.log -txDurationMs 100000 192.168.1.100

The above pair of commands will run a bi-directional test for 100 seconds, generating a tab-delimited report in the file specified by -log. As of Coherence 3.6, the tab-delimited report spits out aggregated lifetime (since the test began) metrics every 100,000 (by default) received packets. For analyzing packet loss, it makes more sense to look at the metrics accumulated between reporting intervals rather than since the beginning of the test, since lifetime metrics could mask spikes that occur later in the test. Luckily, the per interval metrics we need to look at can be derived from the lifetime metrics. The following awk script will calculate the additional columns of interest (as well as fix a bug in the test where the data columns don’t align with the header columns due to two missing delimiters):

#!/usr/bin/awk -f
BEGIN {
    FS = "[\t\r\n]";
}

# Header line
/^publisher/ {
    if (FILENAME == "") {
        FILENAME = "stdin";
    }
    else {
        print("Processing " FILENAME);
    }
    gsub(/[\r\n]/, "", $0);
    header = sprintf("%s\tinterval duration secs\tinterval missing packets\tinterval drop rate\tinterval success rate\tinterval throughput mb/sec", $0);
    for (outfile in aOutfile) {
        close(aOutfile[outfile]);
    }
    delete aPrevSent;
    delete aPrevReceived;
    delete aPrevMissing;
    delete aPrevDurationMillis;
    delete aDurationOffset;
    delete aOutfile;
    next;
}

# Initialize prev values
aPrevSent[$1] == ""  {
    aPrevSent[$1] = 0;
        aPrevReceived[$1] = 0;
    aPrevMissing[$1] = 0;
    aPrevDurationMillis[$1] = 0;
    aDurationOffset[$1] = 0;
    aOutfile[$1] = FILENAME "." substr($1, 2, length($1))  ".csv";
    if (aOutfile[$1] ~ /^stdin/) {
        print(header);
    }
    else {
        print(header) > aOutfile[$1];
    }
}

# Account for packet sequence restart
$2 < aPrevDurationMillis[$1] {
    aPrevSent[$1] = 0;
    aPrevReceived[$1] = 0;
    aPrevMissing[$1] = 0;
    aDurationOffset[$1] += aPrevDurationMillis[$1];
}

# Skip duplicate lines
$2 == aPrevDurationMillis[$1] {
    next;
}

{
    split($11, aOoo, /^[0-9]/);
    sOoo = sprintf("%s\t%s", substr($11, 1, 1), aOoo[2]);

    split($13, aGapMillis, /^[0-9]/);
    sGapMillis = sprintf("%s\t%s", substr($13, 1, 1), aGapMillis[2]);

    cIntervalDurationMillis = $2 - aPrevDurationMillis[$1];
    cIntervalSent = $6 - aPrevSent[$1];
    cIntervalReceived = $7 - aPrevReceived[$1];
    cIntervalMissing = $8 - aPrevMissing[$1];
    dflIntervalDropRate = cIntervalMissing / cIntervalSent;
    dflIntervalSuccessRate = 1 - dflIntervalDropRate;
        dflIntervalThroughput = (($3 * cIntervalReceived) / (cIntervalDurationMillis / 1000)) / (1024 * 1024);

    aPrevDurationMillis[$1] = $2;
    aPrevSent[$1] = $6;
    aPrevReceived[$1] = $7;
    aPrevMissing[$1] = $8;

    if (aOutfile[$1] ~ /^stdin/) {
        printf("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%.3f\t%d\t%f\t%f\t%d\n",
                $1, $2 + aDurationOffset[$1], $3, $4, $5, $6, $7, $8, $9, $10, sOoo, $12, sGapMillis,
                cIntervalDurationMillis / 1000, cIntervalMissing, dflIntervalDropRate, dflIntervalSuccessRate, dflIntervalThroughput);
    }
    else {
        printf("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%.3f\t%d\t%f\t%f\t%d\n",
                $1, $2 + aDurationOffset[$1], $3, $4, $5, $6, $7, $8, $9, $10, sOoo, $12, sGapMillis,
                cIntervalDurationMillis / 1000, cIntervalMissing, dflIntervalDropRate, dflIntervalSuccessRate, dflIntervalThroughput) > aOutfile[$1];
    }
}

This script will take the output of the -log option and produce a new file. Assuming you save the contents of the above script to augment-datagram-test.awk and set the execute bit, you can use the script as follows:

server1$ ./augment-datagram-test.awk 192.168.1.101.log

The above command will generate a new file called 192.168.1.101.log.192.168.1.100:10000.csv which contains the additional columns “interval duration secs”, “interval missing packets”, “interval drop rate”, “interval success rate” and “interval throughput mb/sec”. The script will produce one csv file for each publisher present in the tab-delimited report. The script will also accept multiple tab-delimited files as input, processing each one independently, and can also accept input piped through stdin (with output going to stdout).

To actually generate the graphs, I use R. I encountered R earlier this year working with a customer, but didn’t have the chance to play around with it myself. Before I decided to use R, I was taking the output from my awk script and importing into a spreadsheet application and then generating graphs. This proved to be quite tedious and involved too many mouse clicks for my taste, so I turned to R to let me script the process and eliminate the need for a spreadsheet application altogether. R is also much more flexible when it comes to producing graphs, as you have complete control over the plot area. After a few days of playing around with R, I was able to come up with the following script to generate the graphs seen at the beginning of this post:

args <- commandArgs(TRUE)
for (file in args)
{
    outfile <- paste(file, ".png", sep = "")
    cat("Plotting ", file, " as ", outfile, "\n", sep = "")

    # Read and process input file
    dgt     <- read.table(file, header = TRUE, sep = "\t")
    x       <- dgt$duration.ms / 1000
    y       <- dgt$interval.drop.rate * 100
    x.range <- c(0, max(x))
    y.range <- c(0, max(y, 20))
    nonzero <- which(y > 0)
    loss.intervals   <- (length(nonzero) / length(y)) * 100
    throughput.range <- c(0, max(dgt$interval.throughput.mb.sec, 120))
    title <- sub("\\.log\\.", " <- ", file)
    title <- sub("\\.csv", "", title)

    # Create plot as PNG
    png(filename = outfile, height = 400, width = 600, bg = "white")

    # Set margins to make room for right-side axis labels
    par(mar = c(7,5,4,5) + 0.1)

    # Plot packet loss line
    plot(x, y, type = "l", main = title, xlab = "Time (secs)", ylab = "Loss (%)",
            col = "blue", xlim = x.range, ylim = y.range, lwd = 2)

    # Circle points where packet loss > 0
    points(x[nonzero], y[nonzero], cex=1.5)

    # Plot throughput line
    lines(x, dgt$interval.throughput.mb.sec * (y.range[2] / throughput.range[2]),
            col = "green", lwd = 2)

    # Create right-side axis labels and tick marks
    axis(4, at = y.range[2] * c(0:4) / 4,
            labels = (throughput.range[2] / 4) * c(0:4))
    mtext("Throughput (MB/s)", side = 4, line = 3)

    # Draw the background grid lines
    grid()

    # Report the number of intervals that experienced loss (as a %)
    mtext(sprintf("Intervals w/ Loss: %.2f%%", loss.intervals), side = 1,
            line = 3, adj = 1)

    # Create the legend at the bottom
    legend("bottom", inset = -0.4, c("loss", "throughput"),
            col = c("blue", "green"), lty = 1, lwd = 2, bty = "n", horiz = TRUE,
            xpd = TRUE)

    # Close the PNG
    dev.off()
}

Assuming you save the contents of the above script as plot-datagram.r, you can invoke the script as follows:

server1$ r -q --slave -f plot-datagram.r --args 192.168.1.101.log.192.168.1.100:10000.csv

The output from the above command will be a new file called 192.168.1.101.log.192.168.1.100:10000.csv.png which represents a graph of both packet loss and throughput over the duration of the test. The circles indicate intervals where packet loss occurred. This script can also accept multiple files as input, generating a graph for each in a separate file.

With both scripts in hand, generating graphs to visualize packet loss from the output of the Datagram Test can be done in a few seconds:

server1$ ./augment-datagram-test.awk *.log
server1$ r -q --slave -f plot-datagram.r --args *.csv

Add Your Comment