AWK for the Most Minimal of Performance Testing Toolsets!

November 28, 2015

Holiday Lamp awk is a traditional Unix utility that I never had much use for. _So you say it processes data line-by-line? In it’s own unique domain-specific language? Even simple one-liners run from the shell tend to have quotes ("") square brackets ([]),_ and _curly braces ({})? And have you heard how local variables are declared in functions?!?_ It seemed quaint, complex, and distinctly old-school—the kind of command you might use before piping output to your 9-pin dot-matrix printer with |pr|opr . Oh, and the latest edition of O’Reilly’s Effective awk Programming was published June 2, 2001. Ya.

All that was before I stumbled across Adam Drake’s post Command-line tools can be 235x faster than your Hadoop cluster, read Bruce Barnett’s easy to follow awk tutorial, and wrote my first real awk program—a quick script to get basic statistics from multiple runs of the time command. Together with the bash script below for timing multiple runs of a program, this gives us a the start of a super-minimal performance testing toolset. Perfect for when you want to know, say, the average running time across 80 runs of a script.

Step 1: Timing Multiple Runs of a Script

The time command produces results in the format:

real    0m12.667s
user    0m8.240s
sys     0m3.411s

This gives us the total amount of time elapsed while running a command (‘real’), the amount of time spent in user-land code (‘user’), and the amount of time spent on system calls (‘sys’). ¹

To get results over multiple runs, we have the super-simple bash script below. It takes the name of a command to run, runs it (linearly) an arbitrary number of time, and appends the output to the a specified text file:²

#!/usr/bin/env bash
# Usage: ./tester [number of times to run] [utility to run] [output file]
# Ex: ./tester 20 ./test1 test1results
i=1
while [ $i -le $1 ]; do
	echo "Run $i"
	echo "Run $i" >> $3
	{ time $2 > /dev/null; } 2>> $3
	let i=i+1
	echo >> $3
done

Step 2: Getting Some Basic Statistics with AWK: Count, Total, Average, Minimum, Maximum, and Sample Standard Deviation

The fun thing about AWK is that it makes it very easy to implement the three steps of a common data processing pattern:

First you do some initial setup at the beginning.
Then you do something to every data ‘record’ (every row in the table, every line in the file, etc.).
Finally, when you’re done processing all the data, you print your results.

Depending on what you’re doing, steps 1 and 3 might be optional. And if you just want to run a program without processing data record-by-record, you could put it in step 1 (in AWK, the BEGIN {} section) or step 3 (END {}). Simple, right?

The AWK script below processes every record and then prints some basic statistics.³

#!/usr/bin/env awk -f
################################################################################
# Input: A file with output from multiple runs of the time command in a format such as:
#real	0m29.352s
#user	0m17.516s
#sys	0m8.003s
# Output: Basic statictics (count, total, average, minimum, maximum, sample standard deviation) across all runs
#
# Written after reading http://www.grymoire.com/Unix/Awk.html
#
# Changelog
# 2015-11-27 v5 Changed approach for determining  minimum and maximum values to further reduce memory usage and improve performance. Run time over a million records is currently 22 seconds real time with total memory usage of around 300 MB. Removed pre-declaration of other variables in BEGIN, as this no longer results in reduced memory usage.
# 2015-11-25 v4 Pre-declaring variables in BEGIN to reduce memory usage by between 43 and 48 MB over a million records with no other changes.
# 2015-11-21 v3 Changed calculation of SSD to use stored averages, improving run time over 1 million records by about 3.8 seconds (29.679 to 25.845 seconds real time).
# 2015-11-18 v2 Improved parsing of times to reduce memory usage by about 136MB over 1 million sets of records (~3 million records total, 395 MB vs 531 MB as reported by top MEM colum on OS X)
# 2015-10-12 v1 Initial version
################################################################################

BEGIN {
	# Pre-declare minimum and maximum values. 2^53 is the largest interger we can accurately represent with AWK's IEEE 64-bit doubles
	# Rather than pre-declaring fixed minimum and maximum values, a cleaner approach would simply add an inline test for whether min[$1] or max[$1] have already been set (e.g. min[$1] == ""). While I found that this produced accurate results, it seemed to introduce a memory leak.

        min["real"]=2^53;
        min["user"]=2^53;
        min["sys"]=2^53;
        max["real"]=0;
        max["user"]=0;
        max["sys"]=0;

}

# For lines with 2 fields where the first field contains real, user, or system. (Note that the Number of Fields check is not strictly necessary.)
$1 ~ /(real|user|sys)/ && NF == 2 {
	# Extract times in seconds and add to total real/user/sys run time
        m_loc = index($2, "m");
        s_loc = index($2, "s");
        minutes_only = substr($2, 1, m_loc - 1);
        seconds_only = substr($2, m_loc + 1, s_loc - (m_loc + 1));
	seconds = (60 * minutes_only) + seconds_only;
        total_time[$1] += seconds;

	# Count the number of times we've encountered real/user/sys. If the numbers for each don't end up the same we may have a problem
	lines[$1]++;

	# Calculate max & min
	if (seconds < min[$1]){
		min[$1] = seconds;
	}
	if (seconds > max[$1]){
		max[$1] = seconds;
	}

	# Add number of seconds to a multi-dimensional array so we can calculate standard deviation later
	all_times[$1 "," lines[$1]] = seconds;
}
END {
	print "-----Runs-----"
	for (i in lines) {
		print i, lines[i];
	}
	print "\n-----Total-----"
	for (i in total_time){
		print i, total_time[i];
	}
	print "\n-----Average-----"
	for (i in lines) {
		average[i] = total_time[i]/lines[i];
		print i, average[i];
	}
	print "\n-----Min/Max-----"
	for (i in lines) {
		print i, "\t" min[i], "\t", max[i];
	}
	print "\n-----Sample Standard Deviation-----"
	for (i in lines) {
		for (j = 1; j <= lines[i]; j++) {
			# SSD: Sum of Squared Differences
			ssd[i] += (all_times[i "," j] - average[i]) ^ 2
		}
	}
	for (i in lines) {
		# Sample rather than population standard deviation. Subtract 1 from number of samples ("Bessel's correction")
		print i, sqrt(ssd[i]/(lines[i] - 1));
	}
}

So how long does it take to run?

On my MacBook, about 22 seconds (46,004 records/second) on battery power or 25 seconds (39,765 records/second) on wall power. Since we’re storing data in memory to compute sample standard deviation, over a million records it uses around 300 MB. ⁴ If we don’t calculate sample standard deviation memory usage is constant at just 516 KB and the script runs in 14 seconds. While these numbers might not seem particularly fast, keep in mind that 1) they include time to load and parse raw textual data and 2) as written, the computation is bound to a single CPU core. Altogether…

…pretty awking awesome.

Interestingly, ‘real’ is not necessary the sum of ‘user’ and ‘sys’. For example, if a command launches two processes (running in parallel on different cores of a multi-core system), the two processes together could do 20 seconds worth of work in only 10 seconds of time. To see this in practice, note the effect of launching multiple parallel processes with xargs ^[return]
A few notes: 1. More about {} and () in bash: Grouping Commands 2. time sends its output to the standard error stream. This StackOverflow question describes how to capture its output. ^[return]
A few notes: 1. On OS X it’s possible to run the script with ./timestats _input file_. On other systems, you may need to invoke it as awk -f timestats _input file_. 2. At one point I split the code to extract seconds from the input into a separate function. While this did improve the clarity and reusability of the code slightly, for purely stylistic reasons I ultimately decided to stick with inline code (to make this look more like a simple one-off script and less like a real program. ^[return]
Using the “internal memory size” metric on the OS X version of top, monitored by another super-simple performance testing bash script;-) ^[return]

AWK for the Most Minimal of Performance Testing Toolsets!

Step 1: Timing Multiple Runs of a Script

Step 2: Getting Some Basic Statistics with AWK: Count, Total, Average, Minimum, Maximum, and Sample Standard Deviation

So how long does it take to run?

…pretty awking awesome.

Tags