AWK for the Most Minimal of Performance Testing Toolsets!
awk
is a traditional Unix utility that I never had much use for. _So you say it processes data line-by-line? In it’s own unique domain-specific language? Even simple one-liners run from the shell tend to have quotes (""
) square brackets ([]
),_ and _curly braces ({}
)? And have you heard how local variables are declared in functions?!?_ It seemed quaint, complex, and distinctly old-school—the kind of command you might use before piping output to your 9-pin dot-matrix printer with |pr|opr
. Oh, and the latest edition of O’Reilly’s Effective awk Programming was published June 2, 2001. Ya.
All that was before I stumbled across Adam Drake’s post Command-line tools can be 235x faster than your Hadoop cluster, read Bruce Barnett’s easy to follow awk tutorial, and wrote my first real awk program—a quick script to get basic statistics from multiple runs of the time
command. Together with the bash
script below for timing multiple runs of a program, this gives us a the start of a super-minimal performance testing toolset. Perfect for when you want to know, say, the average running time across 80 runs of a script.
Step 1: Timing Multiple Runs of a Script
The time
command produces results in the format:
real 0m12.667s
user 0m8.240s
sys 0m3.411s
This gives us the total amount of time elapsed while running a command (‘real’), the amount of time spent in user-land code (‘user’), and the amount of time spent on system calls (‘sys’). 1
To get results over multiple runs, we have the super-simple bash
script below. It takes the name of a command to run, runs it (linearly) an arbitrary number of time, and appends the output to the a specified text file:2
#!/usr/bin/env bash
# Usage: ./tester [number of times to run] [utility to run] [output file]
# Ex: ./tester 20 ./test1 test1results
i=1
while [ $i -le $1 ]; do
echo "Run $i"
echo "Run $i" >> $3
{ time $2 > /dev/null; } 2>> $3
let i=i+1
echo >> $3
done
Step 2: Getting Some Basic Statistics with AWK: Count, Total, Average, Minimum, Maximum, and Sample Standard Deviation
The fun thing about AWK is that it makes it very easy to implement the three steps of a common data processing pattern:
- First you do some initial setup at the beginning.
- Then you do something to every data ‘record’ (every row in the table, every line in the file, etc.).
- Finally, when you’re done processing all the data, you print your results.
Depending on what you’re doing, steps 1 and 3 might be optional. And if you just want to run a program without processing data record-by-record, you could put it in step 1 (in AWK, the BEGIN {}
section) or step 3 (END {}
). Simple, right?
The AWK script below processes every record and then prints some basic statistics.3
#!/usr/bin/env awk -f
################################################################################
# Input: A file with output from multiple runs of the time command in a format such as:
#real 0m29.352s
#user 0m17.516s
#sys 0m8.003s
# Output: Basic statictics (count, total, average, minimum, maximum, sample standard deviation) across all runs
#
# Written after reading http://www.grymoire.com/Unix/Awk.html
#
# Changelog
# 2015-11-27 v5 Changed approach for determining minimum and maximum values to further reduce memory usage and improve performance. Run time over a million records is currently 22 seconds real time with total memory usage of around 300 MB. Removed pre-declaration of other variables in BEGIN, as this no longer results in reduced memory usage.
# 2015-11-25 v4 Pre-declaring variables in BEGIN to reduce memory usage by between 43 and 48 MB over a million records with no other changes.
# 2015-11-21 v3 Changed calculation of SSD to use stored averages, improving run time over 1 million records by about 3.8 seconds (29.679 to 25.845 seconds real time).
# 2015-11-18 v2 Improved parsing of times to reduce memory usage by about 136MB over 1 million sets of records (~3 million records total, 395 MB vs 531 MB as reported by top MEM colum on OS X)
# 2015-10-12 v1 Initial version
################################################################################
BEGIN {
# Pre-declare minimum and maximum values. 2^53 is the largest interger we can accurately represent with AWK's IEEE 64-bit doubles
# Rather than pre-declaring fixed minimum and maximum values, a cleaner approach would simply add an inline test for whether min[$1] or max[$1] have already been set (e.g. min[$1] == ""). While I found that this produced accurate results, it seemed to introduce a memory leak.
min["real"]=2^53;
min["user"]=2^53;
min["sys"]=2^53;
max["real"]=0;
max["user"]=0;
max["sys"]=0;
}
# For lines with 2 fields where the first field contains real, user, or system. (Note that the Number of Fields check is not strictly necessary.)
$1 ~ /(real|user|sys)/ && NF == 2 {
# Extract times in seconds and add to total real/user/sys run time
m_loc = index($2, "m");
s_loc = index($2, "s");
minutes_only = substr($2, 1, m_loc - 1);
seconds_only = substr($2, m_loc + 1, s_loc - (m_loc + 1));
seconds = (60 * minutes_only) + seconds_only;
total_time[$1] += seconds;
# Count the number of times we've encountered real/user/sys. If the numbers for each don't end up the same we may have a problem
lines[$1]++;
# Calculate max & min
if (seconds < min[$1]){
min[$1] = seconds;
}
if (seconds > max[$1]){
max[$1] = seconds;
}
# Add number of seconds to a multi-dimensional array so we can calculate standard deviation later
all_times[$1 "," lines[$1]] = seconds;
}
END {
print "-----Runs-----"
for (i in lines) {
print i, lines[i];
}
print "\n-----Total-----"
for (i in total_time){
print i, total_time[i];
}
print "\n-----Average-----"
for (i in lines) {
average[i] = total_time[i]/lines[i];
print i, average[i];
}
print "\n-----Min/Max-----"
for (i in lines) {
print i, "\t" min[i], "\t", max[i];
}
print "\n-----Sample Standard Deviation-----"
for (i in lines) {
for (j = 1; j <= lines[i]; j++) {
# SSD: Sum of Squared Differences
ssd[i] += (all_times[i "," j] - average[i]) ^ 2
}
}
for (i in lines) {
# Sample rather than population standard deviation. Subtract 1 from number of samples ("Bessel's correction")
print i, sqrt(ssd[i]/(lines[i] - 1));
}
}
So how long does it take to run?
On my MacBook, about 22 seconds (46,004 records/second) on battery power or 25 seconds (39,765 records/second) on wall power. Since we’re storing data in memory to compute sample standard deviation, over a million records it uses around 300 MB. 4 If we don’t calculate sample standard deviation memory usage is constant at just 516 KB and the script runs in 14 seconds. While these numbers might not seem particularly fast, keep in mind that 1) they include time to load and parse raw textual data and 2) as written, the computation is bound to a single CPU core. Altogether…
…pretty awking awesome.
- Interestingly, ‘real’ is not necessary the sum of ‘user’ and ‘sys’. For example, if a command launches two processes (running in parallel on different cores of a multi-core system), the two processes together could do 20 seconds worth of work in only 10 seconds of time. To see this in practice, note the effect of launching multiple parallel processes with xargs [return]
- A few notes: 1. More about {} and () in
bash
: Grouping Commands 2.time
sends its output to the standard error stream. This StackOverflow question describes how to capture its output. [return] - A few notes: 1. On OS X it’s possible to run the script with
./timestats _input file_
. On other systems, you may need to invoke it asawk -f timestats _input file_
. 2. At one point I split the code to extract seconds from the input into a separate function. While this did improve the clarity and reusability of the code slightly, for purely stylistic reasons I ultimately decided to stick with inline code (to make this look more like a simple one-off script and less like a real program. [return] - Using the “internal memory size” metric on the OS X version of
top
, monitored by another super-simple performance testingbash
script;-) [return]