Making It Faster with Xargs

Xargs could not parallelize snorkeling. xargs is a neat little tool that takes lines from standard input and turns them into the arguments for another command. Even better, though, is that it can run those commands in parallel— as in Command-line tools can be 235x faster than your Hadoop cluster. So what happens if we use xargs to parallelize the timestamp-adjusting script in my last post?

Low expectations

There is little reason to think that parallelization would dramatically improve the run time of this script on my Early 2015 MacBook—the script intuitively seems like it’s more likely to be IO-bound than CPU-bound, and the Intel Core M in the MacBook only has 2 CPU cores to work with anyway.

And yet….

…over 80 runs 1, the parallelized script ran in half the time on average—11.65 seconds vs. 25.88 seconds:

Original Script

Average (Min) (Max) (Sample Standard Deviation)
real 25.8788 24.279 30.322 1.12801
user 16.4054 15.406 19.596 0.713768
sys 6.8974 6.433 7.945 0.288322

Parallelized Script (-P4)
Average (Min) (Max) (Sample Standard Deviation)
real 11.6503 10.85 12.539 0.454745
user 24.0593 23.677 25.046 0.272982
sys 10.1135 9.891 10.498 0.151753

So how did I parallelize it?

  1. Create a version of my original script that operates on a single file:

    #!/usr/bin/env bash  
    # Offset is approximately 17 hours 17 minutes  
    NEWDATE=$(date -j -v+17H -v +17M -f "%m/%d/%Y %H:%M:%S" "$(GetFileInfo -m $1)" +"%m/%d/%Y %H:%M:%S")  
    echo $1 - $NEWDATE  
    # Adjust modification time:  
    SetFile -m "$NEWDATE" "$1"  
    # Adjust creation date  
    SetFile -d "$NEWDATE" "$1"  
    
  2. Dump the list of files to xargs and have it run the above script on every file. Use the -P4 option to run 4 instances in parallel:2
    find . \( -type f -name '*.JPG' -or -name '*.MP4' \) -print0 | xargs -0 -n1 -P4 ./fix1

Checking our script with dtrace

I have to admit—at one point while writing this post I created a script that was significantly faster….until I realized that I wasn’t actually properly globbing all 337 files I wanted to work with. So how do we know that the faster parallelized version of the script actually ran SetFile just as many times as the non-parallelized version? Well, with dtrace it’s pretty easy to count:[3. In recent versions of OS X you’ll have to turn off System Integrity Protection to get this dtrace command to work. Do so at your own risk.]

  1. sudo execsnoop -v > test1
  2. Run test in another window
  3. grep SetFile test1 | wc

In this case, both scripts return the same number of runs of SetFile:-)

So what did I learn?

  • As a tool for quickly parallelizing a shell script, xargs is pretty awesome.
  • Without testing, it’s sometimes not immediately obvious whether a script will benefit from parallelization.
  • Refactoring a script based on a for loop into something parallelizable with find and xargs was super simple. So simple, in fact, that I’ll probably use the [find + xargs + simple script] pattern from the beginning next time.

  1. 20 runs of each script on both battery and wall power, although power source turned out not to have much of an impact on performance. [return]
  2. Given that the machine I’m running this on only has 2 CPU cores (4 logical cores with hyper threading), one might wonder why I chose to run 4 processes in parallel. The answer is simple—in manual testing of a few different levels of parallelization—-P2/-P4/-P350—I found that -P4 seemed to result in the fastest total execution times. For a fascinating discussion of the question of how many threads a program should launch to process a given number of units of work, I highly recommend John Siracusa’s discussion of Grand Central Dispatch in his review of Mac OS X 10.6 Snow Leopard. [return]