Making It Faster with Xargs
xargs
is a neat little tool that takes lines from standard input and turns them into the arguments for another command. Even better, though, is that it can run those commands in parallel— as in Command-line tools can be 235x faster than your Hadoop cluster. So what happens if we use xargs
to parallelize the timestamp-adjusting script in my last post?
Low expectations
There is little reason to think that parallelization would dramatically improve the run time of this script on my Early 2015 MacBook—the script intuitively seems like it’s more likely to be IO-bound than CPU-bound, and the Intel Core M in the MacBook only has 2 CPU cores to work with anyway.
And yet….
…over 80 runs 1, the parallelized script ran in half the time on average—11.65 seconds vs. 25.88 seconds:
Original Script
Average | (Min) | (Max) | (Sample Standard Deviation) | |
---|---|---|---|---|
real | 25.8788 | 24.279 | 30.322 | 1.12801 |
user | 16.4054 | 15.406 | 19.596 | 0.713768 |
sys | 6.8974 | 6.433 | 7.945 | 0.288322 |
Parallelized Script (
-P4
)
Average | (Min) | (Max) | (Sample Standard Deviation) | |
---|---|---|---|---|
real | 11.6503 | 10.85 | 12.539 | 0.454745 |
user | 24.0593 | 23.677 | 25.046 | 0.272982 |
sys | 10.1135 | 9.891 | 10.498 | 0.151753 |
So how did I parallelize it?
Create a version of my original script that operates on a single file:
#!/usr/bin/env bash # Offset is approximately 17 hours 17 minutes NEWDATE=$(date -j -v+17H -v +17M -f "%m/%d/%Y %H:%M:%S" "$(GetFileInfo -m $1)" +"%m/%d/%Y %H:%M:%S") echo $1 - $NEWDATE # Adjust modification time: SetFile -m "$NEWDATE" "$1" # Adjust creation date SetFile -d "$NEWDATE" "$1"
Dump the list of files to
xargs
and have it run the above script on every file. Use the-P4
option to run 4 instances in parallel:2
find . \( -type f -name '*.JPG' -or -name '*.MP4' \) -print0 | xargs -0 -n1 -P4 ./fix1
Checking our script with dtrace
I have to admit—at one point while writing this post I created a script that was significantly faster….until I realized that I wasn’t actually properly globbing all 337 files I wanted to work with. So how do we know that the faster parallelized version of the script actually ran SetFile
just as many times as the non-parallelized version? Well, with dtrace it’s pretty easy to count:[3. In recent versions of OS X you’ll have to turn off System Integrity Protection to get this dtrace command to work. Do so at your own risk.]
sudo execsnoop -v > test1
- Run test in another window
grep SetFile test1 | wc
In this case, both scripts return the same number of runs of SetFile:-)
So what did I learn?
- As a tool for quickly parallelizing a shell script,
xargs
is pretty awesome. - Without testing, it’s sometimes not immediately obvious whether a script will benefit from parallelization.
- Refactoring a script based on a
for
loop into something parallelizable withfind
andxargs
was super simple. So simple, in fact, that I’ll probably use the [find
+xargs
+ simple script] pattern from the beginning next time.
- 20 runs of each script on both battery and wall power, although power source turned out not to have much of an impact on performance. [return]
- Given that the machine I’m running this on only has 2 CPU cores (4 logical cores with hyper threading), one might wonder why I chose to run 4 processes in parallel. The answer is simple—in manual testing of a few different levels of parallelization—
-P2
/-P4
/-P350
—I found that-P4
seemed to result in the fastest total execution times. For a fascinating discussion of the question of how many threads a program should launch to process a given number of units of work, I highly recommend John Siracusa’s discussion of Grand Central Dispatch in his review of Mac OS X 10.6 Snow Leopard. [return]