Sorting Files for Better Compression

Mt. Fuji as viewed from Hottarakashi Onsen In the Unix系 world, the common equivalent of a .ZIP file is a compressed tar archive: .tar.gz. As the extension implies, there is a clear separation between combining multiple files into a single file (the archiving) and the space-saving compression of that file (say, with gzip).

Over the past several decades the most notable benefit of this separation has been the smooth introduction of better compression algorithms, enabling a transition from .tar.Z to .tar.gz to .tar.bz2 to .tar.xz and .tar.lz today. While LHA, LHZ, ZIP, RAR, and 7z all emerged as formats tied to specific software implementations, the basic command to extract a .tar.anything file hasn’t changed in decades:

anything -dc file.tar.anything | tar -xvf -

A lesser-known advantage of separating archiving and compression is the ability to easily change the order that files appear in the archive itself. While this might not sound terribly useful, given an archive with many files containing the same or very similar data it can actually lead to dramatically improved compression performance. For the use case I describe here, I was able to improve xz compression performance by 72% beyond what was possible with just the -9 and --extreme options.

My Use Case

I edit Harajuku Data Lake podcast with Audacity on my MacBook.

Editing an episode generally takes 2-3 hours for a one hour show, and as I edit I continually save copies of my work. For example, in the course of editing the most recent episode (Christine Kawano Usyak: Leadership Musings and Where you…from?), I saved 15 copies totaling 19 GB:1

$ du -h -d1
1.5G	./edit 1 (tracks split, compressor on me, noise reduction on both)_data
1.4G	./edit 1 (to 3-00)_data
1.4G	./edit 1 (to 11-00)_data
1.4G	./edit 1 (to 19-00)_data
1.4G	./edit 1 (to 32-00)_data
1.4G	./edit 1 (to 38-20)_data
1.4G	./edit 1 (to 39-20, audio re-synced)_data
1.3G	./edit 1 (to 45-10)_data
1.3G	./edit 1 (to 49-50)_data
1.3G	./edit 1 (to 58-50)_data
1.3G	./edit 1 (to 1-02-56)_data
1.3G	./edit 1 (to end)_data
1.4G	./edit 2 (noise filters)_data
633M	./edit 3 (truncate silences to 500ms)_data
635M	./edit 4 (add theme song)_data
19G	.
$

This might seem like a lot of data, but given that data in each save should be nearly identical, it represents an ideal case for compression. Unfortunately, naively compression with xz didn’t yield such great results.

$ time tar cvf - Episode\ 16 | xz -z -9 -e -T 0 -v -f - > Episode-16-unsorted.tar.xz

Not only was it incredibly slow, but at 16.68 GB the resulting file was still 83.3% of it’s original size!

Pre-sorting Files

Each File→Save As in Audacity produces a ‘project file’ with a .aup extension and a corresponding directory filled with many, many .au files containing the actual audio data.

$find edit\ 1\ \(to\ 3-00\)_data | wc
    1416    5664   59352
$find edit\ 1\ \(to\ 3-00\)_data | head
edit 1 (to 3-00)_data
edit 1 (to 3-00)_data/e00
edit 1 (to 3-00)_data/e00/d12
edit 1 (to 3-00)_data/e00/d12/e00125a2.au
edit 1 (to 3-00)_data/e00/d12/e00128f2.au
edit 1 (to 3-00)_data/e00/d12/e0012581.au
edit 1 (to 3-00)_data/e00/d12/e0012894.au
edit 1 (to 3-00)_data/e00/d12/e0012ba6.au
edit 1 (to 3-00)_data/e00/d12/e0012797.au
edit 1 (to 3-00)_data/e00/d12/e00122f8.au
...
...
...

By default, if we add all the edit directories to a tar archive with tar cvf - Episode\ 16 > Episode\ 16.tar, each episode will be added to the archive sequentially. Audio files from the first edit will all appear together, and repeated chunks of data will spaced far apart. This makes life hard for a dictionary coder–in the case of gzip repeated patterns must be less than 32KB apart, and even on its -9 setting xz requires repeated patterns to be less than 64MB apart.2

But what if we put all similar files next to each other in the tar file? Since each .au file is around 1MB in size repeated data would still be too far apart for de-duplication by gzip, but xz’s larger dictionary size gives us plenty of leeway. So before adding files to the archive let’s sort them by their hash:

time find Episode\ 16 -not -path '*/\.*' -not -type d -print0 |
xargs -0 -I file md5 file |
rev | sort | rev | cut -c 6- | rev | cut -c 37- | rev |
tr '\n' '\0' |
tar cvf - --null -T - |
xz -z -9 -e -T 0 -v -f - > Episode-16-sorted.tar.xz

A few annotations:

  • time How long did this take?
  • find Episode\ 16 -not -path '*/\.*' -type f -print0 Get a list everything that isn’t a directory (excluding hidden dot-files), separated by null characters as a courtesy to xargs.
  • xargs -0 -I file md5 file Get MD5 hashes of each file.3
  • rev | sort | rev Our first column contains filenames (keys). The second column contains hashes (values). We want to put all keys (files) with the same value (hash) next to each other in our list. So we reverse all the characters in each line and sort that, then reverse that again. This is a surprisingly common pattern.
  • cut -c 6- | rev | cut -c 37- | rev Get just the filename by cutting off extra characters on both tends. If you’re on Linux and using md5sum instead of md5, you can probably do with just a single cut -c 35-.
  • tr '\n' '\0' Just the filenames, separated null characters instead of hard returns.
  • > ../Episode-16-files-sorted We’ll be piping the output to tar in the next step, but for now let’s just save the output.
  • tar cvf - --null -T - Create the actual archive, using the list of files piped in.
  • xz -z -9 -e -T 0 -v -f - > Episode-16-sorted.tar.xz Compress the archive with xz.

Not only was the above command dramatically faster than compressing files out of order, but the results were far better: A 3.12 GB file instead of a 16.68 GB file. In other words, a simple sort of files in the archive saves well over 13 GB of space.

Can We Do Better in the Cloud?

Although my MacBook is surprisingly fast and responsive, I wondered if it wouldn’t be faster to run xz on a high-end cloud server–something with a bunch of Xeon cores, a ton of RAM, and a far higher power budget.

The simplest way to do this is simply to wrap the compression step in an ssh command to pipe data to a remote server:

cat myfile.tar |
ssh [email protected] "xz -z -9 -e -T 0 -v -f -" > myfile.tar.xz

This is simple and effective, but unfortunately not terribly fast. First, even with gigabit fiber, actual transfer speeds to various cloud providers were surprisingly slow. Peering is important, and not all cloud services are created equal.4 Second, a base model 2015 MacBook with a 6 watt processor is surprisingly competitive even against servers with many cores.

To get a sense of the speed of available cloud hardware, I compared the speed of a few cloud providers and a few local systems at just the compression step. For each provider I scped a compressed version of the file up, decompressed it, and then ran the compression command on the remote server. For local systems, I simply sent the uncompressed file over an ssh pipe.

System Real Time Output Size
Raspberry Pi 2
(900 MHz ARM quad-core Cortex-A7, 1GB RAM)5
881m7.118s 3.05 GB
Early 2009 iMac 24in
(2.66 GHz Intel Core 2 Duo, 8GB RAM)5
47m18.658s 3.12 GB
Early 2008 Mac Pro
(2 x 2.8 GHz Quad-Core Intel Xeon, 16GB RAM)
15m16.471s 3.12 GB
2015 MacBook
(1.1 GHz Intel Core M, 8GB RAM)
27m53.715s 3.12 GB
Early 2015 MacBook Pro
(3.1GHz Intel Core i7 16GB RAM)5
19m8.616s 3.12 GB
Vultr “200GB”
(6 cores, 16GB RAM)
18m52.289s 3.12 GB
Digital Ocean “$160/mo”
(8 cores, 16GB RAM)
18m34.284s 3.12 GB
Linode “80GB”
(20 cores, 80GB RAM)
8m42.441s 3.12 GB
Amazon EC2 m4.16xlarge
(64 cores, 256GB RAM)
1m36.869s 3.12 GB

First, due to memory constraints, xz on the Raspberry Pi 2 only used a single thread for compression. This resulted in slightly improved compression performance, but also took much longer than expected as three of the system’s four cores sat idle for the duration of the compression.

Second, while Vultr and DigitalOcean offer more powerful instances, I ran the tests above on the most powerful instances I was able to purchase with just a credit card and no special permissions. These restrictions on purchasing more powerful instances might seem reasonable if it weren’t for the fact that both Linode and Amazon provide significantly more powerful systems without requiring special approval. (Incidentally, a Dell R940 configured with specs similar to the EC2 x4.16xlarge seems to run north of $20,000.)

While it’s certainly possible to compress files in the cloud, in practice even a fast home Internet connection is not that fast, and setting aside the monster servers available from Amazon, even the fastest options available from most cloud providers turned out to not be nearly as quick as I had hoped.

Locality Sensitive Hashing

In the above examples I’m using an MD5 hash to place files that are exactly the same next to each other in the tar archive. While this is excellent for my use case, it fails for files that are different by even a single byte–indeed, this is a key property of a cryptographic hash. But what if there was a way to place files together that are similar but not perfectly identical? It turns out there is, using a technique called locality-sensitive hashing. Whereas a cryptographic hash function attempts to minimize collisions, an LSH function attempts to do the opposite–maximize collisions among similar inputs.

On Linux, the readily available utilities for LSH seem to be limited to simhash, which is perhaps unfortunate given research suggesting that MinHash algorithms perform better for binary data sets. In any case, a naive implementation of file sorting based on the output of simhash produced an archive that was comparable in size to one sorted by MD5 hash:

$ cat hasher
#!/usr/bin/env bash
echo "$(simhash "$1" 2>&1 |tr -d '\n' | xxd -p | tr -d '\n')" "$1"
time find Episode\ 16 -not -type d -print0 |
parallel -0 ./hasher {} |
sort | cut -d' ' -f2- |
tar cf - -T - |
xz -z -9 -e -T 0 -v -f - > Episode-16-lsh-sorted.tar.xz

(Note that I remove newlines from the output of simhash and convert it to hex to ease sorting.)

Unfortunately, this approach doesn’t work in all cases. Recompressing a recent version of the Linux kernel (4.14) produced a file that was about 8% larger than the same set of files compressed in directory order with no additional sorting. I suspect this is because the many source files in the Linux kernel are already naturally grouped by similarity due to the project’s directory structure, and because a simple alphabetical sort of the filtered and cleaned output of simhash is no a particularly effective approach.

If you know of a simple LSH hashing utility or have a suggestion of a better way to sort the hashes please let me know on Twitter or via e-mail!

Xz vs Lzip

One final question is whether xz is even the proper file format, especially given the pointed criticism of xz for archiving tasks by the lzip authors. While compression ratios were similar,6 in my testing I found lzip to be notably slower for compression. Decompression was faster with lzip, as the utility supports multi-threaded decompression. On a Linode ‘24GB’ (8 cores, 24GB RAM) box, taking as input a sorted tar file from a pipe:

Command Real Time Output Size
plzip -9v 23m20.688s 3.27GB
xz -z -9 -e -T 0 -v 14m45.941s 3.12GB
lzip -d 1m9.230s 18.90GB
unxz 6m27.365s 18.90GB

Compression Results Table

Sort Order Compression Size Relative Size
N/A None 18.90 GB 100%
Directory xz 16.68 GB 88.3%
MD5 Hash xz 3.21 GB 16.5%
MD5 Hash plzip 3.27 GB 17.3%
MD5 Hash xz (single-threaded) 3.05 GB 16.1%
MD5 Hash lzip (single-threaded) 3.11 GB 16.5%
LSH Hash xz 3.11 GB 16.5%

Final Thoughts

The tar utility was first introduced in January 1979. Compress dates to 1984, and gzip to 1992. Xz and lzip were both released in the late 2000s and remain under active development. Thanks to the Unix philosophy of small single-purpose utilities, we’re still moving forward on a foundation built nearly four decades ago. Thanks to this Capsela-like modularity of Unix systems, it’s easy to remix basic components in new ways to solve specific problems.

In this case, this modularity made it easy to try a few different ways of mixing tar archives to improve compression performance, and it made it simple to compare compression speed across half-a-dozen local and cloud systems. And, best of all, this modularity let me save me a ton of disk space while still using a storage format that will be readable with completely standard tools for decades to come.

Updates

  1. 2018/1/22 Added results for a 2008 Mac Pro that I recently purchased for 29,800 JPY (about 269 USD at today’s exchange rates) at a local used computer shop.

  1. Throwing out all but the final 43 MB MP3 file would be a perfectly reasonable option. That said, 19 GB isn’t all that much data, and it would certainly be helpful to have if I wanted to re-edit a portion someday. [return]
  2. In the case of gzip, files are broken up into blocks with a maximum size of 900 KB, and within a given block duplicate patterns are only identified a maximum of 32 KB apart.

    Internally, the xz file format encodes data as a series of up to four filters. The most important for our purposes is the LZMA2 filter, which is a small extension over LZMA. LZMA supports dictionary sizes up to 4GB, so in theory repeated patterns could be up to 4GB apart. In practice, however, the -9 option uses a dictionary size of 64MB, and even with custom filter options the dictionary cannot be larger than 1.5GB. In short, repeated segments need to be separated by less than 64MB. A quick test to verify this:

      $ dd if=/dev/random of=random64MB bs=1m count=64
      $ cat random64MB random64MB >> random64MBx2
      $ dd if=/dev/random of=random65MB bs=1m count=65
      $ cat random65MB random65MB >> random65MBx2
      $ xz -9ve random64MBx2
        random64MBx2 (1/1)
          100 %        64.0 MiB / 128.0 MiB = 0.500   1.0 MiB/s       2:05
      $ xz -9ve random65MBx2
        random65MBx2 (1/1)
          100 %       130.0 MiB / 130.0 MiB = 1.000   1.0 MiB/s       2:07
      $ ls -lh random64MBx2.xz random65MBx2.xz
      -rw-r--r--  1 me  staff    64M 11 24 19:21 random64MBx2.xz
      -rw-r--r--  1 me  staff   130M 11 24 19:24 random65MBx2.xz
    
    

    While 64MB of random data repeated twice compresses perfectly to 64MB, 65MB of random data repeated twice is entirely uncompressable, requiring 130MB.

    [return]
  3. I thought calculating MD5 hashes might be a bottleneck, but I was wrong:

      cat /dev/zero | md5 -p | dd of=/dev/null # 197 MB/s
      tar cf - Episode\ 16/ | dd of=/dev/null # 106 MB/s
    

    In other words, a base model 2015 MacBook has no trouble calculating MD5s at 197 MB/sec, but my particular the 2-year-old SSD–converted from HFS+ to APFS and currently at 83% capacity–can only read this particular directory of files at 106 MB/sec.

    [return]
  4. Iperf upstream to cloud services in Tokyo was generally a reasonable 500+ Mbps. Performance to data centers in Singapore and San Francisco, however, was sometimes as slow as 8-16Mbps. In one case a transfer from an AWS S3 bucket in the Tokyo region to a server in Singapore struggled to cross 1 Mbps, which I found almost shocking. [return]
  5. This test was conducted via SSH over a local network. [return]
  6. This is to be expected, since xz is a essentially a container around LZMA2 which is a container for LZMA, whereas lzip is a container for LZMA. [return]