Data compression metrics for FMRI?

Has anyone done, or come across, metrics for different compression methods on standard FMRI / MRI / PET data?

I was thinking of comparisions of Deflate / zip, Zlib / gz, Lzma / Lzma2 / xz 1, Blosc, Bzip2 / bz2, szip (via libaec), LZ4, ZFP, ZStandard.

It would also be very fine to get some metrics for read / write speed using these same metrics.

I’m betting we could do better than .gz, but I’m wondering how much better. And how important it is that we allow flexibility in compression methods in a data access API.

1 Xz format inadequate for long-term archiving

@matthew.brett - i haven’t seen any on neuroimaging data. there are some nice blog posts on compression comparison (e.g., on genotype compression), indeed doing a domain/data-specific evaluation would be great. also chunk size is another meta parameter in such a benchmark, as some formats+compressors can handle indexing into arbitrary locations.

i did this recently for some microscopy data (in both zarr and hdf5) and found that the optimal things in the post did not apply to the data i had at hand. in my use case blosc+zstd did significantly better than blosc+lz4.

i would worry a little about read/write speeds as that would depend heavily on the storage backend (nfs, lustre, rdma over IB, s3, etc.,.). of course given a backend like a local nvme disk, one could look at relative i/o rates for different types of comparison. but those differences may be different for different backends.

at this point it would be easy to consider the data in openneuro with many different repetition times and voxel spacing as a good source for an evaluation.

another thing to keep in mind is how much MATLAB support exists.

Yes, the storage backend would definitely have to go into the metrics - at very least a comparision between local and cloud, but ideally local fast, local slow, NFS / SMB / cloud.

I agree that Matlab etc support is an issue - but a somewhat separate one. For example, we might find an ideal format for storage for analysis, perhaps another for archiving, and another quite different format for sharing - where Matlab access would be essential for the last, but not necessarily for the first two.

But the compression etc metrics would feed into our decision about whether it is worth having these separate formats. For example, if we found an excellent compression tradeoff that would only work for a format that Matlab can’t easily read, then we might be more tempted to have a different analysis and sharing format, with tools to convert between the two, and where Matlab folks might have their own preferred idempotent format for analysis.

Another consideration is random access. We obviously have indexed_gzip in Python, and similar exist for bzip2 and zstd, but if a compression format facilitates efficient random access by default, that would help adoption of a format in other languages. Containers with a built-in concept of chunking can mitigate the penalty for poor random access, though.

WRT random access… there may be middle grounds where you can access, say, a slice in a volume, or a volume in a timeseries, or the timeseries of one voxel at random, without necessarily being able to trivially access any single point.

Data ordering and chunking do matter in uncompressed data as well, of course, but it is a larger consideration for compressed data.

For raw data, CT scans are often 12-bit Bits Stored (0028,0101), while for MRI 16-bit is becoming increasingly common. For these datatypes, neighboring voxels show much less variability for the most significant byte than the least significant byte. With the exception of BLOSC, this redundancy is not considered by the compression formats you describe. See my comments here. Therefore, swizzling the data can make a dramatic impact. While most scientists prefer scalar values, scanner manufacturers use RGB triplets for derived perfusion and diffusion metrics, and these would also benefit from Analyze-style planar storage (RRR…RGGG…GBBBB…B) versus NIfTI triplets (RGBRGB…). While the original question was regarding MRI/PET, this issue is also seen for indexed triangle meshes (e.g. GIfTI).

You may want to look at my Python scripts pigz bench which allow you to compare all sorts of converters for both compression speed/ratio and decompression speed, generating graphs of the Pareto frontier. By default, it uses the Silesia corpus, but you can specify any corpus you want, my earlier perl script provides a MRI corpus.

The indexed gzip is very nice for 4D NIfTI data. And gzip is really ubiquitous. The classic zlib is not optimized for modern hardware, but both CloudFlare zlib and zlib-ng leverage modern instructions to double single threaded performance. If you want to retain gzip but had really good performance, you should consider libdeflate for compression and either libdeflate or Intel’s igzip for decompression. Both demand a lot more RAM. The libdeflate API is simple but inflexible, the Intel API is flexible but alien. For Python users, mgzip provides parallel decompression (though only for gzip files it generates).

DecompressMethod Min Mean Max mb/s
igzip 9040 9157 9233 492.32
libdeflate 9922 9974 10017 448.59
zlibNGclang 17295 17345 17390 257.34

pigz

I had previously posted a comparison, based on a single nifti file, on this BIDS github thread:

basically,

  • lzma compression speed is relatively slow, but decompression is ok; it has the highest compression ratio;
  • zip/gzip has a somewhat balanced speed vs compression ratio;
  • not included in this data, but lz4 is extremely fast in both compression and decompression, but compression ratio is not as high as zip/gzip; lz4 decompression is limited by disk reading speed
  • lz4hc is somewhere between lz4 and zip/gz

I wrote a C wrapper to make these tests (in MATLAB)

also, more thorough comparison in the context of Linux kernel benchmarks can be found if you google " Boot speed improvements for Ubuntu 19.10 Eoan Ermine"
the link is broken but I found a copy in archive

https://web.archive.org/web/20191017194258/https://kernel.ubuntu.com/~cking/boot-speed-eoan-5.3/kernel-compression-method.txt

my zmat toolbox supports zlib/gzip/lzip/lzma/lz4/lz4hc for matlab/octave, I also maintain this package for Fedora and Debian/Ubuntu (as octave-zmat and a C library)