How important is the use of virtual memory for an image storage format?

Neuroimages are large, and may not fit into memory. We would therefore like a format that allows array-like access to a large data file, without loading the whole file into memory. I’m calling this virtual memory.

How important is the ability to use virtual memory, an image format?

I think we need to distinguish between:

  • A sharing format - optimized for human and machine readability
  • An analysis format - optimized for memory, CPU, and / or multiprocessing

For sharing formats - we probably don’t need virtual memory access, but I think virtual memory access is probably essential for an analysis format. What do y’all think?

@matthew.brett this is somewhat related to my recent work supporting the probabilistic Julich-Brain-Atlas in MRIcroGL and NiiVue. The raw data is in a simple NIfTI format, but requires 9.4Gb when decompressed and access is not cache friendly.

A combination of a NIfTI offset table and a lookup table reduces this by a factor of x422. Since probabilistic atlases are rare, perhaps no one else is interested in this topic. However, if anyone has suggestions for the scheme (or are interested in adopting it), I would be interested to hear from them. The labels could be HAWG compliant, just not using the classic NIfTI storage.

Hi @matthew.brett - this looks like a great idea to me.

I would try to attempt some sort of unification with surface formats. In the end, both are composed of the same elements: data, location of those datapoints, and the relationship between them (faces for surfaces and a grid definition for images).

It would be valuable not just having fast and lazy loading across all dimensions, not just the last. This would be really useful when extracting ROIs or when dividing a volume into sub-images for parallel processing.

FWIW, in the context of fMRIPrep I’ve been thinking a lot about having some internal HDF5 specification and only use NIfTI/GIFTI/CIFTI when writing the outputs. In dMRIPrep things might be even closer to that model.

I would be very happy if I could use something tailored to the problem, rather than HDF5.

@neurolabusc - I would say that the issue is not that people are not interested in probabilistic atlases - I’d say it goes the other way around (hard to share, integrate and access → let’s use something simpler)

Great idea. Do you think that it would be feasible to extend the discussion to be major modalities? Fror example:

  • Volume data
  • Surface data
  • Tractography data (streamlines, point clouds)

It sounds like @oesteban already suggested extending to Surface data.
@arokem @rheault and @garyfallidis (not here yet) and I have been discussing a format for Tractography in DIPY

Franco

@pesto

  • @frheault integrated comments from many tractography users to develop a streamline format that fits the usage in our community.

  • For volume data, it seems like NIfTI/BIDS fills this niche. I agree that archiving the raw data is useful, for audits, in case face removal fails, in case we develop a better understanding of what attributes the manufacturer hid in the DICOM. While people like David Clunie believe that DICOM should be the single format, I do think that BIDS/NIfTI distills down manufacturer specific complications. Further, the DICOM format is so complicated that manufacturers have been known to make errors: from personal experience a major vendor did not initialize the memory arrays for DICOM, which allowed private data including patient name to be found in the ‘crevices’ between the DICOM fields. The raw storage does allow random access to individual volumes, and tools like FSLeyes show how indexed Gzip can help for compressed volume data. The requirement for data to be stored with space in the first three dimensions (e.g. for fMRI time must be the 4th dimension XYZT) does break cache coherency for statistical inference - Voxbo would store the post-processed data as TXYZ. Further, background voxels were not saved.

  • For surface data, there are a lot fo tradeoffs between flexibility, size and speed. Personally, I think GIfTI fills this niche for our community reasonably well, so perhaps you could chime in regarding what you think the deficiencies are. I do use a smaller, faster format for my sample datasets, though it lacks the flexibility to replace GIfTI.