Meta Data Preservation

I wanted to kick off a conversation about the meta data preservation aspect of this project. I had floated the idea of using xarray instead of the code in dcmstack to provide a nice API (and serialization format) to the meta data that we parse from the DICOM files and then summarize with respect to the axis of the nD array. The main concern with this approach seems to revolve around how “heavy” of a dependency xarray is (i.e. pulling in Pandas), with a potential solution being some future “xarray-lite” package.

However, my understanding of the proposed xarray-lite package (see: here) is that it will not be able to support anything beyond labeled dimensions which will be of quite limited utility. It can’t even keep track of per-volume parameter values, never mind more complicated cases like per-slice acquisition times. So it seems like we might be stuck having the full xarray package at least as an optional dependency, or taking some other approach.

Thoughts?

Thanks for starting this.

The first thing to say is that it does not seem that that the xarray-lite package is likely to be in good shape in time for us to use for the grant work at least - see this comment.

The second was to ask about your (Brendan’s) understanding of the xarray-lite idea. I see that the proposal was to have more minimal axis labels, and metadata, but withtout the indexing in Pandas. But I guess we don’t need to be able to index our axes with the metadata - we only need to be able to get the metadata at a certain standard integer index position. I mean, we’re unlikely to need the kind of thing you can do in Pandas with:

rows = df.loc['first':'fourth']

and I assumed that’s what they mean by indexing - selecting values by label rather than integer position.

What do you think?

In xarray terms, we need support for “Coordinates” to track meta data that varies over a dimension and this appears to be inextricably linked to the Pandas “Index” class. To support more complicated cases (like per-slice acquisition times) I believe we would need “MultiIndex Coordinates” which appear to have some sharp corners that the xarray developers are trying to improve (see here).

While I agree some of the Pandas fancy indexing isn’t really needed, there is some cool stuff being done like automatically doing gradients/ interpolation / integration that respects the coordinate values.

I probably should have emphasized more that xarray also provides serialization for the spatially varying metadata so we don’t have to invent our own format. We can JSON encode the output of to_dict(data=False) (see here) to use with things like Nifti. We also get a foothold in the world of NetCDF and Zarr.

Thanks, and sorry to be slow to reply - I’ve been catching up on university business.

Am I right in thinking that the xarray-lite / Variable class does not capture the ‘coordinates’? So it is true that it would not be useful in making the meta-data useful?

I wonder whether we should start here with a spike - doing the most stupid possible thing to get xarray integrated into nibabel as a test, and then experiment with what we can do with the meta-data?

Have you come across a good case where the DICOM data would be a great use when stored as coordinates on the axes? Maybe we could use that as use-case?

Sorry, my laptop bit the dust last week which has set me back a bit…

Yes the xarray-lite / Variable class don’t capture “coordinates”.

I guess the SpatialImages API could grow an optional “to_xarray” method that requires the xarray package to be installed as well as the correct meta data support in the underlying file format. For Nifti images this would look for a JSON encoded xarray dict in the Nifti extensions. For multi-frame DICOM files we would need to translate the per-frame meta data into xarray coordinates. For non multi-frame DICOM images I guess it would be up to the MetaSummary class in my PR to produce the xarray coordinates (at least for the dimensions above what is captured in the individual DICOM files).

I’m less sure on the API for saving xarray data into files. I guess the SpatialImage classes that support it can accept an xarray as the “data” property instead of just ndarrays. It would be strange then if future references to the data property didn’t return an xarray though…

My main use case for coordinates is using them as the independent parameter in various models. It is pretty common for me to work with data that has non-uniform sampling along the extra-spatial dimensions (b-values, flip angles, inversion time, echo time, etc.). Even being able to load data like this into a viewer like Napari and have the curves automatically plotted correctly would be quite nice.