Sparky File Format - CONNJUR/CONNJUR_widgets GitHub Wiki
From Sparky Manual:
Here is a sketchy description of the UCSF NMR data format as used by Sparky. Sparky needs to know the dimension, data matrix size, data values, nucleus types, and parameters to convert data indices to ppm and Hz. The ppm and Hz scales are determined by knowing along each axis the transmitter frequency, spectral width, and ppm value for index 0. Given this information you can write a UCSF NMR data file for input to Sparky. The file format is somewhat complicated. The data matrix is broken down into small arrays so that small regions of the data can be read with few disk accesses.
A UCSF NMR data file is a binary file with a 180 byte header, followed by 128 byte headers for each axis of the spectrum, followed by the spectrum data. All integer and float values in the file have big endian byte order independent of the native byte order for the machine that wrote the file. An int has big endian order if the most significant byte appears first in the file (or at the lowest memory address). Sun Sparc and SGI Rx000 processors are big endian and DEC Alpha and Intel x86 are little endian. If you write a conversion program to produce UCSF format the code should check the native byte order at run time and reverse floats and ints if necessary before writing them out. All float values are in IEEE format. All header bytes not described below should be set to zero.
The 180 byte header contains:
position | bytes | contents | required value |
0-9 | 10 | file type = | UCSF NMR (8 character null terminated string) |
10 | 1 | dimension of spectrum | |
11 | 1 | number of data components = | 1 (for real data) |
13 | 1 | format version number = | 2 (for current format) |
The first byte in the file is position 0. A complex spectrum has two components. Sparky only reads real data and I will only describe below the layout for real data so set the number of components to 1. Use format version number 2.
For each axis of the spectrum write a 128 byte header of the following form:
position | bytes | contents |
0 | 6 | nucleus name (1H, 13C, 15N, 31P, ...) null terminated ASCII |
8 | 4 | integer number of data points along this axis |
16 | 4 | integer tile size along this axis |
20 | 4 | float spectrometer frequency for this nucleus (MHz) |
24 | 4 | float spectral width (Hz) |
28 | 4 | float center of data (ppm) |
Next comes the spectrum data floating point values. The full data matrix is divided into tiles. Each tile is written to the file as an array with the highest axis varies fastest. If a tile size does not evenly divide the data matrix size then there will be a partial tiles at the edge. Partial tiles are written out as a full tiles with zero values for the region outside the data matrix. The tiles themselves form an array and are written out in sequence with highest axis varying fastest.
Here is a detailed example of how the floating point values are written. Suppose we have a 2D spectrum with matrix that is 1024 by 4096 data points. Each value is a 4 byte IEEE floating point number. So the data takes up 16 MBytes (1024 x 4096 x 4 bytes). Sparky wants to be able to show you a small region of the spectrum without any delay, so the data is stored in small rectangular blocks (tiles). Only the tiles needed to display the region you are viewing need to be read from disk. The size of tiles in UCSF format files is typically 32 Kbytes or 16 Kbytes, although a little bit bigger sizes like 128 Kbytes would be reasonable with current disk drive seek times (circa 1999). The current conversion programs that produce UCSF format divide each axis of the data matrix by 2 in steps until the resulting tile size is less than or equal to 32 Kbytes. For purposes of example I will instead take tiles that are 128 by 256 floating point values. So each one is 128 Kbytes ( = 128 * 256 * 4). Then an 8 = (1024 / 128) by 16 = (4096 / 256) grid of tiles covers the 1024 by 4096 data matrix. Denote a point in the data matrix like (312,1587) and a tile {2,5}. Call the two axes w1 and w2. Tile {0,0} includes a rectangle of data values from 0 to 127 along axis w1, and 0 to 255 along axis w2. Tile {a,b} is the data rectangle a*128 to (a+1)*128 - 1 along axis w1 and b*256 to (b+1)*256 - 1 along axis w2.
You write the tiles in the following order.
{0,0} {0,1} {0,2} . . {0,7}
{1,0} {1,1} {1,2} . . {1,7}
. . .
{15,0} {15,1} {15,2} . . {15,7}
This order is what was meant above by the phrase "highest axis varies fastest". In this case the w2 axis varies fastest. Each tile is output with the same rule. Tile {0,0} has its 128 by 256 floating point values output in the following order:
(0,0) (0,1) (0,2) . . (0,255)
(1,0) (1,1) (1,2) . . (1,255)
. . .
(127,0) (127,1) (127,2) . . (127,255)
For 3D data the tiles are 3D data cubes. The order of writing the grid of tiles out has the w3 axis varying fastest, w2 second fastest, and w1 slowest. Likewise each tile's cube of floating point values are written out in order having the w3 axis vary fastest, w2 second fastest, and w1 slowest.
This is a more detailed description of the binary header information. This description stems from an examination of the Sparky Source Code, in particular, the file ucsffile.cc. The 180 byte header is essentially the NMR_HEADER data structure, while the 128 byte headers are the NMR_AXIS data structure. Each structure has unused pointers. Unused is saved in the axis header, the other is not saved to the file header. However, the file header does include byte padding for alignment of the structure fields by a 32-bit compiler. (Key: Green refers to metadata which is typically set correctly by Sparky and useful to manage. Yellow are fields which are either not typically set or are not used by anything.)
The 180 byte header contains:
position | bytes | field name | description | purpose / options |
0-9 | 10 | ident | file type | "UCSF NMR" required for version 2 (null terminated ASCII) |
10 | 1 | naxis | number of dimensions | Must be 2, 3 or 4 |
11 | 1 | ncomponents | number of data components | Must be 1 (real data, not complex or hypercomplex) |
12 | 1 | encoding | unknown | Seems to be always set to false: 0 |
13 | 1 | version | format version number | Must be 2 (for current format) |
14-22 | 9 | owner | user name who created file | whoever executed pipe2ucsf (null terminated ASCII) |
23-48 | 26 | date | date user created file | when pipe2ucsf was executed (null terminated ASCII) |
49-128 | 80 | comment | unknown | Does not appear to be used (null terminated ASCII) |
129-131 | 3 | ---- | byte padding | structure alignment by compiler |
132-139 | 8 | seek_pos | type long | Does not appear to be used |
140-179 | 40 | scratch | unknown | Does not appear to be used (null terminated ASCII) |
For each axis of the spectrum there is a 128 byte header of the following form:
position | bytes | bit | data type | field name | contents |
0-5 | 6 | char[] | nucleus | nucleus name (1H, 13C, 15N, 31P, ...) null terminated ASCII | |
6-7 | 2 | short | spectral_shift | spectral shift: seems to be left or right circular shift? | |
8-11 | 4 | unsigned int | npoints | number of active data points along this axis | |
12-15 | 4 | unsigned int | size | total size of axis | |
16-19 | 4 | unsigned int | bsize | integer tile size (# of points per cache block) | |
20-23 | 4 | float | spectrometer_freq | spectrometer frequency for this nucleus (MHz) | |
24-27 | 4 | float | spectral_width | spectral width (Hz) | |
28-31 | 4 | float | xmtr_freq | center of data (ppm) | |
32-35 | 4 | float | zero_order | phase correction (unknown units) | |
35-39 | 4 | float | first_order | phase correction (unknown units) | |
40-43 | 4 | float | first_pt_scale | scaling for first point | |
44-47 | 4 | 1 | unsigned int | status.transformed | |
1 | unsigned int | status.base_corrected | |||
14 | unsigned int | status.reserved | |||
48-51 | 4 | 2 | unsigned int | flags.fid.convolution | |
1 | unsigned int | flags.fid.dc_offset | |||
1 | unsigned int | flags.fid.forward_extend | |||
1 | unsigned int | flags.fid.backward_extend | |||
2 | unsigned int | flags.fid.replace | |||
4 | unsigned int | flags.fid.apodization | |||
2 | unsigned int | flags.fid.ft_code | |||
4 | unsigned int | flags.fid.nfills | |||
1 | unsigned int | flags.fid.absolute_value | |||
1 | unsigned int | flags.fid.spectrum_reverse | |||
1 | unsigned int | flags.fid.baseline_offset | |||
2 | unsigned int | flags.fid.baseline_fit | |||
10 | unsigned int | flags.fid.reserved | |||
52-53 | 2 | short | conv.width | half width of window | |
54-55 | 2 | short | conv.extrapolation | # points at ends to extrapolate | |
56-59 | 4 | float | apo.p1.shift | union. Alt float p2.gaussian | |
60-63 | 4 | float | apo.p1.line_broad | union. Alt unused | |
64-67 | 4 | unsigned int | forward.start | ||
68-69 | 2 | unsigned short | forward.poly_order | ||
70-71 | 2 | unsigned short | forward.npredicted | ||
72-73 | 2 | unsigned short | forward.npoints | ||
74-75 | 4 | unsigned int | backwards.start | ||
76-79 | 2 | unsigned short | backwards.poly_order | ||
80-81 | 2 | unsigned short | backwards.npredicted | ||
82-83 | 2 | unsigned short | backwards.npoints | ||
84-87 | 4 | unsigned int | replace.before_start | ||
88-91 | 4 | unsigned int | replace.after_start | ||
92-95 | 4 | unsigned int | replace.first | ||
96-97 | 2 | unsigned short | replace.npredicted | ||
98-99 | 2 | unsigned short | replace.poly_order | ||
100-101 | 2 | unsigned short | replace.before_npoints | ||
102-103 | 2 | unsigned short | replace.after_npoints | ||
104-107 | 4 | unsigned int | base_offset.start | ||
108-111 | 4 | unsigned int | base_offset.end | ||
112-115 | 4 | unsigned int | base_fit.solvent_start | ||
116-117 | 2 | unsigned short | base_fit.poly_order | ||
118-119 | 2 | unsigned short | base_fit.solvent_width | ||
120-123 | 4 | void* pointer | unused | ||
124-127 | 4 | ---- | byte padding |
Finally, returning to the color coding. In the source code of Sparky, it is mentioned that the processing parameters for each axis are only included as legacy data from an older processing program called Striker. In Aug 2018, mgryk converted two nmrPipe files to Sparky format using pipe2ucsf. Even though the processing parameters were stored in the nmrPipe header, the Sparky header was filled with all 0's. Thus, it appears that this header data is actually 'superfluous junk'. In principle, we could have CONNJUR Spectrum Translator convert nmrPipe to Sparky and set this metadata. However, since Striker is obsolete, such metadata would not be useful for it's intended purpose. For now, mgryk is in favor of ignoring this metadata completely.