How to Package Data Using the GVRS Library - gwlucastrig/gridfour GitHub Wiki

Introduction

Gridfour's GvrsFile classes help Java applications manage raster (grid) data in situations where the size of the data exceeds what could reasonably be kept in memory. To do so, it provides an API that allows an application to seamlessly swap blocks of the grid between memory and data files. The GVRS library and file format also provide a way of storing raster data in files that can be shared between applications or used across multiple application runs. And, in cases where storage space or transmission bandwidth is limited, GVRS provides built-in data compression that can reduce storage requirements by at least a factor of four.

This article gives an introduction to the Java API for GVRS and offers example code showing how to use it in your own applications. As usual, code for the applications described in this article is available from the The Gridfour Software Project. You may find the code in the Gridfour demo code tree in the class PackageData.java.

Sample Data

A wiki article that describes how to use an API to manipulate and store data would surely benefit from have a good source of data to use as an example. We are fortunate to have two: ETOPO1 and GEBCO_2019. These products give world-wide Earth surface elevations and ocean depth (bathymetry) values in a regular geographic coordinate grid.

In the ETOPO1 product, grid points are given in a regular spacing of 1 minute of arc (e.g. 60 rows or columns for each degree of latitude or longitude). Thus ETOPO1 includes (360x60)x(180x60) data values, or about 233 million samples. Storing that many numeric values in memory requires a lot of capacity, but is not out-of-reach for a modern computer. The GEBCO product, however, uses a finer resolution based on a grid spacing of 15 seconds of arc (e.g. 240 rows or columns for each degree of latitude or longitude). That spacing results in a data collection containing 3.7 billion samples. And while there are plenty of computers with sufficient memory to hold that much data, it wouldn't leave much room for anything else.

A Tiling Scheme

Fortunately, when dealing with such large data sets, applications seldom need to access the entire collection all at once. So an obvious solution to the problem of memory use is to store the data on a file and load (or store) pieces of it on an as-needed basis. This approach is used in data formats such as the TIFF image specification which partition large grids into smaller, regularly sized sub-grids known as tiles (Adobe, 1992, pg. 68). The figure below illustrates the idea of a larger grid being subdivided into tiles.

Tiling Partitions

The figure below illustrates how a tiling scheme works. A collection of surface elevation and bathymetry data could be divided into regular tiles ten-degrees across. An application requiring access to information in Europe would load the relevant tiles without needing to access information from South America.

Tiling Scheme

In GVRS, the size of tiles are arbitrary (though all tiles must be of a uniform size). Applications are free to specify tile sizes according to their needs. In the case of the ETOPO1 data set, the PackageData demonstration application specifies a grid of 90 rows by 120 columns. In terms of geographic coordinates, the 1-minute resolution used in ETOPO1 means that these tiles will cover an area with a span of 1.5 degrees latitude and 2 degrees of longitude. This size was chosen after some experimentation because it gives a good data compression results and provides a convenient size for access.

Tiling is a core feature of the GVRS API and deeply involved in all its operations. In fact, even a small grid is treated as being tiled, though it is completely reasonable to specify a tile size that matches that of the entire grid.

A few other details about GVRS tiling are worth noting:

Tiles can be written or read in any order.
Not all tiles in the specification need to be populated with data. The overhead for empty tiles is small.
The GVRS compression logic treats each tile as a separate block of data that is compressed individually, without reference to other tiles.
A key feature of the GVRS file API is that it maintains an in-memory cache to store tiles for rapid access.

Data Elements

A GVRS file may store one or more data values for each grid cell. In the simplest case, each grid cell is assigned a single value. Examples of the single-value case include terrestrial elevation, surface temperature, average rainfall, etc. Some phenomena, like winds or ocean currents, require specifications for speed and direction, and may require the storage of two values for grid cell. Some products involve a mix of elements with different data types. For example, the GEBCO ocean depth product mentioned above gives real-valued depth values. But it also features a supplemental grid of integer codes that indicate what kind of data-collection method was used for each sample point.

In GVRS, we view each grid cell as consisting of a set of one or more elements . Each element has an independent data type and range of values. GVRS also expects an application to assign distinct names to various elements. The GVRS API provides two families of classes for working with elements:

GvrsElementSpecification -- Provides the definitions for an element including name and data type. Provided to the API when a GvrsFile is constructed.
GvrsElement -- Provides access to the content of a GVRS file, used for read and write operations.

Creating a GVRS File

The GVRS library is a tool for both creating raster files and accessing them. Writing a GVRS file is a 3 step process:

Create a specification object to describe the organization of the GVRS file:
- Specify the grid dimensions. Optionally, specify the tile size
- Specify data types
- Set other configuration options (data compression, checksums, etc.)
The grid specification and a file-path specification are used to create a new file for writing data. The initial file is treated as an empty collection of tiles and will typically be smaller than 1 kilobyte in size.
Values are added to the file one grid-point at a time. The internal bookkeeping and management of tiles is mostly transparent to the calling application.

The code snippet below gives an example of a simple case, in which a single value is stored in a 10-by-10 grid. The two-argument version of the GvrsFileSpecification constructor takes the specifications for the number of rows and number of columns in the grid. An alternate version allows an application to control the size of the tiles, but the automatic selection is usually adequate. In GVRS, row and column settings are always given in the order row first, then column.

// set up file specification, add one element named "z" ------
GvrsFileSpecification fileSpec    = new GvrsFileSpecification(10, 10);
GvrsElementSpecificationShort zElementSpec = new GvrsElementSpecificationShort("z");
fileSpec.addElementSpecification(zElementSpec);

// create a new file, access the element named "z" --------
File outputFileRef = new File("Example1.gvrs");
try(GvrsFile gvrsFile = new GvrsFile(outputFileRef, fileSpec)){
	GvrsElement zElement = gvrsFile.getElement("z");
	// the zElement object may now be used to read-and-write
	// data to the file:
	zElement.writeValue(0, 1, 2021); // write a value a grid row 0, col 1
}catch(IOException ioex){}
}

When creating a new GVRS file, GvrsElementSpecification objects are created and registered with a GvrsFileSpecification object. When the specifications are passed into the GvrsFile constructor, they are used to establish the internal organization of the file. Once a GVRS file is opened, applications can obtain instances of the GvrsElement objects that were built from the GVRS element specifications. Then the application uses GVRS elements to read or write data.

The PackageData example

The code fragment below is taken from the PackageData example application (with some simplifications applied for the sake of this discussion). PackageData extracts elevation and bathymetry values from the ETOPO1 and GEBCO products, and stores the information in a GVRS file with optional data compression. Both ETOPO1 and GEBCO are distributed in a file format called NetCDF. A wiki-page describing how to read data from NetCDF files is provided at this site under the title How to Extract Data from a NetCDF File

The parameters for the number of rows and columns in the grid are based on the dimensions of the source data. The number of rows and columns in the tile were chosen because they seemed well suited to the needs of an application that might use the GVRS file.

NetcdfFile ncfile = NetcdfFile.open(filePathToSourceETOPO1);
Variable z = ncfile.findVariable("z");  // the NetCDF variable for reading values

int nRowsInGrid = 10800;  // spans 90 south to 90 north
int nColsInGrid = 21600;  // spans 180 west to 180 east
int nRowsInTile = 90;
int nColsInTile = 120;

// Create a specification for the overall grid and tiling.
GvrsFileSpecification spec
        = new GvrsFileSpecification(nRowsInGrid, nColsInGrid, nRowsInTile, nColsInTile);
GvrsElementSpecificationShort elementSpec = new GvrsElementSpecificationShort("z");
    spec.addElementSpecification(elementSpec);

// Create a GVRS-formatted file for output
File outputFile      = new File("ETOPO1.gvrs");
GvrsFile gvrs        = new GvrsFile(outputFile, spec); 
GvrsElement zElement = gvrs.getElement("z");

When a GVRS file is created, the metadata from the specification object is used to populate the header file. Some of this metadata is immutable and must be fully specified before the output file is created, other elements can be adjusted after the file is opened. We will look at some of these setting later on.

Storing Data

As shown in the code block below, storing data in a GVRS file is a relatively straightforward process. In fact, most of complexity in the code example comes from accessing the NetCDF data rather than writing it to the GVRS API. The example reads data from the source file one row at a time. Because of the way it is organized, this approach is the most efficient pattern for accessing the ETOPO1 file.

Readers who are familiar with the NetCDF API may notice similarities between the NetCDF concept of a "variable" and the GVRS concept of an "element". Although the underlying structure of the two products is fundamentally different, the NetCDF Variable and the GvrsElement serve a similar role in that they provide access points for reading and writing data from a file.

// initialize access specifications for NetCDF.
Variable z = ncfile.findVariable("z");  // the NetCDF variable for reading values
int[] readOrigin = new int[2]; 
int[] readShape = new int[2];

GvrsElement zElement = gvrs.getElement("z");  // the GVRS element for reading/writing values

for (int iRow = 0; iRow < nRowsInGrid; iRow++) {
	// NetCDF uses the origin and shape arrays
	// to specify a section of the grid to be read.
	readOrigin[0] = iRow;
	readOrigin[1] = 0;
	readShape[0] = 1;
	readShape[1] = nColsInGrid;
	// Read one row of data from the NetCDF file.
	// The data will be stored in a NetCDF "Array" object.
	// Then loop on each column, obtain the elevation/bathymetry data
	// and store it in GVRS.   ETOPO1 stores data as integers.
	Array array = z.read(readOrigin, readShape);.
	for (int iCol = 0; iCol < nColsInGrid; iCol++) {
		int sample = array.getInt(iCol);
		zElement.writeValue(iRow, iCol, sample);
	}
}
gvrs.flush();
gvrs.close();

The Tile Cache

The snippet above would work just fine, except that it might be a little slower than we would prefer. The reason for this is that each row stored to the data file requires swapping tiles in and out of memory. Because there are 21600 columns in the master grid and 120 columns in each tile, there are 21600/120 = 180 tiles in each row of the tiling scheme. So in order to keep an entire row of the master-grid data in memory, the GVRS API needs to keep 180 tiles in its cache. But, by default, the GVRS cache size is only 16 tiles in size. And 16 tiles is not wide enough to hold an entire row of data. Because the storage process scans one-row-at-a-time, the cache has to drop and load tiles 180 times per row. This approach leads to a lot of redundant reading and writing of tiles.

Fortunately, the tile cache size can be adjusted by using the following adjustment before storing the data.

gvrs.setTileCacheSize(GvrsCacheSize.Large);

The adjustment is applied after the GvrsFile is opened, but before the application starts to access data. The "Large" cache size adjustment tells GVRS to adjust the size of the cache so that it is large enough to store an entire row (or entire column) of tiles. Tiles use 4 bytes per each value stored in memory. Because this example code specifies a tile size of 90-by-120 grid values, an entire row of 180 tiles would require 180x90x120x4 = 7776000 bytes (about 7.4 megabytes). This setting is not onerous, and it will dramatically improve the speed of writing a file.

How much difference does the larger cache make? With the default "Medium" cache size, the storage process required 237.2 seconds. With the "Large" size setting, it required 9.2 seconds.

Here it is worth emphasizing that the need for a larger cache size is due to the pattern-of-access applied in packaging the data. Had the packaging process populated a single tile at a time (rather than spanning an entire row of tiles), the increased tile cache size would not have been required. This consideration applies both when writing data and when reading data for a GVRS file.

Other Settings

In addition to the overall grid size and tiling scheme specifications, there are a number of other settings that can be stored in a GVRS file header when it is created. These settings are supplied through calls to the GvrsFileSpecification class' access methods.

Coordinate System

The main GvrsFile API allows an application to read and write raster data values by supplying the row and column indices for the values of interest. Many real-world applications depend on a horizontal coordinate system that relate these values not to grid coordinates, but real-valued position information.

Gridfour allows an application to specify two broad categories of coordinate systems for file access: Cartesian Coordinates and Geographic Coordinates.

Although Gridfour is not limited to geographic coordinate systems, the two example products we used for this wiki-article are geographic in nature. So we will begin with the specification for a geographic coordinate system.

As mentioned above, the row and column spacing for ETOPO1 is a uniform 1 minute of arc (1/60th of a degree). There are a couple of ETOPO1 variations. The one chosen for this discussion runs from south to north and west to east. The latitudes start just above the South Pole and run to just below the North. The longitudes start just to the east of the International Date Line (-179.99166 longitude) and extend to just to its west (+179.99166 longitude). The following code snippet shows an example of how a specification for ETOPO1 geographic coordinates could be constructed:

GvrsFileSpecification spec = new GvrsFileSpecification(nRows, nCols, nRowsInTile, nColsInTile);
double h = 1.0/60.0;  // one minute spacing, 1/60th of a degree
spec.setGeographicCoordinates(
         -90+h/2,   // south, first row in grid 
        -180+h/2,   // west, first column in grid 
          90-h/2,   // north, last row in grid
         180-h/2);  // east, last column in grid

Note that the geographic coordinates are given in degrees and specified in the order latitude, longitude. West longitudes are given as negative values. East longitudes are given as positive values.

As an example of Cartesian coordinates, consider an example in which the (x, y) coordinates are normalized between 0 and 1. In that case, we might specify coordinates using the following

GvrsFileSpecification spec = new GvrsFileSpecification(nRows, nCols, nRowsInTile, nColsInTile);
spec.setCartesianCoordinates(
        0,   // x coordinate of first column in grid 
        0,   // y coordinate of first row in grid
        1,   // x coordinate of last column in grid
        1);  // y coordinate of last row in grid

Note that the order of the Cartesian coordinates follows the standard practice of being given as (x, y). In the example above, we assumed that the coordinates were increasing as the row and column index increased. Consider the case where the x coordinate increased which the y coordinate decreased. Such a specification would look like:

 spec.setCartesianCoordinates(
        0,   // x coordinate of first column in grid 
        1,   // y coordinate of first row in grid
        1,   // x coordinate of last column in grid
        0);  // y coordinate of last row in grid

The GvrsFile class implements methods for mapping Cartesian or Geographic coordinates to grid coordinates and vice versus. These are shown below:

public double []mapGridToCartesian(double row, double column)   // returns x,y
public double []mapCartesianToGrid(double   x, double      y)

public double []mapGridToGeographic(double row, double column)  // returns lat, lon
public double []mapGeographicToGrid(double latitude, longitude)

Note that the mapping functions may return and/or accept fractional values for the row and column values the process. This feature is intended to support data value interpolation.

Data definition: dimension and data type.

As noted above, a GVRS definition may specify one or more elements per each grid cell. These elements can be of the same or a mixed data type. GVRS currently supports four different data types:

Short -- data stored as two-byte signed integers (GvrsElementSpecificationShort)
Integer -- data stored as four-byte signed integers (GvrsElementSpecificationInt)
Float -- data stored as single-precision floating-point values (GvrsElementSpecificationFloat)
Integer-coded-float -- floating point data is scaled and stored as an integer value (GrsElementSpecificationIntCodedFloat)

All of these elements allow the specification of range-of-value limits and "fill" values. These parameters may be specified using the alternate constructors for the GvrsElement classes. See the API Javadoc for further informaiton.

A fill value is the value of any grid cell that has not otherwise been populated. Conceptually, you can think of a GVRS file as being initially populated with fill-values when it is created. However, fill values are not actually stored in the file unless they are explicitly set by the calling application,

The integer-coded-float is a special case in that it requires a scale and offset value as part of its constructor

	GvrsElementSpecIntCodedFloat(String name, float scale, float offset)

The scale and offset values specified as part of the integer-scaled-float model are used in cases where floating point values are to be converted to integers for internal storage. While this approach can result in reduced precision for the input data, it has advantages when compressing the data. The GVRS implementations of integer compression attain better compression ratios than the implementation for floating-point values. If data compression is not required, the integer-scaled option has little advantage. However, it is worth noting that this format is sometimes encountered in raster data formats used for publicly available data sources and the GVRS equivalent may be useful in handling such source.

Scale and offset are treated as follows:

intValue = (floatValue-offset) * scale
floatValue = (intValue/scale) + offset

Data Compression

Data compression is enabled as part of the file specification:

spec.setDataCompressionEnabled(compressionEnabled);

Data compression is an interesting topic and will be discussed in more detail in a future wiki article. For now, we will simply note it's impact on the storage for ETOPO1 and GEBCO_2019 data. In uncompressed form, ETOPO1 data is stored as a 4-byte integer. GEBCO_2019 is stored as a 4-byte float.

Product	Size (bits/sample)	Number of Samples	Time to Process (sec)
ETOPO1	4.460	233,280,000	68.3
GEBCO	2.909	3,732,480,000	1215.5
GEBCO x 2	3.59	3,732,480,000	1320.3

The bits per sample value for GEBCO is lower than that for ETOPO1 because the sample points are closer together (15 seconds of arc versus 1 minute) and there tends to be less variation in its values from sample to sample. Unfortunately, in order to store floating point values using data compression, they need to be converted to integer values as described above. Since the GEBCO_2019 elevation and depth values are non-integral, truncating the decimal part of their values loses some information. For the second GEBCO value in the table above(GEBCO x 2), a scaling factor of two was specified for the data model:

GvrsElementSpecification zElemSpec = GvrsElementSpecificationIntCodedFloat("z", 2.0, 0.0);

Again, we emphasize that the value truncation is required only for compressed data. When data is stored in its non-compressed form, the full precision of the 4-byte floating-point variables is maintained.

Adding Supplemental Content with GVRS Metadata Records

In order to maintain a simple design, the GVRS API presents a deliberately minimal feature set. Even so, we recognize that many users have application-specific requirements for the product. In some cases, users may wish to attach supplemental data to a GVRS file. This requirement can be met through the use of GVRS Metadata records.

GVRS Metadata records are blocks of text or binary data that can be stored as part of a GVRS file. The GvrsMetadata class serves as a container for receiving and delivering data in the form of standard primitive types including int, short, float, double, String, and byte. Each instance of the GvrsMetadata class is identified using an arbitrary name and optional integer ID. The following code snippet shows an example:

GvrsMetadata applicationSpecificData = 
	new GvrsMetadata("AppString", 0, GvrsMetadataType.String);
applicationSpecificData.setString("my application data");
gvrs.writeMetadata(applicationSpecificString);

GvrsMetadata result = gvrs.readMetadata("AppString", 0);
String resultString = result.getString();
System.out.println("result: "+resultString);

The example above creates an arbitrary metadata instance with the name "AppString", an integer identifier of zero, and a data-type specification of String. The integer ID serves as a way of ensuring that the particular metadata entry is unique. In cases where uniqueness is not required, an alternate constructor is supported:

GvrsMetadata note1 = new GvrsMetadata("Note", GvrsMetadataType.String);
GvrsMetadata note2 = new GvrsMetadata("Note", GvrsMetadataType.String);
note1.setString("Elevation in meters MSL");
note2.setString("All data accurate to 1 meter.");
gvrs.writeMetadata(note1);
gvrs.writeMetadata(note2);

The ETOPO1 and GEBCO products provide an example of how this feature may be used. Both data sets are based on geophysical information and, naturally, there are industry standards for specifying metadata related to their content. Because GVRS is intended to be a general-purpose utility, adding direct support for such standards is outside its scope. However, the PackageData demonstration application implements code for storing relevant metadata in a GVRS file.

One of the example metadata elements used the PackageData demonstration code is based on the Well-Known Text (WKT) standard. Well-Known Text is used by many Geographic Information Systems (GIS) to supply information about coordinate systems (map projections), units of measure, and specifications for Earth's size and shape ("datums"), and other information needed to accurately represent the data on a map. While neither ETOPO1 nor GEBCO supply WKT files in their distributions, both include relevant specifications on their product web sites. For the demo, we used the information to create a file called GlobalMSL.prj that contains metadata in WKT format. GlobalMSL.prj is bundled as a resource with the Gridfour demo module and is accessed by the PackageData application as shown below:

String wkt = readGlobalMSL();

GvrsMetadata metadataWKT = new GvrsMetadata("WKT", 0, GvrsDataType.String);
metadataWKT.setDescription("Well-Known Text, geographic metadata");
metadataWKT.setString(wkt);
gvrs.writeMetadata(metadataWKT);

One issue for future creation is whether the Gridfour project can create a set of standardized names for storing metadata in a GVRS file. Some commonly used values including WKT, TIFF (for TIFF files), and Copyright are already defined in the GvrsMetadataConstants class. The development of a formally managed standard for specifying identification strings in GVRS would depend on much wider adoption of GVRS than is currently the case.

Conclusion

The information given in this wiki article should be enough to get you started using GVRS. The use of many of the functions described above is demonstrated in the PackageData.java application included in the Gridfour software distribution.

The GVRS API is still under development. While it has undergone quite a bit of testing, it has seen very little actual use. If you encounter issues using GVRS, please let us know. Also, if you identify new features or enhancements you would like added to the API, we welcome your suggestions.

References

Adobe Systems Inc., 1992. TIFF Revision 6.0, Final-June 3, 1992. Accessed December 2019 from https://www.itu.int/itudoc/itu-t/com16/tiff-fx/docs/tiff6.pdf

General Bathymetric Chart of the Oceans [GEBCO], 2019. GEBCO Gridded Bathymetry Data. Accessed December 2019 from https://www.gebco.net/data_and_products/gridded_bathymetry_data/

National Oceanographic and Atmospheric Administration [NOAA], 2019. ETOPO1 Global Relief Model. Accessed December 2019 from https://www.ngdc.noaa.gov/mgg/global/

Sonalysts, Inc., 2019. wXstation. Accessed December 2019 from http://www.sonalysts.com/products/wxstation/

University Corporation for Atmospheric Research [UCAR], 2019. NetCDF-Java Library Accessed December 2019 from https://www.unidata.ucar.edu/software/netcdf-java/current/