What is the Australian Geoscience Data Cube (AGDC)? - GeoscienceAustralia/agdc GitHub Wiki

What is the Australian Geoscience Data Cube (AGDC)?

The AGDC (data cube) is a common analytical framework composed of a series of data structures and tools which facilitate the organisation and analysis of large gridded data collections.

The data cube has initially been developed for the analysis of temporally-rich earth observation data, however the flexibility of the platform allows other gridded data collections to also be included and analysed. Such data may include elevation models, geophysical grids, interpolated surfaces and model outputs. A key characteristic of the data cube is that every unique observation is kept, which is in contrast to many other methods used to handle large gridded data collections. The data cube provides a common analytical framework to allow multiple data sources to produce information for multiple uses.

Another key characteristic of the data cube is the inclusion of calibrated data. In the past the calibration of earth observation data has been performed on a per-application basis for selected times and areas leading to large overheads. Traditional approaches to big data are inefficient because they process only a few images at a time, so that costly and difficult calibration steps must be completed by every user. Most of the data stay in the archive.

By calibrating the entire data stream to the same standard and making the data accessible in a High Performance Data (HPD) structure co-located with a High Performance Computing (HPC) facility the data can be viewed as enabling infrastructure for data-intensive science.

By transforming raw data into a standardised data infrastructure, the requirement for difficult and time-consuming pre-processing of the data is eliminated from individual applications. This allows an increased capacity for development of information products by the earth observation community, and increased value for the public from earth observation information.

The first data stream to be transformed into the data cube structure has been the Australasian Landsat archive. Geoscience Australia has collected Landsat data using our ground station at Alice Springs since 1979, however up until recently only a few images at a time could be retrieved from the archive and processed through to information products. This was a costly and slow process, which cannot scale to deal with the growing data challenge.

Data volumes typically grow faster than our ability to extract useful information. We are on the cusp of a ‘data deluge’. New satellites from the USA, Europe, Japan and elsewhere are producing far more data than ever before (Figure 3). A new approach was needed. By having earth observation data in a HPD structure, the thousands of cores available in a HPC facility allows for rapid processing. Very large analyses are now possible within a matter of hours, rather than months or years, and at high resolution (e.g., nominally 25 metres for all of Australia).

How it works

The data cube works by organising the data into stacks of consistent, time-stamped geographic ‘tiles’ so that they can be rapidly manipulated in an HPC environment (figure 4). A relational database is used to track all of the tiles in the data cube. Although the data cube contains some 10E14 individual observations, the database can be used to track every observation back to the point of collection. The data cube is not optimised for any one use, but provides HPD/HPC infrastructure for many possible uses.

Proving the concept

So far, we have used the data cube to complete two analyses that were previously impractical. The first study, conducted for the Murray Darling Basin Authority, processed earth observations from space for the entire Murray Darling Basin (roughly 1/7 of the total area of Australia) to map changes in greenness and vegetation health for a 15 year period from 1998 to 2012, at a nominal 25 metre resolution. The second study was continental in scale and mapped observations of surface water “Water Observations from Space” for all of Australia, also at a nominal 25 metre resolution, as an information source for the National Flood Risk Information Portal. Water Observations from Space (WOfS) mapped surface water observations for all of Australia at 25 metre resolution using data from 1998 to 2012. WOfS is accessible at http://www.ga.gov.au/flood-study-web/#/water-observations

What next?

Like most new and innovative technologies the Data cube continues to develop at the same time it is in use. Future work will include:

  • Closer management of the data collections going in to the data cube, and increased number of datasets
  • Building of robust ‘tools’ to extract data from the data cube and to analyse it
  • Building of ‘work flows’ and ‘virtual laboratories’ to process data from the data cube to produce information
  • Developing standard ‘services’ that will ultimately support mobile applications

Getting involved

The data cube is collaboration between Geoscience Australia, the National Computational Infrastructure and the CSIRO. For more information on working with this team on the data cube, contact [email protected]