Curated Reference|Challenge Data Sets - rmcgranaghan/data_science_tools_and_resources GitHub Wiki

This page holds information (including links and citations, if available) for curated and AI-ready space science data sets that can help reduce the barrier to entry for use of data science and machine learning tools for scientific discovery.

We begin with a definition of `AI-ready' that has been revised and produced by the community. Below are data sets that have met these criteria. If you have an AI-ready dataset, open an issue in this repository, providing the requisite resources for your data set and we will evaluate and add the reference to those data here if it meets the criteria.

What makes data "AI-ready"?

Curated "AI-ready" data sets (see note below) minimize duplication of effort between users and optimize the data for ingestion into an ML platform or a tool. Without careful, conscientious preparation by an expert, many machine learning projects are doomed to generate spurious results, and sideline users with having to learn bit-by-bit how to deal with every minor artefact or calibration offset.

“AI readiness” is the process by which “raw” data or data products are transformed into a “feature set” for use in AI/ML analysis. It transitions the data from the subject matter domain to statistical applicability.

The key is the (a) availability and (b) usability of (c) a [large] amount of (d) similarly processed data that can be (e) readily fed into analysis algorithms or tools. Much of this is satisfied with standard science data archives and interfaces. However, data curation for ML can require additional steps: resampling, interpolating, and patching missing or spurious results so it is regularly sampled across temporal and spatial dimensions. Further still, metadata are used to coalign, co-register and cross-calibrate the measurements (temporally, spatially, spectrally) to ensure that each point has the optimal correspondence within the data set. The curated data and metadata must then be made available in a way that is easy to read and understand.

A dataset can be AI ready, or a user can make use of AI-ready preparation tools (e.g. AIDApy). However, AI readiness is relative - a dataset that is suitable for one topic or use may not be suitable for another topic. Determining appropriateness is an important step of data preparation.

Elements of an AI-ready dataset

AI-ready data preparation typically involves several steps:

  • Data are [calibrated, promoted in level, standardized, etc.] so that values correspond well to the physical system being studied
  • Spurious data and non-physical values are either corrected or identified
  • Data are interpolated, patched, etc. to provide even or consistent sampling. May be re-sampled in space or time in order to synchronize with other data that could be used in the feature set.
  • Data labels (such as features and event indicators), if relevant, are compatible with the feature set, i.e. Labels “Y” should have a format compatible for learning with feature set “X” as an input.
  • Metadata are appended to be consistent with common formats, ensure complementarity between the features, and any other supplemental information is applied to make the data usable.
  • Data release standards are satisfied, with “all clear” from the preparers of the dataset for general distribution and use.
  • The data are then made available in a format that is easy to read into an AI algorithm (e.g. Keras, PyTorch)

Optional (but often desirable) steps:

  • Usable code with commonly performed steps (query/sampling, data ingest, etc.). The target is to provide the steps that almost every user would have to develop on their own.
  • Feature scaling optimized for learning (such as normalization options like logarithmic for multi-scale sensitivity)
  • A description of the types of physical processes that are well-sampled and therefore can be addressed or approached with each given data set
  • A structuring of the data such that it supports complex queries

A good representation of the many properties of a "truly" AI-ready dataset is the Data Science Hierarchy of Needs:

Data Science Hierarchy of Needs diagram

Source: https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007

AI Readiness for Scientific Studies

A well-prepared data set will provide supplemental information about the physics underlying it so that the user is aware of what to expect in a typical sample. For example, to analyze heliophysics data it is necessary to understand that: “the solar wind is typically described as having four different conditions: slow solar wind, fast solar wind, compression regions, and ejecta. Nominal solar wind magnetic fields usually range between 0-10 nT and extreme events can be several times larger.” Or “The sun rotates with an average period of 27 days and solar features often will last for months or years. Coronal holes tend to have an average value of 10 DN in 335 angstroms, while active regions can be as bright as 1000 DN in this line. Coronal holes can last for many weeks, but flares are rapid, so this data set’s cadence of 5 minutes works well for coronal holes but is not ideal for flare studies.” This type of information helps bridge the gap between data scientists and heliophysicists.

Finally, a well-prepared dataset allows a potential user to assess whether the dataset is appropriate to their proposed task. This may include a description of the types of problems that can be addressed or approached with each given data set is really valuable. This encapsulates commonly known physics and data science pitfalls that can easily be avoided. For example, coronal holes appear well in EUV images but 94 angstrom images do not have enough signal to distinguish them regularly. Or in the case of data science, the irregular data dropouts in a time series may make fourier analysis unreliable, but is better suited to a lomb-scargle periodic analysis. Additionally, some methods cannot be successful without statistical sampling across the parameter space, so a description of the sampling density is also good to provide.

  • Note: Many make a distinction between "curated" and "AI-ready," where AI-ready involves the specific step of ensuring compatibility with common machine learning platforms/libraries. In this discussion, the two terms are used interchangeably, but in the future we may want to distinguish between the two

Data Sets

(Unproven thus far) Data Sets

  • K. D. Leka's curated set of emerging active regions (need to add)

Resources

  • Criteria for 'analysis ready, cloud optimized (ARCO)' from Pangeo
  • The Python package AIDApy (Artificial Intelligence Data Analysis) centralizes and simplifies access to: Spacecraft data from heliospheric mission; Space physics simulations; Advanced statistical tools, Machine Learning, Deep Learning algorithms, and applications
  • Although not technically a curated database, the Virtual Space Physics Observatory keeps a list of online catalogs that are a useful resource for identifying key intervals and events. Go to https://vspo.gsfc.nasa.gov/ and in the form under ''Element Restriction'' choose ''Catalogs'' as a Resource Type.
  • Blog post on 'analysis ready data' (not strictly AI-ready)