data driven analysis based on statistical properties can complement model based interpretations of the observation - cpshooter/geoML GitHub Wiki

In sifting through large martian data sets to find observations of particular scientific interest, there is always a lingering question of "what are we missing?" Targeted searches for materials such as phyllosilicates or carbonates might successfully identify key observations of each, but given limited analysis time we might inadvertently pass over other equally interesting compositions. Data sets with high dimensionality pose an additional challenge for efficient analysis and interpretation. Two prime examples are point spectra such as those collected by LIBS (Laser-Induced Breakdown Spectroscopy) instruments like ChemCam on the Mars Science Laboratory and hyperspectral data such as that collected by CRISM on Mars Reconnaissance Orbiter. We have developed DEMUD (Discovery through Eigenbasis Modeling of Uninteresting Data), a method that quickly scans through large data sets to identify observations that stand out as unusual or anomalous. It focuses attention on new or unexpected observations, accelerating the discovery of new phenomena. Further, DEMUD learns from feedback. Each item it selects can be designated 'interesting' or 'uninteresting'. DEMUD learns to ignore items deemed uninteresting even if they are otherwise anomalous (e.g., observations that contain noise, bad pixels, or even just novelties that are now well understood). DEMUD progressively peels back layers of anomalies in the data set, without specializing on one particular science goal. It therefore remains open to discovering new phenomena that may not have been anticipated by the mission prior to data collection. This data-driven analysis based on statistical properties can complement model-based interpretations of the observations. In addition to highlighting items of interest, DEMUD also provides explanations for individual decisions, e.g., highlighting specific wavelengths where the chosen sample has unusually high, or low, intensity. It therefore can greatly accelerate not just the discovery, but also the interpretation of novel observations within large data sets. We tested DEMUD by using it to analyze data collected by CRISM and LIBS observations of rock samples and the same laboratory standards used to calibrate ChemCam. The CRISM analysis replicated the discovery of magnesite (a carbonate) in an image of Nili Fossae (FRT00003E12). In the LIBS experiments, DEMUD quickly identified a variety of carbonates as anomalous with respect to olivine, basalt, and andesite samples. Compared to using a regular Principal Components Analysis to rank the samples, DEMUD cut the number of queries needed to find all samples of interest in half. We expect that DEMUD could similarly accelerate the analysis of large data sets or in time-constrained mission operation settings. We also produced a ranked list of heterogeneous novelties within each data set as a demonstration of DEMUD's utility as an interpretive tool. While the ChemCam instrument will not produce excessively large data volumes on a daily basis, as the mission goes on it will be important to interpret new data in the context of all that has been collected, as well as new laboratory standards observed under martian conditions. DEMUD can serve to assist in the full-collection analysis and interpretation each time new data is added to the archive.