when numbers are real - Josh-Joseph/data_science GitHub Wiki

When Numbers are Real

Usually, data science analysis deal with "math-land" [1] numbers. However, sometimes the numbers we are handling are tied to a real, physical quantity. Of course, we can always ignore reality and stay in math-land -- and in fact for starters this may always be the thing to do. But there are certain situation where the "reality" of our number can actually be extremely useful. In this post I'll tackle two main situation where thinking about the reality of a number has helped: 1) Detecting bogus/bad data for which you do not have the provenance of, and 2) Differentiating the source of data for which you do not have the provenance of. You'll notice that both situations include a lack of knowledge about the provenance of the data -- or lack of believe over the observed/given provenance of a data set.

Provenance

Any time we want to analyze a data-set we must think about the "provenance" of the data. In this context "provenance" represents all of the information needed to completely reproduce the data-set (if need be). This means that provenance includes both the experiment conditions for the data collection, as well as all transformations done to the data thereafter.

In the ideal world, we would know the provenance of all the data we handle, and we ensure that we provided all provenance information when we transmit and share data to other parties. The real world is rarely ideal. Most of the time, we will have a reasonable guess as to the provenance of data. Sometimes, we have no clue as to the provenance and we must infer consistent histories of the transformation applied to the data from our prior information and the data itself. We also have to be wary of being given incorrect provenance information.

So, in the case where we either have no provenance, bad provenance, or incomplete provenance we have to think about how to proceed with an analysis. We could ignore provenance altogether, in which case we must then test every assumption since we are not using any knowledge of the data's source, collection mechanism, experimental design or post-processing. Ignoring provenance -- and still performing useful analysis -- is one of the "holy grails" of data science. It implies that we have somehow figured out how to create and run analysis without needing to think about the data whatsoever; a true universal analysis. I have yet to be able to pull that off.

If we do want to reason about the provenance, we can simply choose to treat what we have been told/"know" about the data as true and simply try to make the most of it. This boils down to a problem of trust: how much do you trust the source of the provenance information? In my experience, I rarely see nefariously edited provenance information (but it does crop up once in a while). However, data science projects are often a hodgepodge of techniques, people, and goals all thrown together with too little direction, too little time, and too much bias; errors in provenance are almost unavoidable!

The Physical Sensor

Middle School: The Significant Digits

The Edge of Science and Engineering: Common Measurement Limits

Here are some common limits for spatial data (refs: gps-gov):

  • GPS (real-time no-post-processing non-differential vanilla): 3.5m
  • GPS (real-time augmented): 1cm
  • GPS (post-processing augmented): 1mm

Time data (refs: time-man):

  • software clocks (most common OS library call): 0.001s
  • high-resolution timers (hardware based): 1us (microsecond resolution)

Detecting "Post-Processing"

Detecting a "Source"

The Particulars: Spatial Data

The Particulars: Temporal Data

Wrap-up

Footnotes

[1]: See our post math land, turing land, and the real world