Data formats - leondutoit/data-centric-programming GitHub Wiki

This section is not about data types - such as integers, floating point numbers, strings and dates. It is about common data storage and interchange formats.

Delimited files

The workhorses of data analysis are delimited files - tab and comma-separated files often referred to as TSV and CSV files. This should be familiar to almost anyone who has read this far, so I won't spend much time on it. Both python and R have good tools for reading and writing from such files.

Suppose we have a small CSV file:

$ touch myfile.csv
$ echo "day, sleep" >> myfile.csv
$ echo "1, 8" >> myfile.csv
$ echo "2, 2" >> myfile.csv
$ echo "3, 10" >> myfile.csv

Let's read it, transform it and write it out to a new file with Python pandas and R. In both cases, data is read into dataframes.

Python pandas file IO

IO is short for Input/Output. Let's look at reading and writing CSV files with pandas.

import pandas as pd

df = pd.read_csv("myfile.csv")
print df # have a look
# square the values for some nonsense computation
df = df.apply(lambda x: x*x, axis = 0)
df.to_csv("myfile_modified.csv")

This will create a new csv file with the squares of the input values. There are many options to control formatting when reading and writing files. Check them out.

R file IO

Let's use the standard library functions:

df <- read.csv("myfile.csv")
df <- df*df
write.csv(df, "myfile_modified_again.csv", row.names = FALSE)

JSON

JSON is short for JavaScript Object Notation. It is hard to overstate the importance of JSON in data analysis. It is important because of its wide-spread use in data interchange (sending data over the network) and its flexibility in storing multidimensional data in a "flat-file" like way. It is also incredibly useful since it is a native data storage format in Javascript and resembles familiar data structures in other programming languages: key-value stores. It is also a compound data type in many databases. The official specification is documented on json.org.

To make things a bit more concrete before discussing JSON more, let's look at what the CSV file above would look like in JSON format. Recall the CSV file looked like this:

"day","sleep"
1,8
2,3
3,6

One JSON representation of this data is this:

{
    "day": [1, 2, 3],
    "sleep": [8, 3, 6]
}

In this case the JSON is composed of two key-value pairs. Each one corresponds to a column in the CSV file. The key is the name of the column and the value is an array of values. Another option looks like this:

{
    [{
        "day": 1,
        "sleep": 8
    },
    {
        "day": 2,
        "sleep": 3
    },
    {
        "day": 3,
        "sleep": 6
    }]
}

In this case the JSON has an array with three objects, each one with two key-value pairs. The key names, as before, are the column names, and the values are the values in the columns that correspond to particular rows. It is inside an array because there is a specific order to the values that have to be preserved. In the example we could, therefore, produce the same CSV file from two different JSON data structures.

the JSON-stat format for tabular data.

tools, jq etc