Home - abrice/dz GitHub Wiki

dz() API Reference#

##Introduction A good chart needs data. The aim of dz() is to make it easier to manipulate data in order to produce a final set that can be used to produce a chart. dz() is an API construct that simplifies and standardises data manipulation.

Dz() is written in javascript and requires D3 to be loaded first.

There is always a lot of attention on the chart. That has meant a whole range of libraries have evolved to try and make it easier to use d3 to produce charts. That's very useful, but the first thing that is needed is data! And getting data into a web page, filtering it, recalculating and rearranging it, and then getting it into the right format for a table, a chart, a treemap, a sankey, a dendogram, all gets a bit complicated and bespoke.

So, dz() tries to address this 'get the data ready' step of the process. Take in data in a standard form, manipulate it according to the passed-in instructions, and then return it in a variety of formats. Do it fast, do it consistently, and do it without too much fiddling. Let the human concentrate on what they are analysing and charting and let the computer worry about the data. That's the theory.

So, step by step...

##Step 1: Get the data in

Easy, use D3. It is brilliant at ingesting data. It can bring in large files (my largest was 1 million rows) with remarkable speed and it has a very simple interface. Two types of data are common:

Comma Seperated Values (csv) which d3.csv() can bring in with just a few lines of code. There is an abundance of data on the web in .csv format or in Excel files that are easily converted to .csv format.
Javascript Object Notation (json) which is a bit more specialised but just as easily ingested by d3.json(). Json ingestion into dz() isn't available yet.

The starting point for dz() is a .csv file that is read into the webpage using d3.csv().

Step 1: Easily ingest data. Solution ==> d3! Easy, effective, and brilliant

##Step 2: Manipulate the data Now the data is there, it's in the perfect format, it runs easily into d3, and a magical chart appears. Mostly not.

The trouble with data is that it needs poking and prodding. It needs to be sliced and diced and generally played with until the right data arrives. The need is to:

Create new data elements based on whatever calculations seem appropriate
Filter and group rows arbitrarily
Structure multiple sets of arbitrary columns from an initial population of columns
Easily format the data in those columns
Extract multiple views, based on different row/column selections from the same base data
Output the data in d3-ready form such as row/column array, key/value[] hierarchical, or name/children[] hierarchical

And do it all through some sort of structured API-like interface.

Excel is the solution I hear you say. Sometimes it is, but often it's as much trouble as coded manipulation. A simple extract of a pivot table into .csv/.json takes time. And if you change your mind, you have to do it all again. And again. And if you actually need a hierarchical data structure, good luck...

So, d3() does this middle bit, creating an API model to simplify data manipulation.

And finally...

##Step 3: Draw a chart Solution ==> d3! Easy, effective, and brilliant

What does dz() actually do?

dz() adopts the Mike Bostock closure/chaining model to try and simplify the data preparation phase in a consistent and reusable manner. The emphasis is on simple. The conceptual flow is:

data is read in using a separate processes such as d3.csv()

Then four function calls are made to dz().

The .csv data is passed in ==> dz().basedata(csvdata). This basedata is not changed by dz() and can always be requested back as dz().basedata().
An array of columns is passed in ==> dz().datacolumns(columns) which describes all the columns that are available (the master set of columns). There are two types of columns: those that should be extracted from the basedata, and any new columns that need to be calculated. Column calculators are just like in d3 (function (d,i) {}) and all columns (basedata and earlier calculations) are available. This means an enormous range of manipulations are available in a simple and consistent way. If a new column is needed, pass in dz().datacolumns(columns) again with 'columns' updated.
Row specifications are passed in to define row groupings and row filters ==> dz().rows(rows). Groupings result in nested data while filters include or exclude rows from the final data set. Filters are also javascript functions.
An array of columns is passed ==> dz().columns(columns). These are the columns that will appear in the final dataset, the output columns, and they are (obviously) a subset of dz().datacolumns. Using this subset-of-columns approach separates the calculate from the present at a data level making it easy to reuse calculate data consistently.
dz() always returns the current output dataset in an array format that can then be used for presentation. Obviously it's the version after applying .columns and .rows.
If the data shape that is needed is key/value[] hierarchical or name/children[] hierarchical then that shape of data can simply be requested and it will be returned instead. Similarly, the data can be passed back in .csv-ready or .json ready format.
Finally, the entire data specification used to create the returned dataset can be saved as a .json object in a .json file for easy storage, retrieval, or sharing.

Example json objects passed to dz():

The total population of all possible columns:

    datacolumns = [
        {field:"Location",shorttext:"Global Location",type:"string"},
        {field:"Year",shorttext:"Year",type:"string"},
        {field:"Apples",shorttext:"Apples",type:"number",format:"{0:n2}"},
        {field:"Apricots",shorttext:"Apricots",type:"number",format:"{0:n0}"},
        {field:"Apricots2",shorttext:"A clone of Apricots",type:"number",format:"{0:n0}",
            calculator: function (d) {
                return d["Apricots"];
            }},
        {field:"Mashed Apples",shorttext:"Mashed apples",type:"number",format:"{0:n0}",
            calculator: function (d) {
                return (d["Apples"] * 10000);
            }
        },
        {field:"Random",shorttext:"Random Numbers",type:"number",format:"{0:n0}",
            calculator: function (d) {
                return Math.floor(Math.random() * (10000 - 0 + 1)) + 0;
            }
        }
    ];

The columns that appear in the output dataset:

  columns = {
       values: [
           {field:"Apples",aggregation:"sum"},
           {field:"Apricots2",aggregation:"mean",format:"{0:n6}"}, 
           {field:"Mashed Apples",aggregation:"calculate",
               calculator:function (d) {
                   var retvalue = 0;
                   for (var i = 0; i < d.length; i++) {
                       retvalue += +d[i]/2;
                   }
                   return retvalue;
               }
           }
       ]
   };

Filter out rows where the location is Singapore plus add a complex numeric filter:

        rows = {
            filters: [
                {filter: function (d) {
                        if (d["Location"] === "Singapore") {
                            return false;
                        }
                        return true;
                    }
                },
                {filter: function (d) {
                        if (d["Apples"] >22 && +d["Apples"]===+d["Apricots"]) {
                            return true;
                        }
                    }
                }
            ]

Group rows into an arbitrary number of levels:

rows.groups: [
    {field:"Year",sort:false},
    {field:"Location",sort:true}
];

###Calling dz() var presentationdata = dz().basedata(csvdata).datacolumns(datacolumns).rows(rows).columns(columns);

To now just see the data by year change rows.groups to: rows.groups: [ {field:"Year",sort:false} ];

and presentationdata = presentationdata.rows(rows);

Under the covers

Row grouping uses d3.nest()
Aggregation calculations use d3.mean(), d3.sum(), etc
D3 is required, no other dependencies are present

Where is it?

Not yet available because it's undergoing testing. But so far, so good.