Documentation - leondutoit/data-centric-programming GitHub Wiki
Documenting your analysis/project is essential for many reasons. Reading code is harder than reading plain language at best. The more help you provide for others (and yourself in the future) in understanding your code the better. Code is documented at many levels: project, module, class, script, and function for example.
In general
At the project level - if the project consists of many classes and modules working together one typically describes what the application does and how it is structured. The Hadoop project, for example, is described at a high level on it's web page.
A module (aka package, library) like Python's Flask micro web framework, or R's open source implementation of the Grammar of Graphics, ggplot2 typically has a high level and a more detailed API documentation.
In object oriented programming, classes are collections of methods that work together in a common namespace. It is standard practice to provide documentation of the class and the methods within the class.
When the program you have written is composed of different functions and contained in one file or script one will document the script itself and also the individual functions. See, for example, the documentation of the plot function in ggplot2.
Python
A 'docstring' is defined as "[...] a string literal that occurs as the first statement in a module, function, class, or method definition" by the official Python docstring guideline. A docstring then, is the basic unit that describes code at different levels.
As an example, consider a class with a method that transforms a SQL resultset into serialised JSON data - a common task when transporting data from a relational database to a web browser for visualisation:
"""Project - data transformers for common analytics tasks.
"""
import numpy as np
import pandas as pd
class DataTransformer(object):
"""A DataTranformer object has a method to transform a sql
resultset to valid JSON. For convenience the most recent
resultset and JSON is stored on the object.
"""
def __init__(self):
self.last_resultset = None
self.last_json = None
def __repr__(self):
return "DataTransformer()"
def sql_to_json(self, result_set):
"""Transforms SQL resulset to valid JSON.
Both the input and output of the most recent call is stored
on the instance attributes for convenience.
:param result_set: a SQL resultset from a query
"""
self.last_resultset = result_set
data = pd.DataFrame(np.array(result_set, dtype=object))
valid_json = data.to_json(orient="records")
self.last_json = valid_json
return valid_json
Notice how the class is documented at many levels. In the sql_to_json
method there is a reference to the parameter :param resultset
. This syntax can be used by a documentation generation system like Sphinx. To see how this type of docstring, written in reStructuredText Docstring Format would be rendered in HTML take a look at this example. But anyways, the general point here is to illustrate that in addition to writing code for completing the tasks you want to, it should be common practice to document the code itself. It is best to think of documentation as being part of writing code and to make it a habit.
R
Hadley Wickham's Advanced R Programming wiki has a great section describing documentation of R code at all levels of modules and functions.
It is quite common to produce the same plot for differnt combinations of data. In such cases it is helpful to have a high level function, or functions, that are wrappers around the plotting function you are using. While it may not be absolutely necessary in every case, it can be a good way to reduce code duplication. Let's look at how to document such a case.
#' Functions that wrap density plot with ggplot2
#'
#' \code{plot_data} returns a ggplot object that can define aesthetics for one variable
#'
#' @param data a data frame
#'
#' \code{make_density} returns a layered ggplot object with a density
#'
#' @param plot_expression a ggplot2 object with data bound to it
#'
#' @example
#' density_plot <- make_density(plot_data(mydataframe))
#' print(density_plot)
plot_data <- function(data) {
ggplot(data, aes_string(x = x), environment=environment())
}
make_density <- function(plot_expression) {
return (plot_expression + geom_density(alpha=0.8) + scale_x_log10())
}
In this case the documentation conventions used is dictated by what is necessary to produce .Rd files with the R package roxygen2. Rd files are the standard way of searching for help
in R.