autoplot - janeshdev/ggplot2 GitHub Wiki

autoplot

NOTE: This is still a discussion and nothing is final.

Summary

autoplot is an idiom for creating complete ggplot graphs that are appropriate for specific types of data first introduced in version 0.9.0. ggplot2 does not provide any useful methods, but declares the S3 generic for other packages to extend

Rationale

ggplot provides a framework for creating plots based on starting with data, mapping data to aesthetics, scaling the data and aesthetics, and providing a theming system for controlling plot appearance. At the same time, many packages which create specific data structures, typically S3 or S4 objects, provide plot methods to graphically display them based, typically, on base graphics. These default plots implement appropriate conventions for the type of data plotted. A layer that is missing is the ability to create a standard plot using ggplot so that it can be further adjusted by setting scales, themes, etc. autoplot aims to fill this niche. However, the design should not be so rigid as to make adaptation impossible; a strength of ggplot is the ability to rearrange data presentation in different ways for different needs.

Best practices

The plotting of specialized types of data structures can be split into two steps:

Converting the specialized data structure into a data.frame which exposes variables in a structure appropriate for ggplot. A mechanism/idiom for this already exists in fortify.
Define the plot based on mapping these variables to the appropriate aesthetics using existing geoms/stats/scales. This is the step autoplot should do.

fortify

Any object which has an autoplot method should also have a fortify method which the autoplot method uses to convert the specialized data structure into a data.frame. The purpose of this separation is to be able to re-use the work of data restructuring even if the specific autoplot method is not used. The documentation for the fortify method should enumerate the variables in the returned data.frame in such a way that they are known for other uses and that how they relate to the original data is known.

autoplot

The autoplot method should use the appropriate fortify method to convert the data structure to a data.frame. It can then construct a ggplot object using this data.frame and creating layers (geoms or stats) with the appropriate aesthetic mappings. Documentation should state what layers/geoms/stats are created, and what aesthetic mappings are made. just include the actual code?

autoplot should define default aesthetic mappings but allow the user to override them. One way to do it is to have a mapping argument to autoplot and then define the actual mapping as:

map <- c(mapping, aes_string(x="nameOfX", y="nameOfY", colour="nameOfColour"))
ggplot() + geom_foo(mapping=map)

for example.

naming conventions

If a package which defines these specific data structures also defines fortify and autoplot, then they are just two additional methods. Enhances or depends on ggplot?

If a separate package implements them, what should the package naming convention be? GGobject? ggobject? autplotObject? originalpackageGG?

Future work / open questions

S4 classes?
What should the convention be if the data structure can not be well represented by a single data.frame, but rather by a set of data.frames?
a list of data.frames pro: natural R idiom for collecting two or more things; con: breaks the return value convention for fortify
a "block diagonal" data.frame - that is, a single data.frame that has columns which are the combination of all the columns in all the individual data.frames, but only one set of columns are filled in at a time. pros: is a data.frame, which fortify is supposed to return; easy to create given the separate data.frames -- just plyr::rbind.fill them. con: inelegant as a data structure; wasteful of space
a data.frame with additional data.frames as attributes. pro: is a data.frame which fortify is to return (and which won't choke ggplot/qplot). con: Sets one of the data.frames as dominant over the others; seems not all that natural.
What should package naming conventions be?
Extends vs depends?
How much should autoplot take extra parameters to define variations on standard plot? example: triangle versus square lines in ggdendro. Should the fortify function pull out all possible data so that any version can be plotted?
If calling fortify is expensive, should each autoplot function check its input data to see if it is either a object of that type (dispatched via S3 methods) or called directly with the fortified data (and thus already a data.frame)? If so, then the autoplot function should be exported so it can be called directly.
Should the fortify and autoplot methods for specific data types be exported/public?

Case studies / example implementations

These are examples of packages or functions which create complete graphics of specific data types using ggplot, whether or not they use the autoplot mechanism.

Original discussion was based on http://stackoverflow.com/questions/7098830/bad-idea-ggplotting-an-s3-class-object which had discussion of linear regression model diagnostics and an example of trees.
ggdendro (CRAN page) (GitHub repo): does not implement in this way (as of 0.0-7), but has many of the pieces and some of the separation. Could be expanded/adapted if conventions are settled on.
granovaGG (CRAN page): first release September 4, 2011.
Survival curves: I (BrianDiggs) have some code that creates Kaplan-Meier curves from survfit objects, but it needs work; partially, I was wondering about a framework such as this.