2 ggparty - martin-borkovec/ggparty GitHub Wiki
But first things first.
Let’s recreate a simple example already used in the partykit
vignette.
If you are not familiar with the
parykit you should
definitely check it out before you work with this package.
data("WeatherPlay", package = "partykit")
sp_o <- partysplit(1L, index = 1:3)
sp_h <- partysplit(3L, breaks = 75)
sp_w <- partysplit(4L, index = 1:2)
pn <- partynode(1L, split = sp_o, kids = list(
partynode(2L, split = sp_h, kids = list(
partynode(3L, info = "yes"),
partynode(4L, info = "no"))),
partynode(5L, info = "yes"),
partynode(6L, split = sp_w, kids = list(
partynode(7L, info = "yes"),
partynode(8L, info = "no")))))
py <- party(pn, WeatherPlay)
The ggparty()
function takes a tree of class party
and allows us to
plot it with the help of the ggplot2 package. To make this possible,
the 'party'
object first needs to be transformed into a 'data.frame'
and be passed to a ggplot()
call. This is exactly what happens when we
run ggparty()
.
is.ggplot(ggparty(py))
[1] TRUE
pander::pandoc.table(ggparty(py)$data[,1:16])
id | x | y | parent | birth_order | breaks_label | info | info_list |
---|---|---|---|---|---|---|---|
1 | 0.5 | 1 | NA | 0 | NA | NA | NA |
2 | 0.2 | 0.75 | 1 | 1 | sunny | NA | NA |
3 | 0.1 | 0.5 | 2 | 1 | NA <= NA* 75 | yes | NA |
4 | 0.3 | 0.5 | 2 | 2 | NA > NA* 75 | no | NA |
5 | 0.5 | 0.5 | 1 | 2 | overcast | yes | NA |
6 | 0.8 | 0.75 | 1 | 3 | rainy | NA | NA |
7 | 0.7 | 0.5 | 6 | 1 | false | yes | NA |
8 | 0.9 | 0.5 | 6 | 2 | true | no | NA |
Table continues below
splitvar | level | kids | nodesize | p.value | horizontal | x_parent | y_parent |
---|---|---|---|---|---|---|---|
outlook | 0 | 3 | 14 | NA | FALSE | NA | NA |
humidity | 1 | 2 | 5 | NA | FALSE | 0.5 | 1 |
NA | 2 | 0 | 2 | NA | FALSE | 0.2 | 0.75 |
NA | 2 | 0 | 3 | NA | FALSE | 0.2 | 0.75 |
NA | 2 | 0 | 4 | NA | FALSE | 0.5 | 1 |
windy | 1 | 2 | 5 | NA | FALSE | 0.5 | 1 |
NA | 2 | 0 | 3 | NA | FALSE | 0.8 | 0.75 |
NA | 2 | 0 | 2 | NA | FALSE | 0.8 | 0.75 |
The first 16 columns of the 'data.frame'
passed by ggparty()
to
ggplot()
contain these values:
- id… ID of the node
- x… X coordinate of the node
- y… Y coordinate of the node
- parent… ID of node’s parent
- birth_order… Position relative to parent. Goes from left to right.
- breaks_label… String containing the corresponding split break of the parent’s split variable.
- info… String containing the info of the node
- info_list… List containing the info of the node if it was a list
- splitvar… String containing the name of the Variable to split with. (only inner nodes)
- level… At which level to draw the node. (0 = root)
- kids… Number of node’s kids
- nodesize… Number of rows in node’s data.
- p.value… P value of model if present
- horizontal… Logical - specifies whether the tree is to be drawn horizontally or vertically. Identical for all nodes.
- x_parent… X coordinate of the node’s parent
- y_parent… Y coordinate of the node’s parent
The remaining columns contain lists of the node’s data
and we will
need geom_node_plot()
to work with them.
Every ggparty plot starts with a call to the eponymous ggparty()
function which requires an object of class 'party'
. To draw a tree we
will need to add several of these components:
- geom_edge() draws the edges between the nodes
- geom_edge_label() labels the edges with the corresponding split breaks
- geom_node_label() labels the nodes with the split variable, node info or anything else. The shorthand versions of this geom geom_node_splitvar() and geom_node_info() have the correct defaults to write the split variables in the inner nodes resp. the info in the terminal nodes.
- geom_node_plot() creates a custom ggplot at the location of the node
In most cases we will probably want to draw at least edges, edge labels
and node labels, so we will have to call the respective functions. The
default mappings of geom_edge()
and and geom_edge_label()
ensure
that lines between the related nodes are drawn and the corresponding
split breaks are plotted at their centers.
Since the text we want to print on the nodes differs depending on the kind of node, we will call geom_node_label twice. Once for the inner nodes, to plot the split variables and once for the terminal nodes to plot the info elements of the tree, which in this case contain the play decision.
ggparty(py) +
geom_edge() +
geom_edge_label() +
geom_node_label(aes(label = splitvar), ids = "inner") +
# identical to geom_node_splitvar() +
geom_node_label(aes(label = info), ids = "terminal")
# identical to geom_node_info()
Instead of adding geom_node_label()
we can also add the convenience
versions geom_node_splitvar()
and geom_node_info()
which contain the
correct defaults to plot the split variables in the inner nodes and the
info in the terminal nodes.
Thanks to the ggplot2 mechanics we can now map different aspects of our
plot to properties of the nodes. Whether that’s the best choide in this
case is a different question.
ggparty(py) +
geom_edge() +
geom_edge_label() +
# map color to level and size to nodesize for all nodes
geom_node_splitvar(aes(col = factor(level),
size = nodesize)) +
geom_node_info(aes(col = factor(level),
size = nodesize))
We can create a horizontal tree simply by setting horizontal
in
ggparty()
to TRUE
.
ggparty(py, horizontal = TRUE) +
geom_edge() +
geom_edge_label() +
geom_node_splitvar() +
geom_node_info()