In ggplot2, a plot is constructed by adding layers to it. A layer consists of two important parts: the geometry (geoms), and statistical transformations (stats). The 'stat' part of a layer is important because it performs a computation on the data before it is displayed. Stats determine what is displayed, not how it is displayed.
For example, if you add stat_density()
to a plot, a kernel density
estimation is performed, which can be displayed with the 'geom' part of a
layer. For many geom_*()
functions, stat_identity()
is used,
which performs no extra computation on the data.
Specifying stats
There are five ways in which the 'stat' part of a layer can be specified.
# 1. The stat can have a layer constructor
stat_density()
# 2. A geom can default to a particular stat
geom_density() # has `stat = "density"` as default
# 3. It can be given to a geom as a string
geom_line(stat = "density")
# 4. The ggproto object of a stat can be given
geom_area(stat = StatDensity)
# 5. It can be given to `layer()` directly:
layer(
geom = "line",
stat = "density",
position = "identity"
)
Many of these ways are absolutely equivalent. Using
stat_density(geom = "line")
is identical to using
geom_line(stat = "density")
. Note that for layer()
, you need to
provide the "position"
argument as well. To give stats as a string, take
the function name, and remove the stat_
prefix, such that stat_bin
becomes "bin"
.
Some of the more well known stats that can be used for the stat
argument
are: "density"
, "bin"
,
"count"
, "function"
and
"smooth"
.
Paired geoms and stats
Some geoms have paired stats. In some cases, like geom_density()
, it is
just a variant of another geom, geom_area()
, with slightly different
defaults.
In other cases, the relationship is more complex. In the case of boxplots for
example, the stat and the geom have distinct roles. The role of the stat is
to compute the five-number summary of the data. In addition to just
displaying the box of the five-number summary, the geom also provides display
options for the outliers and widths of boxplots. In such cases, you cannot
freely exchange geoms and stats: using stat_boxplot(geom = "line")
or
geom_area(stat = "boxplot")
give errors.
Some stats and geoms that are paired are:
Using computed variables
As mentioned above, the role of stats is to perform computation on the data.
As a result, stats have 'computed variables' that determine compatibility
with geoms. These computed variables are documented in the
Computed variables sections of the documentation, for example in
?stat_bin
. While more thoroughly documented
in after_stat()
, it should briefly be mentioned that these computed stats
can be accessed in aes()
.
For example, the ?stat_density
documentation states that,
in addition to a variable called density
, the stat computes a variable
named count
. Instead of scaling such that the area integrates to 1, the
count
variable scales the computed density such that the values
can be interpreted as counts. If stat_density(aes(y = after_stat(count)))
is used, we can display these count-scaled densities instead of the regular
densities.
The computed variables offer flexibility in that arbitrary geom-stat pairings
can be made. While not necessarily recommended, geom_line()
can be paired
with stat = "boxplot"
if the line is instructed on how to use the boxplot
computed variables:
Under the hood
Internally, stats are represented as ggproto
classes that
occupy a slot in a layer. All these classes inherit from the parental
Stat
ggproto object that orchestrates how stats work. Briefly, stats
are given the opportunity to perform computation either on the layer as a
whole, a facet panel, or on individual groups. For more information on
extending stats, see the Creating a new stat section after
running vignette("extending-ggplot2")
. Additionally, see the New stats
section of the
online book.
See also
For an overview of all stat layers, see the online reference.
How computed aesthetics work.
Other layer documentation:
layer()
,
layer_geoms
,
layer_positions