Replace labels in data based on 2D density

stat_dens2d_labels() Sets values mapped to the label aesthetic to "" or a user provided character string based on the local density in regions of a plot panel. Its main use is together with repulsive geoms from package ggrepel. If there is no mapping to label in data, the mapping is set to rownames(data), with a message.

stat_dens2d_labels(
  mapping = NULL,
  data = NULL,
  geom = "text",
  position = "identity",
  ...,
  keep.fraction = 0.1,
  keep.number = Inf,
  keep.sparse = TRUE,
  keep.these = FALSE,
  exclude.these = FALSE,
  these.target = "label",
  pool.along = c("xy", "x", "y", "none"),
  xintercept = 0,
  yintercept = 0,
  invert.selection = FALSE,
  h = NULL,
  n = NULL,
  label.fill = "",
  return.density = FALSE,
  na.rm = TRUE,
  show.legend = FALSE,
  inherit.aes = TRUE
)

Arguments

mapping: The aesthetic mapping, usually constructed with aes or aes_. Only needs to be set at the layer level if you are overriding the plot defaults.
data: A layer specific dataset - only needed if you want to override the plot defaults.
geom: The geometric object to use display the data.
position: The position adjustment to use for overlapping points on this layer
...: other arguments passed on to layer. This can include aesthetics whose values you want to set, not map. See layer for more details.
keep.fraction: numeric [0..1]. The fraction of the observations (or rows) in data to be retained.
keep.number: integer Set the maximum number of observations to retain, effective only if obeying keep.fraction would result in a larger number.
keep.sparse: logical If TRUE, the default, observations from the more sparse regions are retained, if FALSE those from the densest regions.
keep.these, exclude.these: character vector, integer vector, logical vector or function that takes one or more variables in data selected by these.target. Negative integers behave as in R's extraction methods. The rows from data indicated by keep.these and exclude.these are kept or excluded irrespective of the local density.
these.target: character, numeric or logical selecting one or more column(s) of data. If TRUE the whole data object is passed.
pool.along: character, one of "none" or "x", indicating if selection should be done pooling the observations along the x aesthetic, or separately on either side of xintercept.
xintercept, yintercept: numeric The split points for the data filtering.
invert.selection: logical If TRUE, the complement of the selected rows are returned.
h: vector of bandwidths for x and y directions. Defaults to normal reference bandwidth (see bandwidth.nrd). A scalar value will be taken to apply to both directions.
n: Number of grid points in each direction. Can be scalar or a length-2 integer vector
label.fill: character vector of length 1, a function or NULL.
return.density: logical vector of lenght 1. If TRUE add columns "density" and "keep.obs" to the returned data frame.
na.rm: a logical value indicating whether NA values should be stripped before the computation proceeds.
show.legend: logical. Should this layer be included in the legends? NA, the default, includes if any aesthetics are mapped. FALSE never includes, and TRUE always includes.
inherit.aes: If FALSE, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn't inherit behaviour from the default plot specification, e.g. borders.

Value

A plot layer instance. Using as output data the input data after value substitution based on a 2D the filtering criterion.

Details

stat_dens2d_labels() is designed to work together with geometries from package 'ggrepel'. To avoid text labels being plotted over unlabelled points all the rows in data need to be retained but labels replaced with the empty character string, "". Function stat_dens2d_filter cannot be used with the repulsive geoms from 'ggrepel' because it drops observations.

stat_dens2d_labels() can be useful also in other situations, as the substitution character string can be set by the user by passing an argument to label.fill. If this argument is NULL the unselected rows are filtered out identically as by stat_dens2d_filter.

The local density of observations in 2D (x and y) is computed with function kde2d and used to select observations, passing to the geom all the rows in its data input but with with the text of labels replaced in those "not kept". The default is to select observations in sparse regions of the plot, but the selection can be inverted so that only observations in the densest regions are returned. Specific observations can be protected from having the label replaced by passing a suitable argument to keep.these. Logical and integer vectors function as indexes to rows in data, while a character vector is compared to values in the variable mapped to the label aesthetic. A function passed as argument to keep.these will receive as its first argument the values in the variable mapped to label and should return a character, logical or numeric vector as described above.

How many labels are retained intact in addition to those in keep.these is controlled with arguments passed to keep.number and keep.fraction. keep.number sets the maximum number of observations selected, whenever keep.fraction results in fewer observations selected, it is obeyed.

Computation of density and of the default bandwidth require at least two observations with different values. If data do not fulfill this condition, they are kept only if keep.fraction = 1. This is correct behavior for a single observation, but can be surprising in the case of multiple observations.

Parameters keep.these and exclude.these make it possible to force inclusion or exclusion of observations after the density is computed. In case of conflict, exclude.these overrides keep.these.

Note

Which points are kept and which not depends on how dense a grid is used and how flexible the density surface estimate is. This depends on the values passed as arguments to parameters n, bw and kernel. It is also important to be aware that both geom_text() and geom_text_repel() can avoid overplotting by discarding labels at the plot rendering stage, i.e., what is plotted may differ from what is returned by this statistic.

Examples


random_string <-
  function(len = 6) {
    paste(sample(letters, len, replace = TRUE), collapse = "")
  }

# Make random data.
set.seed(1001)
d <- tibble::tibble(
  x = rnorm(100),
  y = rnorm(100),
  group = rep(c("A", "B"), c(50, 50)),
  lab = replicate(100, { random_string() })
)

# using defaults
ggplot(data = d, aes(x, y, label = lab)) +
  geom_point() +
  stat_dens2d_labels()


ggplot(data = d, aes(x, y, label = lab)) +
  geom_point() +
  stat_dens2d_labels(keep.these = "zoujdg")


ggplot(data = d, aes(x, y, label = lab)) +
  geom_point() +
  stat_dens2d_labels(keep.these = function(x) {grepl("^z", x)})


ggplot(data = d, aes(x, y, label = lab)) +
  geom_point() +
  stat_dens2d_labels(geom = "text_s",
                     position = position_nudge_center(x = 0.1, y = 0.1,
                                                      center_x = mean,
                                                      center_y = mean),
                     vjust = "outward_mean", hjust = "outward_mean") +
  expand_limits(x = c(-4, 4.5))


ggrepel.installed <- requireNamespace("ggrepel", quietly = TRUE)
if (ggrepel.installed) {
  library(ggrepel)

  ggplot(data = d, aes(x, y, label = lab, colour = group)) +
    geom_point() +
    stat_dens2d_labels(geom = "text_repel")

  ggplot(data = d, aes(x, y, label = lab, colour = group)) +
    geom_point() +
    stat_dens2d_labels(geom = "text_repel", label.fill = NA)

# we keep labels starting with "a" across the whole plot, but all in sparse
# regions. To achieve this we pass as argument to label.fill a fucntion
# instead of a character string.
  label.fun <- function(x) {ifelse(grepl("^a", x), x, "")}
  ggplot(data = d, aes(x, y, label = lab, colour = group)) +
    geom_point() +
    stat_dens2d_labels(geom = "text_repel", label.fill = label.fun)
}

# Using geom_debug() we can see that all 100 rows in \code{d} are
# returned. But only those labelled in the previous example still contain
# the original labels.

gginnards.installed <- requireNamespace("gginnards", quietly = TRUE)
if (gginnards.installed) {
  library(gginnards)

  ggplot(data = d, aes(x, y, label = lab)) +
    geom_point() +
    stat_dens2d_labels(geom = "debug")

  ggplot(data = d, aes(x, y, label = lab)) +
    geom_point() +
    stat_dens2d_labels(geom = "debug", return.density = TRUE)

  ggplot(data = d, aes(x, y, label = lab)) +
    geom_point() +
    stat_dens2d_labels(geom = "debug", label.fill = NULL)

  ggplot(data = d, aes(x, y, label = lab)) +
    geom_point() +
    stat_dens2d_labels(geom = "debug", label.fill = FALSE, return.density = TRUE)

  ggplot(data = d, aes(x, y, label = lab)) +
    geom_point() +
    stat_dens2d_labels(geom = "debug", label.fill = NULL, return.density = TRUE)

  ggplot(data = d, aes(x, y)) +
    geom_point() +
    stat_dens2d_labels(geom = "debug")
}

#> [1] "PANEL 1; group(s) -1; 'draw_function()' input 'data' (head):"
#>            x           y PANEL group label xintercept yintercept
#> 1  2.1886481  0.07862339     1    -1                0          0
#> 2 -0.1775473 -0.98708727     1    -1                0          0
#> 3 -0.1852753 -1.17523226     1    -1                0          0
#> 4 -2.5065362  1.68140888     1    -1     4          0          0
#> 5 -0.5573113  0.75623228     1    -1                0          0
#> 6 -0.1435595  0.30309733     1    -1                0          0

Arguments

Value

Details

Note

See also

Examples