Correlation Data Frame — correlate • corrr

An implementation of stats::cor(), which returns a correlation data frame rather than a matrix. See details below. Additional adjustment include the use of pairwise deletion by default.

Usage

correlate(
  x,
  y = NULL,
  use = "pairwise.complete.obs",
  method = "pearson",
  diagonal = NA,
  quiet = FALSE
)

Arguments

x: a numeric vector, matrix or data frame.
y: NULL (default) or a vector, matrix or data frame with compatible dimensions to x. The default is equivalent to y = x (but more efficient).
use: an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".
method: a character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman": can be abbreviated.
diagonal: Value (typically numeric or NA) to set the diagonal to
quiet: Set as TRUE to suppress message about method and use parameters.

Value

A correlation data frame cor_df

Details

This function returns a correlation matrix as a correlation data frame in the following format:

A tibble (see tibble)
An additional class, "cor_df"
A "term" column
Standardized variances (the matrix diagonal) set to missing values by default (NA) so they can be ignored in calculations.

The use argument and its possible values are inherited from stats::cor():

"everything": NAs will propagate conceptually, i.e. a resulting value will be NA whenever one of its contributing observations is NA
"all.obs": the presence of missing observations will produce an error
"complete.obs": correlations will be computed from complete observations, with an error being raised if there are no complete cases.
"na.or.complete": correlations will be computed from complete observations, returning an NA if there are no complete cases.
"pairwise.complete.obs": the correlation between each pair of variables is computed using all complete pairs of those particular variables.

As of version 0.4.3, the first column of a cor_df object is named "term". In previous versions this first column was named "rowname".

There is a ggplot2::autoplot() method for quickly visualizing the correlation matrix, for more information see autoplot.cor_df().

Examples

if (FALSE) {
correlate(iris)
}

correlate(iris[-5])
#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'
#> # A tibble: 4 × 5
#>   term         Sepal.Length Sepal.Width Petal.Length Petal.Width
#>   <chr>               <dbl>       <dbl>        <dbl>       <dbl>
#> 1 Sepal.Length       NA          -0.118        0.872       0.818
#> 2 Sepal.Width        -0.118      NA           -0.428      -0.366
#> 3 Petal.Length        0.872      -0.428       NA           0.963
#> 4 Petal.Width         0.818      -0.366        0.963      NA    

correlate(mtcars)
#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'
#> # A tibble: 11 × 12
#>    term     mpg    cyl   disp     hp    drat     wt    qsec     vs      am
#>    <chr>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
#>  1 mpg   NA     -0.852 -0.848 -0.776  0.681  -0.868  0.419   0.664  0.600 
#>  2 cyl   -0.852 NA      0.902  0.832 -0.700   0.782 -0.591  -0.811 -0.523 
#>  3 disp  -0.848  0.902 NA      0.791 -0.710   0.888 -0.434  -0.710 -0.591 
#>  4 hp    -0.776  0.832  0.791 NA     -0.449   0.659 -0.708  -0.723 -0.243 
#>  5 drat   0.681 -0.700 -0.710 -0.449 NA      -0.712  0.0912  0.440  0.713 
#>  6 wt    -0.868  0.782  0.888  0.659 -0.712  NA     -0.175  -0.555 -0.692 
#>  7 qsec   0.419 -0.591 -0.434 -0.708  0.0912 -0.175 NA       0.745 -0.230 
#>  8 vs     0.664 -0.811 -0.710 -0.723  0.440  -0.555  0.745  NA      0.168 
#>  9 am     0.600 -0.523 -0.591 -0.243  0.713  -0.692 -0.230   0.168 NA     
#> 10 gear   0.480 -0.493 -0.556 -0.126  0.700  -0.583 -0.213   0.206  0.794 
#> 11 carb  -0.551  0.527  0.395  0.750 -0.0908  0.428 -0.656  -0.570  0.0575
#> # … with 2 more variables: gear <dbl>, carb <dbl>
#> # ℹ Use `colnames()` to see all variable names
if (FALSE) {

# Also supports DB backend and collects results into memory

library(sparklyr)
sc <- spark_connect(master = "local")
mtcars_tbl <- copy_to(sc, mtcars)
mtcars_tbl %>%
  correlate(use = "pairwise.complete.obs", method = "spearman")
spark_disconnect(sc)
}