Skip to contents

describe_data returns a set of common descriptive statistics (e.g., n, mean, sd) for numeric variables.

Usage

describe_data(data, column, na.rm = TRUE, short = FALSE)

Arguments

data

A data frame.

column

An unquoted (numerical) column name from the data frame.

na.rm

Logical. Should missing values (including NaN) be excluded in calculating the descriptives? The default is TRUE.

short

Logical. Should only a subset of descriptives be reported? If set to TRUE, only the N, M, and SD will be returned. The default is FALSE.

Details

The data can be grouped using dplyr::group_by so that descriptives will be calculated for each group level.

When na.rm is set to FALSE, a percentage column will be added to the output that contains the percentage of non-missing data.

Skew and kurtosis are based on the skewness and kurtosis functions of the moments package (Komsta & Novomestky, 2015).

Percentages are calculated based on the total of non-missing observations. When na.rm is set to FALSE, percentages are based on the total of missing and non-missing observations.

Examples

# Load the dplyr package for access to the %>% operator and group_by()
library(dplyr)

# Inspect descriptives of the response column from the 'quote_source' data
# frame included in tidystats
describe_data(quote_source, response)
#> # A tibble: 1 × 13
#>   var     missing     N     M    SD     SE   min   max range median  mode   skew
#>   <chr>     <int> <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
#> 1 respon…      18  6325  5.59  2.19 0.0275     1     9     8      5     5 -0.137
#> # … with 1 more variable: kurtosis <dbl>

# Repeat the former, now for each level of the source column
quote_source %>%
  group_by(source) %>%
  describe_data(response)
#> # A tibble: 2 × 14
#> # Groups:   source [2]
#>   var     source missing     N     M    SD     SE   min   max range median  mode
#>   <chr>   <chr>    <int> <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
#> 1 respon… Bin L…      18  3083  5.23  2.11 0.0380     1     9     8      5     5
#> 2 respon… Washi…       0  3242  5.93  2.21 0.0388     1     9     8      6     5
#> # … with 2 more variables: skew <dbl>, kurtosis <dbl>
  
# Only inspect the total N, mean, and standard deviation
quote_source %>%
  group_by(source) %>%
  describe_data(response, short = TRUE)
#> # A tibble: 2 × 5
#> # Groups:   source [2]
#>   var      source         N     M    SD
#>   <chr>    <chr>      <int> <dbl> <dbl>
#> 1 response Bin Laden   3083  5.23  2.11
#> 2 response Washington  3242  5.93  2.21