describe_data returns a set of common descriptive statistics (e.g., n, mean, sd) for numeric variables.

describe_data(data, column, na.rm = TRUE, short = FALSE)

Arguments

data

A data frame.

column

An unquoted (numerical) column name from the data frame.

na.rm

Logical. Should missing values (including NaN) be excluded in calculating the descriptives? The default is TRUE.

short

Logical. Should only a subset of descriptives be reported? If set to TRUE, only the N, M, and SD will be returned. The default is FALSE.

Details

The data can be grouped using dplyr::group_by so that descriptives will be calculated for each group level.

When na.rm is set to FALSE, a percentage column will be added to the output that contains the percentage of non-missing data.

Skew and kurtosis are based on the skewness and kurtosis functions of the moments package (Komsta & Novomestky, 2015).

Percentages are calculated based on the total of non-missing observations. When na.rm is set to FALSE, percentages are based on the total of missing and non-missing observations.

Examples

# Load the dplyr package for access to the %>% operator and group_by() library(dplyr) # Inspect descriptives of the response column from the 'quote_source' data # frame included in tidystats describe_data(quote_source, response)
#> # A tibble: 1 x 13 #> variable missing N M SD SE min max range median mode #> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 response 18 6325 5.59 2.19 0.0275 1 9 8 5 5 #> # … with 2 more variables: skew <dbl>, kurtosis <dbl>
# Repeat the former, now for each level of the source column quote_source %>% group_by(source) %>% describe_data(response)
#> # A tibble: 2 x 14 #> # Groups: source [2] #> variable missing N source M SD SE min max range median #> <chr> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 response 18 3083 Bin L… 5.23 2.11 0.0380 1 9 8 5 #> 2 response 0 3242 Washi… 5.93 2.21 0.0388 1 9 8 6 #> # … with 3 more variables: mode <dbl>, skew <dbl>, kurtosis <dbl>
# Only inspect the total N, mean, and standard deviation quote_source %>% group_by(source) %>% describe_data(response, short = TRUE)
#> # A tibble: 2 x 5 #> # Groups: source [2] #> variable source N M SD #> <chr> <chr> <int> <dbl> <dbl> #> 1 response Bin Laden 3083 5.23 2.11 #> 2 response Washington 3242 5.93 2.21