Create a function to get summary statistics of a data frame in R

Create a function to get summary statistics of a data frame in R

Problem Description:

I have below data frame df3.


I need to create a function ‘ST’ to get mean and standard diviation when the city is given. As an example, if I give ST(NY), I should get a table like below.


XX are the values in 2 decimal places. I wrote few codes but I am struggeling to concatenate these codes to get one fucntion. Below are my codes.

df3 %>%
   group_by(City) %>% 
   summarise_at(vars("Income","Cost","Age"), median,2)

ST <- function(c) {
  if (df3$City == s)
    dataframe (
    df3 %>%
       group_by(City) %>% 
       summarise_at(vars("Income","Cost","Age"), mean,2),
    df3 %>%
       group_by(City) %>% 
       summarise_at(vars("Income","Cost","Age"), sd,2)
  else {

Solution – 1

  1. No need to call library(dplyr) multiple times, and doing so in the middle of a data.frame(..) expression is not right. Candidly, even if that were syntactically correct code (it could be with {...} bracing), it is generally considered better to put things like that at the beginning of the function, organizing the code. Put it at the beginning of your function, ST <- function(c) { library(dplyr); ... }.

  2. From ?summarize_at,

    Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb. See vignette("colwise") for details.), …

    I’ll demo the use of across.

  3. summarize can be given multiple (named) functions at once, I’ll show that, too.

  4. Your if (df3$City == .) is wrong for a few reasons, notably because if requires its conditional to be exactly length-1 (anything else is an error, a warning, and/or logical failure) but the test is returning a logical vector as long as the number of rows in df3. A better tactic is to use dplyr::filter.

  5. Your function is using objects that were neither passed to it nor defined within it, this is bad practice. Best practice is to pass the data and arguments in the function call.

ST <- function(X, city, na.rm = TRUE) {
  library(tidyr) # pivot_longer
  filter(X, City %in% city) %>%
    summarize(across(c("Income", "Cost", "Age"), 
                     list(mu = ~ mean(., na.rm = na.rm),
                          sigma = ~ sd(., na.rm = na.rm)))) %>%
    pivot_longer(everything(), names_pattern = "(.*)_(.*)",
                 names_to = c("variable", ".value"))
ST(df3, "NY")
# # A tibble: 3 x 3
#   variable     mu  sigma
#   <chr>     <dbl>  <dbl>
# 1 Income   7140.  3550. 
# 2 Cost     6773.  2576. 
# 3 Age        47.8   17.7

Notice that I used City %in% city instead of ==; in most cases this is identical, but there are two benefits to this:

  1. NA inclusion works. Note that NA == NA returns NA (which stifles many conditional processing if not capture correctly) whereas NA %in% NA returns TRUE, which seems more intuitive (to me at least).

  2. It allows for city (the function argument) to be length other than 1, such as ST(df3, c("NY", "Boston")). While that may not be a necessary thing for this function, it can be a handy utility in other function definitions, and can be a good thing to consider. Said differently and in CS-speak, it’s good to think about a function handling not just "1" or "2" static things, but perhaps "1 or more" or "0 or more" (relatively unlimited number of arguments). (For this, I’ll rename the function argument from city to cities, suggesting it can take more than one.)

From this use of %in%, it might make sense to include the city name in the output; this can be done by adding a group_by after the filter, as in

ST <- function(X, cities, digits = 2, na.rm = TRUE) {
  library(tidyr) # pivot_longer
  filter(X, City %in% cities) %>%
    group_by(City) %>%
    summarize(across(c("Income", "Cost", "Age"), 
                     list(mu = ~ mean(., na.rm = na.rm),
                          sigma = ~ sd(., na.rm = na.rm)))) %>%
    pivot_longer(-City, names_pattern = "(.*)_(.*)",
                 names_to = c("variable", ".value")) %>%
    mutate(across(c(mu, sigma), ~ round(., digits)))
ST(df3, c("NY", "Boston"))
# # A tibble: 6 x 4
#   City   variable     mu  sigma
#   <chr>  <chr>     <dbl>  <dbl>
# 1 Boston Income   5639.  1847. 
# 2 Boston Cost     6284.  2299. 
# 3 Boston Age        42     15.7
# 4 NY     Income   7140.  3550. 
# 5 NY     Cost     6773.  2576. 
# 6 NY     Age        47.8   17.7

Edit: I added the rounding.

Solution – 2

ST <- function(city_name) {
  df %>%  
    filter(City == city_name) %>% 
    pivot_longer(cols = Income:Age, names_to = "variable") %>%  
    group_by(City, variable) %>%  
    summarise(mean = mean(value), 
              sd = sd(value), .groups = "drop")


# A tibble: 3 × 4
  City   variable  mean     sd
  <chr>  <chr>    <dbl>  <dbl>
1 Boston Age        42    15.7
2 Boston Cost     6284. 2299. 
3 Boston Income   5639. 1847. 
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.