## Create a function to get summary statistics of a data frame in R

Problem Description:

I have below data frame df3.

CityIncomeCostAge
NY1237243243
NY6352863232
Boston6487284654
NJ6547735342
Boston7564725221
NY9363756335
Boston3262735254
NY9473866776
NJ6234485731
Boston5242768439
NJ7483474847
NY9273657353

I need to create a function ‘ST’ to get mean and standard diviation when the city is given. As an example, if I give ST(NY), I should get a table like below.

variableMeanSD
IncomeXXXX
CostXXXX
AgeXXXX

XX are the values in 2 decimal places. I wrote few codes but I am struggeling to concatenate these codes to get one fucntion. Below are my codes.

``````library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), median,2)

ST <- function(c) {
if (df3\$City == s)
dataframe (
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), mean,2),
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), sd,2)
else {
"NA"
}
}
ST(NJ)
``````

## Solution – 1

1. No need to call `library(dplyr)` multiple times, and doing so in the middle of a `data.frame(..)` expression is not right. Candidly, even if that were syntactically correct code (it could be with `{...}` bracing), it is generally considered better to put things like that at the beginning of the function, organizing the code. Put it at the beginning of your function, `ST <- function(c) { library(dplyr); ... }`.

2. Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb. See vignette("colwise") for details.), …

I’ll demo the use of `across`.

3. `summarize` can be given multiple (named) functions at once, I’ll show that, too.

4. Your `if (df3\$City == .)` is wrong for a few reasons, notably because `if` requires its conditional to be exactly length-1 (anything else is an error, a warning, and/or logical failure) but the test is returning a `logical` vector as long as the number of rows in `df3`. A better tactic is to use `dplyr::filter`.

5. Your function is using objects that were neither passed to it nor defined within it, this is bad practice. Best practice is to pass the data and arguments in the function call.

``````ST <- function(X, city, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% city) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(everything(), names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value"))
}
ST(df3, "NY")
# # A tibble: 3 x 3
#   variable     mu  sigma
#   <chr>     <dbl>  <dbl>
# 1 Income   7140.  3550.
# 2 Cost     6773.  2576.
# 3 Age        47.8   17.7
``````

Notice that I used `City %in% city` instead of `==`; in most cases this is identical, but there are two benefits to this:

1. `NA` inclusion works. Note that `NA == NA` returns `NA` (which stifles many conditional processing if not capture correctly) whereas `NA %in% NA` returns `TRUE`, which seems more intuitive (to me at least).

2. It allows for `city` (the function argument) to be length other than 1, such as `ST(df3, c("NY", "Boston"))`. While that may not be a necessary thing for this function, it can be a handy utility in other function definitions, and can be a good thing to consider. Said differently and in CS-speak, it’s good to think about a function handling not just "1" or "2" static things, but perhaps "1 or more" or "0 or more" (relatively unlimited number of arguments). (For this, I’ll rename the function argument from `city` to `cities`, suggesting it can take more than one.)

From this use of `%in%`, it might make sense to include the city name in the output; this can be done by adding a `group_by` after the `filter`, as in

``````ST <- function(X, cities, digits = 2, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% cities) %>%
group_by(City) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(-City, names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value")) %>%
mutate(across(c(mu, sigma), ~ round(., digits)))
}
ST(df3, c("NY", "Boston"))
# # A tibble: 6 x 4
#   City   variable     mu  sigma
#   <chr>  <chr>     <dbl>  <dbl>
# 1 Boston Income   5639.  1847.
# 2 Boston Cost     6284.  2299.
# 3 Boston Age        42     15.7
# 4 NY     Income   7140.  3550.
# 5 NY     Cost     6773.  2576.
# 6 NY     Age        47.8   17.7
``````

## Solution – 2

``````ST <- function(city_name) {
df %>%
filter(City == city_name) %>%
pivot_longer(cols = Income:Age, names_to = "variable") %>%
group_by(City, variable) %>%
summarise(mean = mean(value),
sd = sd(value), .groups = "drop")
}

ST("Boston")

# A tibble: 3 × 4
City   variable  mean     sd
<chr>  <chr>    <dbl>  <dbl>
1 Boston Age        42    15.7
2 Boston Cost     6284. 2299.
3 Boston Income   5639. 1847.
``````
