Creating categorical income variable

Creating categorical income variable

Problem Description:

I have a dataframe with the following format:

IDIncomeYear
11002000
22002000
33002000
45002001
510002001
615002001
7100002002
8150002002
9200002002

I’d like to add a column called income_cat with three possible levels; "low", "medium" and "high" depending on whether the income is in the lower 33th percentile, the middle 33th percentile or the top 33th percentile of that specific year.

IDIncomeYearincome_cat
11002000low
22002000medium
33002000high
45002001low
510002001medium
615002001high
7100002002low
8150002002medium
9200002002high

I struggle to find the proper way to do this and would be very thankful for any suggestions!

Solution – 1

library(tidyverse) 

df %>%  
  group_by(Year) %>% 
  mutate(income_cat = case_when(Income > quantile(Income, 0.66) ~ "High", 
                                Income < quantile(Income, 0.33) ~ "Low", 
                                between(Income, 
                                        quantile(Income, 0.33),
                                        quantile(Income, 0.66)) ~ "Medium"))

# A tibble: 9 x 4
# Groups:   Year [3]
     ID Income  Year income_cat
  <dbl>  <dbl> <dbl> <chr>     
1     1    100  2000 Low       
2     2    200  2000 Medium    
3     3    300  2000 High      
4     4    500  2001 Low       
5     5   1000  2001 Medium    
6     6   1500  2001 High      
7     7  10000  2002 Low       
8     8  15000  2002 Medium    
9     9  20000  2002 High   

Solution – 2

This is your first question, so I will make an exception, but stackoverflow community generally asks for something to work with. In other words, we need at least some code. Otherwise, you just ask other to do work for you.
This should work in your case.

df$income_cat=as.factor(ifelse(df$Income<quantile(df$Income,0.33), 'low', 
                        ifelse(df$Income<quantile(df$Income,0.66), 'medium', 'high')))

Please tell me if it does or doesn’t work.

Solution – 3

Here’s an answer using data.table. Starting with toy data:

df <- data.table(id = 1:9, income = 100+100*(1:9), year = rep(2000+1:3, each = 3))

   id income year
1:  1    200 2001
2:  2    300 2001
3:  3    400 2001
4:  4    500 2002
5:  5    600 2002
6:  6    700 2002
7:  7    800 2003
8:  8    900 2003
9:  9   1000 2003

While grouping by year, we can use data.table::fcase and the quantile function included in base R:

library(data.table)
setDT(df)

df[, income_cat := fcase(income < quantile(income, 0.33), "low",
                         income < quantile(income, 0.66), "mid",
                         default = "high"), 
   by = year]
df
   id income year income_cat
1:  1    200 2001        low
2:  2    300 2001        mid
3:  3    400 2001       high
4:  4    500 2002        low
5:  5    600 2002        mid
6:  6    700 2002       high
7:  7    800 2003        low
8:  8    900 2003        mid
9:  9   1000 2003       high

Solution – 4

library(data.table)
library(magrittr)

df <- data.table(id = 1:9, income = 100+100*(1:9), year = rep(2000+1:3, each = 3))

df[, res := cut(
  x = income, 
  breaks = c(-Inf, quantile(x = income, probs = seq(0, 1, 1 /3))[2:3], +Inf),
  labels = c("Low", "Medium", "High")), by = year] %>%
  .[]
#>    id income year    res
#> 1:  1    200 2001    Low
#> 2:  2    300 2001 Medium
#> 3:  3    400 2001   High
#> 4:  4    500 2002    Low
#> 5:  5    600 2002 Medium
#> 6:  6    700 2002   High
#> 7:  7    800 2003    Low
#> 8:  8    900 2003 Medium
#> 9:  9   1000 2003   High

Created on 2022-11-30 with reprex v2.0.2

Solution – 5

I understand that year is a grouping factor.
Here is a version with findInterval.
I have tried to recreate the idea of your dataframe and added some more random data points to each group – see data at the end.

library(dplyr)

df %>%
  group_by(year) %>%
  mutate(quantile = findInterval(income,
        quantile(income, probs=c(0.3, .66)))) |> 
  mutate(quantile = factor(quantile, labels = c("low", "medium", "high")))
#> # A tibble: 30 × 3
#> # Groups:   year [3]
#>    income  year quantile
#>     <int> <dbl> <fct>   
#>  1    258  2000 medium  
#>  2    278  2000 high    
#>  3    113  2000 low     
#>  4    294  2000 high    
#>  5    269  2000 medium  
#>  6    149  2000 low     
#>  7    217  2000 medium  
#>  8    142  2000 low     
#>  9    298  2000 high    
#> 10    297  2000 high    
#> # … with 20 more rows

Data

set.seed(123)
income <- c(sample(100:300, 10),
            sample(500:1500,10),
            sample(10000:20000, 10))
year <- c(rep(2000,10), rep(2001,10), rep(2002,10))

df <- data.frame(income, year)
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject