R: Selecting Multiple Columns Based on Conditions

R: Selecting Multiple Columns Based on Conditions

Problem Description:

I have the following data in R:

id = 1:5
col1 = c("john", "henry", "adam", "jenna", "Phone: 222 2222")
col2 = c("river B8C 9L4", "Field U9H 5E2 PP", "NA", "ocean A1B 5H1 dd", "dave")
col3 = c("Phone: 111 1111 111", "steve", "forest K0Y 1U9 hu2", "NA", "NA")
col4 = c("matt", "peter", "Phone: 333 333 1113", "Phone: 444 111 1153", "kevin")
my_data = data.frame(id, col1, col2, col3, col4)

      id            col1             col2                col3                col4
1  1            john    river B8C 9L4 Phone: 111 1111 111                matt
2  2           henry Field U9H 5E2 PP               steve               peter
3  3            adam               NA  forest K0Y 1U9 hu2 Phone: 333 333 1113
4  4           jenna ocean A1B 5H1 dd                  NA Phone: 444 111 1153
5  5 Phone: 222 2222             dave                  NA               kevin

I am trying to accomplish the following task – I would like to create a new dataset with the following columns. For each row:

  • Step 1: id (trivial, this is always the first column)
  • Step 2: A column with the phone number
  • Step 3: A column that satisfies the following condition '(([A-Z] ?[0-9]){3})|.', '\1'
  • Step 4: Once Step 1 – Step 3 has been completed, I would like to combine all names into a single column

Here is a sample of the desired output:

  id              name             address                phone
1  1         john matt       river B8C 9L4  Phone: 111 1111 111
2  2 henry steve peter    Field U9H 5E2 PP                   NA
3  3              adam  forest K0Y 1U9 hu2  Phone: 333 333 1113
4  4             jenna  ocean A1B 5H1 dd    Phone: 444 111 1153
5  5        dave kevin                  NA      Phone: 222 2222

Here is the code I have written:

my_data$col1[grep("Phone", my_data$col1)]
my_data$col2[grep("Phone", my_data$col2)]
my_data$col3[grep("Phone", my_data$col3)]
my_data$col4[grep("Phone", my_data$col4)]

my_data$col1[grep( '(([A-Z] ?[0-9]){3})|.', '\1' , my_data$col1)]
my_data$col2[grep('(([A-Z] ?[0-9]){3})|.', '\1', my_data$col2)]
my_data$col3[grep('(([A-Z] ?[0-9]){3})|.', '\1', my_data$col3)]
my_data$col4[grep('(([A-Z] ?[0-9]){3})|.', '\1', my_data$col4)]

Based on the above code, I was thinking on identifying which of the columns meet the condition in each step, and then using the COLASCE statement in dplyr to create the final dataset. But I think this might be a very long way of accomplishing this problem.

Can someone please suggest a faster way to solve this problem?

Thanks!

Solution – 1

Try this:

tmp <- apply(my_data[,-1], 1, function(z) { z <- z[!is.na(z) & z != "NA"]; ind <- dplyr::case_when(grepl("^Phone:", z) ~ "phone", grepl("(([A-Z] ?[0-9]){3})", z) ~ "address", TRUE ~ "name"); data.frame(lapply(split(z, ind), paste, collapse = " ")); })
tmp
# [[1]]
#         address      name               phone
# 1 river B8C 9L4 john matt Phone: 111 1111 111
# [[2]]
#            address              name
# 1 Field U9H 5E2 PP henry steve peter
# [[3]]
#              address name               phone
# 1 forest K0Y 1U9 hu2 adam Phone: 333 333 1113
# [[4]]
#            address  name               phone
# 1 ocean A1B 5H1 dd jenna Phone: 444 111 1153
# [[5]]
#         name           phone
# 1 dave kevin Phone: 222 2222

With this, we can combine them (I prefer dplyr::bind_rows or data.table::rbindlist).

dplyr::bind_rows(tmp)
#              address              name               phone
# 1      river B8C 9L4         john matt Phone: 111 1111 111
# 2   Field U9H 5E2 PP henry steve peter                <NA>
# 3 forest K0Y 1U9 hu2              adam Phone: 333 333 1113
# 4   ocean A1B 5H1 dd             jenna Phone: 444 111 1153
# 5               <NA>        dave kevin     Phone: 222 2222
cbind(my_data[,1,drop=FALSE], dplyr::bind_rows(tmp))
#   id            address              name               phone
# 1  1      river B8C 9L4         john matt Phone: 111 1111 111
# 2  2   Field U9H 5E2 PP henry steve peter                <NA>
# 3  3 forest K0Y 1U9 hu2              adam Phone: 333 333 1113
# 4  4   ocean A1B 5H1 dd             jenna Phone: 444 111 1153
# 5  5               <NA>        dave kevin     Phone: 222 2222

Solution – 2

One solution based on grep

setNames(data.frame(my_data$id, 
sapply(c("^[a-z]+[a-z]+$", "^[^P]+.*[[:digit:]]+", "Phone"), function(srch) 
  apply(my_data[,-1], 1, function(x) 
    paste(grep(srch, x, value=T), collapse=" ")))),
  c("Id", "Name", "Address", "Phone"))
  Id              Name            Address               Phone
1  1         john matt      river B8C 9L4 Phone: 111 1111 111
2  2 henry steve peter   Field U9H 5E2 PP
3  3              adam forest K0Y 1U9 hu2 Phone: 333 333 1113
4  4             jenna   ocean A1B 5H1 dd Phone: 444 111 1153
5  5        dave kevin                        Phone: 222 2222
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject