Compare multiple character columns dataframe R and create new column based on condition

Compare multiple character columns dataframe R and create new column based on condition

Problem Description:

I am trying to automatize a process in R if it is possible in order to avoid to do it manually because it will be 5000 rows to check manually.

I attach a toy example to be more clear of the process that I would like to do.

I have compared 5 methods to classify some reads to species.

Consider for example the first 5 cases:

code <- sprintf("sample % d", 1:5)

Specie_methodA<- c("NA", "NA","NA","NA", "Escherichia coli")
Specie_methodB<- c("Methanobrevibacter smithii", "NA", "NA","Blautia faecis","NA")
Specie_methodC<- c("","","","Blautia faecis","")
Specie_methodD<-c("NA","NA","CAG-41_sp900066215","NA","")
Specie_methodE<-c("","","","","Campylobacter coli")

table <- data.frame(code, Specie_methodA, Specie_methodB, Specie_methodC, Specie_methodD, Specie_methodE)

For each row, I would like to check if a particular specie is obtained,and if it is the case to print it his name in a new column (desired_output in table2, see code below). If two different species are obtained within a row between the 5 methods, I desire a "ERROR" string output. And if no specie is detect by any of the 5 methods, that will print "NA".

Therefore by the table indicated above, I desired to obtain the next output:

desired_output<-c("Methanobrevibacter smithii", "NA","CAG-41_sp90006621","Blautia faecis","ERROR")
table2 <- data.frame(code, Specie_methodA, Specie_methodB, Specie_methodC, Specie_methodD, Specie_methodE,desired_output)

Solution – 1

We can create a user-defined function

get_desired_output <- function(specie1,specie2,specie3,specie4,specie5){
  species <- c(specie1,specie2,specie3,specie4,specie5)
  # remove empty string, NA string and duplicates
  species <- species[!(species%in%c('NA',''))]%>%unique()
  if(length(species)==0){
    return('NA')
  }
  if(length(species)>1){
    return('ERROR')
  }
  return(species)
}

ifdplyr>=1.0.0:

output <- table%>%
  mutate(across(Specie_methodA:Specie_methodE, as.character))%>%
  rowwise()%>%
  mutate(desired_output=get_desired_output(Specie_methodA,Specie_methodB,Specie_methodC,Specie_methodD,Specie_methodE))


ifdplyr<1.0.0:

output <- table%>%
  mutate_at(vars(Specie_methodA:Specie_methodE),as.character)%>%
  rowwise()%>%
  mutate(desired_output=get_desired_output(Specie_methodA,Specie_methodB,Specie_methodC,Specie_methodD,Specie_methodE))

> output
Source: local data frame [5 x 7]
Groups: <by row>

# A tibble: 5 x 7
  code     Specie_methodA  Specie_methodB       Specie_methodC Specie_methodD   Specie_methodE   desired_output      
  <fct>    <chr>           <chr>                <chr>          <chr>            <chr>            <chr>               
1 sample ~ NA              Methanobrevibacter ~ ""             NA               ""               Methanobrevibacter ~
2 sample ~ NA              NA                   ""             NA               ""               NA                  
3 sample ~ NA              NA                   ""             CAG-41_sp900066~ ""               CAG-41_sp900066215  
4 sample ~ NA              Blautia faecis       Blautia faecis NA               ""               Blautia faecis      
5 sample ~ Escherichia co~ NA                   ""             ""               Campylobacter c~ ERROR
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject