dataset - data matching/data selection with multiple conditions in a long shaped database r -
i have been struggling problem while, it's rather complex data selection multiple possible output , can't find expression want. measuring divorce rates in colony of birds.
reproducible database:
nest<- rep(seq(1:10),2) year<- c(rep(2014, 10), rep(2015, 10)) pair<- c("th4327_th4317", "2", "th8522_t75390" ,"4", "tj1704_tj1703", "th4335_th4333", "7", "8", "th4337_th4323", "t74703_th1797", "th4327_th4317", "12", "th8522_t75550","14", "tj1704_na" , "th4335_th4333", "17", "th8715_th8714", "th4388_th4323", "te9639_th9675") test<- data.frame(nest, year, pair) test$pair <- as.character(test$pair) test$year <- as.character(test$year)
the underscore separates id of 2 members of pair. when no id present growing number placed. same nests each year displayed. in 2 consecutive years have 5 possible scenarios (the numbers nest ids):
same pair 2014-2015: 1-6
empty 2014-2015: 2-4-7
empty 2014 occupied 2015: 8
change of pairs in same nest: 10
change of 1 of member of pair: 3-9
unknown: 5
the results after are:
pairs stayed "same pair 2014-2015" : 2
pairs in 1 changed "change of 1 of member of pair": 2
i figured how calculate pairs stay together...
same<-test$pair[test$year=="2014"] %in% test$pair[test$year=="2015"] table(same)
however cannot obtain information pairs divorce.
i tried several commands, which
, ifelse
, have not been successful.
i happy give further explanation if not clear. know quite messy problem.
thanks lot, best.
have fun
here approach using merge. strategy follows. first split pairs p1
, p2
(i did tidyr::separate
). subset data across years , merge using p1
unique identifier. means there 2 different p2
, 1 2014 , 1 2015. straightforward test if groups stay or divorce.
if have many years, approach need generalized. gladly provide such generalization if need be.
library(tidyr) test <- test %>% filter(nchar(test$pair) > 3) %>% #getting rid of missing pairs separate(pair, c("p1", "p2"), "_") %>% select(-nest) #getting rid of nest superfluous test <- merge(test[test$year=="2014",], test[test$year=="2015",], = "p1", = true) #same group across 2014 , 2015 na.omit(test[test$p2.x == test$p2.y, grep("p", names(test))]) #different group across 2014 , 2015 na.omit(test[test$p2.x != test$p2.y, grep("p", names(test))])
update
to generalize code many years use following code. better approach looping. note above code did not work because forgot include dplyr
library. sure download , load both dplyr
, tidyr
. these libraries great data manipulation. here sources on tidyr , dplyr. let me know if have more problems.
library(tidyr) library(dplyr) test <- test %>% filter(nchar(test$pair) > 3) %>% #getting rid of missing pairs separate(pair, c("p1", "p2"), "_") %>% #splitting pairs select(-nest) #getting rid of nest superfluous test <- split(test, test$year) #split data lists year test <- map(function(d, n){names(d)[grepl("p2", names(d))] <- paste("p2", n, sep = "_"); d}, d = test, n = names(test)) #this line can omitted. insures final data set looks nice. test <- reduce(function(...) merge(..., = "p1", = true), test)
without packages (i.e. in base r)
if don't want use dplyr
, tidyr
packages can replace first several lines of code (up until when split
called) base r approach
test <- test[nchar(test$pair) > 3, !names(test)%in%"nest"] split_pair <- do.call(rbind, strsplit(test$pair, "_")) test$p1 <- split_pair[, 1] test$p2 <- split_pair[, 2] test <- test[, !names(test)%in%"pair"]
final update... hopefully
have fun brings great point in comment below. since use p1
unique identifier, not possible identify when p2
changes. overcome following...
test <- split(test, test$year) #split data lists year test <- reduce(function(...) merge(..., = c("p1", "p2"), = true), test) #merge on both p1 , p2 overcome previous problem. pair unique identifiers #stayed in same relationship stay = test$year.x == "2014" & test$year.y == "2015" na.omit(test[stay, ]) #p1 changes couples between year.x , year.y tp1 <- test[test$p1 %in% test[duplicated(test$p1), "p1"], c("p1", "p2", "year.x", "year.y")] is_na <- (is.na(tp1$year.x) & is.na(tp1$year.y)) stay_tp1 <- tp1$year.x == "2014" & tp1$year.y == "2015" stay_tp1[is.na(stay_tp1)] <- false tp1 <- tp1[!(stay_tp1 | is_na), ] #a similar approach works p2. notice best in function. if use function remember need pass variables strings, unless want use nse.
the final bit of code might bit confusing. let me explain. identify if bird changes partners identify duplicates, since bird moves 1 pair appear twice. in case of many years, however, bird can change pairs in 1 of several years. identify correct year bird changes need use above code. suggest construct function deal case, since there fair bit of typing involved.
Comments
Post a Comment