r - Remove duplicates from dataset based on criteria -
this question has answer here:
i have dataset of scores:
id sub score 1 mat 45 2 mat 34 3 mat 67 1 mat 43 2 mat 34 4 mat 22 5 sci 78 6 mat 32 1 mat 56 1 sci 40
i want output top scores each id in each subject. example, new list should show:
id sub score 2 mat 34 3 mat 67 4 mat 22 5 sci 78 6 mat 32 1 mat 56 1 sci 40
i can find duplicated results through:
results[duplicated(results[, c(1,2)]),]
how order results , delete lowest scoring ones?
there many ways expected output. 1 option dplyr
group 'id', 'sub' columns, top score observation top_n
, , if there duplicate rows, use distinct
.
library(dplyr) df1 %>% group_by(id, sub) %>% top_n(1) %>% distinct() id sub score # (int) (chr) (int) #1 2 mat 34 #2 3 mat 67 #3 4 mat 22 #4 5 sci 78 #5 6 mat 32 #6 1 mat 56 #7 1 sci 40
or data.table
, convert 'data.frame' 'data.table' (setdt(df1)
), grouped 'id', 'sub', order
'score' in descending , subset first row of each group combination (.sd[1l]
or head(.sd, 1)
can used).
library(data.table) setdt(df1)[order(-score), .sd[1l] ,.(id, sub)]
or option unique
after order
columns select first observation each duplicate.
unique(setdt(df1)[order(id, sub,-score)], = c('id', 'sub'))
or base r
, order
columns, , use duplicated
remove rows duplicates first 2 columns.
df2 <- df1[with(df1, order(id, sub, -score)),] df2[!duplicated(df2[1:2]),]
Comments
Post a Comment