r - Remove duplicates from dataset based on criteria -
this question has answer here:
i have dataset of scores:
id sub score 1 mat 45 2 mat 34 3 mat 67 1 mat 43 2 mat 34 4 mat 22 5 sci 78 6 mat 32 1 mat 56 1 sci 40 i want output top scores each id in each subject. example, new list should show:
id sub score 2 mat 34 3 mat 67 4 mat 22 5 sci 78 6 mat 32 1 mat 56 1 sci 40 i can find duplicated results through:
results[duplicated(results[, c(1,2)]),] how order results , delete lowest scoring ones?
there many ways expected output. 1 option dplyr group 'id', 'sub' columns, top score observation top_n, , if there duplicate rows, use distinct.
library(dplyr) df1 %>% group_by(id, sub) %>% top_n(1) %>% distinct() id sub score # (int) (chr) (int) #1 2 mat 34 #2 3 mat 67 #3 4 mat 22 #4 5 sci 78 #5 6 mat 32 #6 1 mat 56 #7 1 sci 40 or data.table, convert 'data.frame' 'data.table' (setdt(df1)), grouped 'id', 'sub', order 'score' in descending , subset first row of each group combination (.sd[1l] or head(.sd, 1) can used).
library(data.table) setdt(df1)[order(-score), .sd[1l] ,.(id, sub)] or option unique after order columns select first observation each duplicate.
unique(setdt(df1)[order(id, sub,-score)], = c('id', 'sub')) or base r, order columns, , use duplicated remove rows duplicates first 2 columns.
df2 <- df1[with(df1, order(id, sub, -score)),] df2[!duplicated(df2[1:2]),]
Comments
Post a Comment