twitter - Remove "semi-duplicate" rows in R -
i have dataset looks this:
text id screenname retweetcount isretweet retweeted longitude latitude 1 xx 778980737861062656 0504traveller 0 false false <na> <na> 2 xx 778967536167559168 iz_azman 0 false false <na> <na> 3 yy 778962265298960384 iz_azman 0 false false <na> <na> 4 yy 778954988122939392 travelindtoday 2 false false <na> <na> 5 zz 778948691969224705 umtn 2 false false <na> <na> 6 zz 778942095843135493 flyinsider 0 false false <na> <na>
these tweets package twittr
in r. tweets have same text
different retweetcount
. want keep unique tweets (by text
), keeping highest retweetcount
amongst duplicates. (in case above, tweets 1, 4, , 5.)
how do that?
you can dplyr
library(dplyr) df %>% group_by(text) %>% slice(which.max(retweetcount)) #text id screenname retweetcount isretweet retweeted longitude latitude #(fctr) (dbl) (fctr) (int) (lgl) (lgl) (fctr) (fctr) #1 xx 7.789807e+17 0504traveller 0 false false <na> <na> #4 yy 7.789550e+17 travelindtoday 2 false false <na> <na> #5 zz 7.789487e+17 umt 2 false false <na> <na>
another approach in base r
using ave
and order
:
df[ave(df$retweetcount,df$text, fun = function(x) order(x, decreasing = t)) == 1, ] # text id screenname retweetcount isretweet retweeted longitude latitude #1 xx 7.789807e+17 0504traveller 0 false false <na> <na> #4 yy 7.789550e+17 travelindtoday 2 false false <na> <na> #5 zz 7.789487e+17 umtn 2 false false <na> <na>
Comments
Post a Comment