python pandas how to drop duplicates selectively -
i need @ rows in column ['b'] , if row non-empty go corresponding column ['c'] , drop duplicates of particular index against other rows in third column ['c'] while preserving particular index. came across drop_duplicates, unable find way duplicates of highlighted row opposed duplicates in column. can't use drop_duplicates on whole column because want retain duplicates in column may correspond empty values in column ['b'].
so possible scenarios be: if in ['b'] find non empty value, may go current index in ['c'] , find duplicates of 1 index , drop those. these duplicates correspond empty or non-empty values in ['b']. if in ['b'] find empty value skip next index. way possible empty value indices in ['b'] removed indirectly because duplicates of index in ['c'] corresponding non empty ['b'] value.
edited sample data:
preprocessed:
df1 = pd.dataframe([['','ccch'], ['chc','ccch'], ['cchcc','cnhcc'], ['','ccch'], ['cnhcc','cnoch'], ['','nch'], ['','nch']], columns=['b', 'c']) df1 b c 0 ccch 1 chc ccch 2 cchcc cnhcc 3 ccch 4 cnhcc cnoch 5 nch 6 nch
post processing , dropping correct duplicates:
df2 = pd.dataframe([['chc','ccch'], ['cchcc','cnhcc'], ['cnhcc','cnoch'], ['','nch'], ['','nch']], columns=['b', 'c']) df2 b c 1 chc ccch 2 cchcc cnhcc 4 cnhcc cnoch 5 nch 6 nch
above see result rows removed rows 0,3 duplicates in column ['c'] of row 1 has non 0 'b' value. row 5,6 kept though duplicates of each other in column ['c'] because have no non 0 'b' value. rows 2 , 4 kept because not duplicates in column ['c'].
so logic go through each row in column 'b' if empty move down row , continue. if not empty go corresponding column 'c' , drop duplicates of column 'c' row while preserving index , continue next row untill logic has been applied values in column 'b'.
column b value empty --> @ next value in column b
| or if not empty |
column b not empty --> column c --> drop duplicates of index of column c while keeping current index --> @ next value in column b
say group dataframe according 'c'
column, , check each group existence of 'b'
-column non-empty entry:
if there no such entry, return entire group
otherwise, return group, non-empty entries in
'b'
, duplicates dropped
in code:
def remove_duplicates(g): return g if sum(g.b == '') == len(g) else g[g.b != ''].drop_duplicates(subset='b') >>> df1.groupby(df1.c).apply(remove_duplicates)['b'].reset_index()[['b', 'c']] b c 0 chc ccch 1 cchcc cnhcc 2 cnhcc cnoch 3 nch 4 nch
Comments
Post a Comment