python - Convert pandas dataframe column with xml data to normalised columns? -
i have dataframe
in pandas
, 1 of columns xml string. want create 1 column each of xml nodes column names in normalised form. example,
id xmlcolumn 1 <main attr1='abc' attr2='xyz'><item><prop1>text1</prop1><prop2>text2</prop2></item></main> 2 <main ........</main>
i want convert data frame so:
id main.attr1 main.attr2 main.item.prop1 main.item.prop2 1 abc xyz text1 text2 2 .....
how that, while still keeping existing columns in dataframe
?
the first step needs done convert xml string pandas series
(under assumption, there same amount of columns in end). need function like:
def convert_xml(raw): # etree xml mangling
this can achieved e.g. using etree package in python. returned series must have index, each entry in index new column name appear, e.g. example:
pd.series(['abc', 'xyz'], index=['main.attr1', 'main.attr2'])
given function, can following pandas (mocking away xml mangling):
frame = pd.dataframe({'keep': [42], 'xml': '<foo></foo>'}) temp = frame['xml'].apply(convert_xml) frame = frame.drop('xml', axis=1) frame = pd.concat([frame, temp], axis=1)
Comments
Post a Comment