python - Convert pandas dataframe column with xml data to normalised columns? -
i have dataframe in pandas, 1 of columns xml string. want create 1 column each of xml nodes column names in normalised form. example,
id xmlcolumn 1 <main attr1='abc' attr2='xyz'><item><prop1>text1</prop1><prop2>text2</prop2></item></main> 2 <main ........</main> i want convert data frame so:
id main.attr1 main.attr2 main.item.prop1 main.item.prop2 1 abc xyz text1 text2 2 ..... how that, while still keeping existing columns in dataframe?
the first step needs done convert xml string pandas series (under assumption, there same amount of columns in end). need function like:
def convert_xml(raw): # etree xml mangling this can achieved e.g. using etree package in python. returned series must have index, each entry in index new column name appear, e.g. example:
pd.series(['abc', 'xyz'], index=['main.attr1', 'main.attr2']) given function, can following pandas (mocking away xml mangling):
frame = pd.dataframe({'keep': [42], 'xml': '<foo></foo>'}) temp = frame['xml'].apply(convert_xml) frame = frame.drop('xml', axis=1) frame = pd.concat([frame, temp], axis=1)
Comments
Post a Comment