python - How do I extract attributes from xml tags using Beautiful Soup? -


i trying use beautiful soup in django extract xml tags. sample of tags i'm using:

<item> <title> title goes here </title> <link> link1 goes here </link> <description> description goes here </description> <media:thumbnail url="image url goes here" height="222" width="300"/> <pubdate>thu, 15 sep 2016 13:24:48 edt</pubdate> <guid ispermalink="true"> link2 goes here </guid> </item> 

i have obtained strings of title,link , description tags. i'm having trouble obtaining url media:thumbnail tag.

this snippet got values of rest of tags:

soup=beautifulsoup(urlopen(xmllink),'xml') items in soup.find_all('item'):     listtitle.append(items.title.get_text())     listurl.append(items.link.get_text())     listdescription.append(items.description.get_text()) 

help

the issue because not every item has media:thumbnail need check first:

in [60]: import requests  in [61]:  bs4 import beautifulsoup  in [62]: soup = beautifulsoup(requests.get("https://rss.sciencedaily.com/computers_math/computer_programming.xml").content, "xml")  in [63]:   in [63]: item in soup.find_all("item"):    ....:         thumb = item.find("thumbnail")    ....:         if thumb:    ....:                 print(thumb["url"])    ....:          https://images.sciencedaily.com/2016/09/160915132448.jpg https://images.sciencedaily.com/2016/09/160915090018.jpg https://images.sciencedaily.com/2016/09/160914090327.jpg https://images.sciencedaily.com/2016/09/160913134149.jpg https://images.sciencedaily.com/2016/09/160909094844.jpg https://images.sciencedaily.com/2016/09/160907125004.jpg https://images.sciencedaily.com/2016/09/160906085157.jpg https://images.sciencedaily.com/2016/08/160831085055.jpg https://images.sciencedaily.com/2016/08/160822181811.jpg https://images.sciencedaily.com/2016/08/160815134941.jpg https://images.sciencedaily.com/2016/08/160815134817.jpg https://images.sciencedaily.com/2016/08/160809095640.jpg https://images.sciencedaily.com/2016/08/160803140137.jpg https://images.sciencedaily.com/2016/07/160722104135.jpg https://images.sciencedaily.com/2016/07/160721144139.jpg https://images.sciencedaily.com/2016/07/160721103855.jpg https://images.sciencedaily.com/2016/07/160720094641.jpg https://images.sciencedaily.com/2016/07/160718133206.jpg https://images.sciencedaily.com/2016/07/160713105850.jpg https://images.sciencedaily.com/2016/07/160711151055.jpg https://images.sciencedaily.com/2016/07/160707083258.jpg https://images.sciencedaily.com/2016/06/160629125823.jpg https://images.sciencedaily.com/2016/06/160627125140.jpg https://images.sciencedaily.com/2016/06/160624101050.jpg https://images.sciencedaily.com/2016/06/160622104810.jpg 

a faster alternative use lxml:

from lxml import etree  item in tree.findall(".//item/media:thumbnail",tree.nsmap):      parent = item.getparent()      print(parent.xpath("title/text()")[0])      print(parent.xpath("link/text()")[0])      print(item.get("url")) 

Comments

Popular posts from this blog

angular - Is it possible to get native element for formControl? -

unity3d - Rotate an object to face an opposite direction -

javascript - Why jQuery Select box change event is now working? -