Beautiful Soup – Scrape Nested Tags


Beautiful Soup – Scrape Nested Tags



”;


The arrangement of tags or elements in a HTML document is hierarchical nature. The tags are nested upto multiple levels. For example, the <head> and <body> tags are nested inside <html> tag. Similarly, one or more <li> tags may be inside a <ul> tag. In this chapter, we shall find out how to scrape a tag that has one or more children tags nested in it.

Let us consider the following HTML document −


<div id="outer">
   <div id="inner">
      <p>Hello<b>World</b></p>
      <img src=''logo.jpg''>
   </div>
</div>

In this case, the two <div> tags and a <p> tag has one or more child elements nested inside. Whereas, the <img> and <b> tag donot have any children tags.

The findChildren() method returns a ResultSet of all the children under a tag. So, if a tag doesn”t have any children, the ResultSet will be an empty list like [].

Taking this as a cue, the following code finds out the tags under each tag in the document tree and displays the list.

Example


html = """
   <div id="outer">
      <div id="inner">
         <p>Hello<b>World</b></p>
         <img src=''logo.jpg''>
      </div>
   </div>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, ''html.parser'')
for tag in soup.find_all():
   print ("Tag: {} attributes: {}".format(tag.name, tag.attrs))
   print ("Child tags: ", tag.findChildren())
   print()

Output


Tag: div attributes: {''id'': ''outer''}
Child tags:  [<div id="inner">
<p>Hello<b>World</b></p>
<img src="logo.jpg"/>
</div>, <p>Hello<b>World</b></p>, <b>World</b>, <img src="logo.jpg"/>]

Tag: div attributes: {''id'': ''inner''}
Child tags:  [<p>Hello<b>World</b></p>, <b>World</b>, <img src="logo.jpg"/>]

Tag: p attributes: {}
Child tags:  [<b>World</b>]

Tag: b attributes: {}
Child tags:  []

Tag: img attributes: {''src'': ''logo.jpg''}
Child tags:  []

Advertisements

”;

Leave a Reply

Your email address will not be published. Required fields are marked *