Beautiful Soup – Navigating by Tags


Beautiful Soup – Navigating by Tags



”;


One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag”s children). Beautiful Soup provides different ways to navigate and iterate over”s tag”s children.

Easiest way to search a parse tree is to search the tag by its name.

soup.head

The soup.head function returns the contents put inside the <head> .. </head> element of a HTML page.


Consider the following HTML page to be scraped:
<html>
   <head>
      <title>TutorialsPoint</title>
      <script>
         document.write("Welcome to TutorialsPoint");
      </script>
   </head>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It''s all Free</b></p>
   </body>
</html>

Following code extracts the contents of <head> element

Example


from bs4 import BeautifulSoup
with open("index.html") as fp:
   soup = BeautifulSoup(fp, ''html.parser'')
print(soup.head)

Output


<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>

soup.body

Similarly, to return the contents of body part of HTML page, use soup.body

Example


from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, ''html.parser'')
print (soup.body)

Output


<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It''s all Free</b></p>
</body>

You can also extract specific tag (like first <h1> tag) in the <body> tag.

Example


from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, ''html.parser'')

print(soup.body.h1)

Output


<h1>Tutorialspoint Online Library</h1>

soup.p

Our HTML file contains a <p> tag. We can extract the contents of this tag

Example


from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, ''html.parser'')

print(soup.p)

Output


<p><b>It''s all Free</b></p>

Tag.contents

A Tag object may have one or more PageElements. The Tag object”s contents property returns a list of all elements included in it.

Let us find the elements in <head> tag of our index.html file.

Example


from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, ''html.parser'')

tag = soup.head
print (tag.contents)

Output


[''n'',
<title>TutorialsPoint</title>,
''n'',
<script>
document.write("Welcome to TutorialsPoint");
</script>,
''n'']

Tag.children

The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it.

The Tag object has a children property that returns a list iterator object containing the enclosed PageElements.

To demonstrate the children property, we shall use the following HTML script (index.html). In the <body> section, there are two <ul> list elements, one nested in another. In other words, the body tag has top level list elements, and each list element has another list under it.


<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul>
      <li>Accounts</li>
         <ul>
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ul>
         <li>Rani</li>
         <li>Ankita</li>
         </ul>
      </ul>
   </body>
</html>

The following Python code gives a list of all the children elements of top level <ul> tag.

Example


from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, ''html.parser'')

tag = soup.ul
print (list(tag.children))

Output


[''n'', <li>Accounts</li>, ''n'', <ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>, ''n'', <li>HR</li>, ''n'', <ul>
<li>Rani</li>
<li>Ankita</li>
</ul>, ''n'']

Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy.

Example


for child in tag.children:
   print (child)

Output


<li>Accounts</li>

<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>

<li>HR</li>

<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>

Tag.find_all()

This method returns a result set of contents of all the tags matching with the argument tag provided.

Let us consider the following HTML page(index.html) for this −


<html>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It''s all Free</b></p>
      <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
      <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
      <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>
      <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>
      <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
   </body>
</html>

The following code lists all the elements with <a> tag

Example


from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, ''html.parser'')

result = soup.find_all("a")
print (result)

Output


[
   <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>,
   <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>,
   <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>,
   <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>,
   <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
]

Advertisements

”;

Leave a Reply

Your email address will not be published. Required fields are marked *