Beautiful Soup – Find all Comments


Beautiful Soup – Find all Comments



”;


Inserting comments in a computer code is supposed to be a good programming practice. Comments are helpful for understanding the logic of the program. They also serve as a documentation. You can put comments in a HTML as well as XML script, just as in a program written in C, Java, Python etc. BeautifulSoup API can be helpful to identify all the comments in a HTML document.

In HTML and XML, the comment text is written between <!– and –> tags.


<!-- Comment Text -->

The BeutifulSoup package, whose internal name is bs4, defines Comment as an important object. The Comment object is a special type of NavigableString object. Hence, the string property of any Tag that is found between <!– and –> is recognized as a Comment.

Example


from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, ''html.parser'')
comment = soup.b.string
print (comment, type(comment))

Output


This is a comment text in HTML <class ''bs4.element.Comment''>

To search for all the occurrences of comment in a HTML document, we shall use find_all() method. Without any argument, find_all() returns all the elements in the parsed HTML document. You can pass a keyword argument ”string” to find_all() method. We shall assign the return value of a function iscomment() to it.


comments = soup.find_all(string=iscomment)

The iscomment() function verifies if the text in a tag is a comment object or not, with the help of isinstance() function.


def iscomment(elem):
   return isinstance(elem, Comment)

The comments variable shall store all the comment text occurrences in the given HTML document. We shall use the following index.html file in the example code −


<html>
   <head>
      <!-- Title of document -->
      <title>TutorialsPoint</title>
   </head>
   <body>
      <!-- Page heading -->
      <h2>Departmentwise Employees</h2>
      <!-- top level list-->
      <ul id="dept">
      <li>Accounts</li>
         <ul id=''acc''>
         <!-- first inner list -->
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ul id="HR">
         <!-- second inner list -->
         <li>Rani</li>
         <li>Ankita</li>
         </ul>
      </ul>
   </body>
</html>

The following Python program scrapes the above HTML document, and finds all the comments in it.

Example


from bs4 import BeautifulSoup, Comment

fp = open(''index.html'')

soup = BeautifulSoup(fp, ''html.parser'')

def iscomment(elem):
    return isinstance(elem, Comment)

comments = soup.find_all(string=iscomment)
print (comments)

Output


['' Title of document '', '' Page heading '', '' top level list'', '' first inner list '', '' second inner list '']

The above output shows a list of all comments. We can also use a for loop over the collection of comments.

Example


i=0
for comment in comments:
   i+=1
   print (i,".",comment)

Output


1 .  Title of document 
2 .  Page heading
3 .  top level list
4 .  first inner list
5 .  second inner list

In this chapter, we learned how to extract all the comment strings in a HTML document.

Advertisements

”;

Leave a Reply

Your email address will not be published. Required fields are marked *