Beautiful Soup – Get Tag Position


Beautiful Soup – Get Tag Position



”;


The Tag object in Beautiful Soup possesses two useful properties that give the information about its position in the HTML document. They are −

sourceline − line number at which the tag is found

sourcepos − The starting index of the tag in the line in which it is found.

These properties are supported by the html.parser which is Python”s in-built parser and html5lib parser. They are not available when you are using lmxl parser.

In the following example, a HTML string is parsed with html.parser and we find the line number and position of <p> tag in the HTML string.

Example


html = ''''''
<html>
   <body>
      <p>Web frameworks</p>
      <ul>
      <li>Django</li>
      <li>Flask</li>
      </ul>
      <p>GUI frameworks</p>
      <ol>
      <li>Tkinter</li>
      <li>PyQt</li>
      </ol>
   </body>
</html>
''''''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, ''html.parser'')

p_tags = soup.find_all(''p'')
for p in p_tags:
   print (p.sourceline, p.sourcepos, p.string)

Output


4 0 Web frameworks
9 0 GUI frameworks

For html.parser, these numbers represent the position of the initial less-than sign, which is 0 in this example. It is slightly different when html5lib parser is used.

Example


html = ''''''
<html>
   <body>
      <p>Web frameworks</p>
      <ul>
      <li>Django</li>
      <li>Flask</li>
      </ul>
      <p>GUI frameworks</p>
      <ol>
      <li>Tkinter</li>
      <li>PyQt</li>
      </ol>
   </body>
</html>
''''''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, ''html5lib'')

li_tags = soup.find_all(''li'')
for l in li_tags:
   print (l.sourceline, l.sourcepos, l.string)

Output


6 3 Django
7 3 Flask
11 3 Tkinter
12 3 PyQt

When using html5lib, the sourcepos property returns the position of the final greater-than sign.

Advertisements

”;

Leave a Reply

Your email address will not be published. Required fields are marked *