”;
The Tag object in Beautiful Soup possesses two useful properties that give the information about its position in the HTML document. They are −
sourceline − line number at which the tag is found
sourcepos − The starting index of the tag in the line in which it is found.
These properties are supported by the html.parser which is Python”s in-built parser and html5lib parser. They are not available when you are using lmxl parser.
In the following example, a HTML string is parsed with html.parser and we find the line number and position of <p> tag in the HTML string.
Example
html = '''''' <html> <body> <p>Web frameworks</p> <ul> <li>Django</li> <li>Flask</li> </ul> <p>GUI frameworks</p> <ol> <li>Tkinter</li> <li>PyQt</li> </ol> </body> </html> '''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, ''html.parser'') p_tags = soup.find_all(''p'') for p in p_tags: print (p.sourceline, p.sourcepos, p.string)
Output
4 0 Web frameworks 9 0 GUI frameworks
For html.parser, these numbers represent the position of the initial less-than sign, which is 0 in this example. It is slightly different when html5lib parser is used.
Example
html = '''''' <html> <body> <p>Web frameworks</p> <ul> <li>Django</li> <li>Flask</li> </ul> <p>GUI frameworks</p> <ol> <li>Tkinter</li> <li>PyQt</li> </ol> </body> </html> '''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, ''html5lib'') li_tags = soup.find_all(''li'') for l in li_tags: print (l.sourceline, l.sourcepos, l.string)
Output
6 3 Django 7 3 Flask 11 3 Tkinter 12 3 PyQt
When using html5lib, the sourcepos property returns the position of the final greater-than sign.
”;