”;
There are two types of tags in HTML. Many of the tags are in pairs of opening and closing counterparts. The top level <html> tag having a corresponding closing </html> tag is the main example. Others are <body> and </body>, <p> and </p>, <h1> and </h1> and many more. Other tags are self-closing tags – such as <img> and<a>. The self-closing tags don”t have a text as most of the tags with opening and closing symbols (such as <b>Hello</b>). In this chapter, we shall have a look at how can we get the text part inside such tags, with the help of Beautiful Soup library.
There are more than one methods/properties available in Beautiful Soup, with which we can fetch the text associated with a tag object.
Sr.No | Methods & Description |
---|---|
1 | text property
Get all child strings of a PageElement, concatenated using a separator if specified. |
2 | string property
Convenience property to string from a child element. |
3 | strings property
yields string parts from all the child objects under the current PageElement. |
4 | stripped_strings property
Same as strings property, with the linebreaks and whitespaces removed. |
5 | get_text() method
returns all child strings of this PageElement, concatenated using a separator if specified. |
Consider the following HTML document −
<div id="outer"> <div id="inner"> <p>Hello<b>World</b></p> <img src=''logo.jpg''> </div> </div>
If we retrieve the stripped_string property of each tag in the parsed document tree, we will find that the two div tags and the p tag have two NavigableString objects, Hello and World. The <b> tag embeds world string, while <img> doesn”t have a text part.
The following example fetches the text from each of the tags in the given HTML document −
Example
html = """ <div id="outer"> <div id="inner"> <p>Hello<b>World</b></p> <img src=''logo.jpg''> </div> </div> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, ''html.parser'') for tag in soup.find_all(): print ("Tag: {} attributes: {} ".format(tag.name, tag.attrs)) for txt in tag.stripped_strings: print (txt) print()
Output
Tag: div attributes: {''id'': ''outer''} Hello World Tag: div attributes: {''id'': ''inner''} Hello World Tag: p attributes: {} Hello World Tag: b attributes: {} World Tag: img attributes: {''src'': ''logo.jpg''}
”;