Beautiful Soup – Remove Child Elements ”; Previous Next HTML document is a hierarchical arrangement of different tags, where a tag may have one or more tags nested in it at more than one level. How do we remove the child elements of a certain tag? With BeautifulSoup, it is very easy to do it. There are two main methods in BeautifulSoup library, to remove a certain tag. The decompose() method and extract() method, the difference being that that the latter returns the thing that was removed, whereas the former just destroys it. Hence to remove the child elements, call findChildren() method for a given Tag object, and then extract() or decompose() on each. Consider the following code segment − soup = BeautifulSoup(fp, “html.parser”) soup.decompose() print (soup) This will destroy the entire soup object itself, which is the parsed tree of the document. Obviously, we would not like to do that. Now the following code − soup = BeautifulSoup(fp, “html.parser”) tags = soup.find_all() for tag in tags: for t in tag.findChildren(): t.extract() In the document tree, <html> is the first tag, and all other tags are its children, hence it will remove all the tags except <html> and </html> in the first iteration of the loop itself. More effective use of this can be done if we want to remove the children of a specific tag. For example, you may want to remove the header row of a HTML table. The following HTML script ha a table with first <tr> element having headers marked by <th> tag. <html> <body> <h2>Beautiful Soup – Remove Child Elements</h2> <table border=”1″> <tr class=”header”> <th>Name</th> <th>Age</th> <th>Marks</th> </tr> <tr> <td>Ravi</td> <td>23</td> <td>67</td> </tr> <tr> <td>Anil</td> <td>27</td> <td>84</td> </tr> </table> </body> </html> We can use the following Python code to remove all the children elements of <tr> tag with <th> cells. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, “html.parser”) tags = soup.find_all(”tr”, {”class”:”header”}) for tag in tags: for t in tag.findChildren(): t.extract() print (soup) Output <html> <body> <h2>Beautiful Soup – Parse Table</h2> <table border=”1″> <tr class=”header”> </tr> <tr> <td>Ravi</td> <td>23</td> <td>67</td> </tr> <tr> <td>Anil</td> <td>27</td> <td>84</td> </tr> </table> </body> </html> It can be seen that the <th> elements have been removed from the parsed tree Print Page Previous Next Advertisements ”;
Author: user
Beautiful Soup – Search by text inside a Tag ”; Previous Next Beautiful Soup provides different means to search for a certain text in the given HTML document. Here, we use the string argument of the find() method for the purpose. In the following example, we use the find() method to search for the word ”by” Example html = ””” <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs./p> ””” from bs4 import BeautifulSoup, NavigableString def search(tag): if ”by” in tag.text: return True soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”p”, string=search) print (tag) Output <p> DJs flock by when MTV ax quiz prog.</p> You can find all occurrences of the word with find_all() method tag = soup.find_all(”p”, string=search) print (tag) Output [<p> DJs flock by when MTV ax quiz prog.</p>, <p> Junk MTV quiz graced by fox whelps.</p>] There may be a situation where the required text may be somewhere in a child tag deep inside the document tree. We need to first locate a tag which has no further elements and then check whether the required text is in it. Example html = ””” <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs./p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tags = soup.find_all(lambda tag: len(tag.find_all()) == 0 and “by” in tag.text) for tag in tags: print (tag) Output <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> Print Page Previous Next Advertisements ”;
Beautiful Soup – Specifying the Parser ”; Previous Next A HTML document tree is parsed into an object of BeautifulSoup class. The constructor of this class needs the mandatory argument as the HTML string or a file object pointing to the html file. The constructor has all other optional arguments, important being features. BeautifulSoup(markup, features) Here markup is a HTML string or file object. The features parameter specifies the parser to be used. It may be a specific parser such as “lxml”, “lxml-xml”, “html.parser”, or “html5lib; or type of markup to be used (“html”, “html5”, “xml”). If the features argument is not given, BeautifulSoup chooses the best HTML parser that”s installed. Beautiful Soup ranks lxml”s parser as being the best, then html5lib”s, then Python”s built-in parser. You can specify one of the following − The type of markup you want to parse. Beautiful Soup currently supports are “html”, “xml”, and “html5”. The name of the parser library to be used. Currently supported options are “lxml”, “html5lib”, and “html.parser” (Python”s built-in HTML parser). To install lxml or html5lib parser, use the command − pip3 install lxml pip3 install html5lib These parsers have their advantages and disadvantages as shown below − Parser: Python”s html.parser Usage − BeautifulSoup(markup, “html.parser”) Advantages Batteries included Decent speed Lenient (As of Python 3.2) Disadvantages Not as fast as lxml, less lenient than html5lib. Parser: lxml”s HTML parser Usage − BeautifulSoup(markup, “lxml”) Advantages Very fast Lenient Disadvantages External C dependency Parser: lxml”s XML parser Usage − BeautifulSoup(markup, “lxml-xml”) Or BeautifulSoup(markup, “xml”) Advantages Very fast The only currently supported XML parser Disadvantages External C dependency Parser: html5lib Usage − BeautifulSoup(markup, “html5lib”) Advantages Extremely lenient Parses pages the same way a web browser does Creates valid HTML5 Disadvantages Very slow External Python dependency Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here”s a short document, parsed as HTML − Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<a><b /></a>”, “html.parser”) print (soup) Output <a><b></b></a> An empty <b /> tag is not valid HTML. Hence the parser turns it into a <b></b> tag pair. The same document is now parsed as XML. Note that the empty <b /> tag is left alone, and that the document is given an XML declaration instead of being put into an <html> tag. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<a><b /></a>”, “xml”) print (soup) Output <?xml version=”1.0″ encoding=”utf-8″?> <a><b/></a> In case of a perfectly-formed HTML document, all HTML parsers result in similar parsed tree though one parser will be faster than another. However, if HTML document is not perfect, there will be different results by different types of parsers. See how the results differ when “<a></p>” is parsed with different parsers − lxml parser Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<a></p>”, “lxml”) print (soup) Output <html><body><a></a></body></html> Note that the dangling </p> tag is simply ignored. html5lib parser Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<a></p>”, “html5lib”) print (soup) Output <html><head></head><body><a><p></p></a></body></html> The html5lib pairs it with an opening <p> tag. This parser also adds an empty <head> tag to the document. Built-in html parser Example Built in from bs4 import BeautifulSoup soup = BeautifulSoup(“<a></p>”, “html.parser”) print (soup) Output <a></a> This parser also ignores the closing </p> tag. But this parser makes no attempt to create a well-formed HTML document by adding a <body> tag, doesn”t even bother to add an <html> tag. The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the “correct” way. Print Page Previous Next Advertisements ”;
Beautiful Soup – Remove HTML Tags ”; Previous Next In this chapter, let us see how we can remove all tags from a HTML document. HTML is a markup language, made up of predefined tags. A tag marks a certain text associated with it so that the browser renders it as per its predefined meaning. For example, the word Hello marked with <b> tag for example <b>Hello</b), is rendered in bold face by the browser. If we want to filter out the raw text between different tags in a HTML document, we can use any of the two methods – get_text() or extract() in Beautiful Soup library. The get_text() method collects all the raw text part from the document and returns a string. However, the original document tree is not changed. In the example below, the get_text() method removes all the HTML tags. Example html = ””” <html> <body> <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) text = soup.get_text() print(text) Output The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. Not that the soup object in the above example still contains the parsed tree of the HTML document. Another approach is to collect the string enclosed in a Tag object before extracting it from the soup object. In HTML, some tags don”t have a string property (we can say that tag.string is None for some tags such as <html> or <body>). So, we concatenate strings from all other tags to obtain the plain text out of the HTML document. Following program demonstrates this approach. Example html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tags = soup.find_all() string=”” for tag in tags: #print (tag.name, tag.string) if tag.string != None: string=string+tag.string+”n” tag.extract() print (“Document text after removing tags:”) print (string) print (“Document:”) print (soup) Output Document text after removing tags: The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. Document: The clear() method removes the inner string of a tag object but doesn”t return it. Similarly the decompose() method destroys the tag as well as all its children elements. Hence, these methods are not suitable to retrieve the plain text from HTML document. Print Page Previous Next Advertisements ”;
BeautifulSoup – Scraping Link from HTML ”; Previous Next While scraping and analysing the content from resources with a website, you are often required to extract all the links that a certain page contains. In this chapter, we shall find out how we can extract links from a HTML document. HTML has the anchor tag <a> to insert a hyperlink. The href attribute of anchor tag lets you to establish the link. It uses the following syntax − <a href==”web page URL”>hypertext</a> With the find_all() method we can collect all the anchor tags in a document and then print the value of href attribute of each of them. In the example below, we extract all the links found on Google”s home page. We use requests library to collect the HTML contents of https://google.com, parse it in a soup object, and then collect all <a> tags. Finally, we print href attributes. Example from bs4 import BeautifulSoup import requests url = “https://www.google.com/” req = requests.get(url) soup = BeautifulSoup(req.content, “html.parser”) tags = soup.find_all(”a”) links = [tag[”href”] for tag in tags] for link in links: print (link) Here”s the partial output when the above program is run − Output https://www.google.co.in/imghp?hl=en&tab=wi https://maps.google.co.in/maps?hl=en&tab=wl https://play.google.com/?hl=en&tab=w8 https://www.youtube.com/?tab=w1 https://news.google.com/?tab=wn https://mail.google.com/mail/?tab=wm https://drive.google.com/?tab=wo https://www.google.co.in/intl/en/about/products?tab=wh http://www.google.co.in/history/optout?hl=en /preferences?hl=en https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/&ec=GAZAAQ /advanced_search?hl=en-IN&authuser=0 https://www.google.com/url?q=https://io.google/2023/%3Futm_source%3Dgoogle-hpp%26utm_medium%3Dembedded_marketing%26utm_campaign%3Dhpp_watch_live%26utm_content%3D&source=hpp&id=19035434&ct=3&usg=AOvVaw0qzqTkP5AEv87NM-MUDd_u&sa=X&ved=0ahUKEwiPzpjku-z-AhU1qJUCHVmqDJoQ8IcBCAU However, a HTML document may have hyperlinks of different protocol schemes, such as mailto: protocol for link to an email ID, tel: scheme for link to a telephone number, or a link to a local file with file:// URL scheme. In such a case, if we are interested in extracting links with https:// scheme, we can do so by the following example. We have a HTML document that consists of hyperlinks of different types, out of which only ones with https:// prefix are being extracted. html = ””” <p><a href=”https://www.tutorialspoint.com”>Web page link </a></p> <p><a href=”https://www.example.com”>Web page link </a></p> <p><a href=”mailto:[email protected]”>Email link</a></p> <p><a href=”tel:+4733378901″>Telephone link</a></p> ””” from bs4 import BeautifulSoup import requests soup = BeautifulSoup(html, “html.parser”) tags = soup.find_all(”a”) links = [tag[”href”] for tag in tags] for link in links: if link.startswith(“https”): print (link) Output https://www.tutorialspoint.com https://www.example.com Print Page Previous Next Advertisements ”;
Beautiful Soup – Scrape Nested Tags ”; Previous Next The arrangement of tags or elements in a HTML document is hierarchical nature. The tags are nested upto multiple levels. For example, the <head> and <body> tags are nested inside <html> tag. Similarly, one or more <li> tags may be inside a <ul> tag. In this chapter, we shall find out how to scrape a tag that has one or more children tags nested in it. Let us consider the following HTML document − <div id=”outer”> <div id=”inner”> <p>Hello<b>World</b></p> <img src=”logo.jpg”> </div> </div> In this case, the two <div> tags and a <p> tag has one or more child elements nested inside. Whereas, the <img> and <b> tag donot have any children tags. The findChildren() method returns a ResultSet of all the children under a tag. So, if a tag doesn”t have any children, the ResultSet will be an empty list like []. Taking this as a cue, the following code finds out the tags under each tag in the document tree and displays the list. Example html = “”” <div id=”outer”> <div id=”inner”> <p>Hello<b>World</b></p> <img src=”logo.jpg”> </div> </div> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) for tag in soup.find_all(): print (“Tag: {} attributes: {}”.format(tag.name, tag.attrs)) print (“Child tags: “, tag.findChildren()) print() Output Tag: div attributes: {”id”: ”outer”} Child tags: [<div id=”inner”> <p>Hello<b>World</b></p> <img src=”logo.jpg”/> </div>, <p>Hello<b>World</b></p>, <b>World</b>, <img src=”logo.jpg”/>] Tag: div attributes: {”id”: ”inner”} Child tags: [<p>Hello<b>World</b></p>, <b>World</b>, <img src=”logo.jpg”/>] Tag: p attributes: {} Child tags: [<b>World</b>] Tag: b attributes: {} Child tags: [] Tag: img attributes: {”src”: ”logo.jpg”} Child tags: [] Print Page Previous Next Advertisements ”;
Beautiful Soup – Parsing Tables ”; Previous Next In addition to a textual content, a HTML document may also have a structured data in the form of HTML tables. With Beautiful Soup, we can extract the tabular data in Python objects such as list or dictionary, if required store it in databases or spreadsheets, and perform processing. In this chapter, we shall parse HTML table using Beautiful Soup. Although Beautiful Soup doesn”t any special function or method for extracting table data, we can achieve it by simple scraping techniques. Just like any table, say in SQL or spreadsheet, HTML table consists of rows and columns. HTML has <table> tag to build a tabular structure. There are one or more nested <tr> tags one each for a row. Each row consists of <td> tags to hold the data in each cell of the row. First row usually is used for column headings, and the headings are placed in <th> tag instead of <td> Following HTML script renders a simple table on the browser window − <html> <body> <h2>Beautiful Soup – Parse Table</h2> <table border=”1″> <tr> <th>Name</th> <th>Age</th> <th>Marks</th> </tr> <tr class=”data”> <td>Ravi</td> <td>23</td> <td>67</td> </tr> <tr class=”data”> <td>Anil</td> <td>27</td> <td>84</td> </tr> </table> </body> </html> Note that, the appearance of data rows is customized with a CSS class data, in order to distinguish it from the header row. We shall now see how to parse the table data. First, we obtain the document tree in the BeautifulSoup object. Then collect all the column headers in a list. Example from bs4 import BeautifulSoup soup = BeautifulSoup(markup, “html.parser”) tbltag = soup.find(”table”) headers = [] headings = tbltag.find_all(”th”) for h in headings: headers.append(h.string) The data row tags with class=”data” attribute following the header row are then fetched. A dictionary object with column header as key and corresponding value in each cell is formed and appended to a list of dict objects. rows = tbltag.find_all_next(”tr”, {”class”:”data”}) trows=[] for i in rows: row = {} data = i.find_all(”td”) n=0 for j in data: row[headers[n]] = j.string n+=1 trows.append(row) A list of dictionary objects is collected in trows. You can then use it for different purposes such as storing in a SQL table, saving as a JSON or pandas dataframe object. The complete code is given below − markup = “”” <html> <body> <p>Beautiful Soup – Parse Table</p> <table> <tr> <th>Name</th> <th>Age</th> <th>Marks</th> </tr> <tr class=”data”> <td>Ravi</td> <td>23</td> <td>67</td> </tr> <tr class=”data”> <td>Anil</td> <td>27</td> <td>84</td> </tr> </table> </body> </html> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(markup, “html.parser”) tbltag = soup.find(”table”) headers = [] headings = tbltag.find_all(”th”) for h in headings: headers.append(h.string) print (headers) rows = tbltag.find_all_next(”tr”, {”class”:”data”}) trows=[] for i in rows: row = {} data = i.find_all(”td”) n=0 for j in data: row[headers[n]] = j.string n+=1 trows.append(row) print (trows) Output [{”Name”: ”Ravi”, ”Age”: ”23”, ”Marks”: ”67”}, {”Name”: ”Anil”, ”Age”: ”27”, ”Marks”: ”84”}] Print Page Previous Next Advertisements ”;
Beautiful Soup – Remove all Scripts ”; Previous Next One of the often used tags in HTML is the <script> tag. It facilitates embedding a client side script such as JavaScript code in HTML. In this chapter, we will use BeautifulSoup to remove script tags from the HTML document. The <script> tag has a corresponding </script> tag. In between the two, you may include either a reference to an external JavaScript file, or include JavaScript code inline with the HTML script itself. To include an external Javascript file, the syntax used is − <head> <script src=”javascript.js”></script> </head> You can then invoke the functions defined in this file from inside HTML. Instead of referring to an external file, you can put JavaScipt code inside the HTML within the <script> and </script> code. If it is put inside the <head> section of the HTML document, then the functionality is available throughout the document tree. On the other hand, if put anywhere in the <body> section, the JavaScript functions are available from that point on. <body> <p>Hello World</p> <script> alert(“Hello World”) </script> </body> To remove all script tags with Beautiful is easy. You have to collect the list of all script tags from the parsed tree and extract them one by one. Example html = ””” <html> <head> <script src=”javascript.js”></scrript> </head> <body> <p>Hello World</p> <script> alert(“Hello World”) </script> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) for tag in soup.find_all(”script”): tag.extract() print (soup) Output <html> <head> </head> </html> You can also use the decompose() method instead of extract(), the difference being that that the latter returns the thing that was removed, whereas the former just destroys it. For a more concise code, you may also use list comprehension syntax to achieve the soup object with script tags removed, as follows − [tag.decompose() for tag in soup.find_all(”script”)] Print Page Previous Next Advertisements ”;
Beautiful Soup – Remove Empty Tags ”; Previous Next In HTML, many of the tags have an opening and closing tag. Such tags are mostly used for defining the formatting properties, such as <b> and </b>, <h1> and </h1> etc. There are some self-closing tags also which don”t have a closing tag and no textual part. For example <img>, <br>, <input> etc. However, while composing HTML, tags such as <p></p> without any text may be inadvertently inserted. We need to remove such empty tags with the help of Beautiful Soup library functions. Removing textual tags without any text between opening and closing symbols is easy. You can call extract() method on a tag if length of its inner text is 0. for tag in tags: if (len(tag.get_text(strip=True)) == 0): tag.extract() However, this would remove tags such as <hr>, <img>, and <input> etc. These are all self-closing or singleton tags. You would not like to close tags that have one or more attributes even if there is no text associated with it. So, you”ll have to check if a tag has any attributes and the get_text() returns none. In the following example, there are both situations where an empty textual tag and some singleton tags are present in the HTML string. The code retains the tags with attributes but removes ones without any text embedded. Example html =””” <html> <body> <p>Paragraph</p> <embed type=”image/jpg” src=”Python logo.jpg” width=”300″ height=”200″> <hr> <b></b> <p> <a href=”#”>Link</a> <ul> <li>One</li> </ul> <input type=”text” id=”fname” name=”fname”> <img src=”img_orange_flowers.jpg” alt=”Flowers”> </body> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tags =soup.find_all() for tag in tags: if (len(tag.get_text(strip=True)) == 0): if len(tag.attrs)==0: tag.extract() print (soup) Output <html> <body> <p>Paragraph</p> <embed height=”200″ src=”Python logo.jpg” type=”image/jpg” width=”300″/> <p> <a href=”#”>Link</a> <ul> <li>One</li> </ul> <input id=”fname” name=”fname” type=”text”/> <img alt=”Flowers” src=”img_orange_flowers.jpg”/> </p> </body> </html> Note that the original html code has a <p> tag without its enclosing </p>. The parser automatically inserts the closing tag. The position of the closing tag may change if you change the parser to lxml or html5lib. Print Page Previous Next Advertisements ”;
Beautiful Soup – Find all Headings ”; Previous Next In this chapter, we shall explore how to find all heading elements in a HTML document with BeautifulSoup. HTML defines six heading styles from H1 to H6, each with decreasing font size. Suitable tags are used for different page sections, such as main heading, heading for section, topic etc. Let us use the find_all() method in two different ways to extract all the heading elements in a HTML document. We shall use the following HTML script (saved as index.html) in the code examples in this chapter − <html> <head> <title>BeautifulSoup – Scraping Headings</title> </head> <body> <h2>Scraping Headings</h2> <b>The quick, brown fox jumps over a lazy dog.</b> <h3>Paragraph Heading</h3> <p>DJs flock by when MTV ax quiz prog.</p> <h3>List heading</h3> <ul> <li>Junk MTV quiz graced by fox whelps.</li> <li>Bawds jog, flick quartz, vex nymphs.</li> </ul> </body> </html> Example 1 In this approach, we collect all the tags in the parsed tree, and check if the name of each tag is found in a list of all heading tags. from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) headings = [”h1”,”h2”,”h3”, ”h4”, ”h5”, ”h6”] tags = soup.find_all() heads = [(tag.name, tag.contents[0]) for tag in tags if tag.name in headings] print (heads) Here, headings is a list of all heading styles h1 to h6. If the name of a tag is any of these, the tag and its contents are collected in a lists named heads. Output [(”h2”, ”Scraping Headings”), (”h3”, ”Paragraph Heading”), (”h3”, ”List heading”)] Example 2 You can pass a regex expression to the find_all() method. Take a look at the following regex. re.compile(”^h[1-6]$”) This regex finds all tags that start with h, have a digit after the h, and then end after the digit. Let use this as an argument to find_all() method in the code below − from bs4 import BeautifulSoup import re fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) tags = soup.find_all(re.compile(”^h[1-6]$”)) print (tags) Output [<h2>Scraping Headings</h2>, <h3>Paragraph Heading</h3>, <h3>List heading</h3>] Print Page Previous Next Advertisements ”;