Beautiful Soup – Encoding

Beautiful Soup – Encoding ”; Previous Next All HTML or XML documents are written in some specific encoding like ASCII or UTF-8. However, when you load that HTML/XML document into BeautifulSoup, it has been converted to Unicode. Example from bs4 import BeautifulSoup markup = “<p>I will display £</p>” soup = BeautifulSoup(markup, “html.parser”) print (soup.p) print (soup.p.string) Output <p>I will display £</p> I will display £ Above behavior is because BeautifulSoup internally uses the sub-library called Unicode, Dammit to detect a document”s encoding and then convert it into Unicode. However, not all the time, the Unicode, Dammit guesses correctly. As the document is searched byte-by-byte to guess the encoding, it takes lot of time. You can save some time and avoid mistakes, if you already know the encoding by passing it to the BeautifulSoup constructor as from_encoding. Below is one example where the BeautifulSoup misidentifies, an ISO-8859-8 document as ISO-8859-7 − Example from bs4 import BeautifulSoup markup = b”<h1>xedxe5xecxf9</h1>” soup = BeautifulSoup(markup, ”html.parser”) print (soup.h1) print (soup.original_encoding) Output <h1>翴檛</h1> ISO-8859-7 To resolve above issue, pass it to BeautifulSoup using from_encoding − Example from bs4 import BeautifulSoup markup = b”<h1>xedxe5xecxf9</h1>” soup = BeautifulSoup(markup, “html.parser”, from_encoding=”iso-8859-8″) print (soup.h1) print (soup.original_encoding) Output <h1>םולש</h1> iso-8859-8 Another new feature added from BeautifulSoup 4.4.0 is, exclude_encoding. It can be used, when you don”t know the correct encoding but sure that Unicode, Dammit is showing wrong result. soup = BeautifulSoup(markup, exclude_encodings=[“ISO-8859-7”]) Output encoding The output from a BeautifulSoup is UTF-8 document, irrespective of the entered document to BeautifulSoup. Below a document, where the polish characters are there in ISO-8859-2 format. Example markup = “”” <!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN”> <HTML> <HEAD> <META HTTP-EQUIV=”content-type” CONTENT=”text/html; charset=iso-8859-2″> </HEAD> <BODY> ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż </BODY> </HTML> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(markup, “html.parser”, from_encoding=”iso-8859-8″) print (soup.prettify()) Output <!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN”> <html> <head> <meta content=”text/html; charset=utf-8″ http-equiv=”content-type”/> </head> <body> ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż </body> </html> In the above example, if you notice, the <meta> tag has been rewritten to reflect the generated document from BeautifulSoup is now in UTF-8 format. If you don”t want the generated output in UTF-8, you can assign the desired encoding in prettify(). print(soup.prettify(“latin-1″)) Output b”<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN”>n<html>n <head>n <meta content=”text/html; charset=latin-1″ http-equiv=”content-type”/>n </head>n <body>n ą ć ę ł ń xf3 ś ź ż Ą Ć Ę Ł Ń xd3 Ś Ź Żn </body>n</html>n” In the above example, we have encoded the complete document, however you can encode, any particular element in the soup as if they were a python string − soup.p.encode(“latin-1”) soup.h1.encode(“latin-1″) Output b”<p>My first paragraph.</p>” b”<h1>My First Heading</h1>” Any characters that can”t be represented in your chosen encoding will be converted into numeric XML entity references. Below is one such example − markup = u”<b>N{SNOWMAN}</b>” snowman_soup = BeautifulSoup(markup) tag = snowman_soup.b print(tag.encode(“utf-8″)) Output b”<b>xe2x98x83</b>” If you try to encode the above in “latin-1” or “ascii”, it will generate “&#9731”, indicating there is no representation for that. print (tag.encode(“latin-1”)) print (tag.encode(“ascii”)) Output b”<b>☃</b>” b”<b>☃</b>” Unicode, Dammit Unicode, Dammit is used mainly when the incoming document is in unknown format (mainly foreign language) and we want to encode in some known format (Unicode) and also we don”t need Beautifulsoup to do all this. Print Page Previous Next Advertisements ”;

Beautiful Soup – Extract Email IDs

Beautiful Soup – Extract Email IDs ”; Previous Next To Extract Email addresses from a web page is an important application a web scraping library such as BeautifulSoup. In any web page, the Email IDs usually appear in the href attribute of anchor <a> tag. The Email ID is written using mailto URL scheme. Many a times, the Email Address may be present in page content as a normal text (without any hyperlink). In this chapter, we shall use BeautifulSoup library to fetch Email IDs from HTML page, with simple techniques. A typical usage of Email ID in href attribute is as below − <a href = “mailto:[email protected]”>test link</a> In the first example, we shall consider the following HTML document for extracting the Email IDs from the hyperlinks − <html> <head> <title>BeautifulSoup – Scraping Email IDs</title> </head> <body> <h2>Contact Us</h2> <ul> <li><a href = “mailto:[email protected]”>Sales Enquiries</a></li> <li><a href = “mailto:[email protected]”>Careers</a></li> <li><a href = “mailto:[email protected]”>Partner with us</a></li> </ul> </body> </html> Here”s the Python code that finds the Email Ids. We collect all the <a> tags in the document, and check if the tag has href attribute. If true, the part of its value after 6th character is the email Id. from bs4 import BeautifulSoup import re fp = open(“contact.html”) soup = BeautifulSoup(fp, “html.parser”) tags = soup.find_all(“a”) for tag in tags: if tag.has_attr(“href”) and tag[”href”][:7]==”mailto:”: print (tag[”href”][7:]) For the given HTML document, the Email IDs will be extracted as follows − [email protected] [email protected] [email protected] In the second example, we assume that the Email IDs appear anywhere in the text. To extract them, we use the regex searching mechanism. Regex is a complex character pattern. Python”s re module helps in processing the regex (Regular Expression) patterns. The following regex pattern is used for searching the email address − pat = r”[w.+-]+@[w-]+.[w.-]+” For this exercise, we shall use the following HTML document, having Email IDs in <li>tags. <html> <head> <title>BeautifulSoup – Scraping Email IDs</title> </head> <body> <h2>Contact Us</h2> <ul> <li>Sales Enquiries: [email protected]</a></li> <li>Careers: [email protected]</a></li> <li>Partner with us: [email protected]</a></li> </ul> </body> </html> Using the email regex, we”ll find the appearance of the pattern in each <li> tag string. Here is the Python code − Example from bs4 import BeautifulSoup import re def isemail(s): pat = r”[w.+-]+@[w-]+.[w.-]+” grp=re.findall(pat,s) return (grp) fp = open(“contact.html”) soup = BeautifulSoup(fp, “html.parser”) tags = soup.find_all(”li”) for tag in tags: emails = isemail(tag.string) if emails: print (emails) Output [”[email protected]”] [”[email protected]”] [”[email protected]”] Using the simple techniques described above, we can use BeautifulSoup to extract Email IDs from web pages. Print Page Previous Next Advertisements ”;

Beautiful Soup – Pretty Printing

Beautiful Soup – Pretty Printing ”; Previous Next To display the entire parsed tree of an HTML document or the contents of a specific tag, you can use the print() function or call str() function as well. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<h1>Hello World</h1>”, “lxml”) print (“Tree:”,soup) print (“h1 tag:”,str(soup.h1)) Output Tree: <html><body><h1>Hello World</h1></body></html> h1 tag: <h1>Hello World</h1> The str() function returns a string encoded in UTF-8. To get a nicely formatted Unicode string, use Beautiful Soup”s prettify() method. It formats the Beautiful Soup parse tree so that there each tag is on its own separate line with indentation. It allows to you to easily visualize the structure of the Beautiful Soup parse tree. Consider the following HTML string. <p>The quick, <b>brown fox</b> jumps over a lazy dog.</p> Using the prettify() method we can better understand its structure − html = ””” <p>The quick, <b>brown fox</b> jumps over a lazy dog.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “lxml”) print (soup.prettify()) Output <html> <body> <p> The quick, <b> brown fox </b> jumps over a lazy dog. </p> </body> </html> You can call prettify() on on any of the Tag objects in the document. print (soup.b.prettify()) Output <b> brown fox </b> The prettify() method is for understanding the structure of the document. However, it should not be used to reformat it, as it adds whitespace (in the form of newlines), and changes the meaning of an HTML document. He prettify() method can optionally be provided formatter argument to specify the formatting to be used. Print Page Previous Next Advertisements ”;

Beautiful Soup – Get Tag Position

Beautiful Soup – Get Tag Position ”; Previous Next The Tag object in Beautiful Soup possesses two useful properties that give the information about its position in the HTML document. They are − sourceline − line number at which the tag is found sourcepos − The starting index of the tag in the line in which it is found. These properties are supported by the html.parser which is Python”s in-built parser and html5lib parser. They are not available when you are using lmxl parser. In the following example, a HTML string is parsed with html.parser and we find the line number and position of <p> tag in the HTML string. Example html = ””” <html> <body> <p>Web frameworks</p> <ul> <li>Django</li> <li>Flask</li> </ul> <p>GUI frameworks</p> <ol> <li>Tkinter</li> <li>PyQt</li> </ol> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) p_tags = soup.find_all(”p”) for p in p_tags: print (p.sourceline, p.sourcepos, p.string) Output 4 0 Web frameworks 9 0 GUI frameworks For html.parser, these numbers represent the position of the initial less-than sign, which is 0 in this example. It is slightly different when html5lib parser is used. Example html = ””” <html> <body> <p>Web frameworks</p> <ul> <li>Django</li> <li>Flask</li> </ul> <p>GUI frameworks</p> <ol> <li>Tkinter</li> <li>PyQt</li> </ol> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html5lib”) li_tags = soup.find_all(”li”) for l in li_tags: print (l.sourceline, l.sourcepos, l.string) Output 6 3 Django 7 3 Flask 11 3 Tkinter 12 3 PyQt When using html5lib, the sourcepos property returns the position of the final greater-than sign. Print Page Previous Next Advertisements ”;

Beautiful Soup – Copying Objects

Beautiful Soup – Copying Objects ”; Previous Next To create a copy of any tag or NavigableString, use copy() function from the copy module from Python”s standard library. Example from bs4 import BeautifulSoup import copy markup = “<p>Learn <b>Python, Java</b>, <i>advanced Python and advanced Java</i>! from Tutorialspoint</p>” soup = BeautifulSoup(markup, “html.parser”) i1 = soup.find(”i”) icopy = copy.copy(i1) print (icopy) Output <i>advanced Python and advanced Java</i> Although the two copies (original and copied one) contain the same markup however, the two do not represent the same object. print (i1 == icopy) print (i1 is icopy) Output True False The copied object is completely detached from the original Beautiful Soup object tree, just as if extract() had been called on it. print (icopy.parent) Output None Print Page Previous Next Advertisements ”;

Beautiful Soup – Selecting nth Child

Beautiful Soup – Selecting nth Child ”; Previous Next HTML is characterized by the hierarchical order of tags. For example, the <html> tag encloses <body> tag, inside which there may be a <div> tag further may have <ul> and <li> elements nested respectively. The findChildren() method and .children property both return a ResultSet (list) of all the child tags directly under an element. By traversing the list, you can obtain the child located at a desired position, nth child. The code below uses the children property of a <div> tag in the HTML document. Since the return type of children property is a list iterator, we shall retrieve a Python list from it. We also need to remove the whitespaces and line breaks from the iterator. Once done, we can fetch the desired child. Here the child element with index 1 of the <div> tag is displayed. Example from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.div children = tag.children childlist = [child for child in children if child not in [”n”, ” ”]] print (childlist[1]) Output <p>Python</p> To use findChildren() method instead of children property, change the statement to children = tag.findChildren() There will be no change in the output. A more efficient approach toward locating nth child is with the select() method. The select() method uses CSS selectors to obtain required PageElements from the current element. The Soup and Tag objects support CSS selectors through their .css property, which is an interface to the CSS selector API. The selector implementation is handled by the Soup Sieve package, which gets installed along with bs4 package. The Soup Sieve package defines different types of CSS selectors, namely simple, compound and complex CSS selectors that are made up of one or more type selectors, ID selectors, class selectors. These selectors are defined in CSS language. There are pseudo class selectors as well in Soup Sieve. A CSS pseudo-class is a keyword added to a selector that specifies a special state of the selected element(s). We shall use :nth-child pseudo class selector in this example. Since we need to select a child from <div> tag at 2nd position, we shall pass :nthchild(2) to the select_one() method. Example from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.div child = tag.select_one(”:nth-child(2)”) print (child) Output <p>Python</p> We get the same result as with the findChildren() method. Note that the child numbering starts with 1 and not 0 as in case of a Python list. Print Page Previous Next Advertisements ”;

Beautiful Soup – Search by text inside a Tag

Beautiful Soup – Search by text inside a Tag ”; Previous Next Beautiful Soup provides different means to search for a certain text in the given HTML document. Here, we use the string argument of the find() method for the purpose. In the following example, we use the find() method to search for the word ”by” Example html = ””” <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs./p> ””” from bs4 import BeautifulSoup, NavigableString def search(tag): if ”by” in tag.text: return True soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”p”, string=search) print (tag) Output <p> DJs flock by when MTV ax quiz prog.</p> You can find all occurrences of the word with find_all() method tag = soup.find_all(”p”, string=search) print (tag) Output [<p> DJs flock by when MTV ax quiz prog.</p>, <p> Junk MTV quiz graced by fox whelps.</p>] There may be a situation where the required text may be somewhere in a child tag deep inside the document tree. We need to first locate a tag which has no further elements and then check whether the required text is in it. Example html = ””” <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs./p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tags = soup.find_all(lambda tag: len(tag.find_all()) == 0 and “by” in tag.text) for tag in tags: print (tag) Output <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> Print Page Previous Next Advertisements ”;

Beautiful Soup – Specifying the Parser

Beautiful Soup – Specifying the Parser ”; Previous Next A HTML document tree is parsed into an object of BeautifulSoup class. The constructor of this class needs the mandatory argument as the HTML string or a file object pointing to the html file. The constructor has all other optional arguments, important being features. BeautifulSoup(markup, features) Here markup is a HTML string or file object. The features parameter specifies the parser to be used. It may be a specific parser such as “lxml”, “lxml-xml”, “html.parser”, or “html5lib; or type of markup to be used (“html”, “html5”, “xml”). If the features argument is not given, BeautifulSoup chooses the best HTML parser that”s installed. Beautiful Soup ranks lxml”s parser as being the best, then html5lib”s, then Python”s built-in parser. You can specify one of the following − The type of markup you want to parse. Beautiful Soup currently supports are “html”, “xml”, and “html5”. The name of the parser library to be used. Currently supported options are “lxml”, “html5lib”, and “html.parser” (Python”s built-in HTML parser). To install lxml or html5lib parser, use the command − pip3 install lxml pip3 install html5lib These parsers have their advantages and disadvantages as shown below − Parser: Python”s html.parser Usage − BeautifulSoup(markup, “html.parser”) Advantages Batteries included Decent speed Lenient (As of Python 3.2) Disadvantages Not as fast as lxml, less lenient than html5lib. Parser: lxml”s HTML parser Usage − BeautifulSoup(markup, “lxml”) Advantages Very fast Lenient Disadvantages External C dependency Parser: lxml”s XML parser Usage − BeautifulSoup(markup, “lxml-xml”) Or BeautifulSoup(markup, “xml”) Advantages Very fast The only currently supported XML parser Disadvantages External C dependency Parser: html5lib Usage − BeautifulSoup(markup, “html5lib”) Advantages Extremely lenient Parses pages the same way a web browser does Creates valid HTML5 Disadvantages Very slow External Python dependency Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here”s a short document, parsed as HTML − Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<a><b /></a>”, “html.parser”) print (soup) Output <a><b></b></a> An empty <b /> tag is not valid HTML. Hence the parser turns it into a <b></b> tag pair. The same document is now parsed as XML. Note that the empty <b /> tag is left alone, and that the document is given an XML declaration instead of being put into an <html> tag. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<a><b /></a>”, “xml”) print (soup) Output <?xml version=”1.0″ encoding=”utf-8″?> <a><b/></a> In case of a perfectly-formed HTML document, all HTML parsers result in similar parsed tree though one parser will be faster than another. However, if HTML document is not perfect, there will be different results by different types of parsers. See how the results differ when “<a></p>” is parsed with different parsers − lxml parser Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<a></p>”, “lxml”) print (soup) Output <html><body><a></a></body></html> Note that the dangling </p> tag is simply ignored. html5lib parser Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<a></p>”, “html5lib”) print (soup) Output <html><head></head><body><a><p></p></a></body></html> The html5lib pairs it with an opening <p> tag. This parser also adds an empty <head> tag to the document. Built-in html parser Example Built in from bs4 import BeautifulSoup soup = BeautifulSoup(“<a></p>”, “html.parser”) print (soup) Output <a></a> This parser also ignores the closing </p> tag. But this parser makes no attempt to create a well-formed HTML document by adding a <body> tag, doesn”t even bother to add an <html> tag. The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the “correct” way. Print Page Previous Next Advertisements ”;

Beautiful Soup – Remove Child Elements

Beautiful Soup – Remove Child Elements ”; Previous Next HTML document is a hierarchical arrangement of different tags, where a tag may have one or more tags nested in it at more than one level. How do we remove the child elements of a certain tag? With BeautifulSoup, it is very easy to do it. There are two main methods in BeautifulSoup library, to remove a certain tag. The decompose() method and extract() method, the difference being that that the latter returns the thing that was removed, whereas the former just destroys it. Hence to remove the child elements, call findChildren() method for a given Tag object, and then extract() or decompose() on each. Consider the following code segment − soup = BeautifulSoup(fp, “html.parser”) soup.decompose() print (soup) This will destroy the entire soup object itself, which is the parsed tree of the document. Obviously, we would not like to do that. Now the following code − soup = BeautifulSoup(fp, “html.parser”) tags = soup.find_all() for tag in tags: for t in tag.findChildren(): t.extract() In the document tree, <html> is the first tag, and all other tags are its children, hence it will remove all the tags except <html> and </html> in the first iteration of the loop itself. More effective use of this can be done if we want to remove the children of a specific tag. For example, you may want to remove the header row of a HTML table. The following HTML script ha a table with first <tr> element having headers marked by <th> tag. <html> <body> <h2>Beautiful Soup – Remove Child Elements</h2> <table border=”1″> <tr class=”header”> <th>Name</th> <th>Age</th> <th>Marks</th> </tr> <tr> <td>Ravi</td> <td>23</td> <td>67</td> </tr> <tr> <td>Anil</td> <td>27</td> <td>84</td> </tr> </table> </body> </html> We can use the following Python code to remove all the children elements of <tr> tag with <th> cells. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, “html.parser”) tags = soup.find_all(”tr”, {”class”:”header”}) for tag in tags: for t in tag.findChildren(): t.extract() print (soup) Output <html> <body> <h2>Beautiful Soup – Parse Table</h2> <table border=”1″> <tr class=”header”> </tr> <tr> <td>Ravi</td> <td>23</td> <td>67</td> </tr> <tr> <td>Anil</td> <td>27</td> <td>84</td> </tr> </table> </body> </html> It can be seen that the <th> elements have been removed from the parsed tree Print Page Previous Next Advertisements ”;

Beautiful Soup – Remove HTML Tags

Beautiful Soup – Remove HTML Tags ”; Previous Next In this chapter, let us see how we can remove all tags from a HTML document. HTML is a markup language, made up of predefined tags. A tag marks a certain text associated with it so that the browser renders it as per its predefined meaning. For example, the word Hello marked with <b> tag for example <b>Hello</b), is rendered in bold face by the browser. If we want to filter out the raw text between different tags in a HTML document, we can use any of the two methods – get_text() or extract() in Beautiful Soup library. The get_text() method collects all the raw text part from the document and returns a string. However, the original document tree is not changed. In the example below, the get_text() method removes all the HTML tags. Example html = ””” <html> <body> <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) text = soup.get_text() print(text) Output The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. Not that the soup object in the above example still contains the parsed tree of the HTML document. Another approach is to collect the string enclosed in a Tag object before extracting it from the soup object. In HTML, some tags don”t have a string property (we can say that tag.string is None for some tags such as <html> or <body>). So, we concatenate strings from all other tags to obtain the plain text out of the HTML document. Following program demonstrates this approach. Example html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tags = soup.find_all() string=”” for tag in tags: #print (tag.name, tag.string) if tag.string != None: string=string+tag.string+”n” tag.extract() print (“Document text after removing tags:”) print (string) print (“Document:”) print (soup) Output Document text after removing tags: The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. Document: The clear() method removes the inner string of a tag object but doesn”t return it. Similarly the decompose() method destroys the tag as well as all its children elements. Hence, these methods are not suitable to retrieve the plain text from HTML document. Print Page Previous Next Advertisements ”;