Beautiful Soup – encode Method

Beautiful Soup – encode() Method ”; Previous Next Method Description The encode() method in Beautiful Soup renders a bytestring representation of the given PageElement and its contents. The prettify() method, which allows to you to easily visualize the structure of the Beautiful Soup parse tree, has the encoding argument. The encode() method plays the same role as the encoding in prettify() method has. Syntax encode(encoding, indent_level, formatter, errors) Parameters encoding − The destination encoding. indent_level − Each line of the rendering will be indented this many levels. Used internally in recursive calls while pretty-printing. formatter − A Formatter object, or a string naming one of the standard formatters. errors − An error handling strategy. Return Value The encode() method returns a byte string representation of the tag and its contents. Example 1 The encoding parameter is utf-8 by default. Following code shows the encoded byte string representation of the soup object. from bs4 import BeautifulSoup soup = BeautifulSoup(“Hello “World!””, ”html.parser”) print (soup.encode(”utf-8”)) Output b”Hello xe2x80x9cWorld!xe2x80x9d” Example 2 The formatter object has the following predefined values − formatter=”minimal” − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML. formatter=”html” − Beautiful Soup will convert Unicode characters to HTML entities whenever possible. formatter=”html5″ − it”s similar to formatter=”html”, but Beautiful Soup will omit the closing slash in HTML void tags like “br”. formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML. In the following example, different formatter values are used as argument for encode() method. from bs4 import BeautifulSoup french = “<p>Il a dit <<Sacré bleu!>></p>” soup = BeautifulSoup(french, ”html.parser”) print (“minimal: “) print(soup.p.encode(formatter=”minimal”)) print (“html: “) print(soup.p.encode(formatter=”html”)) print (“None: “) print(soup.p.encode(formatter=None)) Output minimal: b”<p>Il a dit <<Sacrxc3xa9 bleu!>></p>” html: b”<p>Il a dit <<Sacré bleu!>></p>” None: b”<p>Il a dit <<Sacrxc3xa9 bleu!>></p>” Example 3 The following example uses Latin-1 as the encoding parameter. markup = ””” <html> <head> <meta content=”text/html; charset=ISO-Latin-1″ http-equiv=”Content-type” /> </head> <body> <p>Sacr`e bleu!</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(markup, ”lxml”) print(soup.p.encode(“latin-1″)) Output b”<p>Sacr`e bleu!</p>” Print Page Previous Next Advertisements ”;

Beautiful Soup – previous_siblings Property

Beautiful Soup – previous_siblings Property ”; Previous Next Method Description The HTML tags appearing at the same indentation level are called siblings. The previous_siblings property in Beautiful Soup returns returns a generator object used to iterate over all the tags and strings before the current tag, under the same parent. This gives he similar output as find_previous_siblings() method. Syntax element.previous_siblings Return type The previous_siblings property returns a generator of sibling PageElements. Example 1 The following example parses the given HTML string that has a few tags embedded inside the outer <p> tag. The previous siblings of the <u> tag are fetched with the help of previous_siblings property. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>”, ”html.parser”) tag1 = soup.u print (“previous siblings:”) for tag in tag1.previous_siblings: print (tag) Output previous siblings: <i>Python</i> <b>Excellent</b> Example 2 In the index.html file used in the following example, there are three input elements in the HTML form. We find out what are the sibling tags previous to the one with id set as marks, and under the <form> tag. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”id”:”marks”}) sibs = tag.previous_siblings print (“previous siblings:”) for sib in sibs: print (sib) Output previous siblings: <input id=”age” name=”age” type=”text”/> <input id=”nm” name=”name” type=”text”/> Example 3 The top level <html> tag always has two sibling tags – head and body. Hence, the <body> tag has only one previous sibling i.e. head, as the following code shows − html = ””” <html> <head> <title>Hello</title> </head> <body> <p>Excellent</p><p>Python</p><p>Tutorial</p> </body> </head> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tags = soup.body.previous_siblings print (“previous siblings:”) for tag in tags: print (tag) Output previous siblings: <head> <title>Hello</title> </head> Print Page Previous Next Advertisements ”;

Beautiful Soup – prettify Method

Beautiful Soup – prettify() Method ”; Previous Next Method Description To get a nicely formatted Unicode string, use Beautiful Soup”s prettify() method. It formats the Beautiful Soup parse tree so that there each tag is on its own separate line with indentation. It allows to you to easily visualize the structure of the Beautiful Soup parse tree. Syntax prettify(encoding, formatter) Parameters encoding − The eventual encoding of the string. If this is None, a Unicode string will be returned. A Formatter object, or a string naming one of the standard formatters. Return Type The prettify() method returns a Unicode string (if encoding==None) or a bytestring (otherwise). Example 1 Consider the following HTML string. <p>The quick, <b>brown fox</b> jumps over a lazy dog.</p> Using the prettify() method we can better understand its structure − html = ””” <p>The quick, <b>brown fox</b> jumps over a lazy dog.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “lxml”) print (soup.prettify()) Output <html> <body> <p> The quick, <b> brown fox </b> jumps over a lazy dog. </p> </body> </html> Example 2 You can call prettify() on on any of the Tag objects in the document. print (soup.b.prettify()) Output <b> brown fox </b> The prettify() method is for understanding the structure of the document. However, it should not be used to reformat it, as it adds whitespace (in the form of newlines), and changes the meaning of an HTML document. He prettify() method can optionally be provided formatter argument to specify the formatting to be used. There are following possible values for the formatter. formatter=”minimal” − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML. formatter=”html” − Beautiful Soup will convert Unicode characters to HTML entities whenever possible. formatter=”html5″ − it”s similar to formatter=”html”, but Beautiful Soup will omit the closing slash in HTML void tags like “br”. formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML. Example 3 from bs4 import BeautifulSoup french = “<p>Il a dit <<Sacré bleu!>></p>” soup = BeautifulSoup(french, ”html.parser”) print (“minimal: “) print(soup.prettify(formatter=”minimal”)) print (“html: “) print(soup.prettify(formatter=”html”)) print (“None: “) print(soup.prettify(formatter=None)) Output minimal: <p> Il a dit < <sacré bleu!=””> > </sacré> </p> html: <p> Il a dit < <sacré bleu!=””> > </sacré> </p> None: <p> Il a dit < <sacré bleu!=””> > </sacré> </p> Print Page Previous Next Advertisements ”;

Beautiful Soup – unwrap Method

Beautiful Soup – unwrap() Method ”; Previous Next Method Description The unwrap() method is the opposite of wrap() method. It It replaces a tag with whatever”s inside that tag. It removes the tag from an element and returns it. Syntax unwrap() Parameters The method doesn”t require any parameter. Return Type The unwrap() method returns the tag that has been removed. Example 1 In the following example, the <b> tag from the html string is removed. html = ””” <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”b”) newtag = tag1.unwrap() print (soup) Output <p>The quick, brown fox jumps over a lazy dog.</p> Example 2 The code below prints the returned value of unwrap() method. html = ””” <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”b”) newtag = tag1.unwrap() print (newtag) Output <b></b> Example 3 The unwrap() method is useful for good for stripping out markup, as the following code shows − html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) #print (soup.unwrap()) for tag in soup.find_all(): tag.unwrap() print (soup) Output The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. Print Page Previous Next Advertisements ”;

Beautiful Soup – find_all Method

Beautiful Soup – find_all() Method ”; Previous Next Method Description The find_all() method in Beautiful Soup looks for the elements that match the given criteria in the children of this PageElement and returns a list of all elements. Syntax Soup.find_all(name, attrs, recursive, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered. limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return type The find_all() method returns a ResultSet object which is a list generator. Example 1 When we can pass in a value for name, Beautiful Soup only considers tags with certain names. Text strings will be ignored, as will tags whose names that don”t match. In this example we pass title to find_all() method. from bs4 import BeautifulSoup html = open(”index.html”) soup = BeautifulSoup(html, ”html.parser”) obj = soup.find_all(”input”) print (obj) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Example 2 We shall use following HTML script in this example − <html> <body> <h2>Departmentwise Employees</h2> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> </body> </html> We can pass a string to the name argument of find_all() method. With string you can search for strings instead of tags. You can pass in a string, a regular expression, a list, a function, or the value True. In this example, a function is passed to name argument. All the name starting with ”A” are returned by find_all() method. from bs4 import BeautifulSoup def startingwith(ch): return ch.startswith(”A”) soup = BeautifulSoup(html, ”html.parser”) lst=soup.find_all(string=startingwith) print (lst) Output [”Accounts”, ”Anand”, ”Ankita”] Example 3 In this example, we pass limit=2 argument to find_all() method. The method returns first two appearances of <li> tag. soup = BeautifulSoup(html, ”html.parser”) lst=soup.find_all(”li”, limit =2) print (lst) Output [<li>Accounts</li>, <li>Anand</li>] Print Page Previous Next Advertisements ”;

Beautiful Soup – NavigableString Method

Beautiful Soup – NavigableString() Method ”; Previous Next Method Description The NavigableString() method in bs4 package is the constructor method for NavigableString class. A NavigableString represents the innermost child element of a parsed document. This method casts a regular Python string to a NavigableString. Conversely, the built-in str() method coverts NavigableString object to a Unicode string. Syntax NavigableString(string) Parameters string − an object of Python”s str class. Return Value The NavigableString() method returns a NavigableString object. Example 1 In the code below, the HTML string contains an empty <b> tag. We add a NavigableString object in it. html = “”” <p><b></b></p> “”” from bs4 import BeautifulSoup, NavigableString soup = BeautifulSoup(html, ”html.parser”) navstr = NavigableString(“Hello World”) soup.b.append(navstr) print (soup) Output <p><b>Hello World</b></p> Example 2 In this example, we see that two NavigableString objects are appended to an empty <b> tag. The tag responds to strings property instead of string property. It is a generator of NavigableString objects. html = “”” <p><b></b></p> “”” from bs4 import BeautifulSoup, NavigableString soup = BeautifulSoup(html, ”html.parser”) navstr = NavigableString(“Hello”) soup.b.append(navstr) navstr = NavigableString(“World”) soup.b.append(navstr) for s in soup.b.strings: print (s, type(s)) Output Hello <class ”bs4.element.NavigableString”> World <class ”bs4.element.NavigableString”> Example 3 Instead of strings property, if we access the stripped_strings property of <b> tag object, we get a generator of Unicode strings i.e. str objects. html = “”” <p><b></b></p> “”” from bs4 import BeautifulSoup, NavigableString soup = BeautifulSoup(html, ”html.parser”) navstr = NavigableString(“Hello”) soup.b.append(navstr) navstr = NavigableString(“World”) soup.b.append(navstr) for s in soup.b.stripped_strings: print (s, type(s)) Output Hello <class ”str”> World <class ”str”> Print Page Previous Next Advertisements ”;

Beautiful Soup – find_previous Method

Beautiful Soup – find_previous() Method ”; Previous Next Method Description The find_previous() method in Beautiful Soup look backwards in the document from this PageElement and find the first PageElement that matches the given criteria. It returns the first tag or NavigableString that comes before the current tag in the document. Like all other find methods, this method has the following syntax − Syntax find_previous(name, attrs, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − A filter for a NavigableString with specific text. kwargs − A dictionary of filters on attribute values. Return Value The find_previous() method returns a Tag or NavigableString object. Example 1 In the example below, we try to find which is the previous object before the <body> tag. It happens to be <title> element. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.body print (tag.find_previous()) Output <title>TutorialsPoint</title> Example 2 There are three input elements in the HTML document used in this example. The following code locates the input element with name attribute = age and looks for its previous element. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”name”:”age”}) print (tag.find_previous()) Output <input id=”nm” name=”name” type=”text”/> Example 3 The element before <title> happens to be <head> element. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”title”) print (tag.find_previous()) Output <head> <title>TutorialsPoint</title> </head> Print Page Previous Next Advertisements ”;

Beautiful Soup – decompose Method

Beautiful Soup – decompose() Method ”; Previous Next Method Description The decompose() method destroys current element along with its children, thus the element is removed from the tree, wiping it out and everything beneath it. You can check whether an element has been decomposed, by the `decomposed` property. It returns True if destroyed, false otherwise. Syntax decompose() Parameters No parameters are defined for this method. Return Type The method doesn”t return any object. Example 1 When we call descompose() method on the BeautifulSoup object itself, the entire content will be destroyed. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) soup.decompose() print (“decomposed:”,soup.decomposed) print (soup) Output decomposed: True document: Traceback (most recent call last): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~ TypeError: can only concatenate str (not “NoneType”) to str Since the soup object is decomposed, it returns True, however, you get TypeError as shown above. Example 2 The code below makes use of decompose() method to remove all the occurrences of <p> tags in the HTML string used. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) p_all = soup.find_all(”p”) [p.decompose() for p in p_all] print (“document:”,soup) Output Rest of the HTML document after removing all <p> tags will be printed. document: <html> <body> </body> </html> Example 3 Here, we find the <body> tag from the HTML document tree and decompose the previous element which happens to be the <title> tag. The resultant document tree omits the <title> tag. html = ””” <html> <head> <title>TutorialsPoint</title> </head> <body> Hello World </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag = soup.body tag.find_previous().decompose() print (“document:”,soup) Output document: <html> <head> </head> <body> Hello World </body> </html> Print Page Previous Next Advertisements ”;

Beautiful Soup – insert_after Method

Beautiful Soup – insert_after() Method ”; Previous Next Method Description The insert_after() method in Beautiful soup inserts tags or strings immediately after something else in the parse tree. The inserted element becomes the immediate successor of this one. The inserted element can be a tag or a string. Syntax insert_after(*args) Parameters args − One or more elements, may be tag or a string. Return Value This insert_after() method doesn”t return any new object. Example 1 Following code inserts a string “Python” after the first <b> tag. from bs4 import BeautifulSoup markup = ”<p>An <b>Excellent</b> Tutorial <u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert_after(“Python “) print (soup.prettify()) Output <p> An <b> Excellent </b> Python Tutorial <u> from TutorialsPoint </u> </p> Example 2 You can also insert a tag before another tag. Take a look at this example. from bs4 import BeautifulSoup, NavigableString markup = ”<P>Excellent <b>Tutorial</b> from TutorialsPoint</p>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag1 = soup.new_tag(”b”) tag1.string = “on Python “ tag.insert_after(tag1) print (soup.prettify()) Output <p> Excellent <b> Tutorial </b> <b> on Python </b> from TutorialsPoint </p> Example 3 Multiple tags or strings can be inserted after a certain tags. from bs4 import BeautifulSoup, NavigableString markup = ”<P>Excellent <b>Tutorials</b> from TutorialsPoint</p>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.p tag1 = soup.new_tag(”i”) tag1.string = ”and Java” tag.insert_after(“on Python”, tag1) print (soup.prettify()) Output <p> Excellent <b> Tutorials </b> from TutorialsPoint </p> on Python <i> and Java </i> Print Page Previous Next Advertisements ”;

Beautiful Soup – find_parent Method

Beautiful Soup – find_parent() Method ”; Previous Next Method Description The find_parent() method in BeautifulSoup package finds the closest parent of this PageElement that matches the given criteria. Syntax find_parent( name, attrs, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. kwargs − A dictionary of filters on attribute values. Return Type The find_parent() method returns Tag object or a NavigableString object. Example 1 We shall use following HTML script in this example − <html> <body> <h2>Departmentwise Employees</h2> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> </body> </html> In the following example, we find the name of the tag that is parent to the string ”HR”. from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) obj=soup.find(string=”HR”) print (obj.find_parent().name) Output li Example 2 The <body> tag is always enclosed within the top level <html> tag. In the following example, we confirm this fact with find_parent() method − from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) obj=soup.find(”body”) print (obj.find_parent().name) Output html Print Page Previous Next Advertisements ”;