Beautiful Soup – Kinds of objects

Beautiful Soup – Kinds of objects ”; Previous Next When we pass a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects defined in bs4 package. Tag NavigableString BeautifulSoup Comments Tag Object A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<b class=”boldest”>TutorialsPoint</b>”, ”lxml”) tag = soup.html print (type(tag)) Output <class ”bs4.element.Tag”> Tags contain lot of attributes and methods and two important features of a tag are its name and attributes. Name (tag.name) Every tag contains a name and can be accessed through ”.name” as suffix. tag.name will return the type of tag it is. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<b class=”boldest”>TutorialsPoint</b>”, ”lxml”) tag = soup.html print (tag.name) Output html However, if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<b class=”boldest”>TutorialsPoint</b>”, ”lxml”) tag = soup.html tag.name = “strong” print (tag) Output <strong><body><b class=”boldest”>TutorialsPoint</b></body></strong> Attributes (tag.attrs) A tag object can have any number of attributes. In the above example, the tag <b class=”boldest”> has an attribute ”class” whose value is “boldest”. Anything that is NOT tag, is basically an attribute and must contain a value. A dictionary of attributes and their values is returned by “attrs”. You can access the attributes either through accessing the keys too. In the example below, the string argument for Beautifulsoup() constructor contains HTML input tag. The attributes of input tag are returned by “attr”. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<input type=”text” name=”name” value=”Raju”>”, ”lxml”) tag = soup.input print (tag.attrs) Output {”type”: ”text”, ”name”: ”name”, ”value”: ”Raju”} We can do all kind of modifications to our tag”s attributes (add/remove/modify), using dictionary operators or methods. In the following example, the value tag is updated. The updated HTML string shows changes. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<input type=”text” name=”name” value=”Raju”>”, ”lxml”) tag = soup.input print (tag.attrs) tag[”value”]=”Ravi” print (soup) Output <html><body><input name=”name” type=”text” value=”Ravi”/></body></html> We add a new id tag, and delete the value tag. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<input type=”text” name=”name” value=”Raju”>”, ”lxml”) tag = soup.input tag[”id”]=”nm” del tag[”value”] print (soup) Output <html><body><input id=”nm” name=”name” type=”text”/></body></html> Multi-valued attributes Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. Others include ”rel”, ”rev”, ”headers”, ”accesskey” and ”accept-charset”. The multi-valued attributes in beautiful soup are shown as list. Example from bs4 import BeautifulSoup css_soup = BeautifulSoup(”<p class=”body”></p>”, ”lxml”) print (“css_soup.p[”class”]:”, css_soup.p[”class”]) css_soup = BeautifulSoup(”<p class=”body bold”></p>”, ”lxml”) print (“css_soup.p[”class”]:”, css_soup.p[”class”]) Output css_soup.p[”class”]: [”body”] css_soup.p[”class”]: [”body”, ”bold”] However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone − Example from bs4 import BeautifulSoup id_soup = BeautifulSoup(”<p id=”body bold”></p>”, ”lxml”) print (“id_soup.p[”id”]:”, id_soup.p[”id”]) print (“type(id_soup.p[”id”]):”, type(id_soup.p[”id”])) Output id_soup.p[”id”]: body bold type(id_soup.p[”id”]): <class ”str”> NavigableString object Usually, a certain string is placed in opening and closing tag of a certain type. The HTML engine of the browser applies the intended effect on the string while rendering the element. For example , in <b>Hello World</b>, you find a string in the middle of <b> and </b> tags so that it is rendered in bold. The NavigableString object represents the contents of a tag. It is an object of bs4.element.NavigableString class. To access the contents, use “.string” with tag. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<h2 id=”message”>Hello, Tutorialspoint!</h2>”, ”html.parser”) print (soup.string) print (type(soup.string)) Output Hello, Tutorialspoint! <class ”bs4.element.NavigableString”> A NavigableString object is similar to a Python Unicode string. some of its features support Navigating the tree and Searching the tree. A NavigableString can be converted to a Unicode string with str() function. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<h2 id=”message”>Hello, Tutorialspoint!</h2>”,”html.parser”) tag = soup.h2 string = str(tag.string) print (string) Output Hello, Tutorialspoint! Just as a Python string, which is immutable, the NavigableString also can”t be modified in place. However, use replace_with() to replace the inner string of a tag with another. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<h2 id=”message”>Hello, Tutorialspoint!</h2>”,”html.parser”) tag = soup.h2 tag.string.replace_with(“OnLine Tutorials Library”) print (tag.string) Output OnLine Tutorials Library BeautifulSoup object The BeautifulSoup object represents the entire parsed object. However, it can be considered to be similar to Tag object. It is the object created when we try to scrape a web resource. Because it is similar to a Tag object, it supports the functionality required to parse and search the document tree. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) print (soup) print (soup.name) print (”type:”,type(soup)) Output <html> <head> <title>TutorialsPoint</title> </head> <body> <h2>Departmentwise Employees</h2> <ul> <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> </ul> </body> </html> [document] type: <class ”bs4.BeautifulSoup”> The name property of BeautifulSoup object always returns [document]. Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with(). Example from bs4 import BeautifulSoup obj1 = BeautifulSoup(“<book><title>Python</title></book>”, features=”xml”) obj2

Beautiful Soup – find_next_sibling Method

Beautiful Soup – find_next_sibling() Method ”; Previous Next Method Description The find_next_sibling() method in Beautiful Soup Find the closest sibling at the same level to this PageElement that matches the given criteria and appears later in the document. This method is similar to next_sibling property. Syntax find_fnext_sibling(name, attrs, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − The string to search for (rather than tag). kwargs − A dictionary of filters on attribute values. Return Type The find_next_sibling() method returns Tag object or a NavigableString object. Example 1 from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Hello</b><i>Python</i></p>”, ”html.parser”) tag1 = soup.find(”b”) print (“next:”,tag1.find_next_sibling()) Output next: <i>Python</i> Example 2 If the next node doesn”t exist, the method returns None. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Hello</b><i>Python</i></p>”, ”html.parser”) tag1 = soup.find(”i”) print (“next:”,tag1.find_next_sibling()) Output next: None Print Page Previous Next Advertisements ”;

Beautiful Soup – prettify Method

Beautiful Soup – prettify() Method ”; Previous Next Method Description To get a nicely formatted Unicode string, use Beautiful Soup”s prettify() method. It formats the Beautiful Soup parse tree so that there each tag is on its own separate line with indentation. It allows to you to easily visualize the structure of the Beautiful Soup parse tree. Syntax prettify(encoding, formatter) Parameters encoding − The eventual encoding of the string. If this is None, a Unicode string will be returned. A Formatter object, or a string naming one of the standard formatters. Return Type The prettify() method returns a Unicode string (if encoding==None) or a bytestring (otherwise). Example 1 Consider the following HTML string. <p>The quick, <b>brown fox</b> jumps over a lazy dog.</p> Using the prettify() method we can better understand its structure − html = ””” <p>The quick, <b>brown fox</b> jumps over a lazy dog.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “lxml”) print (soup.prettify()) Output <html> <body> <p> The quick, <b> brown fox </b> jumps over a lazy dog. </p> </body> </html> Example 2 You can call prettify() on on any of the Tag objects in the document. print (soup.b.prettify()) Output <b> brown fox </b> The prettify() method is for understanding the structure of the document. However, it should not be used to reformat it, as it adds whitespace (in the form of newlines), and changes the meaning of an HTML document. He prettify() method can optionally be provided formatter argument to specify the formatting to be used. There are following possible values for the formatter. formatter=”minimal” − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML. formatter=”html” − Beautiful Soup will convert Unicode characters to HTML entities whenever possible. formatter=”html5″ − it”s similar to formatter=”html”, but Beautiful Soup will omit the closing slash in HTML void tags like “br”. formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML. Example 3 from bs4 import BeautifulSoup french = “<p>Il a dit <<Sacré bleu!>></p>” soup = BeautifulSoup(french, ”html.parser”) print (“minimal: “) print(soup.prettify(formatter=”minimal”)) print (“html: “) print(soup.prettify(formatter=”html”)) print (“None: “) print(soup.prettify(formatter=None)) Output minimal: <p> Il a dit < <sacré bleu!=””> > </sacré> </p> html: <p> Il a dit < <sacré bleu!=””> > </sacré> </p> None: <p> Il a dit < <sacré bleu!=””> > </sacré> </p> Print Page Previous Next Advertisements ”;

Beautiful Soup – unwrap Method

Beautiful Soup – unwrap() Method ”; Previous Next Method Description The unwrap() method is the opposite of wrap() method. It It replaces a tag with whatever”s inside that tag. It removes the tag from an element and returns it. Syntax unwrap() Parameters The method doesn”t require any parameter. Return Type The unwrap() method returns the tag that has been removed. Example 1 In the following example, the <b> tag from the html string is removed. html = ””” <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”b”) newtag = tag1.unwrap() print (soup) Output <p>The quick, brown fox jumps over a lazy dog.</p> Example 2 The code below prints the returned value of unwrap() method. html = ””” <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”b”) newtag = tag1.unwrap() print (newtag) Output <b></b> Example 3 The unwrap() method is useful for good for stripping out markup, as the following code shows − html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) #print (soup.unwrap()) for tag in soup.find_all(): tag.unwrap() print (soup) Output The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. Print Page Previous Next Advertisements ”;

Beautiful Soup – find_all Method

Beautiful Soup – find_all() Method ”; Previous Next Method Description The find_all() method in Beautiful Soup looks for the elements that match the given criteria in the children of this PageElement and returns a list of all elements. Syntax Soup.find_all(name, attrs, recursive, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered. limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return type The find_all() method returns a ResultSet object which is a list generator. Example 1 When we can pass in a value for name, Beautiful Soup only considers tags with certain names. Text strings will be ignored, as will tags whose names that don”t match. In this example we pass title to find_all() method. from bs4 import BeautifulSoup html = open(”index.html”) soup = BeautifulSoup(html, ”html.parser”) obj = soup.find_all(”input”) print (obj) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Example 2 We shall use following HTML script in this example − <html> <body> <h2>Departmentwise Employees</h2> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> </body> </html> We can pass a string to the name argument of find_all() method. With string you can search for strings instead of tags. You can pass in a string, a regular expression, a list, a function, or the value True. In this example, a function is passed to name argument. All the name starting with ”A” are returned by find_all() method. from bs4 import BeautifulSoup def startingwith(ch): return ch.startswith(”A”) soup = BeautifulSoup(html, ”html.parser”) lst=soup.find_all(string=startingwith) print (lst) Output [”Accounts”, ”Anand”, ”Ankita”] Example 3 In this example, we pass limit=2 argument to find_all() method. The method returns first two appearances of <li> tag. soup = BeautifulSoup(html, ”html.parser”) lst=soup.find_all(”li”, limit =2) print (lst) Output [<li>Accounts</li>, <li>Anand</li>] Print Page Previous Next Advertisements ”;

Beautiful Soup – NavigableString Method

Beautiful Soup – NavigableString() Method ”; Previous Next Method Description The NavigableString() method in bs4 package is the constructor method for NavigableString class. A NavigableString represents the innermost child element of a parsed document. This method casts a regular Python string to a NavigableString. Conversely, the built-in str() method coverts NavigableString object to a Unicode string. Syntax NavigableString(string) Parameters string − an object of Python”s str class. Return Value The NavigableString() method returns a NavigableString object. Example 1 In the code below, the HTML string contains an empty <b> tag. We add a NavigableString object in it. html = “”” <p><b></b></p> “”” from bs4 import BeautifulSoup, NavigableString soup = BeautifulSoup(html, ”html.parser”) navstr = NavigableString(“Hello World”) soup.b.append(navstr) print (soup) Output <p><b>Hello World</b></p> Example 2 In this example, we see that two NavigableString objects are appended to an empty <b> tag. The tag responds to strings property instead of string property. It is a generator of NavigableString objects. html = “”” <p><b></b></p> “”” from bs4 import BeautifulSoup, NavigableString soup = BeautifulSoup(html, ”html.parser”) navstr = NavigableString(“Hello”) soup.b.append(navstr) navstr = NavigableString(“World”) soup.b.append(navstr) for s in soup.b.strings: print (s, type(s)) Output Hello <class ”bs4.element.NavigableString”> World <class ”bs4.element.NavigableString”> Example 3 Instead of strings property, if we access the stripped_strings property of <b> tag object, we get a generator of Unicode strings i.e. str objects. html = “”” <p><b></b></p> “”” from bs4 import BeautifulSoup, NavigableString soup = BeautifulSoup(html, ”html.parser”) navstr = NavigableString(“Hello”) soup.b.append(navstr) navstr = NavigableString(“World”) soup.b.append(navstr) for s in soup.b.stripped_strings: print (s, type(s)) Output Hello <class ”str”> World <class ”str”> Print Page Previous Next Advertisements ”;

Beautiful Soup – find_previous Method

Beautiful Soup – find_previous() Method ”; Previous Next Method Description The find_previous() method in Beautiful Soup look backwards in the document from this PageElement and find the first PageElement that matches the given criteria. It returns the first tag or NavigableString that comes before the current tag in the document. Like all other find methods, this method has the following syntax − Syntax find_previous(name, attrs, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − A filter for a NavigableString with specific text. kwargs − A dictionary of filters on attribute values. Return Value The find_previous() method returns a Tag or NavigableString object. Example 1 In the example below, we try to find which is the previous object before the <body> tag. It happens to be <title> element. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.body print (tag.find_previous()) Output <title>TutorialsPoint</title> Example 2 There are three input elements in the HTML document used in this example. The following code locates the input element with name attribute = age and looks for its previous element. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”name”:”age”}) print (tag.find_previous()) Output <input id=”nm” name=”name” type=”text”/> Example 3 The element before <title> happens to be <head> element. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”title”) print (tag.find_previous()) Output <head> <title>TutorialsPoint</title> </head> Print Page Previous Next Advertisements ”;

Beautiful Soup – append Method

Beautiful Soup – append() Method ”; Previous Next Method Description The append() method in Beautiful Soup adds a given string or another tag at the end of the current Tag object”s contents. The append() method works similar to the append() method of Python”s list object. Syntax append(obj) Parameters obj − any PageElement, may be a string, a NavigableString object or a Tag object. Return Type The append() method doesn”t return a new object. Example 1 In the following example, the HTML script has a <p> tag. With append(), additional text is appended.In the following example, the HTML script has a <p> tag. With append(), additional text is appended. from bs4 import BeautifulSoup markup = ”<p>Hello</p>” soup = BeautifulSoup(markup, ”html.parser”) print (soup) tag = soup.p tag.append(” World”) print (soup) Output <p>Hello</p> <p>Hello World</p> Example 2 With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method. from bs4 import BeautifulSoup, Tag markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag1 = soup.new_tag(”i”) tag1.string = ”World” tag.append(tag1) print (soup.prettify()) Output <b> Hello <i> World </i> </b> Example 3 If you have to add a string to the document, you can append a NavigableString object. from bs4 import BeautifulSoup, NavigableString markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b new_string = NavigableString(” World”) tag.append(new_string) print (soup.prettify()) Output <b> Hello World </b> Print Page Previous Next Advertisements ”;

Beautiful Soup – replace_with Method

Beautiful Soup – replace_with() Method ”; Previous Next Method Description Beautiful Soup”s replace_with() method replaces a tag or string in an element with the provided tag or string. Syntax replace_with(tag/string) Parameters The method accepts a tag object or a string as argument. Return Type The replace_method doesn”t return a new object. Example 1 In this example, the <p> tag is replaced by <b> with the use of replace_with() method. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”p”) txt = tag1.string tag2 = soup.new_tag(”b”) tag2.string = txt tag1.replace_with(tag2) print (soup) Output <html> <body> <b>The quick, brown fox jumps over a lazy dog.</b> </body> </html> Example 2 You can simply replace the inner text of a tag with another string by calling replace_with() method on the tag.string object. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”p”) tag1.string.replace_with(“DJs flock by when MTV ax quiz prog.”) print (soup) Output <html> <body> <p>DJs flock by when MTV ax quiz prog.</p> </body> </html> Example 3 The tag object to be used for replacement can be obtained by any of the find() methods. Here, we replace the text of the tag next to <p> tag. html = ””” <html> <body> <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”p”) tag1.find_next(”b”).string.replace_with(”black”) print (soup) Output <html> <body> <p>The quick, <b>black</b> fox jumps over a lazy dog.</p> </body> </html> Print Page Previous Next Advertisements ”;

Beautiful Soup – parents Property

Beautiful Soup – parents Property ”; Previous Next Method Description The parents property in BeautifulSoup library retrieves all the parent elements of the said PegeElement in a recursive manner. The type of the value returned by the parents property is a generator, with the help of which we can list out the parents in the down-to-up order. Syntax Element.parents Return value The parents property returns a generator object. Example 1 This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <p> tag in the example HTML string. html = “”” <html><head><title>TutorialsPoint</title></head> <body> <p>Hello World</p> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.p for element in tag.parents: print (element.name) Output body html [document] Note that the parent to the BeautifulSoup object is [document]. Example 2 In the following example, we see that the <b> tag is enclosed inside a <p> tag. The two div tags above it have an id attribute. We try to print the only those elements having id attribute. The has_attr() method is used for the purpose. html = “”” <div id=”outer”> <div id=”inner”> <p>Hello<b>World</b></p> </div> </div> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.b for parent in tag.parents: if parent.has_attr(“id”): print(parent[“id”]) Output inner outer Print Page Previous Next Advertisements ”;