Beautiful Soup – unwrap Method

Beautiful Soup – unwrap() Method ”; Previous Next Method Description The unwrap() method is the opposite of wrap() method. It It replaces a tag with whatever”s inside that tag. It removes the tag from an element and returns it. Syntax unwrap() Parameters The method doesn”t require any parameter. Return Type The unwrap() method returns the tag that has been removed. Example 1 In the following example, the <b> tag from the html string is removed. html = ””” <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”b”) newtag = tag1.unwrap() print (soup) Output <p>The quick, brown fox jumps over a lazy dog.</p> Example 2 The code below prints the returned value of unwrap() method. html = ””” <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”b”) newtag = tag1.unwrap() print (newtag) Output <b></b> Example 3 The unwrap() method is useful for good for stripping out markup, as the following code shows − html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) #print (soup.unwrap()) for tag in soup.find_all(): tag.unwrap() print (soup) Output The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. Print Page Previous Next Advertisements ”;

Beautiful Soup – find_all Method

Beautiful Soup – find_all() Method ”; Previous Next Method Description The find_all() method in Beautiful Soup looks for the elements that match the given criteria in the children of this PageElement and returns a list of all elements. Syntax Soup.find_all(name, attrs, recursive, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered. limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return type The find_all() method returns a ResultSet object which is a list generator. Example 1 When we can pass in a value for name, Beautiful Soup only considers tags with certain names. Text strings will be ignored, as will tags whose names that don”t match. In this example we pass title to find_all() method. from bs4 import BeautifulSoup html = open(”index.html”) soup = BeautifulSoup(html, ”html.parser”) obj = soup.find_all(”input”) print (obj) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Example 2 We shall use following HTML script in this example − <html> <body> <h2>Departmentwise Employees</h2> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> </body> </html> We can pass a string to the name argument of find_all() method. With string you can search for strings instead of tags. You can pass in a string, a regular expression, a list, a function, or the value True. In this example, a function is passed to name argument. All the name starting with ”A” are returned by find_all() method. from bs4 import BeautifulSoup def startingwith(ch): return ch.startswith(”A”) soup = BeautifulSoup(html, ”html.parser”) lst=soup.find_all(string=startingwith) print (lst) Output [”Accounts”, ”Anand”, ”Ankita”] Example 3 In this example, we pass limit=2 argument to find_all() method. The method returns first two appearances of <li> tag. soup = BeautifulSoup(html, ”html.parser”) lst=soup.find_all(”li”, limit =2) print (lst) Output [<li>Accounts</li>, <li>Anand</li>] Print Page Previous Next Advertisements ”;

Beautiful Soup – NavigableString Method

Beautiful Soup – NavigableString() Method ”; Previous Next Method Description The NavigableString() method in bs4 package is the constructor method for NavigableString class. A NavigableString represents the innermost child element of a parsed document. This method casts a regular Python string to a NavigableString. Conversely, the built-in str() method coverts NavigableString object to a Unicode string. Syntax NavigableString(string) Parameters string − an object of Python”s str class. Return Value The NavigableString() method returns a NavigableString object. Example 1 In the code below, the HTML string contains an empty <b> tag. We add a NavigableString object in it. html = “”” <p><b></b></p> “”” from bs4 import BeautifulSoup, NavigableString soup = BeautifulSoup(html, ”html.parser”) navstr = NavigableString(“Hello World”) soup.b.append(navstr) print (soup) Output <p><b>Hello World</b></p> Example 2 In this example, we see that two NavigableString objects are appended to an empty <b> tag. The tag responds to strings property instead of string property. It is a generator of NavigableString objects. html = “”” <p><b></b></p> “”” from bs4 import BeautifulSoup, NavigableString soup = BeautifulSoup(html, ”html.parser”) navstr = NavigableString(“Hello”) soup.b.append(navstr) navstr = NavigableString(“World”) soup.b.append(navstr) for s in soup.b.strings: print (s, type(s)) Output Hello <class ”bs4.element.NavigableString”> World <class ”bs4.element.NavigableString”> Example 3 Instead of strings property, if we access the stripped_strings property of <b> tag object, we get a generator of Unicode strings i.e. str objects. html = “”” <p><b></b></p> “”” from bs4 import BeautifulSoup, NavigableString soup = BeautifulSoup(html, ”html.parser”) navstr = NavigableString(“Hello”) soup.b.append(navstr) navstr = NavigableString(“World”) soup.b.append(navstr) for s in soup.b.stripped_strings: print (s, type(s)) Output Hello <class ”str”> World <class ”str”> Print Page Previous Next Advertisements ”;

Beautiful Soup – parent Property

Beautiful Soup – parent Property ”; Previous Next Method Description The parent property in BeautifulSoup library returns the immediate parent element of the said PegeElement. The type of the value returned by the parents property is a Tag object. For the BeautifulSoup object, its parent is a document object Syntax Element.parent Return value The parent property returns a Tag object. For Soup object, it returns document object Example 1 This example uses .parent property to find the immediate parent element of the first <p> tag in the example HTML string. html = “”” <html> <head> <title>TutorialsPoint</title> </head> <body> <p>Hello World</p> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.p print (tag.parent.name) Output body Example 2 In the following example, we see that the <title> tag is enclosed inside a <head> tag. Hence, the parent property for <title> tag returns the <head> tag. html = “”” <html> <head> <title>TutorialsPoint</title> </head> <body> <p>Hello World</p> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.title print (tag.parent) Output <head><title>TutorialsPoint</title></head> Example 3 The behaviour of Python”s built-in HTML parser is a little different from html5lib and lxml parsers. The built-in parser doesn”t try to build a perfect document out of the string provided. It doesn”t add additional parent tags like body or html if they don”t exist in the string. On the other hand, html5lib and lxml parsers add these tags to make the document a perfect HTML document. html = “”” <p><b>Hello World</b></p> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) print (soup.p.parent.name) soup = BeautifulSoup(html, ”html5lib”) print (soup.p.parent.name) Output [document] Body As the HTML parser doesn”t add additional tags, the parent of parsed soup is document object. However, when we use html5lib, the parent tag”s name property is Body. Print Page Previous Next Advertisements ”;

Beautiful Soup – wrap Method

Beautiful Soup – wrap() Method ”; Previous Next Method Description The wrap() method in Beautiful Soup encloses the element inside another element. You can wrap an existing tag element with another, or wrap the tag”s string with a tag. Syntax wrap(tag) Parameters The tag to be wrapped with. Return Type The method returns a new wrapper with the given tag. Example 1 In this example, the <b> tag is wrapped in <div> tag. html = ””” <html> <body> <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”b”) newtag = soup.new_tag(”div”) tag1.wrap(newtag) print (soup) Output <html> <body> <p>The quick, <div><b>brown</b></div> fox jumps over a lazy dog.</p> </body> </html> Example 2 We wrap the string inside the <p> tag with a wrapper tag. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p>tutorialspoint.com</p>”, ”html.parser”) soup.p.string.wrap(soup.new_tag(“b”)) print (soup) Output <p><b>tutorialspoint.com</b></p> Print Page Previous Next Advertisements ”;

Beautiful Soup – find_previous_siblings Method

Beautiful Soup – find_previous_siblings() Method ”; Previous Next Method Description The find_previous_siblings() method in Beautiful Soup package returns all siblings that appear earlier to this PAgeElement in the document and match the given criteria. Syntax find_previous_siblings(name, attrs, string, limit, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − A filter for a NavigableString with specific text. limit − Stop looking after finding this many results. kwargs − A dictionary of filters on attribute values. Return Value The find_previous_siblings() method a ResultSet of PageElements. Example 1 Let us use the following HTML snippet for this purpose − <p> <b> Excellent </b> <i> Python </i> <u> Tutorial </u> </p> In the code below, we try to find all the siblings of <> tag. There are two more tags at the same level in the HTML string used for scraping. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>”, ”html.parser”) tag1 = soup.find(”u”) print (“previous siblings:”) for tag in tag1.find_previous_siblings(): print (tag) Output <i>Python</i> <b>Excellent</b> Example 2 The web page (index.html) has a HTML form with three input elements. We locate one with id attribute as marks and then find its previous siblings. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”id”:”marks”}) sibs = tag.find_previous_sibling() print (sibs) Output [<input id=”age” name=”age” type=”text”/>, <input id=”nm” name=”name” type=”text”/>] Example 3 The HTML string has two <p> tags. We find out the siblings previous to the one with id1 as its id attribute. html = ””” <p><b>Excellent</b><p>Python</p><p id=”id1”>Tutorial</p></p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”p”, id=”id1”) ptags = tag.find_previous_siblings() for ptag in ptags: print (“Tag: {}, Text: {}”.format(ptag.name, ptag.text)) Output Tag: p, Text: Python Tag: b, Text: Excellent Print Page Previous Next Advertisements ”;

Beautiful Soup – insert_before Method

Beautiful Soup – insert_before() Method ”; Previous Next Method Description The insert_before() method in Beautiful soup inserts tags or strings immediately before something else in the parse tree. The inserted element becomes the immediate predecessor of this one. The inserted element can be a tag or a string. Syntax insert_before(*args) Parameters args − One or more elements, may be tag or a string. Return Value This insert_before() method doesn”t return any new object. Example 1 The following example inserts a text “Here is an” before “Excellent in the given HTML markup string. from bs4 import BeautifulSoup, NavigableString markup = ”<b>Excellent</b> Python Tutorial <u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert_before(“Here is an “) print (soup.prettify()) Output Here is an <b> Excellent </b> Python Tutorial <u> from TutorialsPoint </u> Example 2 You can also insert a tag before another tag. Take a look at this example. from bs4 import BeautifulSoup, NavigableString markup = ”<P>Excellent <b>Tutorial</b> from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag1 = soup.new_tag(”b”) tag1.string = “Python “ tag.insert_before(tag1) print (soup.prettify()) Output <p> Excellent <b> Python </b> <b> Tutorial </b> from TutorialsPoint </p> Example 3 The following code passes more than one strings to be inserted before the <b> tag. from bs4 import BeautifulSoup markup = ”<p>There are <b>Tutorials</b> <u>from TutorialsPoint</u></p>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert_before(“many “, ”excellent ”) print (soup.prettify()) Output <p> There are many excellent <b> Tutorials </b> <u> from TutorialsPoint </u> </p> Print Page Previous Next Advertisements ”;

Beautiful Soup – Discussion

Discuss Beautiful Soup ”; Previous Next In this tutorial, we will show you, how to perform web scraping in Python using Beautiful Soup 4 for getting data out of HTML, XML and other markup languages. In this we will try to scrap webpage from various different websites (including IMDB). We will cover beautiful soup 4, python basic tools for efficiently and clearly navigating, searching and parsing HTML web page. We have tried to cover almost all the functionalities of Beautiful Soup 4 in this tutorial. You can combine multiple functionalities introduced in this tutorial into one bigger program to capture multiple meaningful data from the website into some other sub-program as input. Print Page Previous Next Advertisements ”;

Beautiful Soup – next_elements Property

Beautiful Soup – next_elements Property ”; Previous Next Method Description In Beautiful Soup library, the next_elements property returns a generator object containing the next strings or tags in the parse tree. Syntax Element.next_elements Return value The next_elements property returns a generator. Example 1 The next_elements property returns tags and NavibaleStrings appearing after the <b> tag in the document string below − html = ””” <p><b>Excellent</b><p>Python</p><p id=”id1”>Tutorial</p></p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”b”) nexts = tag.next_elements print (“Next elements:”) for next in nexts: print (next) Output Next elements: Excellent Python Python <p id=”id1″>Tutorial</p> Tutorial Example 2 All the elements appearing after the <p> tag are listed below − from bs4 import BeautifulSoup html = ””” <p> <b>Excellent</b><i>Python</i> </p> <u>Tutorial</u> ””” soup = BeautifulSoup(html, ”html.parser”) tag1 = soup.find(”p”) print (“Next elements:”) print (list(tag1.next_elements)) Output Next elements: [”n”, <b>Excellent</b>, ”Excellent”, <i>Python</i>, ”Python”, ”n”, ”n”, <u>Tutorial</u>, ”Tutorial”, ”n”] Example 3 The elements next to the input tag present in the HTML form of index.html are listed below − from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html5lib”) tag = soup.find(”input”) nexts = soup.previous_elements print (“Next elements:”) for next in nexts: print (next) Output Next elements: <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/> Print Page Previous Next Advertisements ”;

Beautiful Soup – decompose Method

Beautiful Soup – decompose() Method ”; Previous Next Method Description The decompose() method destroys current element along with its children, thus the element is removed from the tree, wiping it out and everything beneath it. You can check whether an element has been decomposed, by the `decomposed` property. It returns True if destroyed, false otherwise. Syntax decompose() Parameters No parameters are defined for this method. Return Type The method doesn”t return any object. Example 1 When we call descompose() method on the BeautifulSoup object itself, the entire content will be destroyed. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) soup.decompose() print (“decomposed:”,soup.decomposed) print (soup) Output decomposed: True document: Traceback (most recent call last): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~ TypeError: can only concatenate str (not “NoneType”) to str Since the soup object is decomposed, it returns True, however, you get TypeError as shown above. Example 2 The code below makes use of decompose() method to remove all the occurrences of <p> tags in the HTML string used. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) p_all = soup.find_all(”p”) [p.decompose() for p in p_all] print (“document:”,soup) Output Rest of the HTML document after removing all <p> tags will be printed. document: <html> <body> </body> </html> Example 3 Here, we find the <body> tag from the HTML document tree and decompose the previous element which happens to be the <title> tag. The resultant document tree omits the <title> tag. html = ””” <html> <head> <title>TutorialsPoint</title> </head> <body> Hello World </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag = soup.body tag.find_previous().decompose() print (“document:”,soup) Output document: <html> <head> </head> <body> Hello World </body> </html> Print Page Previous Next Advertisements ”;