Beautiful Soup – find_all_next() Method ”; Previous Next Method Description The find_all_next() method in Beautiful Soup finds all PageElements that match the given criteria and appear after this element in the document. This method returns tags or NavigableString objects and method takes in the exact same parameters as find_all(). Syntax find_all_next(name, attrs, string, limit, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered. limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return Value This method returns a ResultSet containing PageElements (Tags or NavigableString objects). Example 1 Using the index.html as the HTML document for this example, we first locate the <form> tag and collect all the elements after it with find_all_next() method. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.form tags = tag.find_all_next() print (tags) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Example 2 Here, we apply a filter to the find_all_next() method to collect all the tags subsequent to <form>, with id being nm or age. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.form tags = tag.find_all_next(id=[”nm”, ”age”]) print (tags) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>] Example 3 If we check the tags following the body tag, it includes a <h1> tag as well as <form> tag, that includes three input elements. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.body tags = tag.find_all_next() print (tags) Output <h1>TutorialsPoint</h1> <form> <input id=”nm” name=”name” type=”text”/> <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/> </form> <input id=”nm” name=”name” type=”text”/> <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/> Print Page Previous Next Advertisements ”;
Author: user
Beautiful Soup – smooth() Method ”; Previous Next Method Description After calling a bunch of methods that modify the parse tree, you may end up with two or more NavigableString objects next to each other. The smooth() method smooths out this element”s children by consolidating consecutive strings. This makes pretty-printed output look more natural following a lot of operations that modified the tree. Syntax smooth() Parameters This method has no parameters. Return Type This method returns the given tag after smoothing. Example 1 html =”””<html> <head> <title>TutorislsPoint/title> </head> <body> Some Text <div></div> <p></p> <div>Some more text</div> <b></b> <i></i> # COMMENT </body> </html>””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) soup.find(”body”).sm for item in soup.find_all(): if not item.get_text(strip=True): p = item.parent item.replace_with(””) p.smooth() print (soup.prettify()) Output <html> <head> <title> TutorislsPoint/title> </title> </head> <body> Some Text <div> Some more text </div> # COMMENT </body> </html> Example 2 from bs4 import BeautifulSoup soup = BeautifulSoup(“<p>Hello</p>”, ”html.parser”) soup.p.append(“, World”) soup.smooth() print (soup.p.contents) print(soup.p.prettify()) Output [”Hello, World”] <p> Hello, World </p> Print Page Previous Next Advertisements ”;
Beautiful Soup – Useful Resources ”; Previous Next The following resources contain additional information on Beautiful Soup. Please use them to get more in-depth knowledge on this. Useful Video Courses Learn Python 3 – Online course Best Seller 79 Lectures 17.5 hours Joseph Delgadillo More Detail The Complete Python 3 Course: From Beginner to Advanced Best Seller 147 Lectures 18 hours Joseph Delgadillo More Detail Web Scraping using API, Beautiful Soup using Python 39 Lectures 3.5 hours Chandramouli Jayendran More Detail A-Z Python Bootcamp- Basics To Data Science (50+ Hours) Best Seller 436 Lectures 46 hours Chandramouli Jayendran More Detail Beautiful Soup in Action – Web Scraping a Car Dealer Website 7 Lectures 1 hours AlexanderSchlee More Detail Data Project with Beautiful Soup – Web Scraping E-Commerce 7 Lectures 1 hours AlexanderSchlee More Detail Print Page Previous Next Advertisements ”;
Beautiful Soup – Installation ”; Previous Next Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. BeautifulSoup package is not a part of Python”s standard library, hence it must be installed. Before installing the latest version, let us create a virtual environment, as per Python”s recommended method. A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup. We shall use venv module in Python”s standard library to create virtual environment. PIP is included by default in Python version 3.4 or later. Use the following command to create virtual environment in Windows C:usesuser>python -m venv myenv On Ubuntu Linux, update the APT repo and install venv if required before creating virtual environment mvl@GNVBGL3:~ $ sudo apt update && sudo apt upgrade -y mvl@GNVBGL3:~ $ sudo apt install python3-venv Then use the following command to create a virtual environment mvl@GNVBGL3:~ $ sudo python3 -m venv myenv You need to activate the virtual environment. On Windows use the command C:usesuser>cd myenv C:usesusermyenv>scriptsactivate (myenv) C:Usersusersusermyenv> On Ubuntu Linux, use following command to activate the virtual environment mvl@GNVBGL3:~$ cd myenv mvl@GNVBGL3:~/myenv$ source bin/activate (myenv) mvl@GNVBGL3:~/myenv$ Name of the virtual environment appears in the parenthesis. Now that it is activated, we can now install BeautifulSoup in it. (myenv) mvl@GNVBGL3:~/myenv$ pip3 install beautifulsoup4 Collecting beautifulsoup4 Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.0/143.0 KB 325.2 kB/s eta 0:00:00 Collecting soupsieve>1.2 Downloading soupsieve-2.4.1-py3-none-any.whl (36 kB) Installing collected packages: soupsieve, beautifulsoup4 Successfully installed beautifulsoup4-4.12.2 soupsieve-2.4.1 Note that the latest version of Beautifulsoup4 is 4.12.2 and requires Python 3.8 or later. If you don”t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py. (myenv) mvl@GNVBGL3:~/myenv$ python setup.py install To check if Beautifulsoup is properly install, enter following commands in Python terminal − >>> import bs4 >>> bs4.__version__ ”4.12.2” If the installation hasn”t been successful, you will get ModuleNotFoundError. You will also need to install requests library. It is a HTTP library for Python. pip3 install requests Installing a Parser By default, Beautiful Soup supports the HTML parser included in Python”s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser. To install lxml or html5lib parser, use the command: pip3 install lxml pip3 install html5lib These parsers have their advantages and disadvantages as shown below − Parser: Python”s html.parser Usage − BeautifulSoup(markup, “html.parser”) Advantages Batteries included Decent speed Lenient (As of Python 3.2) Disadvantages Not as fast as lxml, less lenient than html5lib. Parser: lxml”s HTML parser Usage − BeautifulSoup(markup, “lxml”) Advantages Very fast Lenient Disadvantages External C dependency Parser: lxml”s XML parser Usage − BeautifulSoup(markup, “lxml-xml”) Or BeautifulSoup(markup, “xml”) Advantages Very fast The only currently supported XML parser Disadvantages External C dependency Parser: html5lib Usage − BeautifulSoup(markup, “html5lib”) Advantages Extremely lenient Parses pages the same way a web browser does Creates valid HTML5 Disadvantages Very slow External Python dependency Print Page Previous Next Advertisements ”;
Beautiful Soup – Kinds of objects ”; Previous Next When we pass a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects defined in bs4 package. Tag NavigableString BeautifulSoup Comments Tag Object A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<b class=”boldest”>TutorialsPoint</b>”, ”lxml”) tag = soup.html print (type(tag)) Output <class ”bs4.element.Tag”> Tags contain lot of attributes and methods and two important features of a tag are its name and attributes. Name (tag.name) Every tag contains a name and can be accessed through ”.name” as suffix. tag.name will return the type of tag it is. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<b class=”boldest”>TutorialsPoint</b>”, ”lxml”) tag = soup.html print (tag.name) Output html However, if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<b class=”boldest”>TutorialsPoint</b>”, ”lxml”) tag = soup.html tag.name = “strong” print (tag) Output <strong><body><b class=”boldest”>TutorialsPoint</b></body></strong> Attributes (tag.attrs) A tag object can have any number of attributes. In the above example, the tag <b class=”boldest”> has an attribute ”class” whose value is “boldest”. Anything that is NOT tag, is basically an attribute and must contain a value. A dictionary of attributes and their values is returned by “attrs”. You can access the attributes either through accessing the keys too. In the example below, the string argument for Beautifulsoup() constructor contains HTML input tag. The attributes of input tag are returned by “attr”. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<input type=”text” name=”name” value=”Raju”>”, ”lxml”) tag = soup.input print (tag.attrs) Output {”type”: ”text”, ”name”: ”name”, ”value”: ”Raju”} We can do all kind of modifications to our tag”s attributes (add/remove/modify), using dictionary operators or methods. In the following example, the value tag is updated. The updated HTML string shows changes. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<input type=”text” name=”name” value=”Raju”>”, ”lxml”) tag = soup.input print (tag.attrs) tag[”value”]=”Ravi” print (soup) Output <html><body><input name=”name” type=”text” value=”Ravi”/></body></html> We add a new id tag, and delete the value tag. Example from bs4 import BeautifulSoup soup = BeautifulSoup(”<input type=”text” name=”name” value=”Raju”>”, ”lxml”) tag = soup.input tag[”id”]=”nm” del tag[”value”] print (soup) Output <html><body><input id=”nm” name=”name” type=”text”/></body></html> Multi-valued attributes Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. Others include ”rel”, ”rev”, ”headers”, ”accesskey” and ”accept-charset”. The multi-valued attributes in beautiful soup are shown as list. Example from bs4 import BeautifulSoup css_soup = BeautifulSoup(”<p class=”body”></p>”, ”lxml”) print (“css_soup.p[”class”]:”, css_soup.p[”class”]) css_soup = BeautifulSoup(”<p class=”body bold”></p>”, ”lxml”) print (“css_soup.p[”class”]:”, css_soup.p[”class”]) Output css_soup.p[”class”]: [”body”] css_soup.p[”class”]: [”body”, ”bold”] However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone − Example from bs4 import BeautifulSoup id_soup = BeautifulSoup(”<p id=”body bold”></p>”, ”lxml”) print (“id_soup.p[”id”]:”, id_soup.p[”id”]) print (“type(id_soup.p[”id”]):”, type(id_soup.p[”id”])) Output id_soup.p[”id”]: body bold type(id_soup.p[”id”]): <class ”str”> NavigableString object Usually, a certain string is placed in opening and closing tag of a certain type. The HTML engine of the browser applies the intended effect on the string while rendering the element. For example , in <b>Hello World</b>, you find a string in the middle of <b> and </b> tags so that it is rendered in bold. The NavigableString object represents the contents of a tag. It is an object of bs4.element.NavigableString class. To access the contents, use “.string” with tag. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<h2 id=”message”>Hello, Tutorialspoint!</h2>”, ”html.parser”) print (soup.string) print (type(soup.string)) Output Hello, Tutorialspoint! <class ”bs4.element.NavigableString”> A NavigableString object is similar to a Python Unicode string. some of its features support Navigating the tree and Searching the tree. A NavigableString can be converted to a Unicode string with str() function. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<h2 id=”message”>Hello, Tutorialspoint!</h2>”,”html.parser”) tag = soup.h2 string = str(tag.string) print (string) Output Hello, Tutorialspoint! Just as a Python string, which is immutable, the NavigableString also can”t be modified in place. However, use replace_with() to replace the inner string of a tag with another. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<h2 id=”message”>Hello, Tutorialspoint!</h2>”,”html.parser”) tag = soup.h2 tag.string.replace_with(“OnLine Tutorials Library”) print (tag.string) Output OnLine Tutorials Library BeautifulSoup object The BeautifulSoup object represents the entire parsed object. However, it can be considered to be similar to Tag object. It is the object created when we try to scrape a web resource. Because it is similar to a Tag object, it supports the functionality required to parse and search the document tree. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) print (soup) print (soup.name) print (”type:”,type(soup)) Output <html> <head> <title>TutorialsPoint</title> </head> <body> <h2>Departmentwise Employees</h2> <ul> <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> </ul> </body> </html> [document] type: <class ”bs4.BeautifulSoup”> The name property of BeautifulSoup object always returns [document]. Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with(). Example from bs4 import BeautifulSoup obj1 = BeautifulSoup(“<book><title>Python</title></book>”, features=”xml”) obj2
Beautiful Soup – find_next_sibling() Method ”; Previous Next Method Description The find_next_sibling() method in Beautiful Soup Find the closest sibling at the same level to this PageElement that matches the given criteria and appears later in the document. This method is similar to next_sibling property. Syntax find_fnext_sibling(name, attrs, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − The string to search for (rather than tag). kwargs − A dictionary of filters on attribute values. Return Type The find_next_sibling() method returns Tag object or a NavigableString object. Example 1 from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Hello</b><i>Python</i></p>”, ”html.parser”) tag1 = soup.find(”b”) print (“next:”,tag1.find_next_sibling()) Output next: <i>Python</i> Example 2 If the next node doesn”t exist, the method returns None. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Hello</b><i>Python</i></p>”, ”html.parser”) tag1 = soup.find(”i”) print (“next:”,tag1.find_next_sibling()) Output next: None Print Page Previous Next Advertisements ”;
Beautiful Soup – append() Method ”; Previous Next Method Description The append() method in Beautiful Soup adds a given string or another tag at the end of the current Tag object”s contents. The append() method works similar to the append() method of Python”s list object. Syntax append(obj) Parameters obj − any PageElement, may be a string, a NavigableString object or a Tag object. Return Type The append() method doesn”t return a new object. Example 1 In the following example, the HTML script has a <p> tag. With append(), additional text is appended.In the following example, the HTML script has a <p> tag. With append(), additional text is appended. from bs4 import BeautifulSoup markup = ”<p>Hello</p>” soup = BeautifulSoup(markup, ”html.parser”) print (soup) tag = soup.p tag.append(” World”) print (soup) Output <p>Hello</p> <p>Hello World</p> Example 2 With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method. from bs4 import BeautifulSoup, Tag markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag1 = soup.new_tag(”i”) tag1.string = ”World” tag.append(tag1) print (soup.prettify()) Output <b> Hello <i> World </i> </b> Example 3 If you have to add a string to the document, you can append a NavigableString object. from bs4 import BeautifulSoup, NavigableString markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b new_string = NavigableString(” World”) tag.append(new_string) print (soup.prettify()) Output <b> Hello World </b> Print Page Previous Next Advertisements ”;
Beautiful Soup – replace_with() Method ”; Previous Next Method Description Beautiful Soup”s replace_with() method replaces a tag or string in an element with the provided tag or string. Syntax replace_with(tag/string) Parameters The method accepts a tag object or a string as argument. Return Type The replace_method doesn”t return a new object. Example 1 In this example, the <p> tag is replaced by <b> with the use of replace_with() method. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”p”) txt = tag1.string tag2 = soup.new_tag(”b”) tag2.string = txt tag1.replace_with(tag2) print (soup) Output <html> <body> <b>The quick, brown fox jumps over a lazy dog.</b> </body> </html> Example 2 You can simply replace the inner text of a tag with another string by calling replace_with() method on the tag.string object. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”p”) tag1.string.replace_with(“DJs flock by when MTV ax quiz prog.”) print (soup) Output <html> <body> <p>DJs flock by when MTV ax quiz prog.</p> </body> </html> Example 3 The tag object to be used for replacement can be obtained by any of the find() methods. Here, we replace the text of the tag next to <p> tag. html = ””” <html> <body> <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag1 = soup.find(”p”) tag1.find_next(”b”).string.replace_with(”black”) print (soup) Output <html> <body> <p>The quick, <b>black</b> fox jumps over a lazy dog.</p> </body> </html> Print Page Previous Next Advertisements ”;
Beautiful Soup – parents Property ”; Previous Next Method Description The parents property in BeautifulSoup library retrieves all the parent elements of the said PegeElement in a recursive manner. The type of the value returned by the parents property is a generator, with the help of which we can list out the parents in the down-to-up order. Syntax Element.parents Return value The parents property returns a generator object. Example 1 This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <p> tag in the example HTML string. html = “”” <html><head><title>TutorialsPoint</title></head> <body> <p>Hello World</p> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.p for element in tag.parents: print (element.name) Output body html [document] Note that the parent to the BeautifulSoup object is [document]. Example 2 In the following example, we see that the <b> tag is enclosed inside a <p> tag. The two div tags above it have an id attribute. We try to print the only those elements having id attribute. The has_attr() method is used for the purpose. html = “”” <div id=”outer”> <div id=”inner”> <p>Hello<b>World</b></p> </div> </div> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.b for parent in tag.parents: if parent.has_attr(“id”): print(parent[“id”]) Output inner outer Print Page Previous Next Advertisements ”;
Beautiful Soup – select() Method ”; Previous Next Method Description In Beautiful Soup library, the select() method is an important tool for scraping the HTML/XML document. Similar to find() and find_*() methods, the select() method also helps in locating an element that satisfies a given criteria. The selection of an element in the document tree is done based on the CSS selector given to it as an argument. Beautiful Soup also has select_one() method. Difference in select() and select_one() is that, select() returns a ResultSet of all the elements belonging to the PageElement and characterized by the CSS selector; whereas select_one() returns the first occurrence of the element satisfying the CSS selector based selection criteria. Prior to Beautiful Soup version 4.7, the select() method used to be able to support only the common CSS selectors. With version 4.7, Beautiful Soup was integrated with Soup Sieve CSS selector library. As a result, much more selectors can now be used. In the version 4.12, a .css property has been added in addition to the existing convenience methods, select() and select_one(). Syntax select(selector, limit, **kwargs) Parameters selector − A string containing a CSS selector. limit − After finding this number of results, stop looking. kwargs − Keyword arguments to be passed. If the limit parameter is set to 1, it becomes equivalent to select_one() method. Return Value The select() method returns a ResultSet of Tag objects. The select_one() method returns a single Tag object. The Soup Sieve library has different types of CSS selectors. The basic CSS selectors are − Type selectors match elements by node name. For example − tags = soup.select(”div”) The Universal selector (*) matches elements of any type. Example − tags = soup.select(”*”) The ID selector matches an element based on its id attribute. The symbol # denotes the ID selector. Example − tags = soup.select(“#nm”) The class selector matches an element based on the values contained in the class attribute. The . symbol prefixed to the class name is the CSS class selector. Example − tags = soup.select(“.submenu”) Example: Type Selector from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tags = soup.select(”div”) print (tags) Output [<div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div>] Example: ID selector from bs4 import BeautifulSoup html = ””” <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> ””” soup = BeautifulSoup(html, ”html.parser”) obj = soup.select(“#nm”) print (obj) Output [<input id=”nm” name=”name” type=”text”/>] Example: class selector html = ””” <ul> <li class=”mainmenu”>Accounts</li> <ul> <li class=”submenu”>Anand</li> <li class=”submenu”>Mahesh</li> </ul> <li class=”mainmenu”>HR</li> <ul> <li class=”submenu”>Rani</li> <li class=”submenu”>Ankita</li> </ul> </ul> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tags = soup.select(“.mainmenu”) print (tags) Output [<li class=”mainmenu”>Accounts</li>, <li class=”mainmenu”>HR</li>] Print Page Previous Next Advertisements ”;