Beautiful Soup – new_tag() Method ”; Previous Next The new_tag() method in Beautiful Soup library creates a new Tag object, that is associated with an existing BeautifulSoup object. You can use this factory method to append or insert the new tag into the document tree. Syntax new_tag(name, namespace, nsprefix, attrs, sourceline, sourcepos, **kwattrs) Parameters name − The name of the new Tag. namespace − The URI of the new Tag”s XML namespace, optional. prefix − The prefix for the new Tag”s XML namespace, optional. attrs − A dictionary of this Tag”s attribute values. sourceline − The line number where this tag was found in its source document. sourcepos − The character position within `sourceline` where this tag was found. kwattrs − Keyword arguments for the new Tag”s attribute values. Return Value This method returns a new Tag object. Example 1 The following example shows the use of new_tag() method. A new tag for <a> element. The tag object is initialized with the href and string attributes and then inserted in the document tree. from bs4 import BeautifulSoup soup = BeautifulSoup(”<p>Welcome to <b>online Tutorial library</b></p>”, ”html.parser”) tag = soup.new_tag(”a”) tag.attrs[”href”] = “www.tutorialspoint.com” tag.string = “Tutorialspoint” soup.b.insert_before(tag) print (soup) Output <p>Welcome to <a href=”www.tutorialspoint.com”>Tutorialspoint</a><b>online Tutorial library</b></p> Example 2 In the following example, we have a HTML form with two input elements. We create a new input tag and append it to the form tag. html = ””” <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> </form>””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.form newtag=soup.new_tag(”input”, attrs={”type”:”text”, ”id”:”marks”, ”name”:”marks”}) tag.append(newtag) print (soup) Output <form> <input id=”nm” name=”name” type=”text”/> <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/></form> Example 3 Here we have an empty <p> tag in the HTML string. A new tag is inserted in it. from bs4 import BeautifulSoup soup = BeautifulSoup(”<p></p>”, ”html.parser”) tag = soup.new_tag(”b”) tag.string = “Hello World” soup.p.insert(0,tag) print (soup) Output <p><b>Hello World</b></p> Print Page Previous Next Advertisements ”;
Category: beautiful Soup
Beautiful Soup – extend() Method ”; Previous Next Method Description The extend() method in Beautiful Soup has been added to Tag class from version 4.7 onwards. It adds all the elements in a list to the tag. This method is analogous to a standard Python List”s extend() method – it takes in an array of strings to append to the tag”s content. Syntax extend(tags) Parameters tags − A list of srings or NavigableString objects to be appended. Return Type The extend() method doesn”t return any new object. Example from bs4 import BeautifulSoup markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b vals = [”World.”, ”Welcome to ”, ”TutorialsPoint”] tag.extend(vals) print (soup.prettify()) Output <b> Hello World. Welcome to TutorialsPoint </b> Print Page Previous Next Advertisements ”;
Beautiful Soup – find_next_siblings() Method ”; Previous Next Method Description The find_next_siblings() method is similar to next_sibling property. It finds all siblings at the same level of this PageElement that match the given criteria and appear later in the document. Syntax find_fnext_siblings(name, attrs, string, limit, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − The string to search for (rather than tag). limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return Type The find_next_siblings() method returns a list of Tag objects or a NavigableString objects. Example 1 Let us use the following HTML snippet for this purpose − <p> <b> Excellent </b> <i> Python </i> <u> Tutorial </u> </p> In the code below, we try to find all the siblings of <b> tag. There are two more tags at the same level in the HTML string used for scraping. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>”, ”html.parser”) tag1 = soup.find(”b”) print (“next siblings:”) for tag in tag1.find_next_siblings(): print (tag) Output The ResultSet of find_next_siblings() is being iterated with the help of for loop. next siblings: <i>Python</i> <u>Tutorial</u> Example 2 If there are no siblings to be found after a tag, this method returns an empty list. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>”, ”html.parser”) tag1 = soup.find(”u”) print (“next siblings:”) print (tag1.find_next_siblings()) Output next siblings: [] Print Page Previous Next Advertisements ”;
Beautiful Soup – next_element Property ”; Previous Next Method Description In Beautiful Soup library, the next_element property returns the Tag or NavigableString that appears immediately next to the current PageElement, even if it is out of the parent tree. There is also a next property which has similar behaviour Syntax Element.next_element Return value The next_element and next properties return a tag or a NavigableString appearing immediately next to the current tag. Example 1 In the document tree parsed from the given HTML string, we find the next_element of the <b> tag html = ””” <p><b>Excellent</b><p>Python</p><p id=”id1”>Tutorial</p></p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”lxml”) tag = soup.b print (tag) nxt = tag.next_element print (“Next:”,nxt) nxt = tag.next_element.next_element print (“Next:”,nxt) Output <b>Excellent</b> Next: Excellent Next: <p>Python</p> The output is a little strange as the next element for <b>Excellent</b> is shown to be ”Excellent”, that is because the inner string is registered as the next element. To obtain the desired result (<p>Python</p>) as the next element, fetch the next_element property of the inner NavigableString object. Example 2 The BeautifulSoup PageElements also support next property which is analogous to next_element property html = ””” <p><b>Excellent</b><p>Python</p><p id=”id1”>Tutorial</p></p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”lxml”) tag = soup.b print (tag) nxt = tag.next print (“Next:”,nxt) nxt = tag.next.next print (“Next:”,nxt) Output <b>Excellent</b> Next: Excellent Next: <p>Python</p> Example 3 In the next example, we try to determine the element next to <body> tag. As it is followed by a line break (n), we need to find the next element of the one next to body tag. It happens to be <h1> tag. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”body”) nxt = tag.next_element.next print (“Next:”,nxt) Output Next: <h1>TutorialsPoint</h1> Print Page Previous Next Advertisements ”;
Beautiful Soup – find_parents() Method ”; Previous Next Method Description The find_parent() method in BeautifulSoup package finds all parents of this Element that matches the given criteria. Syntax find_parents( name, attrs, limit, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return Type The find_parents() method returns a ResultSet consisting of all the parent elements in a reverse order. Example 1 We shall use following HTML script in this example − <html> <body> <h2>Departmentwise Employees</h2> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> </body> </html> Output ul body html [document] Note that the name property of BeautifulSoup object always returns [document]. Example 2 In this example, the limit argument is passed to find_parents() method to restrict the parent search to two levels up. from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) obj=soup.find(”li”) parents=obj.find_parents(limit=2) for parent in parents: print (parent.name) Output ul body Print Page Previous Next Advertisements ”;
Beautiful Soup – find_next() Method ”; Previous Next Method Description The find_next() method in Beautiful soup finds the first PageElement that matches the given criteria and appears later in the document. returns the first tag or NavigableString that comes after the current tag in the document. Like all other find methods, this method has the following syntax − Syntax find_next(name, attrs, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − A filter for a NavigableString with specific text. kwargs − A dictionary of filters on attribute values. Return Value This find_next () method returns a Tag or a NavigableString Example 1 A web page index.html with following script has been used for this example <html> <head> <title>TutorialsPoint</title> </head> <body> <h1>TutorialsPoint</h1> <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> </body> </html> We first locate the <form> tag and then the one next to it. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.h1 print (tag.find_next()) Output <form> <input id=”nm” name=”name” type=”text”/> <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/> </form> Example 2 In this example, we first locate the <input> tag with its name=”age” and obtain its next tag. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”name”:”age”}) print (tag.find_next()) Output <input id=”marks” name=”marks” type=”text”/> Example 3 The tag next to the <head> tag happens to be <title> tag. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.head print (tag.find_next()) Output <title>TutorialsPoint</title> Print Page Previous Next Advertisements ”;
Beautiful Soup – decode() Method ”; Previous Next Method Description The decode() method in Beautiful Soup returns a string or Unicode representation of the parse tree as an HTML or XML document. The method decodes the bytes using the codec registered for encoding. Its function is opposite to that of encode() method. You call encode() to get a bytestring, and decode() to get Unicode. Let us study decode() method with some examples. Syntax decode(pretty_print, encoding, formatter, errors) Parameters pretty_print − If this is True, indentation will be used to make the document more readable. encoding − The encoding of the final document. If this is None, the document will be a Unicode string. formatter − A Formatter object, or a string naming one of the standard formatters. errors − The error handling scheme to use for the handling of decoding errors. Values are ”strict”, ”ignore” and ”replace”. Return Value The decode() method returns a Unicode String. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“Hello “World!””, ”html.parser”) enc = soup.encode(”utf-8”) print (enc) dec = enc.decode() print (dec) Output b”Hello xe2x80x9cWorld!xe2x80x9d” Hello “World!” Print Page Previous Next Advertisements ”;
Beautiful Soup – contents Property ”; Previous Next Method Description The contents property is available with the Soup object as well as Tag object. It returns a list everything that is contained inside the object, all the immediate child elements and text nodes (i.e. Navigable String). Syntax Tag.contents Return value The contents property returns a list of child elements and strings in the Tag/Soup object,. Example 1 Contents of a tag object − from bs4 import BeautifulSoup markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.div print (tag.contents) Output [”n”, <p>Java</p>, ”n”, <p>Python</p>, ”n”, <p>C++</p>, ”n”] Example 2 Contents of the entire document − from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) print (soup.contents) Output [”n”, <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div>, ”n”] Example 3 Note that a NavigableString object doesn”t have contents property. It throws AttributeError if we try to access the same. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.p s=tag.contents[0] print (s.contents) Output Traceback (most recent call last): File “C:UsersuserBeautifulSoup2.py”, line 11, in <module> print (s.contents) ^^^^^^^^^^ File “C:UsersuserBeautifulSoupLibsite-packagesbs4element.py”, line 984, in __getattr__ raise AttributeError( AttributeError: ”NavigableString” object has no attribute ”contents” Print Page Previous Next Advertisements ”;
Beautiful Soup – web-scraping ”; Previous Next Scraping is simply a process of extracting (from various means), copying and screening of data. When we scrape or extract data or feeds from the web (like from web-pages or websites), it is termed as web-scraping. So, web scraping (which is also known as web data extraction or web harvesting) is the extraction of data from web. In short, web scraping provides a way to the developers to collect and analyze data from the internet. Why Web-scraping? Web-scraping provides one of the great tools to automate most of the things a human does while browsing. Web-scraping is used in an enterprise in a variety of ways − Data for Research Smart analyst (like researcher or journalist) uses web scrapper instead of manually collecting and cleaning data from the websites. Products, prices & popularity comparison Currently there are couple of services which use web scrappers to collect data from numerous online sites and use it to compare products popularity and prices. SEO Monitoring There are numerous SEO tools such as Ahrefs, Seobility, SEMrush, etc., which are used for competitive analysis and for pulling data from your client”s websites. Search engines There are some big IT companies whose business solely depends on web scraping. Sales and Marketing The data gathered through web scraping can be used by marketers to analyze different niches and competitors or by the sales specialist for selling content marketing or social media promotion services. Why Python for Web Scraping? Python is one of the most popular languages for web scraping as it can handle most of the web crawling related tasks very easily. Below are some of the points on why to choose python for web scraping − Ease of Use As most of the developers agree that python is very easy to code. We don”t have to use any curly braces “{ }” or semi-colons “;” anywhere, which makes it more readable and easy-to-use while developing web scrapers. Huge Library Support Python provides huge set of libraries for different requirements, so it is appropriate for web scraping as well as for data visualization, machine learning, etc. Easily Explicable Syntax Python is a very readable programming language as python syntax are easy to understand. Python is very expressive and code indentation helps the users to differentiate different blocks or scopes in the code. Dynamically-typed language Python is a dynamically-typed language, which means the data assigned to a variable tells, what type of variable it is. It saves lot of time and makes work faster. Huge Community Python community is huge which helps you wherever you stuck while writing code. Print Page Previous Next Advertisements ”;
Beautiful Soup – Searching the Tree ”; Previous Next In this chapter, we shall discuss different methods in Beautiful Soup for navigating the HTML document tree in different directions – going up and down, sideways, and back and forth. We shall use the following HTML string in all the examples in this chapter − html = “”” <html><head><title>TutorialsPoint</title></head> <body> <p class=”title”><b>Online Tutorials Library</b></p> <p class=”story”>TutorialsPoint has an excellent collection of tutorials on: <a href=”https://tutorialspoint.com/Python” class=”lang” id=”link1″>Python</a>, <a href=”https://tutorialspoint.com/Java” class=”lang” id=”link2″>Java</a> and <a href=”https://tutorialspoint.com/PHP” class=”lang” id=”link3″>PHP</a>; Enhance your Programming skills.</p> <p class=”tutorial”>…</p> “”” The name of required tag lets you navigate the parse tree. For example soup.head fetches you the <head> element − Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) print (soup.head.prettify()) Output <head> <title> TutorialsPoint </title> </head> Going down A tag may contain strings or other tags enclosed in it. The .contents property of Tag object returns a list of all the children elements belonging to it. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.head print (list(tag.children)) Output [<title>TutorialsPoint</title>] The returned object is a list, although in this case, there is only a single child tag enclosed in head element. .children The .children property also returns a list of all the enclosed elements in a tag. Below, all the elements in body tag are given as a list. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.body print (list(tag.children)) Output [”n”, <p class=”title”><b>Online Tutorials Library</b></p>, ”n”, <p class=”story”>TutorialsPoint has an excellent collection of tutorials on: <a class=”lang” href=”https://tutorialspoint.com/Python” id=”link1″>Python</a>, <a class=”lang” href=”https://tutorialspoint.com/Java” id=”link2″>Java</a> and <a class=”lang” href=”https://tutorialspoint.com/PHP” id=”link3″>PHP</a>; Enhance your Programming skills.</p>, ”n”, <p class=”tutorial”>…</p>, ”n”] Instead of getting them as a list, you can iterate over a tag”s children using the .children generator − Example tag = soup.body for child in tag.children: print (child) Output <p class=”title”><b>Online Tutorials Library</b></p> <p class=”story”>TutorialsPoint has an excellent collection of tutorials on: <a class=”lang” href=”https://tutorialspoint.com/Python” id=”link1″>Python</a>, <a class=”lang” href=”https://tutorialspoint.com/Java” id=”link2″>Java</a> and <a class=”lang” href=”https://tutorialspoint.com/PHP” id=”link3″>PHP</a>; Enhance your Programming skills.</p> <p class=”tutorial”>…</p> .descendents The .contents and .children attributes only consider a tag”s direct children. The .descendants attribute lets you iterate over all of a tag”s children, recursively: its direct children, the children of its direct children, and so on. The BeautifulSoup object is at the top of hierarchy of all the tags. Hence its .descendents property includes all the elements in the HTML string. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) print (soup.descendants) The .descendents attribute returns a generator, which can be iterated with a for loop. Here, we list out the descendents of the head tag. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.head for element in tag.descendants: print (element) Output <title>TutorialsPoint</title> TutorialsPoint The head tag contains a title tag, which in turn encloses a NavigableString object TutorialsPoint. The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag”s child. But the BeautifulSoup object only has one direct child (the <html> tag), but it has many descendants. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tags = list(soup.descendants) print (len(tags)) Output 27 Going Up Just as you navigate the downstream of a document with children and descendents properties, BeautifulSoup offers .parent and .parent properties to navigate the upstream of a tag .parent every tag and every string has a parent tag that contains it. You can access an element”s parent with the parent attribute. In our example, the <head> tag is the parent of the <title> tag. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.title print (tag.parent) Output <head><title>TutorialsPoint</title></head> Since the title tag contains a string (NavigableString), the parent for the string is title tag itself. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.title string = tag.string print (string.parent) Output <title>TutorialsPoint</title> .parents You can iterate over all of an element”s parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <a> tag in the example HTML string. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.a print (tag.string) for parent in tag.parents: print (parent.name) Output Python p body html [document] Sideways The HTML tags appearing at the same indentation level are called siblings. Consider the following HTML snippet <p> <b> Hello </b> <i> Python </i> </p> In the outer <p> tag, we have <b> and <i> tags at the same indent level, hence they are called siblings. BeautifulSoup makes it possible to navigate between the tags at same level. .next_sibling and .previous_sibling These attributes respectively return the next tag at the same level, and the previous tag at same level. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Hello</b><i>Python</i></p>”, ”html.parser”) tag1 = soup.b print (“next:”,tag1.next_sibling) tag2 = soup.i print (“previous:”,tag2.previous_sibling) Output next: <i>Python</i> previous: <b>Hello</b> Since the <b> tag doesn”t have a sibling to its left, and <i> tag doesn”t have a sibling to its right, it returns Nobe in both cases. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Hello</b><i>Python</i></p>”, ”html.parser”) tag1 = soup.b print