Beautiful Soup – decode() Method ”; Previous Next Method Description The decode() method in Beautiful Soup returns a string or Unicode representation of the parse tree as an HTML or XML document. The method decodes the bytes using the codec registered for encoding. Its function is opposite to that of encode() method. You call encode() to get a bytestring, and decode() to get Unicode. Let us study decode() method with some examples. Syntax decode(pretty_print, encoding, formatter, errors) Parameters pretty_print − If this is True, indentation will be used to make the document more readable. encoding − The encoding of the final document. If this is None, the document will be a Unicode string. formatter − A Formatter object, or a string naming one of the standard formatters. errors − The error handling scheme to use for the handling of decoding errors. Values are ”strict”, ”ignore” and ”replace”. Return Value The decode() method returns a Unicode String. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“Hello “World!””, ”html.parser”) enc = soup.encode(”utf-8”) print (enc) dec = enc.decode() print (dec) Output b”Hello xe2x80x9cWorld!xe2x80x9d” Hello “World!” Print Page Previous Next Advertisements ”;
Category: beautiful Soup
Beautiful Soup – contents Property ”; Previous Next Method Description The contents property is available with the Soup object as well as Tag object. It returns a list everything that is contained inside the object, all the immediate child elements and text nodes (i.e. Navigable String). Syntax Tag.contents Return value The contents property returns a list of child elements and strings in the Tag/Soup object,. Example 1 Contents of a tag object − from bs4 import BeautifulSoup markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.div print (tag.contents) Output [”n”, <p>Java</p>, ”n”, <p>Python</p>, ”n”, <p>C++</p>, ”n”] Example 2 Contents of the entire document − from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) print (soup.contents) Output [”n”, <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div>, ”n”] Example 3 Note that a NavigableString object doesn”t have contents property. It throws AttributeError if we try to access the same. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.p s=tag.contents[0] print (s.contents) Output Traceback (most recent call last): File “C:UsersuserBeautifulSoup2.py”, line 11, in <module> print (s.contents) ^^^^^^^^^^ File “C:UsersuserBeautifulSoupLibsite-packagesbs4element.py”, line 984, in __getattr__ raise AttributeError( AttributeError: ”NavigableString” object has no attribute ”contents” Print Page Previous Next Advertisements ”;
Beautiful Soup – insert() Method ”; Previous Next Method Description The insert() method in Beautiful Soup add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object. Syntax insert(position, child) Parameters position − The position at which the new PageElement should be inserted. child − A PageElement to be inserted. Return Type The insert() method doesn”t return any new object. Example 1 In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result. from bs4 import BeautifulSoup, NavigableString markup = ”<b>Excellent </b><u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert(1, “Tutorial “) print (soup.prettify()) Output <b> Excellent Tutorial </b> <u> from TutorialsPoint </u> Example 2 In the following example, the insert() method is used to successively insert strings from a list to a <p> tag in HTML markup. from bs4 import BeautifulSoup, NavigableString markup = ”<p>Excellent Tutorials from TutorialsPoint</p>” soup = BeautifulSoup(markup, ”html.parser”) langs = [”Python”, ”Java”, ”C”] i=0 for lang in langs: i+=1 tag = soup.new_tag(”p”) tag.string = lang soup.p.insert(i, tag) print (soup.prettify()) Output <p> Excellent Tutorials from TutorialsPoint <p> Python </p> <p> Java </p> <p> C </p> </p> Print Page Previous Next Advertisements ”;
Beautiful Soup – clear() Method ”; Previous Next Method Description The clear() method in Beautiful Soup library removes the inner content of a tag, keeping the tag intact. If there are any child elements, extract() method is called on them. If decompose argument is set to True, then decompose() method is called instead of extract(). Syntax clear(decompose=False) Parameters decompose − If this is True, decompose() (a more destructive method) will be called instead of extract() Return Value The clear() method doesn”t return any object. Example 1 As clear() method is called on the soup object that represents the entire document, all the content is removed, leaving the document blank. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) soup.clear() print(soup) Output Example 2 In the following example, we find all the <p> tags and call clear() method on each of them. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tags = soup.find_all(”p”) for tag in tags: tag.clear() print(soup) Output Contents of each <p> .. </p> will be removed, the tags will be retained. <html> <body> <p></p> <p></p> <p></p> <p></p> </body> </html> Example 3 Here we clear the contents of <body> tags with decompose argument set to Tue. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tags = soup.find(”body”) ret = tags.clear(decompose=True) print(soup) Output <html> <body></body> </html> Print Page Previous Next Advertisements ”;
Beautiful Soup – Convert HTML to Text ”; Previous Next One of the important and a frequently required application of a web scraper such as Beautiful Soup library is to extract text from a HTML script. You may need to discard all the tags along with the attributes associated if any with each tag and separate out the raw text in the document. The get_text() method in Beautiful Soup is suitable for this purpose. Here is a basic example demonstrating the usage of get_text() method. You get all the text from HTML document by removing all the HTML tags. Example html = ””” <html> <body> <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) text = soup.get_text() print(text) Output The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. The get_text() method has an optional separator argument. In the following example, we specify the separator argument of get_text() method as ”#”. html = ””” <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) text = soup.get_text(separator=”#”) print(text) Output #The quick, brown fox jumps over a lazy dog.# #DJs flock by when MTV ax quiz prog.# #Junk MTV quiz graced by fox whelps.# #Bawds jog, flick quartz, vex nymphs.# The get_text() method has another argument strip, which can be True or False. Let us check the effect of strip parameter when it is set to True. By default it is False. html = ””” <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) text = soup.get_text(strip=True) print(text) Output The quick, brown fox jumps over a lazy dog.DJs flock by when MTV ax quiz prog.Junk MTV quiz graced by fox whelps.Bawds jog, flick quartz, vex nymphs. Print Page Previous Next Advertisements ”;
Beautiful Soup – new_tag() Method ”; Previous Next The new_tag() method in Beautiful Soup library creates a new Tag object, that is associated with an existing BeautifulSoup object. You can use this factory method to append or insert the new tag into the document tree. Syntax new_tag(name, namespace, nsprefix, attrs, sourceline, sourcepos, **kwattrs) Parameters name − The name of the new Tag. namespace − The URI of the new Tag”s XML namespace, optional. prefix − The prefix for the new Tag”s XML namespace, optional. attrs − A dictionary of this Tag”s attribute values. sourceline − The line number where this tag was found in its source document. sourcepos − The character position within `sourceline` where this tag was found. kwattrs − Keyword arguments for the new Tag”s attribute values. Return Value This method returns a new Tag object. Example 1 The following example shows the use of new_tag() method. A new tag for <a> element. The tag object is initialized with the href and string attributes and then inserted in the document tree. from bs4 import BeautifulSoup soup = BeautifulSoup(”<p>Welcome to <b>online Tutorial library</b></p>”, ”html.parser”) tag = soup.new_tag(”a”) tag.attrs[”href”] = “www.tutorialspoint.com” tag.string = “Tutorialspoint” soup.b.insert_before(tag) print (soup) Output <p>Welcome to <a href=”www.tutorialspoint.com”>Tutorialspoint</a><b>online Tutorial library</b></p> Example 2 In the following example, we have a HTML form with two input elements. We create a new input tag and append it to the form tag. html = ””” <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> </form>””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.form newtag=soup.new_tag(”input”, attrs={”type”:”text”, ”id”:”marks”, ”name”:”marks”}) tag.append(newtag) print (soup) Output <form> <input id=”nm” name=”name” type=”text”/> <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/></form> Example 3 Here we have an empty <p> tag in the HTML string. A new tag is inserted in it. from bs4 import BeautifulSoup soup = BeautifulSoup(”<p></p>”, ”html.parser”) tag = soup.new_tag(”b”) tag.string = “Hello World” soup.p.insert(0,tag) print (soup) Output <p><b>Hello World</b></p> Print Page Previous Next Advertisements ”;
Beautiful Soup – extend() Method ”; Previous Next Method Description The extend() method in Beautiful Soup has been added to Tag class from version 4.7 onwards. It adds all the elements in a list to the tag. This method is analogous to a standard Python List”s extend() method – it takes in an array of strings to append to the tag”s content. Syntax extend(tags) Parameters tags − A list of srings or NavigableString objects to be appended. Return Type The extend() method doesn”t return any new object. Example from bs4 import BeautifulSoup markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b vals = [”World.”, ”Welcome to ”, ”TutorialsPoint”] tag.extend(vals) print (soup.prettify()) Output <b> Hello World. Welcome to TutorialsPoint </b> Print Page Previous Next Advertisements ”;
Beautiful Soup – find_next_siblings() Method ”; Previous Next Method Description The find_next_siblings() method is similar to next_sibling property. It finds all siblings at the same level of this PageElement that match the given criteria and appear later in the document. Syntax find_fnext_siblings(name, attrs, string, limit, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − The string to search for (rather than tag). limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return Type The find_next_siblings() method returns a list of Tag objects or a NavigableString objects. Example 1 Let us use the following HTML snippet for this purpose − <p> <b> Excellent </b> <i> Python </i> <u> Tutorial </u> </p> In the code below, we try to find all the siblings of <b> tag. There are two more tags at the same level in the HTML string used for scraping. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>”, ”html.parser”) tag1 = soup.find(”b”) print (“next siblings:”) for tag in tag1.find_next_siblings(): print (tag) Output The ResultSet of find_next_siblings() is being iterated with the help of for loop. next siblings: <i>Python</i> <u>Tutorial</u> Example 2 If there are no siblings to be found after a tag, this method returns an empty list. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>”, ”html.parser”) tag1 = soup.find(”u”) print (“next siblings:”) print (tag1.find_next_siblings()) Output next siblings: [] Print Page Previous Next Advertisements ”;
Beautiful Soup – Navigating by Tags ”; Previous Next One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag”s children). Beautiful Soup provides different ways to navigate and iterate over”s tag”s children. Easiest way to search a parse tree is to search the tag by its name. soup.head The soup.head function returns the contents put inside the <head> .. </head> element of a HTML page. Consider the following HTML page to be scraped: <html> <head> <title>TutorialsPoint</title> <script> document.write(“Welcome to TutorialsPoint”); </script> </head> <body> <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> </body> </html> Following code extracts the contents of <head> element Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print(soup.head) Output <head> <title>TutorialsPoint</title> <script> document.write(“Welcome to TutorialsPoint”); </script> </head> soup.body Similarly, to return the contents of body part of HTML page, use soup.body Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print (soup.body) Output <body> <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> </body> You can also extract specific tag (like first <h1> tag) in the <body> tag. Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print(soup.body.h1) Output <h1>Tutorialspoint Online Library</h1> soup.p Our HTML file contains a <p> tag. We can extract the contents of this tag Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print(soup.p) Output <p><b>It”s all Free</b></p> Tag.contents A Tag object may have one or more PageElements. The Tag object”s contents property returns a list of all elements included in it. Let us find the elements in <head> tag of our index.html file. Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) tag = soup.head print (tag.contents) Output [”n”, <title>TutorialsPoint</title>, ”n”, <script> document.write(“Welcome to TutorialsPoint”); </script>, ”n”] Tag.children The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it. The Tag object has a children property that returns a list iterator object containing the enclosed PageElements. To demonstrate the children property, we shall use the following HTML script (index.html). In the <body> section, there are two <ul> list elements, one nested in another. In other words, the body tag has top level list elements, and each list element has another list under it. <html> <head> <title>TutorialsPoint</title> </head> <body> <h2>Departmentwise Employees</h2> <ul> <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> </ul> </body> </html> The following Python code gives a list of all the children elements of top level <ul> tag. Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) tag = soup.ul print (list(tag.children)) Output [”n”, <li>Accounts</li>, ”n”, <ul> <li>Anand</li> <li>Mahesh</li> </ul>, ”n”, <li>HR</li>, ”n”, <ul> <li>Rani</li> <li>Ankita</li> </ul>, ”n”] Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy. Example for child in tag.children: print (child) Output <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> Tag.find_all() This method returns a result set of contents of all the tags matching with the argument tag provided. Let us consider the following HTML page(index.html) for this − <html> <body> <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> <a class=”prog” href=”https://www.tutorialspoint.com/java/java_overview.htm” id=”link1″>Java</a> <a class=”prog” href=”https://www.tutorialspoint.com/cprogramming/index.htm” id=”link2″>C</a> <a class=”prog” href=”https://www.tutorialspoint.com/python/index.htm” id=”link3″>Python</a> <a class=”prog” href=”https://www.tutorialspoint.com/javascript/javascript_overview.htm” id=”link4″>JavaScript</a> <a class=”prog” href=”https://www.tutorialspoint.com/ruby/index.htm” id=”link5″>C</a> </body> </html> The following code lists all the elements with <a> tag Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) result = soup.find_all(“a”) print (result) Output [ <a class=”prog” href=”https://www.tutorialspoint.com/java/java_overview.htm” id=”link1″>Java</a>, <a class=”prog” href=”https://www.tutorialspoint.com/cprogramming/index.htm” id=”link2″>C</a>, <a class=”prog” href=”https://www.tutorialspoint.com/python/index.htm” id=”link3″>Python</a>, <a class=”prog” href=”https://www.tutorialspoint.com/javascript/javascript_overview.htm” id=”link4″>JavaScript</a>, <a class=”prog” href=”https://www.tutorialspoint.com/ruby/index.htm” id=”link5″>C</a> ] Print Page Previous Next Advertisements ”;
Beautiful Soup – Find Elements by Attribute ”; Previous Next Both find() and find_all() methods are meant to find one or all the tags in the document as per the arguments passed to these methods. You can pass attrs parameter to these functions. The value of attrs must be a dictionary with one or more tag attributes and their values. For the purpose of checking the behaviour of these methods, we shall use the following HTML document (index.html) <html> <head> <title>TutorialsPoint</title> </head> <body> <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> </body> </html> Using find_all() The following program returns a list of all the tags having input type=”text” attribute. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.find_all(attrs={“type”:”text”}) print (obj) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Using find() The find() method returns the first tag in the parsed document that has the given attributes. obj = soup.find(attrs={“name”:”marks”}) Using select() The select() method can be called by passing the attributes to be compared against. The attributes must be put in a list object. It returns a list of all tags that have the given attribute. In the following code, the select() method returns all the tags with type attribute. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.select(“[type]”) print (obj) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Using select_one() The select_one() is method is similar, except that it returns the first tag satisfying the given filter. obj = soup.select_one(“[name=”marks”]”) Output <input id=”marks” name=”marks” type=”text”/> Print Page Previous Next Advertisements ”;