Beautiful Soup – next_elements Property ”; Previous Next Method Description In Beautiful Soup library, the next_elements property returns a generator object containing the next strings or tags in the parse tree. Syntax Element.next_elements Return value The next_elements property returns a generator. Example 1 The next_elements property returns tags and NavibaleStrings appearing after the <b> tag in the document string below − html = ””” <p><b>Excellent</b><p>Python</p><p id=”id1”>Tutorial</p></p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”b”) nexts = tag.next_elements print (“Next elements:”) for next in nexts: print (next) Output Next elements: Excellent Python Python <p id=”id1″>Tutorial</p> Tutorial Example 2 All the elements appearing after the <p> tag are listed below − from bs4 import BeautifulSoup html = ””” <p> <b>Excellent</b><i>Python</i> </p> <u>Tutorial</u> ””” soup = BeautifulSoup(html, ”html.parser”) tag1 = soup.find(”p”) print (“Next elements:”) print (list(tag1.next_elements)) Output Next elements: [”n”, <b>Excellent</b>, ”Excellent”, <i>Python</i>, ”Python”, ”n”, ”n”, <u>Tutorial</u>, ”Tutorial”, ”n”] Example 3 The elements next to the input tag present in the HTML form of index.html are listed below − from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html5lib”) tag = soup.find(”input”) nexts = soup.previous_elements print (“Next elements:”) for next in nexts: print (next) Output Next elements: <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/> Print Page Previous Next Advertisements ”;
Author: user
Beautiful Soup – find_next() Method ”; Previous Next Method Description The find_next() method in Beautiful soup finds the first PageElement that matches the given criteria and appears later in the document. returns the first tag or NavigableString that comes after the current tag in the document. Like all other find methods, this method has the following syntax − Syntax find_next(name, attrs, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − A filter for a NavigableString with specific text. kwargs − A dictionary of filters on attribute values. Return Value This find_next () method returns a Tag or a NavigableString Example 1 A web page index.html with following script has been used for this example <html> <head> <title>TutorialsPoint</title> </head> <body> <h1>TutorialsPoint</h1> <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> </body> </html> We first locate the <form> tag and then the one next to it. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.h1 print (tag.find_next()) Output <form> <input id=”nm” name=”name” type=”text”/> <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/> </form> Example 2 In this example, we first locate the <input> tag with its name=”age” and obtain its next tag. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”name”:”age”}) print (tag.find_next()) Output <input id=”marks” name=”marks” type=”text”/> Example 3 The tag next to the <head> tag happens to be <title> tag. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.head print (tag.find_next()) Output <title>TutorialsPoint</title> Print Page Previous Next Advertisements ”;
Beautiful Soup – decode() Method ”; Previous Next Method Description The decode() method in Beautiful Soup returns a string or Unicode representation of the parse tree as an HTML or XML document. The method decodes the bytes using the codec registered for encoding. Its function is opposite to that of encode() method. You call encode() to get a bytestring, and decode() to get Unicode. Let us study decode() method with some examples. Syntax decode(pretty_print, encoding, formatter, errors) Parameters pretty_print − If this is True, indentation will be used to make the document more readable. encoding − The encoding of the final document. If this is None, the document will be a Unicode string. formatter − A Formatter object, or a string naming one of the standard formatters. errors − The error handling scheme to use for the handling of decoding errors. Values are ”strict”, ”ignore” and ”replace”. Return Value The decode() method returns a Unicode String. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“Hello “World!””, ”html.parser”) enc = soup.encode(”utf-8”) print (enc) dec = enc.decode() print (dec) Output b”Hello xe2x80x9cWorld!xe2x80x9d” Hello “World!” Print Page Previous Next Advertisements ”;
Beautiful Soup – contents Property ”; Previous Next Method Description The contents property is available with the Soup object as well as Tag object. It returns a list everything that is contained inside the object, all the immediate child elements and text nodes (i.e. Navigable String). Syntax Tag.contents Return value The contents property returns a list of child elements and strings in the Tag/Soup object,. Example 1 Contents of a tag object − from bs4 import BeautifulSoup markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.div print (tag.contents) Output [”n”, <p>Java</p>, ”n”, <p>Python</p>, ”n”, <p>C++</p>, ”n”] Example 2 Contents of the entire document − from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) print (soup.contents) Output [”n”, <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div>, ”n”] Example 3 Note that a NavigableString object doesn”t have contents property. It throws AttributeError if we try to access the same. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.p s=tag.contents[0] print (s.contents) Output Traceback (most recent call last): File “C:UsersuserBeautifulSoup2.py”, line 11, in <module> print (s.contents) ^^^^^^^^^^ File “C:UsersuserBeautifulSoupLibsite-packagesbs4element.py”, line 984, in __getattr__ raise AttributeError( AttributeError: ”NavigableString” object has no attribute ”contents” Print Page Previous Next Advertisements ”;
Beautiful Soup – insert() Method ”; Previous Next Method Description The insert() method in Beautiful Soup add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object. Syntax insert(position, child) Parameters position − The position at which the new PageElement should be inserted. child − A PageElement to be inserted. Return Type The insert() method doesn”t return any new object. Example 1 In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result. from bs4 import BeautifulSoup, NavigableString markup = ”<b>Excellent </b><u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert(1, “Tutorial “) print (soup.prettify()) Output <b> Excellent Tutorial </b> <u> from TutorialsPoint </u> Example 2 In the following example, the insert() method is used to successively insert strings from a list to a <p> tag in HTML markup. from bs4 import BeautifulSoup, NavigableString markup = ”<p>Excellent Tutorials from TutorialsPoint</p>” soup = BeautifulSoup(markup, ”html.parser”) langs = [”Python”, ”Java”, ”C”] i=0 for lang in langs: i+=1 tag = soup.new_tag(”p”) tag.string = lang soup.p.insert(i, tag) print (soup.prettify()) Output <p> Excellent Tutorials from TutorialsPoint <p> Python </p> <p> Java </p> <p> C </p> </p> Print Page Previous Next Advertisements ”;
Beautiful Soup – clear() Method ”; Previous Next Method Description The clear() method in Beautiful Soup library removes the inner content of a tag, keeping the tag intact. If there are any child elements, extract() method is called on them. If decompose argument is set to True, then decompose() method is called instead of extract(). Syntax clear(decompose=False) Parameters decompose − If this is True, decompose() (a more destructive method) will be called instead of extract() Return Value The clear() method doesn”t return any object. Example 1 As clear() method is called on the soup object that represents the entire document, all the content is removed, leaving the document blank. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) soup.clear() print(soup) Output Example 2 In the following example, we find all the <p> tags and call clear() method on each of them. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tags = soup.find_all(”p”) for tag in tags: tag.clear() print(soup) Output Contents of each <p> .. </p> will be removed, the tags will be retained. <html> <body> <p></p> <p></p> <p></p> <p></p> </body> </html> Example 3 Here we clear the contents of <body> tags with decompose argument set to Tue. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tags = soup.find(”body”) ret = tags.clear(decompose=True) print(soup) Output <html> <body></body> </html> Print Page Previous Next Advertisements ”;
Beautiful Soup – Convert HTML to Text ”; Previous Next One of the important and a frequently required application of a web scraper such as Beautiful Soup library is to extract text from a HTML script. You may need to discard all the tags along with the attributes associated if any with each tag and separate out the raw text in the document. The get_text() method in Beautiful Soup is suitable for this purpose. Here is a basic example demonstrating the usage of get_text() method. You get all the text from HTML document by removing all the HTML tags. Example html = ””” <html> <body> <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) text = soup.get_text() print(text) Output The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. The get_text() method has an optional separator argument. In the following example, we specify the separator argument of get_text() method as ”#”. html = ””” <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) text = soup.get_text(separator=”#”) print(text) Output #The quick, brown fox jumps over a lazy dog.# #DJs flock by when MTV ax quiz prog.# #Junk MTV quiz graced by fox whelps.# #Bawds jog, flick quartz, vex nymphs.# The get_text() method has another argument strip, which can be True or False. Let us check the effect of strip parameter when it is set to True. By default it is False. html = ””” <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) text = soup.get_text(strip=True) print(text) Output The quick, brown fox jumps over a lazy dog.DJs flock by when MTV ax quiz prog.Junk MTV quiz graced by fox whelps.Bawds jog, flick quartz, vex nymphs. Print Page Previous Next Advertisements ”;
Beautiful Soup – new_tag() Method ”; Previous Next The new_tag() method in Beautiful Soup library creates a new Tag object, that is associated with an existing BeautifulSoup object. You can use this factory method to append or insert the new tag into the document tree. Syntax new_tag(name, namespace, nsprefix, attrs, sourceline, sourcepos, **kwattrs) Parameters name − The name of the new Tag. namespace − The URI of the new Tag”s XML namespace, optional. prefix − The prefix for the new Tag”s XML namespace, optional. attrs − A dictionary of this Tag”s attribute values. sourceline − The line number where this tag was found in its source document. sourcepos − The character position within `sourceline` where this tag was found. kwattrs − Keyword arguments for the new Tag”s attribute values. Return Value This method returns a new Tag object. Example 1 The following example shows the use of new_tag() method. A new tag for <a> element. The tag object is initialized with the href and string attributes and then inserted in the document tree. from bs4 import BeautifulSoup soup = BeautifulSoup(”<p>Welcome to <b>online Tutorial library</b></p>”, ”html.parser”) tag = soup.new_tag(”a”) tag.attrs[”href”] = “www.tutorialspoint.com” tag.string = “Tutorialspoint” soup.b.insert_before(tag) print (soup) Output <p>Welcome to <a href=”www.tutorialspoint.com”>Tutorialspoint</a><b>online Tutorial library</b></p> Example 2 In the following example, we have a HTML form with two input elements. We create a new input tag and append it to the form tag. html = ””” <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> </form>””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.form newtag=soup.new_tag(”input”, attrs={”type”:”text”, ”id”:”marks”, ”name”:”marks”}) tag.append(newtag) print (soup) Output <form> <input id=”nm” name=”name” type=”text”/> <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/></form> Example 3 Here we have an empty <p> tag in the HTML string. A new tag is inserted in it. from bs4 import BeautifulSoup soup = BeautifulSoup(”<p></p>”, ”html.parser”) tag = soup.new_tag(”b”) tag.string = “Hello World” soup.p.insert(0,tag) print (soup) Output <p><b>Hello World</b></p> Print Page Previous Next Advertisements ”;
Beautiful Soup – extend() Method ”; Previous Next Method Description The extend() method in Beautiful Soup has been added to Tag class from version 4.7 onwards. It adds all the elements in a list to the tag. This method is analogous to a standard Python List”s extend() method – it takes in an array of strings to append to the tag”s content. Syntax extend(tags) Parameters tags − A list of srings or NavigableString objects to be appended. Return Type The extend() method doesn”t return any new object. Example from bs4 import BeautifulSoup markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b vals = [”World.”, ”Welcome to ”, ”TutorialsPoint”] tag.extend(vals) print (soup.prettify()) Output <b> Hello World. Welcome to TutorialsPoint </b> Print Page Previous Next Advertisements ”;
Beautiful Soup – find_next_siblings() Method ”; Previous Next Method Description The find_next_siblings() method is similar to next_sibling property. It finds all siblings at the same level of this PageElement that match the given criteria and appear later in the document. Syntax find_fnext_siblings(name, attrs, string, limit, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − The string to search for (rather than tag). limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return Type The find_next_siblings() method returns a list of Tag objects or a NavigableString objects. Example 1 Let us use the following HTML snippet for this purpose − <p> <b> Excellent </b> <i> Python </i> <u> Tutorial </u> </p> In the code below, we try to find all the siblings of <b> tag. There are two more tags at the same level in the HTML string used for scraping. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>”, ”html.parser”) tag1 = soup.find(”b”) print (“next siblings:”) for tag in tag1.find_next_siblings(): print (tag) Output The ResultSet of find_next_siblings() is being iterated with the help of for loop. next siblings: <i>Python</i> <u>Tutorial</u> Example 2 If there are no siblings to be found after a tag, this method returns an empty list. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>”, ”html.parser”) tag1 = soup.find(”u”) print (“next siblings:”) print (tag1.find_next_siblings()) Output next siblings: [] Print Page Previous Next Advertisements ”;